84276 Commits

Author SHA1 Message Date
Minchan Kim
48b4800a1c zsmalloc: page migration support
This patch introduces run-time migration feature for zspage.

For migration, VM uses page.lru field so it would be better to not use
page.next field which is unified with page.lru for own purpose.  For
that, firstly, we can get first object offset of the page via runtime
calculation instead of using page.index so we can use page.index as link
for page chaining instead of page.next.

In case of huge object, it stores handle to page.index instead of next
link of page chaining because huge object doesn't need to next link for
page chaining.  So get_next_page need to identify huge object to return
NULL.  For it, this patch uses PG_owner_priv_1 flag of the page flag.

For migration, it supports three functions

* zs_page_isolate

It isolates a zspage which includes a subpage VM want to migrate from
class so anyone cannot allocate new object from the zspage.

We could try to isolate a zspage by the number of subpage so subsequent
isolation trial of other subpage of the zpsage shouldn't fail.  For
that, we introduce zspage.isolated count.  With that, zs_page_isolate
can know whether zspage is already isolated or not for migration so if
it is isolated for migration, subsequent isolation trial can be
successful without trying further isolation.

* zs_page_migrate

First of all, it holds write-side zspage->lock to prevent migrate other
subpage in zspage.  Then, lock all objects in the page VM want to
migrate.  The reason we should lock all objects in the page is due to
race between zs_map_object and zs_page_migrate.

  zs_map_object				zs_page_migrate

  pin_tag(handle)
  obj = handle_to_obj(handle)
  obj_to_location(obj, &page, &obj_idx);

					write_lock(&zspage->lock)
					if (!trypin_tag(handle))
						goto unpin_object

  zspage = get_zspage(page);
  read_lock(&zspage->lock);

If zs_page_migrate doesn't do trypin_tag, zs_map_object's page can be
stale by migration so it goes crash.

If it locks all of objects successfully, it copies content from old page
to new one, finally, create new zspage chain with new page.  And if it's
last isolated subpage in the zspage, put the zspage back to class.

* zs_page_putback

It returns isolated zspage to right fullness_group list if it fails to
migrate a page.  If it find a zspage is ZS_EMPTY, it queues zspage
freeing to workqueue.  See below about async zspage freeing.

This patch introduces asynchronous zspage free.  The reason to need it
is we need page_lock to clear PG_movable but unfortunately, zs_free path
should be atomic so the apporach is try to grab page_lock.  If it got
page_lock of all of pages successfully, it can free zspage immediately.
Otherwise, it queues free request and free zspage via workqueue in
process context.

If zs_free finds the zspage is isolated when it try to free zspage, it
delays the freeing until zs_page_putback finds it so it will free free
the zspage finally.

In this patch, we expand fullness_list from ZS_EMPTY to ZS_FULL.  First
of all, it will use ZS_EMPTY list for delay freeing.  And with adding
ZS_FULL list, it makes to identify whether zspage is isolated or not via
list_empty(&zspage->list) test.

[minchan@kernel.org: zsmalloc: keep first object offset in struct page]
  Link: http://lkml.kernel.org/r/1465788015-23195-1-git-send-email-minchan@kernel.org
[minchan@kernel.org: zsmalloc: zspage sanity check]
  Link: http://lkml.kernel.org/r/20160603010129.GC3304@bbox
Link: http://lkml.kernel.org/r/1464736881-24886-12-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Minchan Kim
b1123ea6d3 mm: balloon: use general non-lru movable page feature
Now, VM has a feature to migrate non-lru movable pages so balloon
doesn't need custom migration hooks in migrate.c and compaction.c.

Instead, this patch implements the page->mapping->a_ops->
{isolate|migrate|putback} functions.

With that, we could remove hooks for ballooning in general migration
functions and make balloon compaction simple.

[akpm@linux-foundation.org: compaction.h requires that the includer first include node.h]
Link: http://lkml.kernel.org/r/1464736881-24886-4-git-send-email-minchan@kernel.org
Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Minchan Kim
bda807d444 mm: migrate: support non-lru movable page migration
We have allowed migration for only LRU pages until now and it was enough
to make high-order pages.  But recently, embedded system(e.g., webOS,
android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
have seen several reports about troubles of small high-order allocation.
For fixing the problem, there were several efforts (e,g,.  enhance
compaction algorithm, SLUB fallback to 0-order page, reserved memory,
vmalloc and so on) but if there are lots of non-movable pages in system,
their solutions are void in the long run.

So, this patch is to support facility to change non-movable pages with
movable.  For the feature, this patch introduces functions related to
migration to address_space_operations as well as some page flags.

If a driver want to make own pages movable, it should define three
functions which are function pointers of struct
address_space_operations.

1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

What VM expects on isolate_page function of driver is to return *true*
if driver isolates page successfully.  On returing true, VM marks the
page as PG_isolated so concurrent isolation in several CPUs skip the
page for isolation.  If a driver cannot isolate the page, it should
return *false*.

Once page is successfully isolated, VM uses page.lru fields so driver
shouldn't expect to preserve values in that fields.

2. int (*migratepage) (struct address_space *mapping,
		struct page *newpage, struct page *oldpage, enum migrate_mode);

After isolation, VM calls migratepage of driver with isolated page.  The
function of migratepage is to move content of the old page to new page
and set up fields of struct page newpage.  Keep in mind that you should
indicate to the VM the oldpage is no longer movable via
__ClearPageMovable() under page_lock if you migrated the oldpage
successfully and returns 0.  If driver cannot migrate the page at the
moment, driver can return -EAGAIN.  On -EAGAIN, VM will retry page
migration in a short time because VM interprets -EAGAIN as "temporal
migration failure".  On returning any error except -EAGAIN, VM will give
up the page migration without retrying in this time.

Driver shouldn't touch page.lru field VM using in the functions.

3. void (*putback_page)(struct page *);

If migration fails on isolated page, VM should return the isolated page
to the driver so VM calls driver's putback_page with migration failed
page.  In this function, driver should put the isolated page back to the
own data structure.

4. non-lru movable page flags

There are two page flags for supporting non-lru movable page.

* PG_movable

Driver should use the below function to make page movable under
page_lock.

	void __SetPageMovable(struct page *page, struct address_space *mapping)

It needs argument of address_space for registering migration family
functions which will be called by VM.  Exactly speaking, PG_movable is
not a real flag of struct page.  Rather than, VM reuses page->mapping's
lower bits to represent it.

	#define PAGE_MAPPING_MOVABLE 0x2
	page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

so driver shouldn't access page->mapping directly.  Instead, driver
should use page_mapping which mask off the low two bits of page->mapping
so it can get right struct address_space.

For testing of non-lru movable page, VM supports __PageMovable function.
However, it doesn't guarantee to identify non-lru movable page because
page->mapping field is unified with other variables in struct page.  As
well, if driver releases the page after isolation by VM, page->mapping
doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
__ClearPageMovable).  But __PageMovable is cheap to catch whether page
is LRU or non-lru movable once the page has been isolated.  Because LRU
pages never can have PAGE_MAPPING_MOVABLE in page->mapping.  It is also
good for just peeking to test non-lru movable pages before more
expensive checking with lock_page in pfn scanning to select victim.

For guaranteeing non-lru movable page, VM provides PageMovable function.
Unlike __PageMovable, PageMovable functions validates page->mapping and
mapping->a_ops->isolate_page under lock_page.  The lock_page prevents
sudden destroying of page->mapping.

Driver using __SetPageMovable should clear the flag via
__ClearMovablePage under page_lock before the releasing the page.

* PG_isolated

To prevent concurrent isolation among several CPUs, VM marks isolated
page as PG_isolated under lock_page.  So if a CPU encounters PG_isolated
non-lru movable page, it can skip it.  Driver doesn't need to manipulate
the flag because VM will set/clear it automatically.  Keep in mind that
if driver sees PG_isolated page, it means the page have been isolated by
VM so it shouldn't touch page.lru field.  PG_isolated is alias with
PG_reclaim flag so driver shouldn't use the flag for own purpose.

[opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
  Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: John Einar Reitan <john.reitan@foss.arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Aneesh Kumar K.V
a54f9aebaa include/linux/mmdebug.h: add VM_WARN which maps to WARN()
This enables us to do VM_WARN(condition, "warn message");

Link: http://lkml.kernel.org/r/1464692688-6612-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Vladimir Davydov
2a966b77ae mm: oom: add memcg to oom_control
It's a part of oom context just like allocation order and nodemask, so
let's move it to oom_control instead of passing it in the argument list.

Link: http://lkml.kernel.org/r/40e03fd7aaf1f55c75d787128d6d17c5a71226c2.1464358556.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Vladimir Davydov
798fd75695 mm: zap ZONE_OOM_LOCKED
Not used since oom_lock was instroduced.

Link: http://lkml.kernel.org/r/1464358093-22663-1-git-send-email-vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Reza Arbab
df429ac039 memory-hotplug: more general validation of zone during online
When memory is onlined, we are only able to rezone from ZONE_MOVABLE to
ZONE_KERNEL, or from (ZONE_MOVABLE - 1) to ZONE_MOVABLE.

To be more flexible, use the following criteria instead; to online
memory from zone X into zone Y,

* Any zones between X and Y must be unused.
* If X is lower than Y, the onlined memory must lie at the end of X.
* If X is higher than Y, the onlined memory must lie at the start of X.

Add zone_can_shift() to make this determination.

Link: http://lkml.kernel.org/r/1462816419-4479-3-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Reviewd-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Daniel Kiper <daniel.kiper@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Zhang Zhen <zhenzhang.zhang@huawei.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Alexey Dobriyan
91c6a05f72 mm: faster kmalloc_array(), kcalloc()
When both arguments to kmalloc_array() or kcalloc() are known at compile
time then their product is known at compile time but search for kmalloc
cache happens at runtime not at compile time.

Link: http://lkml.kernel.org/r/20160627213454.GA2440@p183.telecom.by
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Thomas Garnier
210e7a43fa mm: SLUB freelist randomization
Implements freelist randomization for the SLUB allocator.  It was
previous implemented for the SLAB allocator.  Both use the same
configuration option (CONFIG_SLAB_FREELIST_RANDOM).

The list is randomized during initialization of a new set of pages.  The
order on different freelist sizes is pre-computed at boot for
performance.  Each kmem_cache has its own randomized freelist.

This security feature reduces the predictability of the kernel SLUB
allocator against heap overflows rendering attacks much less stable.

For example these attacks exploit the predictability of the heap:
 - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
 - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)

Performance results:

slab_test impact is between 3% to 4% on average for 100000 attempts
without smp.  It is a very focused testing, kernbench show the overall
impact on the system is way lower.

Before:

  Single thread testing
  =====================
  1. Kmalloc: Repeatedly allocate then free test
  100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
  100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
  100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
  100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
  100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
  100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
  100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
  100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
  100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
  100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
  2. Kmalloc: alloc/free test
  100000 times kmalloc(8)/kfree -> 70 cycles
  100000 times kmalloc(16)/kfree -> 70 cycles
  100000 times kmalloc(32)/kfree -> 70 cycles
  100000 times kmalloc(64)/kfree -> 70 cycles
  100000 times kmalloc(128)/kfree -> 70 cycles
  100000 times kmalloc(256)/kfree -> 69 cycles
  100000 times kmalloc(512)/kfree -> 70 cycles
  100000 times kmalloc(1024)/kfree -> 73 cycles
  100000 times kmalloc(2048)/kfree -> 72 cycles
  100000 times kmalloc(4096)/kfree -> 71 cycles

After:

  Single thread testing
  =====================
  1. Kmalloc: Repeatedly allocate then free test
  100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
  100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
  100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
  100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
  100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
  100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
  100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
  100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
  100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
  100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
  2. Kmalloc: alloc/free test
  100000 times kmalloc(8)/kfree -> 66 cycles
  100000 times kmalloc(16)/kfree -> 66 cycles
  100000 times kmalloc(32)/kfree -> 66 cycles
  100000 times kmalloc(64)/kfree -> 66 cycles
  100000 times kmalloc(128)/kfree -> 65 cycles
  100000 times kmalloc(256)/kfree -> 67 cycles
  100000 times kmalloc(512)/kfree -> 67 cycles
  100000 times kmalloc(1024)/kfree -> 64 cycles
  100000 times kmalloc(2048)/kfree -> 67 cycles
  100000 times kmalloc(4096)/kfree -> 67 cycles

Kernbench, before:

  Average Optimal load -j 12 Run (std deviation):
  Elapsed Time 101.873 (1.16069)
  User Time 1045.22 (1.60447)
  System Time 88.969 (0.559195)
  Percent CPU 1112.9 (13.8279)
  Context Switches 189140 (2282.15)
  Sleeps 99008.6 (768.091)

After:

  Average Optimal load -j 12 Run (std deviation):
  Elapsed Time 102.47 (0.562732)
  User Time 1045.3 (1.34263)
  System Time 88.311 (0.342554)
  Percent CPU 1105.8 (6.49444)
  Context Switches 189081 (2355.78)
  Sleeps 99231.5 (800.358)

Link: http://lkml.kernel.org/r/1464295031-26375-3-git-send-email-thgarnie@google.com
Signed-off-by: Thomas Garnier <thgarnie@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Thomas Garnier
7c00fce98c mm: reorganize SLAB freelist randomization
The kernel heap allocators are using a sequential freelist making their
allocation predictable.  This predictability makes kernel heap overflow
easier to exploit.  An attacker can careful prepare the kernel heap to
control the following chunk overflowed.

For example these attacks exploit the predictability of the heap:
 - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
 - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)

***Problems that needed solving:
 - Randomize the Freelist (singled linked) used in the SLUB allocator.
 - Ensure good performance to encourage usage.
 - Get best entropy in early boot stage.

***Parts:
 - 01/02 Reorganize the SLAB Freelist randomization to share elements
   with the SLUB implementation.
 - 02/02 The SLUB Freelist randomization implementation. Similar approach
   than the SLAB but tailored to the singled freelist used in SLUB.

***Performance data:

slab_test impact is between 3% to 4% on average for 100000 attempts
without smp.  It is a very focused testing, kernbench show the overall
impact on the system is way lower.

Before:

  Single thread testing
  =====================
  1. Kmalloc: Repeatedly allocate then free test
  100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
  100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
  100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
  100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
  100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
  100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
  100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
  100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
  100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
  100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
  2. Kmalloc: alloc/free test
  100000 times kmalloc(8)/kfree -> 70 cycles
  100000 times kmalloc(16)/kfree -> 70 cycles
  100000 times kmalloc(32)/kfree -> 70 cycles
  100000 times kmalloc(64)/kfree -> 70 cycles
  100000 times kmalloc(128)/kfree -> 70 cycles
  100000 times kmalloc(256)/kfree -> 69 cycles
  100000 times kmalloc(512)/kfree -> 70 cycles
  100000 times kmalloc(1024)/kfree -> 73 cycles
  100000 times kmalloc(2048)/kfree -> 72 cycles
  100000 times kmalloc(4096)/kfree -> 71 cycles

After:

  Single thread testing
  =====================
  1. Kmalloc: Repeatedly allocate then free test
  100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
  100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
  100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
  100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
  100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
  100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
  100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
  100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
  100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
  100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
  2. Kmalloc: alloc/free test
  100000 times kmalloc(8)/kfree -> 66 cycles
  100000 times kmalloc(16)/kfree -> 66 cycles
  100000 times kmalloc(32)/kfree -> 66 cycles
  100000 times kmalloc(64)/kfree -> 66 cycles
  100000 times kmalloc(128)/kfree -> 65 cycles
  100000 times kmalloc(256)/kfree -> 67 cycles
  100000 times kmalloc(512)/kfree -> 67 cycles
  100000 times kmalloc(1024)/kfree -> 64 cycles
  100000 times kmalloc(2048)/kfree -> 67 cycles
  100000 times kmalloc(4096)/kfree -> 67 cycles

Kernbench, before:

  Average Optimal load -j 12 Run (std deviation):
  Elapsed Time 101.873 (1.16069)
  User Time 1045.22 (1.60447)
  System Time 88.969 (0.559195)
  Percent CPU 1112.9 (13.8279)
  Context Switches 189140 (2282.15)
  Sleeps 99008.6 (768.091)

After:

  Average Optimal load -j 12 Run (std deviation):
  Elapsed Time 102.47 (0.562732)
  User Time 1045.3 (1.34263)
  System Time 88.311 (0.342554)
  Percent CPU 1105.8 (6.49444)
  Context Switches 189081 (2355.78)
  Sleeps 99231.5 (800.358)

This patch (of 2):

This commit reorganizes the previous SLAB freelist randomization to
prepare for the SLUB implementation.  It moves functions that will be
shared to slab_common.

The entropy functions are changed to align with the SLUB implementation,
now using get_random_(int|long) functions.  These functions were chosen
because they provide a bit more entropy early on boot and better
performance when specific arch instructions are not available.

[akpm@linux-foundation.org: fix build]
Link: http://lkml.kernel.org/r/1464295031-26375-2-git-send-email-thgarnie@google.com
Signed-off-by: Thomas Garnier <thgarnie@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Brian Foster
9a46b04f16 fs/fs-writeback.c: inode writeback list tracking tracepoints
The per-sb inode writeback list tracks inodes currently under writeback
to facilitate efficient sync processing.  In particular, it ensures that
sync only needs to walk through a list of inodes that were cleaned by
the sync.

Add a couple tracepoints to help identify when inodes are added/removed
to and from the writeback lists.  Piggyback off of the writeback
lazytime tracepoint template as it already tracks the relevant inode
information.

Link: http://lkml.kernel.org/r/1466594593-6757-3-git-send-email-bfoster@redhat.com
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
cc: Josef Bacik <jbacik@fb.com>
Cc: Holger Hoffstätte <holger.hoffstaette@applied-asynchrony.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Dave Chinner
6c60d2b574 fs/fs-writeback.c: add a new writeback list for sync
wait_sb_inodes() currently does a walk of all inodes in the filesystem
to find dirty one to wait on during sync.  This is highly inefficient
and wastes a lot of CPU when there are lots of clean cached inodes that
we don't need to wait on.

To avoid this "all inode" walk, we need to track inodes that are
currently under writeback that we need to wait for.  We do this by
adding inodes to a writeback list on the sb when the mapping is first
tagged as having pages under writeback.  wait_sb_inodes() can then walk
this list of "inodes under IO" and wait specifically just for the inodes
that the current sync(2) needs to wait for.

Define a couple helpers to add/remove an inode from the writeback list
and call them when the overall mapping is tagged for or cleared from
writeback.  Update wait_sb_inodes() to walk only the inodes under
writeback due to the sync.

With this change, filesystem sync times are significantly reduced for
fs' with largely populated inode caches and otherwise no other work to
do.  For example, on a 16xcpu 2GHz x86-64 server, 10TB XFS filesystem
with a ~10m entry inode cache, sync times are reduced from ~7.3s to less
than 0.1s when the filesystem is fully clean.

Link: http://lkml.kernel.org/r/1466594593-6757-2-git-send-email-bfoster@redhat.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Holger Hoffstätte <holger.hoffstaette@applied-asynchrony.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Randy Dunlap
17359a80b9 debugobjects.h: fix trivial kernel doc warning
Add ':' to fix trivial kernel-doc warning in <linux/debugobjects.h>:

  ..//include/linux/debugobjects.h:63: warning: No description found for parameter 'is_static_object'

Link: http://lkml.kernel.org/r/575B01B8.5060600@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Ross Zwisler
6b524995a7 dax: remote unused fault wrappers
Remove the unused wrappers dax_fault() and dax_pmd_fault().  After this
removal, rename __dax_fault() and __dax_pmd_fault() to dax_fault() and
dax_pmd_fault() respectively, and update all callers.

The dax_fault() and dax_pmd_fault() wrappers were initially intended to
capture some filesystem independent functionality around page faults
(calling sb_start_pagefault() & sb_end_pagefault(), updating file mtime
and ctime).

However, the following commits:

   5726b27b09cc ("ext2: Add locking for DAX faults")
   ea3d7209ca01 ("ext4: fix races between page faults and hole punching")

added locking to the ext2 and ext4 filesystems after these common
operations but before __dax_fault() and __dax_pmd_fault() were called.
This means that these wrappers are no longer used, and are unlikely to
be used in the future.

XFS has had locking analogous to what was recently added to ext2 and
ext4 since DAX support was initially introduced by:

   6b698edeeef0 ("xfs: add DAX file operations support")

Link: http://lkml.kernel.org/r/20160714214049.20075-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Linus Torvalds
e65805251f Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "The irq department delivers:

   - new core infrastructure to allow better management of multi-queue
     devices (interrupt spreading, node aware descriptor allocation ...)

   - a new interrupt flow handler to support the new fangled Intel VMD
     devices.

   - yet another new interrupt controller driver.

   - a series of fixes which addresses sparse warnings, missing
     includes, missing static declarations etc from Ben Dooks.

   - a fix for the error handling in the hierarchical domain allocation
     code.

   - the usual pile of small updates to core and driver code"

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
  genirq: Fix missing irq allocation affinity hint
  irqdomain: Fix irq_domain_alloc_irqs_recursive() error handling
  irq/Documentation: Correct result of echnoing 5 to smp_affinity
  MAINTAINERS: Remove Jiang Liu from irq domains
  genirq/msi: Fix broken debug output
  genirq: Add a helper to spread an affinity mask for MSI/MSI-X vectors
  genirq/msi: Make use of affinity aware allocations
  genirq: Use affinity hint in irqdesc allocation
  genirq: Add affinity hint to irq allocation
  genirq: Introduce IRQD_AFFINITY_MANAGED flag
  genirq/msi: Remove unused MSI_FLAG_IDENTITY_MAP
  irqchip/s3c24xx: Fixup IO accessors for big endian
  irqchip/exynos-combiner: Fix usage of __raw IO
  irqdomain: Fix disposal of mappings for interrupt hierarchies
  irqchip/aspeed-vic: Add irq controller for Aspeed
  doc/devicetree: Add Aspeed VIC bindings
  x86/PCI/VMD: Use untracked irq handler
  genirq: Add untracked irq handler
  irqchip/mips-gic: Populate irq_domain names
  irqchip/gicv3-its: Implement two-level(indirect) device table support
  ...
2016-07-25 21:35:03 -07:00
Linus Torvalds
55392c4c06 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "This update provides the following changes:

   - The rework of the timer wheel which addresses the shortcomings of
     the current wheel (cascading, slow search for next expiring timer,
     etc).  That's the first major change of the wheel in almost 20
     years since Finn implemted it.

   - A large overhaul of the clocksource drivers init functions to
     consolidate the Device Tree initialization

   - Some more Y2038 updates

   - A capability fix for timerfd

   - Yet another clock chip driver

   - The usual pile of updates, comment improvements all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (130 commits)
  tick/nohz: Optimize nohz idle enter
  clockevents: Make clockevents_subsys static
  clocksource/drivers/time-armada-370-xp: Fix return value check
  timers: Implement optimization for same expiry time in mod_timer()
  timers: Split out index calculation
  timers: Only wake softirq if necessary
  timers: Forward the wheel clock whenever possible
  timers/nohz: Remove pointless tick_nohz_kick_tick() function
  timers: Optimize collect_expired_timers() for NOHZ
  timers: Move __run_timers() function
  timers: Remove set_timer_slack() leftovers
  timers: Switch to a non-cascading wheel
  timers: Reduce the CPU index space to 256k
  timers: Give a few structs and members proper names
  hlist: Add hlist_is_singular_node() helper
  signals: Use hrtimer for sigtimedwait()
  timers: Remove the deprecated mod_timer_pinned() API
  timers, net/ipv4/inet: Initialize connection request timers as pinned
  timers, drivers/tty/mips_ejtag: Initialize the poll timer as pinned
  timers, drivers/tty/metag_da: Initialize the poll timer as pinned
  ...
2016-07-25 20:43:12 -07:00
Linus Torvalds
8e466955d6 Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 platform updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Intel-SoC enhancements (Andy Shevchenko)

   - Intel CPU symbolic model definition rework (Dave Hansen)

   - ... other misc changes"

* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
  x86/sfi: Enable enumeration of SD devices
  x86/pci: Use MRFLD abbreviation for Merrifield
  x86/platform/intel-mid: Make vertical indentation consistent
  x86/platform/intel-mid: Mark regulators explicitly defined
  x86/platform/intel-mid: Rename mrfl.c to mrfld.c
  x86/platform/intel-mid: Enable spidev on Intel Edison boards
  x86/platform/intel-mid: Extend PWRMU to support Penwell
  x86/pci, x86/platform/intel_mid_pci: Remove duplicate power off code
  x86/platform/intel-mid: Add pinctrl for Intel Merrifield
  x86/platform/intel-mid: Enable GPIO expanders on Edison
  x86/platform/intel-mid: Add Power Management Unit driver
  x86/platform/atom/punit: Enable support for Merrifield
  x86/platform/intel_mid_pci: Rework IRQ0 workaround
  x86, thermal: Clean up and fix CPU model detection for intel_soc_dts_thermal
  x86, mmc: Use Intel family name macros for mmc driver
  x86/intel_telemetry: Use Intel family name macros for telemetry driver
  x86/acpi/lss: Use Intel family name macros for the acpi_lpss driver
  x86/cpufreq: Use Intel family name macros for the intel_pstate cpufreq driver
  x86/platform: Use new Intel model number macros
  x86/intel_idle: Use Intel family macros for intel_idle
  ...
2016-07-25 19:15:35 -07:00
Linus Torvalds
36e635cb21 Merge branch 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 stackdump update from Ingo Molnar:
 "A number of stackdump enhancements"

* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/dumpstack: Add show_stack_regs() and use it
  printk: Make the printk*once() variants return a value
  x86/dumpstack: Honor supplied @regs arg
2016-07-25 18:18:04 -07:00
Linus Torvalds
0f657262d5 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:
 "Various x86 low level modifications:

   - preparatory work to support virtually mapped kernel stacks (Andy
     Lutomirski)

   - support for 64-bit __get_user() on 32-bit kernels (Benjamin
     LaHaise)

   - (involved) workaround for Knights Landing CPU erratum (Dave Hansen)

   - MPX enhancements (Dave Hansen)

   - mremap() extension to allow remapping of the special VDSO vma, for
     purposes of user level context save/restore (Dmitry Safonov)

   - hweight and entry code cleanups (Borislav Petkov)

   - bitops code generation optimizations and cleanups with modern GCC
     (H. Peter Anvin)

   - syscall entry code optimizations (Paolo Bonzini)"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
  x86/mm/cpa: Add missing comment in populate_pdg()
  x86/mm/cpa: Fix populate_pgd(): Stop trying to deallocate failed PUDs
  x86/syscalls: Add compat_sys_preadv64v2/compat_sys_pwritev64v2
  x86/smp: Remove unnecessary initialization of thread_info::cpu
  x86/smp: Remove stack_smp_processor_id()
  x86/uaccess: Move thread_info::addr_limit to thread_struct
  x86/dumpstack: Rename thread_struct::sig_on_uaccess_error to sig_on_uaccess_err
  x86/uaccess: Move thread_info::uaccess_err and thread_info::sig_on_uaccess_err to thread_struct
  x86/dumpstack: When OOPSing, rewind the stack before do_exit()
  x86/mm/64: In vmalloc_fault(), use CR3 instead of current->active_mm
  x86/dumpstack/64: Handle faults when printing the "Stack: " part of an OOPS
  x86/dumpstack: Try harder to get a call trace on stack overflow
  x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables()
  x86/mm/cpa: In populate_pgd(), don't set the PGD entry until it's populated
  x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  x86/mm: Use pte_none() to test for empty PTE
  x86/mm: Disallow running with 32-bit PTEs to work around erratum
  x86/mm: Ignore A/D bits in pte/pmd/pud_none()
  x86/mm: Move swap offset/type up in PTE to work around erratum
  x86/entry: Inline enter_from_user_mode()
  ...
2016-07-25 15:34:18 -07:00
Linus Torvalds
766fd5f6cd Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull NOHZ updates from Ingo Molnar:

 - fix system/idle cputime leaked on cputime accounting (all nohz
   configs) (Rik van Riel)

 - remove the messy, ad-hoc irqtime account on nohz-full and make it
   compatible with CONFIG_IRQ_TIME_ACCOUNTING=y instead (Rik van Riel)

 - cleanups (Frederic Weisbecker)

 - remove unecessary irq disablement in the irqtime code (Rik van Riel)

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/cputime: Drop local_irq_save/restore from irqtime_account_irq()
  sched/cputime: Reorganize vtime native irqtime accounting headers
  sched/cputime: Clean up the old vtime gen irqtime accounting completely
  sched/cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code
  sched/cputime: Count actually elapsed irq & softirq time
2016-07-25 14:43:00 -07:00
Linus Torvalds
cca08cd66c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - introduce and use task_rcu_dereference()/try_get_task_struct() to fix
   and generalize task_struct handling (Oleg Nesterov)

 - do various per entity load tracking (PELT) fixes and optimizations
   (Peter Zijlstra)

 - cputime virt-steal time accounting enhancements/fixes (Wanpeng Li)

 - introduce consolidated cputime output file cpuacct.usage_all and
   related refactorings (Zhao Lei)

 - ... plus misc fixes and enhancements

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Panic on scheduling while atomic bugs if kernel.panic_on_warn is set
  sched/cpuacct: Introduce cpuacct.usage_all to show all CPU stats together
  sched/cpuacct: Use loop to consolidate code in cpuacct_stats_show()
  sched/cpuacct: Merge cpuacct_usage_index and cpuacct_stat_index enums
  sched/fair: Rework throttle_count sync
  sched/core: Fix sched_getaffinity() return value kerneldoc comment
  sched/fair: Reorder cgroup creation code
  sched/fair: Apply more PELT fixes
  sched/fair: Fix PELT integrity for new tasks
  sched/cgroup: Fix cpu_cgroup_fork() handling
  sched/fair: Fix PELT integrity for new groups
  sched/fair: Fix and optimize the fork() path
  sched/cputime: Add steal time support to full dynticks CPU time accounting
  sched/cputime: Fix prev steal time accouting during CPU hotplug
  KVM: Fix steal clock warp during guest CPU hotplug
  sched/debug: Always show 'nr_migrations'
  sched/fair: Use task_rcu_dereference()
  sched/api: Introduce task_rcu_dereference() and try_get_task_struct()
  sched/idle: Optimize the generic idle loop
  sched/fair: Fix the wrong throttled clock time for cfs_rq_clock_task()
2016-07-25 13:59:34 -07:00
Linus Torvalds
7e4dc77b28 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "With over 300 commits it's been a busy cycle - with most of the work
  concentrated on the tooling side (as it should).

  The main kernel side enhancements were:

   - Add per event callchain limit: Recently we introduced a sysctl to
     tune the max-stack for all events for which callchains were
     requested:

       $ sysctl kernel.perf_event_max_stack
       kernel.perf_event_max_stack = 127

     Now this patch introduces a way to configure this per event, i.e.
     this becomes possible:

       $ perf record -e sched:*/max-stack=2/ -e block:*/max-stack=10/ -a

     allowing finer tuning of how much buffer space callchains use.

     This uses an u16 from the reserved space at the end, leaving
     another u16 for future use.

     There has been interest in even finer tuning, namely to control the
     max stack for kernel and userspace callchains separately.  Further
     discussion is needed, we may for instance use the remaining u16 for
     that and when it is present, assume that the sample_max_stack
     introduced in this patch applies for the kernel, and the u16 left
     is used for limiting the userspace callchain (Arnaldo Carvalho de
     Melo)

   - Optimize AUX event (hardware assisted side-band event) delivery
     (Kan Liang)

   - Rework Intel family name macro usage (this is partially x86 arch
     work) (Dave Hansen)

   - Refine and fix Intel LBR support (David Carrillo-Cisneros)

   - Add support for Intel 'TopDown' events (Andi Kleen)

   - Intel uncore PMU driver fixes and enhancements (Kan Liang)

   - ... other misc changes.

  Here's an incomplete list of the tooling enhancements (but there's
  much more, see the shortlog and the git log for details):

   - Support cross unwinding, i.e.  collecting '--call-graph dwarf'
     perf.data files in one machine and then doing analysis in another
     machine of a different hardware architecture.  This enables, for
     instance, to do:

       $ perf record -a --call-graph dwarf

     on a x86-32 or aarch64 system and then do 'perf report' on it on a
     x86_64 workstation (He Kuang)

   - Allow reading from a backward ring buffer (one setup via
     sys_perf_event_open() with perf_event_attr.write_backward = 1)
     (Wang Nan)

   - Finish merging initial SDT (Statically Defined Traces) support, see
     cset comments for details about how it all works (Masami Hiramatsu)

   - Support attaching eBPF programs to tracepoints (Wang Nan)

   - Add demangling of symbols in programs written in the Rust language
     (David Tolnay)

   - Add support for tracepoints in the python binding, including an
     example, that sets up and parses sched:sched_switch events,
     tools/perf/python/tracepoint.py (Jiri Olsa)

   - Introduce --stdio-color to set up the color output mode selection
     in 'annotate' and 'report', allowing emit color escape sequences
     when redirecting the output of these tools (Arnaldo Carvalho de
     Melo)

   - Add 'callindent' option to 'perf script -F', to indent the Intel PT
     call stack, making this output more ftrace-like (Adrian Hunter,
     Andi Kleen)

   - Allow dumping the object files generated by llvm when processing
     eBPF scriptlet events (Wang Nan)

   - Add stackcollapse.py script to help generating flame graphs (Paolo
     Bonzini)

   - Add --ldlat option to 'perf mem' to specify load latency for loads
     event (e.g. cpu/mem-loads/ ) (Jiri Olsa)

   - Tooling support for Intel TopDown counters, recently added to the
     kernel (Andi Kleen)"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (303 commits)
  perf tests: Add is_printable_array test
  perf tools: Make is_printable_array global
  perf script python: Fix string vs byte array resolving
  perf probe: Warn unmatched function filter correctly
  perf cpu_map: Add more helpers
  perf stat: Balance opening and reading events
  tools: Copy linux/{hash,poison}.h and check for drift
  perf tools: Remove include/linux/list.h from perf's MANIFEST
  tools: Copy the bitops files accessed from the kernel and check for drift
  Remove: kernel unistd*h files from perf's MANIFEST, not used
  perf tools: Remove tools/perf/util/include/linux/const.h
  perf tools: Remove tools/perf/util/include/asm/byteorder.h
  perf tools: Add missing linux/compiler.h include to perf-sys.h
  perf jit: Remove some no-op error handling
  perf jit: Add missing curly braces
  objtool: Initialize variable to silence old compiler
  objtool: Add -I$(srctree)/tools/arch/$(ARCH)/include/uapi
  perf record: Add --tail-synthesize option
  perf session: Don't warn about out of order event if write_backward is used
  perf tools: Enable overwrite settings
  ...
2016-07-25 13:20:41 -07:00
Linus Torvalds
c86ad14d30 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The locking tree was busier in this cycle than the usual pattern - a
  couple of major projects happened to coincide.

  The main changes are:

   - implement the atomic_fetch_{add,sub,and,or,xor}() API natively
     across all SMP architectures (Peter Zijlstra)

   - add atomic_fetch_{inc/dec}() as well, using the generic primitives
     (Davidlohr Bueso)

   - optimize various aspects of rwsems (Jason Low, Davidlohr Bueso,
     Waiman Long)

   - optimize smp_cond_load_acquire() on arm64 and implement LSE based
     atomic{,64}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}()
     on arm64 (Will Deacon)

   - introduce smp_acquire__after_ctrl_dep() and fix various barrier
     mis-uses and bugs (Peter Zijlstra)

   - after discovering ancient spin_unlock_wait() barrier bugs in its
     implementation and usage, strengthen its semantics and update/fix
     usage sites (Peter Zijlstra)

   - optimize mutex_trylock() fastpath (Peter Zijlstra)

   - ... misc fixes and cleanups"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (67 commits)
  locking/atomic: Introduce inc/dec variants for the atomic_fetch_$op() API
  locking/barriers, arch/arm64: Implement LDXR+WFE based smp_cond_load_acquire()
  locking/static_keys: Fix non static symbol Sparse warning
  locking/qspinlock: Use __this_cpu_dec() instead of full-blown this_cpu_dec()
  locking/atomic, arch/tile: Fix tilepro build
  locking/atomic, arch/m68k: Remove comment
  locking/atomic, arch/arc: Fix build
  locking/Documentation: Clarify limited control-dependency scope
  locking/atomic, arch/rwsem: Employ atomic_long_fetch_add()
  locking/atomic, arch/qrwlock: Employ atomic_fetch_add_acquire()
  locking/atomic, arch/mips: Convert to _relaxed atomics
  locking/atomic, arch/alpha: Convert to _relaxed atomics
  locking/atomic: Remove the deprecated atomic_{set,clear}_mask() functions
  locking/atomic: Remove linux/atomic.h:atomic_fetch_or()
  locking/atomic: Implement atomic{,64,_long}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}()
  locking/atomic: Fix atomic64_relaxed() bits
  locking/atomic, arch/xtensa: Implement atomic_fetch_{add,sub,and,or,xor}()
  locking/atomic, arch/x86: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
  locking/atomic, arch/tile: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
  locking/atomic, arch/sparc: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
  ...
2016-07-25 12:41:29 -07:00
Linus Torvalds
a2303849a6 Merge branch 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI updates from Ingo Molnar:
 "The biggest change in this cycle were SGI/UV related changes that
  clean up and fix UV boot quirks and problems.

  There's also various smaller cleanups and refinements"

* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi: Reorganize the GUID table to make it easier to read
  x86/efi: Remove the unused efi_get_time() function
  x86/efi: Update efi_thunk() to use the the arch_efi_call_virt*() macros
  x86/uv: Update uv_bios_call() to use efi_call_virt_pointer()
  efi: Convert efi_call_virt() to efi_call_virt_pointer()
  x86/efi: Remove unused variable 'efi'
  efi: Document #define FOO_PROTOCOL_GUID layout
  efibc: Report more information in the error messages
2016-07-25 12:30:01 -07:00
Linus Torvalds
df00ccca72 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes in this cycle were:

   - documentation updates

   - miscellaneous fixes

   - minor reorganization of code

   - torture-test updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
  rcu: Correctly handle sparse possible cpus
  rcu: sysctl: Panic on RCU Stall
  rcu: Fix a typo in a comment
  rcu: Make call_rcu_tasks() tolerate first call with irqs disabled
  rcu: Disable TASKS_RCU for usermode Linux
  rcu: No ordering for rcu_assign_pointer() of NULL
  rcutorture: Fix error return code in rcu_perf_init()
  torture: Inflict default jitter
  rcuperf: Don't treat gp_exp mis-setting as a WARN
  rcutorture: Drop "-soundhw pcspkr" from x86 boot arguments
  rcutorture: Don't specify the cpu type of QEMU on PPC
  rcutorture: Make -soundhw a x86 specific option
  rcutorture: Use vmlinux as the fallback kernel image
  rcutorture/doc: Create initrd using dracut
  torture: Stop onoff task if there is only one cpu
  torture: Add starvation events to error summary
  torture:  Break online and offline functions out of torture_onoff()
  torture: Forgive lengthy trace dumps and preemption
  torture: Remove CONFIG_RCU_TORTURE_TEST_RUNNABLE, simplify code
  torture: Simplify code, eliminate RCU_PERF_TEST_RUNNABLE
  ...
2016-07-25 12:04:11 -07:00
Linus Torvalds
dd95069545 hwmon updates for v4.8
New drivers for FTS BMC "Teutates", TI INA3221, and Sensirion SHT3x.
 Added support for Microchip MCP9808 and TI TMP461.
 Cleanup and minor fixes in various drivers.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXlXNKAAoJEMsfJm/On5mB7QwP/06g1PTYkZDtsvPD0kYwjqs4
 4je/OpmGFoSaytTJiOd1tQxfgmHe4m8eOZbTw0MyvvvJaDGSHPEUsEpH8DautTUD
 O4ngVfFV9R4fV6lBunLaXZc92+00pDqKYky5jzsMy/t0TJ43ycRvEt3Q+k9D85Go
 0hjP72CE6eEHzLKlrbDxbyZOEohbJyqO5bQL8FUy0k7+4LxN8FAKNyNgRW0fNwf4
 FSHMUNil0hGNvhApadvEk6uvQoPYpkTxQSXzFldT3zouJVKhgILBmQGNA+Be0bwP
 PS7ALhcVRcTtcZprY6BNf86cujo+5yWLI1Ifqeu+sNQUkwaZk8df/fxF4XpjWrez
 L1HN4p9nCUXhDGpeTxDedTeWZHDBLr2CuPPmm2vWbRM+gl+LJ5CFLq3oqOEcSR43
 bq8+oRgXtJHK0tlWJG3neabbArMV57bhrEsft4OthMFFaNRquZnqZylX7dBlQO+/
 rEGqALwmutHY2BmVM/jP/WQ6SBZTxWmsq/XVhheDqu842oukzH5CijddCL/JUQbr
 aC+u3gmXb2/gquEOgYosRAAAqLL0IH7AjSxXvhLM2lKSgJlJGmBXqezx9E4bYn6o
 RAAG4qlCIt5YJUkQ61r8j9WmlXd/BIzACyVSCk5tau61kGscepPk7Vk0dgdYUXMC
 wcbDIYHdC2p0voaZdXLf
 =O4g9
 -----END PGP SIGNATURE-----

Merge tag 'hwmon-for-linus-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging

Pull hwmon updates from Guenter Roeck:

 - New drivers for FTS BMC "Teutates", TI INA3221, and Sensirion SHT3x.

 - Added support for Microchip MCP9808 and TI TMP461.

 - Cleanup and minor fixes in various drivers.

* tag 'hwmon-for-linus-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging: (37 commits)
  Documentation: dtb: xgene: Add hwmon dts binding documentation
  hwmon: (ftsteutates) Remove unused including <linux/version.h>
  hwmon: (adt7411) set bit 3 in CFG1 register
  hwmon: Add driver for FTS BMC chip "Teutates"
  hwmon: (sht3x) add humidity heater element control
  hwmon: (jc42) Add support for generic JC-42.4 devicetree binding
  dt/bindings: Add bindings for JC-42.4 compatible temperature sensors
  hwmon: (tmp102) Convert to use regmap, and drop local cache
  hwmon: (tmp102) Rework chip configuration
  hwmon: (tmp102) Improve handling of initial read delay
  hwmon: (lm90) Drop unnecessary else statements
  hwmon: (lm90) Use bool for valid flag
  hwmon: (lm90) Read limit registers only once
  hwmon: (lm90) Simplify read functions
  hwmon: (lm90) Use devm_hwmon_device_register_with_groups
  hwmon: (lm90) Use devm_add_action for cleanup
  hwmon: (lm75) Convert to use regmap
  hwmon: (lm75) Add update_interval attribute
  hwmon: (lm75) Drop lm75_read_value and lm75_write_value
  hwmon: (lm75) Handle cleanup with devm_add_action
  ...
2016-07-24 21:10:30 -07:00
Linus Torvalds
b7545b79a1 USB update for 4.8-rc1
Here's the big USB driver update for 4.8-rc1.  Lots of the normal stuff
 in here, musb, gadget, xhci, and other updates and fixes.  All of the
 details are in the shortlog.
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iFYEABECABYFAleVPioPHGdyZWdAa3JvYWguY29tAAoJEDFH1A3bLfspB5AAnj7a
 VJ2t2kcWzFUNQ6dyJrJCGGRAAKDZmb5CnOGeqJmdVpDzN1CGLYjfiw==
 =47iA
 -----END PGP SIGNATURE-----

Merge tag 'usb-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB updates from Greg KH:
 "Here's the big USB driver update for 4.8-rc1.  Lots of the normal
  stuff in here, musb, gadget, xhci, and other updates and fixes.  All
  of the details are in the shortlog.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'usb-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (169 commits)
  cdc-acm: beautify probe()
  cdc-wdm: use the common CDC parser
  cdc-acm: cleanup error handling
  cdc-acm: use the common parser
  usbnet: move the CDC parser into USB core
  usb: musb: sunxi: Simplify dr_mode handling
  usb: musb: sunxi: make unexported symbols static
  usb: musb: cppi41: add dma channel tracepoints
  usb: musb: cppi41: move struct cppi41_dma_channel to header
  usb: musb: cleanup cppi_dma header
  usb: musb: gadget: add usb-request tracepoints
  usb: musb: host: add urb tracepoints
  usb: musb: add tracepoints to dump interrupt events
  usb: musb: add tracepoints for register access
  usb: musb: dsps: use musb register read/write wrappers instead
  usb: musb: switch dev_dbg to tracepoints
  usb: musb: add tracepoints support for debugging
  usb: quirks: Add no-lpm quirk for Elan
  phy: rcar-gen3-usb2: fix mutex_lock calling in interrupt
  phy: rockhip-usb: use devm_add_action_or_reset()
  ...
2016-07-24 17:22:18 -07:00
Linus Torvalds
721413aff2 TTY/Serial driver update for 4.8-rc1
Here is the big tty and serial driver update for 4.8-rc1.
 
 Lots of good cleanups from Jiri on a number of vt and other tty related
 things, and the normal driver updates.  Full details are in the
 shortlog.
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iFYEABECABYFAleVPbQPHGdyZWdAa3JvYWguY29tAAoJEDFH1A3bLfspWXgAn046
 QCMeFya4J1zjYjcGXJzNfGMUAKCHxha8Xe65cc0LDz8mNB0MgzjHEg==
 =ED8v
 -----END PGP SIGNATURE-----

Merge tag 'tty-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty/serial driver updates from Greg KH:
 "Here is the big tty and serial driver update for 4.8-rc1.

  Lots of good cleanups from Jiri on a number of vt and other tty
  related things, and the normal driver updates.  Full details are in
  the shortlog.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'tty-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (90 commits)
  tty/serial: atmel: enforce tasklet init and termination sequences
  serial: sh-sci: Stop transfers in sci_shutdown()
  serial: 8250_ingenic: drop #if conditional surrounding earlycon code
  serial: 8250_mtk: drop !defined(MODULE) conditional
  serial: 8250_uniphier: drop !defined(MODULE) conditional
  earlycon: mark earlycon code as __used iif the caller is built-in
  tty/serial/8250: use mctrl_gpio helpers
  serial: mctrl_gpio: enable API usage only for initialized mctrl_gpios struct
  serial: mctrl_gpio: add modem control read routine
  tty/serial/8250: make UART_MCR register access consistent
  serial: 8250_mid: Read RX buffer on RX DMA timeout for DNV
  serial: 8250_dma: Export serial8250_rx_dma_flush()
  dmaengine: hsu: Export hsu_dma_get_status()
  tty: serial: 8250: add CON_CONSDEV to flags
  tty: serial: samsung: add byte-order aware bit functions
  tty: serial: samsung: fixup accessors for endian
  serial: sirf: make fifo functions static
  serial: mps2-uart: make driver explicitly non-modular
  serial: mvebu-uart: free the IRQ in ->shutdown()
  serial/bcm63xx_uart: use correct alias naming
  ...
2016-07-24 17:14:37 -07:00
Linus Torvalds
25a0dc4be8 Staging / IIO driver update for 4.8-rc1
Here is the big Staging and IIO driver update for 4.8-rc1.
 
 We ended up adding more code than removing, again, but it's not all that
 bad.  Lots of cleanups all over the staging tree, and new IIO drivers,
 full details in the shortlog.
 
 All of these have been in linux-next for a while with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iFYEABECABYFAleVPQQPHGdyZWdAa3JvYWguY29tAAoJEDFH1A3bLfsplRgAniG6
 jfPnvlHhl70T5HsGJzrc7VS9AKCBQ5x0gzTNxo2nnGfPmR8CVEH7Bg==
 =0/6X
 -----END PGP SIGNATURE-----

Merge tag 'staging-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging and IIO driver updates from Greg KH:
 "Here is the big Staging and IIO driver update for 4.8-rc1.

  We ended up adding more code than removing, again, but it's not all
  that bad.  Lots of cleanups all over the staging tree, and new IIO
  drivers, full details in the shortlog.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'staging-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (417 commits)
  drivers:iio:accel:mma8452: removed unwanted return statements
  drivers:iio:accel:mma8452: added cleanup provision in case of failure.
  iio: Add iio.git tree to MAINTAINERS
  iio:st_pressure: clean useless static channel initializers
  iio:st_pressure:lps22hb: temperature support
  iio:st_pressure:lps22hb: open drain support
  iio:st_pressure: temperature triggered buffering
  iio:st_pressure: document sampling gains
  iio:st_pressure: align storagebits on power of 2
  iio:st_sensors: align on storagebits boundaries
  staging:iio:lis3l02dq drop separate driver
  iio: accel: st_accel: Add lis3l02dq support
  iio: adc: add missing of_node references to iio_dev
  iio: adc: ti-ads1015: add indio_dev->dev.of_node reference
  iio: potentiometer: Fix typo in Kconfig
  iio: potentiometer: mcp4531: Add device tree binding
  iio: potentiometer: mcp4531: Add device tree binding documentation
  iio: potentiometer: mcp4531: Add support for MCP454x, MCP456x, MCP464x and MCP466x
  iio:imu:mpu6050: icm20608 initial support
  iio: adc: max1363: Add device tree binding
  ...
2016-07-24 16:55:23 -07:00
Linus Torvalds
9d0be76f52 Char/Misc driver patches for 4.8-rc1
Here is the big char/misc driver update for 4.8-rc1.
 
 Not a lot of stuff, but it's all over the place, full details are in the
 shortlog below.  All of these have been in linux-next with no reported
 issues for a while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iFYEABECABYFAleVPBsPHGdyZWdAa3JvYWguY29tAAoJEDFH1A3bLfspEQgAoJOX
 nSWKA7j4JMGy1v+uNIqsgUmUAJsFyS388N+Faa2K4uyp7CYQ6jaAZw==
 =0Ofd
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver updates from Greg KH:
 "Here is the big char/misc driver update for 4.8-rc1.

  Not a lot of stuff, but it's all over the place, full details are in
  the shortlog.  All of these have been in linux-next with no reported
  issues for a while"

* tag 'char-misc-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (49 commits)
  lkdtm: silence warnings about function declarations
  lkdtm: hide unused functions
  intel_th: pci: Add Kaby Lake PCH-H support
  intel_th: Fix a deadlock in modprobing
  dsp56k: prevent a harmless underflow
  chardev: add missing line break in pr_warn
  lkdtm: use struct arrays instead of enums
  lkdtm: move jprobe entry points to start of source
  lkdtm: reorganize module paramaters
  lkdtm: rename globals for clarity
  lkdtm: rename "count" to "crash_count"
  lkdtm: remove intentional off-by-one array access
  lkdtm: split remaining logic bug tests to separate file
  lkdtm: split heap corruption tests to separate file
  lkdtm: split memory permissions tests to separate file
  lkdtm: split usercopy tests to separate file
  lkdtm: drop "alloc_size" parameter
  lkdtm: add usercopy test for blocking kernel text
  extcon: adc-jack: add suspend/resume support
  extcon: add missing of_node_put after calling of_parse_phandle
  ...
2016-07-24 16:26:26 -07:00
Linus Torvalds
b403f23044 We've got ten patches this time, half of which are related to a plethora
of nasty outcomes when inodes are transitioned from the unlinked state
 to the free state. Small file systems are particularly vulnerable to these
 problems, and it can manifest as mainly hangs, but also file system
 corruption. The patches have been tested for literally many weeks, with a
 very gruelling test, so I have a high level of confidence.
 
 - Andreas Gruenbacher wrote a series of 5 patches for various lockups
   during the transition of inodes from unlinked to free. The main patch
   is titled "Fix gfs2_lookup_by_inum lock inversion" and the other 4 are
   support and cleanup patches related to that.
 - Ben Marzinski contributed 2 patches with regard to a recreatable
   problem when gfs2 tries to write a page to a file that is being
   truncated, resulting in a BUG() in gfs2_remove_from_journal.
   Note that Ben had to export vfs function __block_write_full_page to get
   this to work properly. It's been posted a long time and he talked to
   various VFS people about it, and nobody seemed to mind.
 - I contributed 3 patches. (1) The first one fixes a memory corruptor:
   a race in which one process can overwrite the gl_object pointer set by
   another process, causing kernel panic and other symptoms. (2) The second
   patch fixes another race that resulted in a false-positive BUG_ON. This
   occurred when resource group reservations were freed by one process
   while another process was trying to grab a new reservation in the same
   resource group. (3) The third patch fixes a problem with doing journal
   replay when the journals are not all the same size.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJXklXIAAoJENeLYdPf93o7AbIIAImLEixK+4CaItEArAKG9TXv
 WbO+eDJfo6AOtAteB6+MdX2UxXAHJsCY6RmiEIAi5LzlVFiiCgRo4z/QgDARAw3c
 2RxlndElaESh82S27sLiFbgZeY7GZv04C0t6AzMkc830BLXiKMs6bXfeq1fzW8Sf
 AgAneACVsX0faRWo/XDuQcK81dwZ+qdOnR2+FvtOSFl1KgV0BrtnsW7IHv+5MIot
 SREDN7VvSQwQrLgwMlC0PvhwK3KCVvuO9ZziLEPpYJONESJfEmuCpG265+tUJNTw
 dIcW3p/vvgow8fb56fSnAxaeplPLlF9qJCq1M9fWZrKVbHg2uyCZMx4P52Fnmz4=
 =uUVs
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-4.7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Bob Peterson:
 "We've got ten patches this time, half of which are related to a
  plethora of nasty outcomes when inodes are transitioned from the
  unlinked state to the free state.  Small file systems are particularly
  vulnerable to these problems, and it can manifest as mainly hangs, but
  also file system corruption.  The patches have been tested for
  literally many weeks, with a very gruelling test, so I have a high
  level of confidence.

   - Andreas Gruenbacher wrote a series of five patches for various
     lockups during the transition of inodes from unlinked to free.

     The main patch is titled "Fix gfs2_lookup_by_inum lock inversion"
     and the other four are support and cleanup patches related to that.

   - Ben Marzinski contributed two patches with regard to a recreatable
     problem when gfs2 tries to write a page to a file that is being
     truncated, resulting in a BUG() in gfs2_remove_from_journal.

     Note that Ben had to export vfs function __block_write_full_page to
     get this to work properly.  It's been posted a long time and he
     talked to various VFS people about it, and nobody seemed to mind.

   - I contributed 3 patches:
       o The first one fixes a memory corruptor: a race in which one
         process can overwrite the gl_object pointer set by another
         process, causing kernel panic and other symptoms.
       o The second patch fixes another race that resulted in a
         false-positive BUG_ON.  This occurred when resource group
         reservations were freed by one process while another process
         was trying to grab a new reservation in the same resource
         group.
       o The third patch fixes a problem with doing journal replay when
         the journals are not all the same size"

* tag 'gfs2-4.7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  GFS2: Fix gfs2_replay_incr_blk for multiple journal sizes
  GFS2: Check rs_free with rd_rsspin protection
  gfs2: writeout truncated pages
  fs: export __block_write_full_page
  gfs2: Lock holder cleanup
  gfs2: Large-filesystem fix for 32-bit systems
  gfs2: Get rid of gfs2_ilookup
  gfs2: Fix gfs2_lookup_by_inum lock inversion
  gfs2: Initialize iopen glock holder for new inodes
  GFS2: don't set rgrp gl_object until it's inserted into rgrp tree
2016-07-24 16:07:52 -07:00
Linus Torvalds
107df03203 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix memory leak in nftables, from Liping Zhang.

 2) Need to check result of vlan_insert_tag() in batman-adv otherwise we
    risk NULL skb derefs, from Sven Eckelmann.

 3) Check for dev_alloc_skb() failures in cfg80211, from Gregory
    Greenman.

 4) Handle properly when we have ppp_unregister_channel() happening in
    parallel with ppp_connect_channel(), from WANG Cong.

 5) Fix DCCP deadlock, from Eric Dumazet.

 6) Bail out properly in UDP if sk_filter() truncates the packet to be
    smaller than even the space that the protocol headers need.  From
    Michal Kubecek.

 7) Similarly for rose, dccp, and sctp, from Willem de Bruijn.

 8) Make TCP challenge ACKs less predictable, from Eric Dumazet.

 9) Fix infinite loop in bgmac_dma_tx_add() from Florian Fainelli.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
  packet: propagate sock_cmsg_send() error
  net/mlx5e: Fix del vxlan port command buffer memset
  packet: fix second argument of sock_tx_timestamp()
  net: switchdev: change ageing_time type to clock_t
  Update maintainer for EHEA driver.
  net/mlx4_en: Add resilience in low memory systems
  net/mlx4_en: Move filters cleanup to a proper location
  sctp: load transport header after sk_filter
  net/sched/sch_htb: clamp xstats tokens to fit into 32-bit int
  net: cavium: liquidio: Avoid dma_unmap_single on uninitialized ndata
  net: nb8800: Fix SKB leak in nb8800_receive()
  et131x: Fix logical vs bitwise check in et131x_tx_timeout()
  vlan: use a valid default mtu value for vlan over macsec
  net: bgmac: Fix infinite loop in bgmac_dma_tx_add()
  mlxsw: spectrum: Prevent invalid ingress buffer mapping
  mlxsw: spectrum: Prevent overwrite of DCB capability fields
  mlxsw: spectrum: Don't emit errors when PFC is disabled
  mlxsw: spectrum: Indicate support for autonegotiation
  mlxsw: spectrum: Force link training according to admin state
  r8152: add MODULE_VERSION
  ...
2016-07-23 15:44:31 +09:00
Andrey Ryabinin
3cb9185c67 radix-tree: fix radix_tree_iter_retry() for tagged iterators.
radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
leading to crash:

  RIP: radix_tree_next_slot include/linux/radix-tree.h:473
    find_get_pages_tag+0x334/0x930 mm/filemap.c:1452
  ....
  Call Trace:
    pagevec_lookup_tag+0x3a/0x80 mm/swap.c:960
    mpage_prepare_extent_to_map+0x321/0xa90 fs/ext4/inode.c:2516
    ext4_writepages+0x10be/0x2b20 fs/ext4/inode.c:2736
    do_writepages+0x97/0x100 mm/page-writeback.c:2364
    __filemap_fdatawrite_range+0x248/0x2e0 mm/filemap.c:300
    filemap_write_and_wait_range+0x121/0x1b0 mm/filemap.c:490
    ext4_sync_file+0x34d/0xdb0 fs/ext4/fsync.c:115
    vfs_fsync_range+0x10a/0x250 fs/sync.c:195
    vfs_fsync fs/sync.c:209
    do_fsync+0x42/0x70 fs/sync.c:219
    SYSC_fdatasync fs/sync.c:232
    SyS_fdatasync+0x19/0x20 fs/sync.c:230
    entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207

We must reset iterator's tags to bail out from radix_tree_next_slot()
and go to the slow-path in radix_tree_next_chunk().

Fixes: 46437f9a554f ("radix-tree: fix race in gang lookup")
Link: http://lkml.kernel.org/r/1468495196-10604-1-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-23 10:25:54 +09:00
Johannes Weiner
73f576c04b mm: memcontrol: fix cgroup creation failure after many small jobs
The memory controller has quite a bit of state that usually outlives the
cgroup and pins its CSS until said state disappears.  At the same time
it imposes a 16-bit limit on the CSS ID space to economically store IDs
in the wild.  Consequently, when we use cgroups to contain frequent but
small and short-lived jobs that leave behind some page cache, we quickly
run into the 64k limitations of outstanding CSSs.  Creating a new cgroup
fails with -ENOSPC while there are only a few, or even no user-visible
cgroups in existence.

Although pinning CSSs past cgroup removal is common, there are only two
instances that actually need an ID after a cgroup is deleted: cache
shadow entries and swapout records.

Cache shadow entries reference the ID weakly and can deal with the CSS
having disappeared when it's looked up later.  They pose no hurdle.

Swap-out records do need to pin the css to hierarchically attribute
swapins after the cgroup has been deleted; though the only pages that
remain swapped out after offlining are tmpfs/shmem pages.  And those
references are under the user's control, so they are manageable.

This patch introduces a private 16-bit memcg ID and switches swap and
cache shadow entries over to using that.  This ID can then be recycled
after offlining when the CSS remains pinned only by objects that don't
specifically need it.

This script demonstrates the problem by faulting one cache page in a new
cgroup and deleting it again:

  set -e
  mkdir -p pages
  for x in `seq 128000`; do
    [ $((x % 1000)) -eq 0 ] && echo $x
    mkdir /cgroup/foo
    echo $$ >/cgroup/foo/cgroup.procs
    echo trex >pages/$x
    echo $$ >/cgroup/cgroup.procs
    rmdir /cgroup/foo
  done

When run on an unpatched kernel, we eventually run out of possible IDs
even though there are no visible cgroups:

  [root@ham ~]# ./cssidstress.sh
  [...]
  65000
  mkdir: cannot create directory '/cgroup/foo': No space left on device

After this patch, the IDs get released upon cgroup destruction and the
cache and css objects get released once memory reclaim kicks in.

[hannes@cmpxchg.org: init the IDR]
  Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org
Fixes: b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups")
Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: John Garcia <john.garcia@mesosphere.io>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Nikolay Borisov <kernel@kyup.com>
Cc: <stable@vger.kernel.org>	[3.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-23 10:25:54 +09:00
Vivien Didelot
eabfdda934 net: switchdev: change ageing_time type to clock_t
The switchdev value for the SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME
attribute is a clock_t and requires to use helpers such as
clock_t_to_jiffies() to convert to milliseconds.

Change ageing_time type from u32 to clock_t to make it explicit.

Fixes: f55ac58ae64c ("switchdev: add bridge ageing_time attribute")
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-19 16:49:20 -07:00
Paolo Abeni
18d3df3eab vlan: use a valid default mtu value for vlan over macsec
macsec can't cope with mtu frames which need vlan tag insertion, and
vlan device set the default mtu equal to the underlying dev's one.
By default vlan over macsec devices use invalid mtu, dropping
all the large packets.
This patch adds a netif helper to check if an upper vlan device
needs mtu reduction. The helper is used during vlan devices
initialization to set a valid default and during mtu updating to
forbid invalid, too bit, mtu values.
The helper currently only check if the lower dev is a macsec device,
if we get more users, we need to update only the helper (possibly
reserving an additional IFF bit).

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-16 20:15:02 -07:00
Linus Torvalds
631517032f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input fixes from Dmitry Torokhov:
 "A few last-minute updates for the input subsystem"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: ts4800-ts - add missing of_node_put after calling of_parse_phandle
  Input: synaptics-rmi4 - use of_get_child_by_name() to fix refcount
  Revert "Input: wacom_w8001 - drop use of ABS_MT_TOOL_TYPE"
  Input: xpad - validate USB endpoint count during probe
  Input: add SW_PEN_INSERTED define
2016-07-16 07:04:12 +09:00
Ingo Molnar
38452af242 Merge branch 'x86/asm' into x86/mm, to resolve conflicts
Conflicts:
	tools/testing/selftests/x86/Makefile

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-15 10:26:04 +02:00
Linus Torvalds
fa3a9f5744 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "20 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  m32r: fix build warning about putc
  mm: workingset: printk missing log level, use pr_info()
  mm: thp: refix false positive BUG in page_move_anon_rmap()
  mm: rmap: call page_check_address() with sync enabled to avoid racy check
  mm: thp: move pmd check inside ptl for freeze_page()
  vmlinux.lds: account for destructor sections
  gcov: add support for gcc version >= 6
  mm, meminit: ensure node is online before checking whether pages are uninitialised
  mm, meminit: always return a valid node from early_pfn_to_nid
  kasan/quarantine: fix bugs on qlist_move_cache()
  uapi: export lirc.h header
  madvise_free, thp: fix madvise_free_huge_pmd return value after splitting
  Revert "scripts/gdb: add documentation example for radix tree"
  Revert "scripts/gdb: add a Radix Tree Parser"
  scripts/gdb: Perform path expansion to lx-symbol's arguments
  scripts/gdb: add constants.py to .gitignore
  scripts/gdb: rebuild constants.py on dependancy change
  scripts/gdb: silence 'nothing to do' message
  kasan: add newline to messages
  mm, compaction: prevent VM_BUG_ON when terminating freeing scanner
2016-07-15 16:00:18 +09:00
Hugh Dickins
5a49973d71 mm: thp: refix false positive BUG in page_move_anon_rmap()
The VM_BUG_ON_PAGE in page_move_anon_rmap() is more trouble than it's
worth: the syzkaller fuzzer hit it again.  It's still wrong for some THP
cases, because linear_page_index() was never intended to apply to
addresses before the start of a vma.

That's easily fixed with a signed long cast inside linear_page_index();
and Dmitry has tested such a patch, to verify the false positive.  But
why extend linear_page_index() just for this case? when the avoidance in
page_move_anon_rmap() has already grown ugly, and there's no reason for
the check at all (nothing else there is using address or index).

Remove address arg from page_move_anon_rmap(), remove VM_BUG_ON_PAGE,
remove CONFIG_DEBUG_VM PageTransHuge adjustment.

And one more thing: should the compound_head(page) be done inside or
outside page_move_anon_rmap()? It's usually pushed down to the lowest
level nowadays (and mm/memory.c shows no other explicit use of it), so I
think it's better done in page_move_anon_rmap() than by caller.

Fixes: 0798d3c022dc ("mm: thp: avoid false positive VM_BUG_ON_PAGE in page_move_anon_rmap()")
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1607120444540.12528@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>	[4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15 14:54:27 +09:00
Naoya Horiguchi
33f4751e99 mm: thp: move pmd check inside ptl for freeze_page()
I found a race condition triggering VM_BUG_ON() in freeze_page(), when
running a testcase with 3 processes:
  - process 1: keep writing thp,
  - process 2: keep clearing soft-dirty bits from virtual address of process 1
  - process 3: call migratepages for process 1,

The kernel message is like this:

  kernel BUG at /src/linux-dev/mm/huge_memory.c:3096!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: cfg80211 rfkill crc32c_intel ppdev serio_raw pcspkr virtio_balloon virtio_console parport_pc parport pvpanic acpi_cpufreq tpm_tis tpm i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy virtio_pci virtio_ring virtio
  CPU: 0 PID: 28863 Comm: migratepages Not tainted 4.6.0-v4.6-160602-0827-+ #2
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  task: ffff880037320000 ti: ffff88007cdd0000 task.ti: ffff88007cdd0000
  RIP: 0010:[<ffffffff811f8e06>]  [<ffffffff811f8e06>] split_huge_page_to_list+0x496/0x590
  RSP: 0018:ffff88007cdd3b70  EFLAGS: 00010202
  RAX: 0000000000000001 RBX: ffff88007c7b88c0 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: 0000000700000200 RDI: ffffea0003188000
  RBP: ffff88007cdd3bb8 R08: 0000000000000001 R09: 00003ffffffff000
  R10: ffff880000000000 R11: ffffc000001fffff R12: ffffea0003188000
  R13: ffffea0003188000 R14: 0000000000000000 R15: 0400000000000080
  FS:  00007f8ec241d740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000             CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f8ec1f3ed20 CR3: 000000003707b000 CR4: 00000000000006f0
  Call Trace:
    ? list_del+0xd/0x30
    queue_pages_pte_range+0x4d1/0x590
    __walk_page_range+0x204/0x4e0
    walk_page_range+0x71/0xf0
    queue_pages_range+0x75/0x90
    ? queue_pages_hugetlb+0x190/0x190
    ? new_node_page+0xc0/0xc0
    ? change_prot_numa+0x40/0x40
    migrate_to_node+0x71/0xd0
    do_migrate_pages+0x1c3/0x210
    SyS_migrate_pages+0x261/0x290
    entry_SYSCALL_64_fastpath+0x1a/0xa4
  Code: e8 b0 87 fb ff 0f 0b 48 c7 c6 30 32 9f 81 e8 a2 87 fb ff 0f 0b 48 c7 c6 b8 46 9f 81 e8 94 87 fb ff 0f 0b 85 c0 0f 84 3e fd ff ff <0f> 0b 85 c0 0f 85 a6 00 00 00 48 8b 75 c0 4c 89 f7 41 be f0 ff
  RIP   split_huge_page_to_list+0x496/0x590

I'm not sure of the full scenario of the reproduction, but my debug
showed that split_huge_pmd_address(freeze=true) returned without running
main code of pmd splitting because pmd_present(*pmd) in precheck somehow
returned 0.  If this happens, the subsequent try_to_unmap() fails and
returns non-zero (because page_mapcount() still > 0), and finally
VM_BUG_ON() fires.  This patch tries to fix it by prechecking pmd state
inside ptl.

Link: http://lkml.kernel.org/r/1466990929-7452-1-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15 14:54:27 +09:00
Dmitry Vyukov
e41f501d39 vmlinux.lds: account for destructor sections
If CONFIG_KASAN is enabled and gcc is configured with
--disable-initfini-array and/or gold linker is used, gcc emits
.ctors/.dtors and .text.startup/.text.exit sections instead of
.init_array/.fini_array.  .dtors section is not explicitly accounted in
the linker script and messes vvar/percpu layout.

We want:
  ffffffff822bfd80 D _edata
  ffffffff822c0000 D __vvar_beginning_hack
  ffffffff822c0000 A __vvar_page
  ffffffff822c0080 0000000000000098 D vsyscall_gtod_data
  ffffffff822c1000 A __init_begin
  ffffffff822c1000 D init_per_cpu__irq_stack_union
  ffffffff822c1000 A __per_cpu_load
  ffffffff822d3000 D init_per_cpu__gdt_page

We got:
  ffffffff8279a600 D _edata
  ffffffff8279b000 A __vvar_page
  ffffffff8279c000 A __init_begin
  ffffffff8279c000 D init_per_cpu__irq_stack_union
  ffffffff8279c000 A __per_cpu_load
  ffffffff8279e000 D __vvar_beginning_hack
  ffffffff8279e080 0000000000000098 D vsyscall_gtod_data
  ffffffff827ae000 D init_per_cpu__gdt_page

This happens because __vvar_page and .vvar get different addresses in
arch/x86/kernel/vmlinux.lds.S:

	. = ALIGN(PAGE_SIZE);
	__vvar_page = .;

	.vvar : AT(ADDR(.vvar) - LOAD_OFFSET) {
		/* work around gold bug 13023 */
		__vvar_beginning_hack = .;

Discard .dtors/.fini_array/.text.exit, since we don't call dtors.
Merge .text.startup into init text.

Link: http://lkml.kernel.org/r/1467386363-120030-1-git-send-email-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: <stable@vger.kernel.org>	[4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15 14:54:27 +09:00
Mauro Carvalho Chehab
12cb22bb8a uapi: export lirc.h header
This header contains the userspace API for lirc.

This is a fixup for commit b7be755733dc ("[media] bz#75751: Move
internal header file lirc.h to uapi/").  It moved the header to the
right place, but it forgot to add it at Kbuild.  So, despite being at
uapi, it is not copied to the right place.

Fixes: b7be755733dc44c72 ("[media] bz#75751: Move internal header file lirc.h to uapi/")
Link: http://lkml.kernel.org/r/320c765d32bfc82c582e336d52ffe1026c73c644.1468439021.git.mchehab@s-opensource.com
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Cc: Alec Leamas <leamas.alec@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15 14:54:27 +09:00
Dave Airlie
d2e1204f89 Merge branch 'drm-vmwgfx-fixes' of git://people.freedesktop.org/~syeh/repos_linux into drm-fixes
A bunch of vmwgfx fixes that fix a black screen issue on latest distros/hw combos.

* 'drm-vmwgfx-fixes' of git://people.freedesktop.org/~syeh/repos_linux:
  drm/vmwgfx: Fix error paths when mapping framebuffer
  drm/vmwgfx: Fix corner case screen target management
  drm/vmwgfx: Delay pinning fbdev framebuffer until after mode set
  drm/vmwgfx: Check pin count before attempting to move a buffer
  drm/ttm: Make ttm_bo_mem_compat available
  drm/vmwgfx: Add an option to change assumed FB bpp
  drm/vmwgfx: Work around mode set failure in 2D VMs
  drm/vmwgfx: Add a check to handle host message failure
2016-07-15 13:51:55 +10:00
Frederic Weisbecker
8612f17ab9 sched/cputime: Reorganize vtime native irqtime accounting headers
The vtime irqtime accounting headers are very scattered and convoluted
right now. Reorganize them such that it is obvious that only
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE does use it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1468421405-20056-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 10:42:35 +02:00
Frederic Weisbecker
0cfdf9a198 sched/cputime: Clean up the old vtime gen irqtime accounting completely
Vtime generic irqtime accounting has been removed but there are a few
remnants to clean up:

* The vtime_accounting_cpu_enabled() check in irq entry was only used
  by CONFIG_VIRT_CPU_ACCOUNTING_GEN. We can safely remove it.

* Without the vtime_accounting_cpu_enabled(), we no longer need to
  have a vtime_common_account_irq_enter() indirect function.

* Move vtime_account_irq_enter() implementation under
  CONFIG_VIRT_CPU_ACCOUNTING_NATIVE which is the last user.

* The vtime_account_user() call was only used on irq entry for
  CONFIG_VIRT_CPU_ACCOUNTING_GEN. We can remove that too.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1468421405-20056-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 10:42:35 +02:00
Rik van Riel
b58c358405 sched/cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code
The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
appear to currently work right.

On CPUs without nohz_full=, only tick based irq time sampling is
done, which breaks down when dealing with a nohz_idle CPU.

On firewalls and similar systems, no ticks may happen on a CPU for a
while, and the irq time spent may never get accounted properly. This
can cause issues with capacity planning and power saving, which use
the CPU statistics as inputs in decision making.

Remove the VTIME_GEN vtime irq time code, and replace it with the
IRQ_TIME_ACCOUNTING code, when selected as a config option by the user.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1468421405-20056-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 10:42:34 +02:00
Rik van Riel
5743021831 sched/cputime: Count actually elapsed irq & softirq time
Currently, if there was any irq or softirq time during 'ticks'
jiffies, the entire period will be accounted as irq or softirq
time.

This is inaccurate if only a subset of the time was actually spent
handling irqs, and could conceivably mis-count all of the ticks during
a period as irq time, when there was some irq and some softirq time.

This can actually happen when irqtime_account_process_tick is called
from account_idle_ticks, which can pass a larger number of ticks down
all at once.

Fix this by changing irqtime_account_hi_update(), irqtime_account_si_update(),
and steal_account_process_ticks() to work with cputime_t time units, and
return the amount of time spent in each mode.

Rename steal_account_process_ticks() to steal_account_process_time(), to
reflect that time is now accounted in cputime_t, instead of ticks.

Additionally, have irqtime_account_process_tick() take into account how
much time was spent in each of steal, irq, and softirq time.

The latter could help improve the accuracy of cputime
accounting when returning from idle on a NO_HZ_IDLE CPU.

Properly accounting how much time was spent in hardirq and
softirq time will also allow the NO_HZ_FULL code to re-use
these same functions for hardirq and softirq accounting.

Signed-off-by: Rik van Riel <riel@redhat.com>
[ Make nsecs_to_cputime64() actually return cputime64_t. ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1468421405-20056-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 10:42:34 +02:00
Ingo Molnar
cefef3a762 Merge branch 'sched/core' into timers/nohz, to avoid conflicts in upcoming patches
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 10:37:48 +02:00
Greg Kroah-Hartman
6c71ee3b61 Third set of IIO new device support, features and cleanups for the 4.8 cycle.
New core features
 - Selection of the clock source for IIO timestamps.  This is done per device
   as it makes little sense to have events in one timebase and data timestamped
   on another.  Biggest reason for this is that we currently use a clock
   source which is non monotonic which can result in 'interesting' data sets.
   (Includes export for get_monotonic_corse64 which Thomas Gleixner didn't mind
    in an earlier version.)
 - MAINTAINERS add the git tree to the list for IIO.
 
 New device support + a kind of indirect staging graduation.
 * Broadcom iproc-static-adc
   - new driver
 * mcp4531
   - support for MCP454x, MCP456x, MCP464x and MCP466x potentiometers
 * mpu6050
   - support the IC20608 6 axis motion tracking device
 * st-sensors
   - support the lis3l02dq + drop the lis3l02dq driver from staging.
   The general purpose driver is missing event support, but good to get
   rid of this driver which was rather long in the tooth.
 
 New driver features
 * ak8975
   - Add vid regulator support and refactor handling in general.
   - Allow a delay after enabling regulators.
   - Runtime and system PM.
 * bmg160
   - filter frequency control support.
 * bmp280
   - SPI device support.
   - EOC interrupt support for the BMP085
   - power management support.
   - supply regulator support.
   - reset gpio support
   - dt bindings for reset gpio and regulators.
   - of table to support device tree registration
 * max1363
   - Device tree bindings.
 * mcp4531
   - Device tree bindings.
 * st-pressure
   - temperature channels as part of triggered buffer (previously not due
   probably to alignment issues - see below).
   - lps22hb open drain interrupt support.
   - lps22hb temperature channel support
 
 Cleanups and reworkings.
 * numerous ADC drivers
   - ensure the iio_dev->dev.of_node is set to the parent dev.of_node so
   as to allow client bindings to find the device.
 * ak8975
   - Fix incorrect handling of missing regulator
   - make sure power is down and remove.
 * bmp280
   - read the calibration data only once as it doesn't change.
 * isl29125
   - Use a few macros to make code a touch more readable.
 * mma8452
   - fix a memory leak on error.
   - drop an unecessary bit of return value handling.
 * potentiometer kconfig
   - typo fix.
 * st-pressure
   - drop some uninformative default assignments of elements of the channel
   array structure (aids readability).
 * st-sensors
   - Harden interrupt handling considerably.  These are actually all using
   level interrupts, but at least two known boards have them wired to
   edge only interrupt chips.  Hence a slightly interesting bit of handling
   is needed in which we first allow for the easy option (level triggered) and
   secondly check the status registers before reenabling edge interrupts and
   fall back to a tight loop in the thread until we successfully clear the
   interrupt.  No harm is done if we never succeed in doing so.  It's an odd
   patch that has been through a lot of revisions to reach a consensus on how
   to handle what is basically broken hardware (which the previous defaults
   allowed to kind of work).
   - Fix alignment to defined storagebytes boundaries.
   - Ensure alignment of power of 2 byte boundaries.  This has always in theory
   been part of the ABI of IIO, but we missed a few that snuck in that need
   fixing.  The effect was minor as they were only followed by timestamp
   channels which were correctly aligned,
   - Add some docs to explain the gain calculations.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIuBAABCAAYBQJXfBqnERxqaWMyM0BrZXJuZWwub3JnAAoJEFSFNJnE9BaIqjwP
 /0OJbr8kIa1i6+iCqCRCPCixdymd6k9wvjDaKSQoDeamen+8iKOLZNhXJJjOX8hd
 eCRMrCJbvY96Bl2Ll51TCEBb8R1xppCwwYIYylKhF9CL6N2ndapzWY0G4XZb6pc0
 e1JIa6uxynAAEsfplBskk4Ytf5PPHDOWER5WsTmxlZcTTAL9gLxIlii2Du0AmeN/
 tANVzwuvK07i5HHuZfYV2h2+OWDSlm4Y5rvE7t8keWpp6wnZ0XtiIw1WjkpR1OY7
 KiKGKRJMomFlp51hP9IKqc20Dweiaf3lHS7BDggvkB11VxyajQTcjvogxQ0BSPUv
 7PTHHlk8txgEUMqrDWP8x0TL97iNt3hiOZ0/rI3IZdFLC8pnibewnB+uHEGCH3tv
 bqToPtpJHjsIiGlCGVxvt8BRgqT5Qq7JT65hYS6774uFcQiPEvPDI44BDqUxaDUf
 /1WFM23VB4KJpx8JnL+nC8iu6DBnVPDWDKAsjGgc+ljnz3VRcSxWz5P0yMFZRMA2
 mbLiG2yiD4oD/LcI8FeZh9X50Irg09ElAWu07VRymrYMRfCYLXO07o5nZJ0bOqOB
 R+1MToYaHz2g6jJ+KGVC0Ul5EuULzymqH0CMbdjWnaD9AaoPuOKkNfUVBkzRK0t/
 TO/wLHm/qNbk+zGZHQFU15mH1Nn9leEJ/uCdnGqkRo7i
 =FxNN
 -----END PGP SIGNATURE-----

Merge tag 'iio-for-4.8c' of git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-next

Jonathan writes:

Third set of IIO new device support, features and cleanups for the 4.8 cycle.

New core features
- Selection of the clock source for IIO timestamps.  This is done per device
  as it makes little sense to have events in one timebase and data timestamped
  on another.  Biggest reason for this is that we currently use a clock
  source which is non monotonic which can result in 'interesting' data sets.
  (Includes export for get_monotonic_corse64 which Thomas Gleixner didn't mind
   in an earlier version.)
- MAINTAINERS add the git tree to the list for IIO.

New device support + a kind of indirect staging graduation.
* Broadcom iproc-static-adc
  - new driver
* mcp4531
  - support for MCP454x, MCP456x, MCP464x and MCP466x potentiometers
* mpu6050
  - support the IC20608 6 axis motion tracking device
* st-sensors
  - support the lis3l02dq + drop the lis3l02dq driver from staging.
  The general purpose driver is missing event support, but good to get
  rid of this driver which was rather long in the tooth.

New driver features
* ak8975
  - Add vid regulator support and refactor handling in general.
  - Allow a delay after enabling regulators.
  - Runtime and system PM.
* bmg160
  - filter frequency control support.
* bmp280
  - SPI device support.
  - EOC interrupt support for the BMP085
  - power management support.
  - supply regulator support.
  - reset gpio support
  - dt bindings for reset gpio and regulators.
  - of table to support device tree registration
* max1363
  - Device tree bindings.
* mcp4531
  - Device tree bindings.
* st-pressure
  - temperature channels as part of triggered buffer (previously not due
  probably to alignment issues - see below).
  - lps22hb open drain interrupt support.
  - lps22hb temperature channel support

Cleanups and reworkings.
* numerous ADC drivers
  - ensure the iio_dev->dev.of_node is set to the parent dev.of_node so
  as to allow client bindings to find the device.
* ak8975
  - Fix incorrect handling of missing regulator
  - make sure power is down and remove.
* bmp280
  - read the calibration data only once as it doesn't change.
* isl29125
  - Use a few macros to make code a touch more readable.
* mma8452
  - fix a memory leak on error.
  - drop an unecessary bit of return value handling.
* potentiometer kconfig
  - typo fix.
* st-pressure
  - drop some uninformative default assignments of elements of the channel
  array structure (aids readability).
* st-sensors
  - Harden interrupt handling considerably.  These are actually all using
  level interrupts, but at least two known boards have them wired to
  edge only interrupt chips.  Hence a slightly interesting bit of handling
  is needed in which we first allow for the easy option (level triggered) and
  secondly check the status registers before reenabling edge interrupts and
  fall back to a tight loop in the thread until we successfully clear the
  interrupt.  No harm is done if we never succeed in doing so.  It's an odd
  patch that has been through a lot of revisions to reach a consensus on how
  to handle what is basically broken hardware (which the previous defaults
  allowed to kind of work).
  - Fix alignment to defined storagebytes boundaries.
  - Ensure alignment of power of 2 byte boundaries.  This has always in theory
  been part of the ABI of IIO, but we missed a few that snuck in that need
  fixing.  The effect was minor as they were only followed by timestamp
  channels which were correctly aligned,
  - Add some docs to explain the gain calculations.
2016-07-14 12:05:29 +09:00