IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Since __GFP_NOACCOUNT was removed by commit 20b5c3039863 ("Revert 'gfp:
add __GFP_NOACCOUNT'"), its description is not necessary.
Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In machines with 140G of memory and enterprise flash storage, we have
seen read and write bursts routinely exceed the kswapd watermarks and
cause thundering herds in direct reclaim. Unfortunately, the only way
to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
system's emergency reserves - which is entirely unrelated to the
system's latency requirements. In order to get kswapd to maintain a
250M buffer of free memory, the emergency reserves need to be set to 1G.
That is a lot of memory wasted for no good reason.
On the other hand, it's reasonable to assume that allocation bursts and
overall allocation concurrency scale with memory capacity, so it makes
sense to make kswapd aggressiveness a function of that as well.
Change the kswapd watermark scale factor from the currently fixed 25% of
the tunable emergency reserve to a tunable 0.1% of memory.
Beyond 1G of memory, this will produce bigger watermark steps than the
current formula in default settings. Ensure that the new formula never
chooses steps smaller than that, i.e. 25% of the emergency reserve.
On a 140G machine, this raises the default watermark steps - the
distance between min and low, and low and high - from 16M to 143M.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are few things about *pte_alloc*() helpers worth cleaning up:
- 'vma' argument is unused, let's drop it;
- most __pte_alloc() callers do speculative check for pmd_none(),
before taking ptl: let's introduce pte_alloc() macro which does
the check.
The only direct user of __pte_alloc left is userfaultfd, which has
different expectation about atomicity wrt pmd.
- pte_alloc_map() and pte_alloc_map_lock() are redefined using
pte_alloc().
[sudeep.holla@arm.com: fix build for arm64 hugetlbpage]
[sfr@canb.auug.org.au: fix arch/arm/mm/mmu.c some more]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
statistics protocol, corresponding to 'Available' in /proc/meminfo.
It indicates to the hypervisor how big the balloon can be inflated
without pushing the guest system to swap.
Signed-off-by: Igor Redko <redkoi@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
statistics protocol, corresponding to 'Available' in /proc/meminfo.
It indicates to the hypervisor how big the balloon can be inflated
without pushing the guest system to swap. This metric would be very
useful in VM orchestration software to improve memory management of
different VMs under overcommit.
This patch (of 2):
Factor out calculation of the available memory counter into a separate
exportable function, in order to be able to use it in other parts of the
kernel.
In particular, it appears a relevant metric to report to the hypervisor
via virtio-balloon statistics interface (in a followup patch).
Signed-off-by: Igor Redko <redkoi@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MEMORY_HOTPLUG already depends on ARCH_ENABLE_MEMORY_HOTPLUG which is
selected by the supported architectures, so the following arch depend is
unnecessary.
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With THP refcounting work, no need to mark PMDs splitting.
(ARC got missed under the sweeping arch change as THP support was likely
not present in orig baseline)
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We remove one instace of flush_tlb_range here. That was added by commit
f714f4f20e59 ("mm: numa: call MMU notifiers on THP migration"). But the
pmdp_huge_clear_flush_notify should have done the require flush for us.
Hence remove the extra flush.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Get list of VMA flags up-to-date and sort it to match VM_* definition
order.
[vbabka@suse.cz: add a note above vmaflag definitions to update the names when changing]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we have two copies of the same code which implements memory
overcommitment logic. Let's move it into mm/util.c and hence avoid
duplication. No functional changes here.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
max_map_count sysctl unrelated to scheduler. Move its bits from
include/linux/sched/sysctl.h to include/linux/mm.h.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Count how many times we put a THP in split queue. Currently, it happens
on partial unmap of a THP.
Rapidly growing value can indicate that an application behaves
unfriendly wrt THP: often fault in huge page and then unmap part of it.
This leads to unnecessary memory fragmentation and the application may
require tuning.
The event also can help with debugging kernel [mis-]behaviour.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Workingset code was recently made memcg aware, but shadow node shrinker
is still global. As a result, one small cgroup can consume all memory
available for shadow nodes, possibly hurting other cgroups by reclaiming
their shadow nodes, even though reclaim distances stored in its shadow
nodes have no effect. To avoid this, we need to make shadow node
shrinker memcg aware.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A page is activated on refault if the refault distance stored in the
corresponding shadow entry is less than the number of active file pages.
Since active file pages can't occupy more than half memory, we assume
that the maximal effective refault distance can't be greater than half
the number of present pages and size the shadow nodes lru list
appropriately. Generally speaking, this assumption is correct, but it
can result in wasting a considerable chunk of memory on stale shadow
nodes in case the portion of file pages is small, e.g. if a workload
mostly uses anonymous memory.
To sort this out, we need to compute the size of shadow nodes lru basing
not on the maximal possible, but the current size of file cache. We
could take the size of active file lru for the maximal refault distance,
but active lru is pretty unstable - it can shrink dramatically at
runtime possibly disrupting workingset detection logic.
Instead we assume that the maximal refault distance equals half the
total number of file cache pages. This will protect us against active
file lru size fluctuations while still being correct, because size of
active lru is normally maintained lower than size of inactive lru.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allocation of radix_tree_node objects can be easily triggered from
userspace, so we should account them to memory cgroup. Besides, we need
them accounted for making shadow node shrinker per memcg (see
mm/workingset.c).
A tricky thing about accounting radix_tree_node objects is that they are
mostly allocated through radix_tree_preload(), so we can't just set
SLAB_ACCOUNT for radix_tree_node_cachep - that would likely result in a
lot of unrelated cgroups using objects from each other's caches.
One way to overcome this would be making radix tree preloads per memcg,
but that would probably look cumbersome and overcomplicated.
Instead, we make radix_tree_node_alloc() first try to allocate from the
cache with __GFP_ACCOUNT, no matter if the caller has preloaded or not,
and only if it fails fall back on using per cpu preloads. This should
make most allocations accounted.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As kmem accounting is now either enabled for all cgroups or disabled
system-wide, there's no point in having memcg_kmem_online() helper -
instead one can use memcg_kmem_enabled() and mem_cgroup_online(), as
shrink_slab() now does.
There are only two places left where this helper is used -
__memcg_kmem_charge() and memcg_create_kmem_cache(). The former can
only be called if memcg_kmem_enabled() returned true. Since the cgroup
it operates on is online, mem_cgroup_is_root() check will be enough.
memcg_create_kmem_cache() can't use mem_cgroup_online() helper instead
of memcg_kmem_online(), because it relies on the fact that in
memcg_offline_kmem() memcg->kmem_state is changed before
memcg_deactivate_kmem_caches() is called, but there we can just
open-code the check.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's just convenient to implement a memcg aware shrinker when you know
that shrink_control->memcg != NULL unless memcg_kmem_enabled() returns
false.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Workingset code was recently made memcg aware, but shadow node shrinker
is still global. As a result, one small cgroup can consume all memory
available for shadow nodes, possibly hurting other cgroups by reclaiming
their shadow nodes, even though reclaim distances stored in its shadow
nodes have no effect. To avoid this, we need to make shadow node
shrinker memcg aware.
The actual work is done in patch 6 of the series. Patches 1 and 2
prepare memcg/shrinker infrastructure for the change. Patch 3 is just a
collateral cleanup. Patch 4 makes radix_tree_node accounted, which is
necessary for making shadow node shrinker memcg aware. Patch 5 reduces
shadow nodes overhead in case workload mostly uses anonymous pages.
This patch:
Currently, in the legacy hierarchy kmem accounting is off for all
cgroups by default and must be enabled explicitly by writing something
to memory.kmem.limit_in_bytes. Since we don't support reclaim on
hitting kmem limit, nor do we have any plans to implement it, this is
likely to be -1, just to enable kmem accounting and limit kernel memory
consumption by the memory.limit_in_bytes along with user memory.
This user API was introduced when the implementation of kmem accounting
lacked slab shrinker support and hence was useless in practice. Things
have changed since then - slab shrinkers were made memcg aware, the
accounting overhead seems to be negligible, and a failure to charge a
kmem allocation should not have critical consequences, because we only
account those kernel objects that should be safe to fail. That's why
kmem accounting is enabled by default for all cgroups in the default
hierarchy, which will eventually replace the legacy one.
The ability to enable kmem accounting for some cgroups while keeping it
disabled for others is getting difficult to maintain. E.g. to make
shadow node shrinker memcg aware (see mm/workingset.c), we need to know
the relationship between the number of shadow nodes allocated for a
cgroup and the size of its lru list. If kmem accounting is enabled for
all cgroups there is no problem, but what should we do if kmem
accounting is enabled only for half of cgroups? We've no other choice
but use global lru stats while scanning root cgroup's shadow nodes, but
that would be wrong if kmem accounting was enabled for all cgroups
(which is the case if the unified hierarchy is used), in which case we
should use lru stats of the root cgroup's lruvec.
That being said, let's enable kmem accounting for all memory cgroups by
default. If one finds it unstable or too costly, it can always be
disabled system-wide by passing cgroup.memory=nokmem to the kernel at
boot time.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sometimes gcc mysteriously doesn't inline
very small functions we expect to be inlined. See
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
With this .config:
http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
the following functions get deinlined many times.
Examples of disassembly:
<SetPageUptodate> (43 copies, 141 calls):
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 80 0f 08 lock orb $0x8,(%rdi)
5d pop %rbp
c3 retq
<PagePrivate> (10 copies, 134 calls):
48 8b 07 mov (%rdi),%rax
55 push %rbp
48 89 e5 mov %rsp,%rbp
48 c1 e8 0b shr $0xb,%rax
83 e0 01 and $0x1,%eax
5d pop %rbp
c3 retq
This patch fixes this via s/inline/__always_inline/.
Code size decrease after the patch is ~7k:
text data bss dec hex filename
92125002 20826048 36417536 149368586 8e72f0a vmlinux
92118087 20826112 36417536 149361735 8e71447 vmlinux7_pageops_after
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds two command line keys:
-c|--cgroup path|@inode Walk only pages owned by this memory cgroup
-C|--list-cgroup Show memory cgroup inodes
[vdavydov@virtuozzo.com: opt_cgroup should be uint64_t. Fix conflicts with "tools/vm/page-types.c: support swap entry"]
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Similarly to direct reclaim/compaction, kswapd attempts to combine
reclaim and compaction to attempt making memory allocation of given
order available.
The details differ from direct reclaim e.g. in having high watermark as
a goal. The code involved in kswapd's reclaim/compaction decisions has
evolved to be quite complex.
Testing reveals that it doesn't actually work in at least one scenario,
and closer inspection suggests that it could be greatly simplified
without compromising on the goal (make high-order page available) or
efficiency (don't reclaim too much). The simplification relieas of
doing all compaction in kcompactd, which is simply woken up when high
watermarks are reached by kswapd's reclaim.
The scenario where kswapd compaction doesn't work was found with mmtests
test stress-highalloc configured to attempt order-9 allocations without
direct reclaim, just waking up kswapd. There was no compaction attempt
from kswapd during the whole test. Some added instrumentation shows
what happens:
- balance_pgdat() sets end_zone to Normal, as it's not balanced
- reclaim is attempted on DMA zone, which sets nr_attempted to 99, but
it cannot reclaim anything, so sc.nr_reclaimed is 0
- for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so
it merely checks if high watermarks were reached for base pages.
This is true, so no reclaim is attempted. For DMA, testorder=0
wasn't used, as compaction_suitable() returned COMPACT_SKIPPED
- even though the pgdat_needs_compaction flag wasn't set to false, no
compaction happens due to the condition sc.nr_reclaimed >
nr_attempted being false (as 0 < 99)
- priority-- due to nr_reclaimed being 0, repeat until priority reaches
0 pgdat_balanced() is false as only the small zone DMA appears
balanced (curiously in that check, watermark appears OK and
compaction_suitable() returns COMPACT_PARTIAL, because a lower
classzone_idx is used there)
Now, even if it was decided that reclaim shouldn't be attempted on the
DMA zone, the scenario would be the same, as (sc.nr_reclaimed=0 >
nr_attempted=0) is also false. The condition really should use >= as
the comment suggests. Then there is a mismatch in the check for setting
pgdat_needs_compaction to false using low watermark, while the rest uses
high watermark, and who knows what other subtlety. Hopefully this
demonstrates that this is unsustainable.
Luckily we can simplify this a lot. The reclaim/compaction decisions
make sense for direct reclaim scenario, but in kswapd, our primary goal
is to reach high watermark in order-0 pages. Afterwards we can attempt
compaction just once. Unlike direct reclaim, we don't reclaim extra
pages (over the high watermark), the current code already disallows it
for good reasons.
After this patch, we simply wake up kcompactd to process the pgdat,
after we have either succeeded or failed to reach the high watermarks in
kswapd, which goes to sleep. We pass kswapd's order and classzone_idx,
so kcompactd can apply the same criteria to determine which zones are
worth compacting. Note that we use the classzone_idx from
wakeup_kswapd(), not balanced_classzone_idx which can include higher
zones that kswapd tried to balance too, but didn't consider them in
pgdat_balanced().
Since kswapd now cannot create high-order pages itself, we need to
adjust how it determines the zones to be balanced. The key element here
is adding a "highorder" parameter to zone_balanced, which, when set to
false, makes it consider only order-0 watermark instead of the desired
higher order (this was done previously by kswapd_shrink_zone(), but not
elsewhere). This false is passed for example in pgdat_balanced().
Importantly, wakeup_kswapd() uses true to make sure kswapd and thus
kcompactd are woken up for a high-order allocation failure.
The last thing is to decide what to do with pageblock_skip bitmap
handling. Compaction maintains a pageblock_skip bitmap to record
pageblocks where isolation recently failed. This bitmap can be reset by
three ways:
1) direct compaction is restarting after going through the full deferred cycle
2) kswapd goes to sleep, and some other direct compaction has previously
finished scanning the whole zone and set zone->compact_blockskip_flush.
Note that a successful direct compaction clears this flag.
3) compaction was invoked manually via trigger in /proc
The case 2) is somewhat fuzzy to begin with, but after introducing
kcompactd we should update it. The check for direct compaction in 1),
and to set the flush flag in 2) use current_is_kswapd(), which doesn't
work for kcompactd. Thus, this patch adds bool direct_compaction to
compact_control to use in 2). For the case 1) we remove the check
completely - unlike the former kswapd compaction, kcompactd does use the
deferred compaction functionality, so flushing tied to restarting from
deferred compaction makes sense here.
Note that when kswapd goes to sleep, kcompactd is woken up, so it will
see the flushed pageblock_skip bits. This is different from when the
former kswapd compaction observed the bits and I believe it makes more
sense. Kcompactd can afford to be more thorough than a direct
compaction trying to limit allocation latency, or kswapd whose primary
goal is to reclaim.
For testing, I used stress-highalloc configured to do order-9
allocations with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just
on kswapd/kcompactd reclaim/compaction (the interfering kernel builds in
phases 1 and 2 work as usual):
stress-highalloc
4.5-rc1+before 4.5-rc1+after
-nodirect -nodirect
Success 1 Min 1.00 ( 0.00%) 5.00 (-66.67%)
Success 1 Mean 1.40 ( 0.00%) 6.20 (-55.00%)
Success 1 Max 2.00 ( 0.00%) 7.00 (-16.67%)
Success 2 Min 1.00 ( 0.00%) 5.00 (-66.67%)
Success 2 Mean 1.80 ( 0.00%) 6.40 (-52.38%)
Success 2 Max 3.00 ( 0.00%) 7.00 (-16.67%)
Success 3 Min 34.00 ( 0.00%) 62.00 ( 1.59%)
Success 3 Mean 41.80 ( 0.00%) 63.80 ( 1.24%)
Success 3 Max 53.00 ( 0.00%) 65.00 ( 2.99%)
User 3166.67 3181.09
System 1153.37 1158.25
Elapsed 1768.53 1799.37
4.5-rc1+before 4.5-rc1+after
-nodirect -nodirect
Direct pages scanned 32938 32797
Kswapd pages scanned 2183166 2202613
Kswapd pages reclaimed 2152359 2143524
Direct pages reclaimed 32735 32545
Percentage direct scans 1% 1%
THP fault alloc 579 612
THP collapse alloc 304 316
THP splits 0 0
THP fault fallback 793 778
THP collapse fail 11 16
Compaction stalls 1013 1007
Compaction success 92 67
Compaction failures 920 939
Page migrate success 238457 721374
Page migrate failure 23021 23469
Compaction pages isolated 504695 1479924
Compaction migrate scanned 661390 8812554
Compaction free scanned 13476658 84327916
Compaction cost 262 838
After this patch we see improvements in allocation success rate
(especially for phase 3) along with increased compaction activity. The
compaction stalls (direct compaction) in the interfering kernel builds
(probably THP's) also decreased somewhat thanks to kcompactd activity,
yet THP alloc successes improved a bit.
Note that elapsed and user time isn't so useful for this benchmark,
because of the background interference being unpredictable. It's just
to quickly spot some major unexpected differences. System time is
somewhat more useful and that didn't increase.
Also (after adjusting mmtests' ftrace monitor):
Time kswapd awake 2547781 2269241
Time kcompactd awake 0 119253
Time direct compacting 939937 557649
Time kswapd compacting 0 0
Time kcompactd compacting 0 119099
The decrease of overal time spent compacting appears to not match the
increased compaction stats. I suspect the tasks get rescheduled and
since the ftrace monitor doesn't see that, the reported time is wall
time, not CPU time. But arguably direct compactors care about overall
latency anyway, whether busy compacting or waiting for CPU doesn't
matter. And that latency seems to almost halved.
It's also interesting how much time kswapd spent awake just going
through all the priorities and failing to even try compacting, over and
over.
We can also configure stress-highalloc to perform both direct
reclaim/compaction and wakeup kswapd/kcompactd, by using
GFP_KERNEL|__GFP_HIGH|__GFP_COMP:
stress-highalloc
4.5-rc1+before 4.5-rc1+after
-direct -direct
Success 1 Min 4.00 ( 0.00%) 9.00 (-50.00%)
Success 1 Mean 8.00 ( 0.00%) 10.00 (-19.05%)
Success 1 Max 12.00 ( 0.00%) 11.00 ( 15.38%)
Success 2 Min 4.00 ( 0.00%) 9.00 (-50.00%)
Success 2 Mean 8.20 ( 0.00%) 10.00 (-16.28%)
Success 2 Max 13.00 ( 0.00%) 11.00 ( 8.33%)
Success 3 Min 75.00 ( 0.00%) 74.00 ( 1.33%)
Success 3 Mean 75.60 ( 0.00%) 75.20 ( 0.53%)
Success 3 Max 77.00 ( 0.00%) 76.00 ( 0.00%)
User 3344.73 3246.04
System 1194.24 1172.29
Elapsed 1838.04 1836.76
4.5-rc1+before 4.5-rc1+after
-direct -direct
Direct pages scanned 125146 120966
Kswapd pages scanned 2119757 2135012
Kswapd pages reclaimed 2073183 2108388
Direct pages reclaimed 124909 120577
Percentage direct scans 5% 5%
THP fault alloc 599 652
THP collapse alloc 323 354
THP splits 0 0
THP fault fallback 806 793
THP collapse fail 17 16
Compaction stalls 2457 2025
Compaction success 906 518
Compaction failures 1551 1507
Page migrate success 2031423 2360608
Page migrate failure 32845 40852
Compaction pages isolated 4129761 4802025
Compaction migrate scanned 11996712 21750613
Compaction free scanned 214970969 344372001
Compaction cost 2271 2694
In this scenario, this patch doesn't change the overall success rate as
direct compaction already tries all it can. There's however significant
reduction in direct compaction stalls (that is, the number of
allocations that went into direct compaction). The number of successes
(i.e. direct compaction stalls that ended up with successful
allocation) is reduced by the same number. This means the offload to
kcompactd is working as expected, and direct compaction is reduced
either due to detecting contention, or compaction deferred by kcompactd.
In the previous version of this patchset there was some apparent
reduction of success rate, but the changes in this version (such as
using sync compaction only), new baseline kernel, and/or averaging
results from 5 executions (my bet), made this go away.
Ftrace-based stats seem to roughly agree:
Time kswapd awake 2532984 2326824
Time kcompactd awake 0 257916
Time direct compacting 864839 735130
Time kswapd compacting 0 0
Time kcompactd compacting 0 257585
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can reuse the nid we've determined instead of repeated pfn_to_nid()
usages. Also zone_to_nid() should be a bit cheaper in general than
pfn_to_nid().
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memory compaction can be currently performed in several contexts:
- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP
page fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc
The purpose of compaction is two-fold. The obvious purpose is to
satisfy a (pending or future) high-order allocation, and is easy to
evaluate. The other purpose is to keep overal memory fragmentation low
and help the anti-fragmentation mechanism. The success wrt the latter
purpose is more
The current situation wrt the purposes has a few drawbacks:
- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of
keeping memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would
be better if compaction was performed asynchronously to keep
fragmentation low, before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP
page faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in
some scenarios. It could also end up compacting for a high-order
allocation request when it should be reclaiming memory for a later
order-0 request.
To improve the situation, we should be able to benefit from an
equivalent of kswapd, but for compaction - i.e. a background thread
which responds to fragmentation and the need for high-order allocations
(including hugepages) somewhat proactively.
One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let
kswapd handle reclaim, as order-0 allocations are often more critical
than high-order ones.
Another possibility is to extend khugepaged, but this kthread is a
single instance and tied to THP configs.
This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new
tunables. The lifecycle mimics kswapd kthreads, including the memory
hotplug hooks.
For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality.
Unlike direct compaction, it uses only sync compaction, as there's no
allocation latency to minimize.
This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with
the old approach.
Waking up of the kcompactd threads is also tied to kswapd activity and
follows these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from
the slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so
don't invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd
Future possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are
not available (currently not done due to __GFP_NO_KSWAPD) or when a
fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
possible to perform periodic compaction with kcompactd.
[arnd@arndb.de: fix build errors with kcompactd]
[paul.gortmaker@windriver.com: don't use modular references for non modular code]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During work on kcompactd integration I have spotted a confusing check of
balance_classzone_idx, which I believe is bogus.
The balanced_classzone_idx is filled by balance_pgdat() as the highest
zone it attempted to balance. This was introduced by commit dc83edd941f4
("mm: kswapd: use the classzone idx that kswapd was using for
sleeping_prematurely()").
The intention is that (as expressed in today's function names), the
value used for kswapd_shrink_zone() calls in balance_pgdat() is the same
as for the decisions in kswapd_try_to_sleep().
An unwanted side-effect of that commit was breaking the checks in
kswapd() whether there was another kswapd_wakeup with a tighter (=lower)
classzone_idx. Commits 215ddd6664ce ("mm: vmscan: only read
new_classzone_idx from pgdat when reclaiming successfully") and
d2ebd0f6b895 ("kswapd: avoid unnecessary rebalance after an unsuccessful
balancing") tried to fixed, but apparently introduced a bogus check that
this patch removes.
Consider zone indexes X < Y < Z, where:
- Z is the value used for the first kswapd wakeup.
- Y is returned as balanced_classzone_idx, which means zones with index higher
than Y (including Z) were found to be unreclaimable.
- X is the value used for the second kswapd wakeup
The new wakeup with value X means that kswapd is now supposed to balance
harder all zones with index <= X. But instead, due to Y < Z, it will go
sleep and won't read the new value X. This is subtly wrong.
The effect of this patch is that kswapd will react better in some
situations, where e.g. the first wakeup is for ZONE_DMA32, the second is
for ZONE_DMA, and due to unreclaimable ZONE_NORMAL. Before this patch,
kswapd would go sleep instead of reclaiming ZONE_DMA harder. I expect
these situations are very rare, and more value is in better
maintainability due to the removal of confusing and bogus check.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can disable debug_pagealloc processing even if the code is compiled
with CONFIG_DEBUG_PAGEALLOC. This patch changes the code to query
whether it is enabled or not in runtime.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Takashi Iwai <tiwai@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can disable debug_pagealloc processing even if the code is compiled
with CONFIG_DEBUG_PAGEALLOC. This patch changes the code to query
whether it is enabled or not in runtime.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can disable debug_pagealloc processing even if the code is compiled
with CONFIG_DEBUG_PAGEALLOC. This patch changes the code to query
whether it is enabled or not in runtime.
[akpm@linux-foundation.org: export _debug_pagealloc_enabled to modules]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Takashi Iwai <tiwai@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can disable debug_pagealloc processing even if the code is compiled
with CONFIG_DEBUG_PAGEALLOC. This patch changes the code to query
whether it is enabled or not in runtime.
[akpm@linux-foundation.org: clean up code, per Christian]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Takashi Iwai <tiwai@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As CONFIG_DEBUG_PAGEALLOC can be enabled/disabled via kernel parameters
we can optimize some cases by checking the enablement state.
This is follow-up work for Christian's Optimize CONFIG_DEBUG_PAGEALLOC:
https://lkml.org/lkml/2016/1/27/194
Remaining work is to make sparc to be aware of this but it looks not
easy for me so I skip that in this series.
This patch (of 5):
We can disable debug_pagealloc processing even if the code is complied
with CONFIG_DEBUG_PAGEALLOC. This patch changes the code to query
whether it is enabled or not in runtime.
[akpm@linux-foundation.org: update comment, per David. Adjust comment to use 80 cols]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/pid/pagemap (pte_to_pagemap_entry() internally) already reports
about swap entry, so let's make the in-kernel utility aware of it.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently /proc/kpageflags returns just KPF_COMPOUND_TAIL for slab tail
pages, which is inconvenient when grasping how slab pages are
distributed (userspace always needs to check which kind of tail pages by
itself). This patch sets KPF_SLAB for such pages.
With this patch:
$ grep Slab /proc/meminfo ; tools/vm/page-types -b slab
Slab: 64880 kB
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000080 16220 63 _______S__________________________________ slab
total 16220 63
16220 pages equals to 64880 kB, so returned result is consistent with the
global counter.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently /proc/kpageflags returns nothing for "tail" buddy pages, which
is inconvenient when grasping how free pages are distributed. This
patch sets KPF_BUDDY for such pages.
With this patch:
$ grep MemFree /proc/meminfo ; tools/vm/page-types -b buddy
MemFree: 3134992 kB
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000400 779272 3044 __________B_______________________________ buddy
0x0000000000000c00 4385 17 __________BM______________________________ buddy,mmap
total 783657 3061
783657 pages is 3134628 kB (roughly consistent with the global counter,)
so it's OK.
[akpm@linux-foundation.org: update comment, per Naoya]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Show how much memory is allocated to kernel stacks.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Show how much memory is used for storing reclaimable and unreclaimable
in-kernel data structures allocated from slab caches.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, tree_{stat,events} helpers can only get one stat index at a
time, so when there are a lot of stats to be reported one has to call it
over and over again (see memory_stat_show). This is neither effective,
nor does it look good. Instead, let's make these helpers take a
snapshot of all available counters.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Slab pages are charged in two steps. First, an appropriate per memcg
cache is selected (see memcg_kmem_get_cache) basing on the current
context, then the new slab page is charged to the memory cgroup which
the selected cache was created for (see memcg_charge_slab ->
__memcg_kmem_charge_memcg). It is OK to bypass kmemcg charge at step 1,
but if step 1 succeeded and we successfully allocated a new slab page,
step 2 must be performed, otherwise we would get a per memcg kmem cache
which contains a slab that does not hold a reference to the memory
cgroup owning the cache. Since per memcg kmem caches are destroyed on
memcg css free, this could result in freeing a cache while there are
still active objects in it.
However, currently we will bypass slab page charge if the memory cgroup
owning the cache is offline (see __memcg_kmem_charge_memcg). This is
very unlikely to occur in practice, because for this to happen a process
must be migrated to a different cgroup and the old cgroup must be
removed while the process is in kmalloc somewhere between steps 1 and 2
(e.g. trying to allocate a new page). Nevertheless, it's still better
to eliminate such a possibility.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the OOM killer scans tasks and encounters a PF_EXITING one, it
force-selects that task regardless of the score. The problem is that if
that task got stuck waiting for some state the allocation site is
holding, the OOM reaper can not move on to the next best victim.
Frankly, I don't even know why we check for exiting tasks in the OOM
killer. We've tried direct reclaim at least 15 times by the time we
decide the system is OOM, there was plenty of time to exit and free
memory; and a task might exit voluntarily right after we issue a kill.
This is testing pure noise. Remove it.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While working on a script to restore all sysctl params before a series of
tests I found that writing any value into the
/proc/sys/kernel/{nmi_watchdog,soft_watchdog,watchdog,watchdog_thresh}
causes them to call proc_watchdog_update().
NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
There doesn't appear to be a reason for doing this work every time a write
occurs, so only do it when the values change.
Signed-off-by: Josh Hunt <johunt@akamai.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: <stable@vger.kernel.org> [4.1.x+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 1f330c327900 ("drivers/firmware/broadcom/bcm47xx_nvram.c: use
__ioread32_copy() instead of open-coding") switched to use a generic
copy function, but failed to notice that the header pointer is updated
between the two copies, resulting in bogus data being copied in the
latter one. Fix by keeping the old header pointer.
The patch fixes totally broken networking on WRT54GL router (both LAN and
WLAN interfaces fail to probe).
Fixes: 1f330c327900 ("drivers/firmware/broadcom/bcm47xx_nvram.c: use __ioread32_copy() instead of open-coding")
Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Rafal Milecki <zajec5@gmail.com>
Cc: Hauke Mehrtens <hauke@hauke-m.de>
Cc: <stable@vger.kernel.org> [4.4.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All architectures now need ioremap_uc(), ia64 seems defines this already
through its ioremap_nocache() and it already ensures it *only* uses UC.
This is needed since v4.3 to complete an allyesconfig compile on ia64,
there were others archs that needed this, and this one seems to have
fallen through the cracks.
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org> [4.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here is the big USB patchset for 4.6-rc1.
The normal mess is here, gadget and xhci fixes and updates, and lots of
other driver updates and cleanups as well. Full details are in the
shortlog.
All have been in linux-next for a while with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlbp8/EACgkQMUfUDdst+ylsyQCgnVK6ZIFVPV9VijJvBIjxS3F+
fTMAoIMQwNrRMHQOq/lhxX00AgN0B9Ch
=2EQp
-----END PGP SIGNATURE-----
Merge tag 'usb-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB updates from Greg KH:
"Here is the big USB patchset for 4.6-rc1.
The normal mess is here, gadget and xhci fixes and updates, and lots
of other driver updates and cleanups as well. Full details are in the
shortlog.
All have been in linux-next for a while with no reported issues"
* tag 'usb-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (266 commits)
USB: core: let USB device know device node
usb: devio: Add ioctl to disallow detaching kernel USB drivers.
usb: gadget: f_acm: Fix configfs attr name
usb: udc: lpc32xx: remove USB PLL and USB OTG clock management
usb: udc: lpc32xx: remove direct access to clock controller registers
usb: udc: lpc32xx: switch to clock prepare/unprepare model
usb: renesas_usbhs: gadget: fix giveback status code in usbhsg_pipe_disable()
usb: gadget: renesas_usb3: Use ARCH_RENESAS
usb: dwc2: Fix issues in dwc2_complete_non_isoc_xfer_ddma()
usb: dwc2: Add support for Lantiq ARX and XRX SoCs
usb: phy: generic: Handle late registration of gadget
usb: gadget: bdc_udc: fix race condition in bdc_udc_exit()
usb: musb: core: added missing const qualifier to musb_hdrc_platform_data::config
usb: dwc2: Move host-specific core functions into hcd.c
usb: dwc2: Move register save and restore functions
usb: dwc2: Use kmem_cache_free()
usb: dwc2: host: If using uframe scheduler, end splits better
usb: dwc2: host: Totally redo the microframe scheduler
usb: dwc2: host: Properly set even/odd frame
usb: dwc2: host: Add dwc2_hcd_get_future_frame_number() call
...
This variant has coherent cache, is equipped with interrupt distributor
and is capable of running SMP linux.
Signed-off-by: Piet Delaney <piet.delaney@gmail.com>
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Here's the big tty/serial driver pull request for 4.6-rc1.
Lots of changes in here, Peter has been on a tear again, with lots of
refactoring and bugs fixes, many thanks to the great work he has been
doing. Lots of driver updates and fixes as well, full details in the
shortlog.
All have been in linux-next for a while with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlbp8z8ACgkQMUfUDdst+ym1vwCgnOOCORaZyeQ4QrcxPAK5pHFn
VrMAoNHvDgNYtG+Hmzv25Lgp3HnysPin
=MLRG
-----END PGP SIGNATURE-----
Merge tag 'tty-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial updates from Greg KH:
"Here's the big tty/serial driver pull request for 4.6-rc1.
Lots of changes in here, Peter has been on a tear again, with lots of
refactoring and bugs fixes, many thanks to the great work he has been
doing. Lots of driver updates and fixes as well, full details in the
shortlog.
All have been in linux-next for a while with no reported issues"
* tag 'tty-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (220 commits)
serial: 8250: describe CONFIG_SERIAL_8250_RSA
serial: samsung: optimize UART rx fifo access routine
serial: pl011: add mark/space parity support
serial: sa1100: make sa1100_register_uart_fns a function
tty: serial: 8250: add MOXA Smartio MUE boards support
serial: 8250: convert drivers to use up_to_u8250p()
serial: 8250/mediatek: fix building with SERIAL_8250=m
serial: 8250/ingenic: fix building with SERIAL_8250=m
serial: 8250/uniphier: fix modular build
Revert "drivers/tty/serial: make 8250/8250_ingenic.c explicitly non-modular"
Revert "drivers/tty/serial: make 8250/8250_mtk.c explicitly non-modular"
serial: mvebu-uart: initial support for Armada-3700 serial port
serial: mctrl_gpio: Add missing module license
serial: ifx6x60: avoid uninitialized variable use
tty/serial: at91: fix bad offset for UART timeout register
tty/serial: at91: restore dynamic driver binding
serial: 8250: Add hardware dependency to RT288X option
TTY, devpts: document pty count limiting
tty: goldfish: support platform_device with id -1
drivers: tty: goldfish: Add device tree bindings
...
Here is the big char/misc driver update for 4.6-rc1.
The majority of the patches here is hwtracing and some new mic drivers,
but there's a lot of other driver updates as well. Full details in the
shortlog.
All have been in linux-next for a while with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlbp9IcACgkQMUfUDdst+ykyJgCeLTC2QNGrh51kiJglkVJ0yD36
q4MAn0NkvSX2+iv5Jq8MaX6UQoRa4Nun
=MNjR
-----END PGP SIGNATURE-----
Merge tag 'char-misc-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc updates from Greg KH:
"Here is the big char/misc driver update for 4.6-rc1.
The majority of the patches here is hwtracing and some new mic
drivers, but there's a lot of other driver updates as well. Full
details in the shortlog.
All have been in linux-next for a while with no reported issues"
* tag 'char-misc-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (238 commits)
goldfish: Fix build error of missing ioremap on UM
nvmem: mediatek: Fix later provider initialization
nvmem: imx-ocotp: Fix return value of imx_ocotp_read
nvmem: Fix dependencies for !HAS_IOMEM archs
char: genrtc: replace blacklist with whitelist
drivers/hwtracing: make coresight-etm-perf.c explicitly non-modular
drivers: char: mem: fix IS_ERROR_VALUE usage
char: xillybus: Fix internal data structure initialization
pch_phub: return -ENODATA if ROM can't be mapped
Drivers: hv: vmbus: Support kexec on ws2012 r2 and above
Drivers: hv: vmbus: Support handling messages on multiple CPUs
Drivers: hv: utils: Remove util transport handler from list if registration fails
Drivers: hv: util: Pass the channel information during the init call
Drivers: hv: vmbus: avoid unneeded compiler optimizations in vmbus_wait_for_unload()
Drivers: hv: vmbus: remove code duplication in message handling
Drivers: hv: vmbus: avoid wait_for_completion() on crash
Drivers: hv: vmbus: don't loose HVMSG_TIMER_EXPIRED messages
misc: at24: replace memory_accessor with nvmem_device_read
eeprom: 93xx46: extend driver to plug into the NVMEM framework
eeprom: at25: extend driver to plug into the NVMEM framework
...
Just a few patches this time around for the 4.6-rc1 merge window.
Largest is a new firmware driver, but there are some other updates to
the driver core in here as well, the shortlog has the details.
All have been in linux-next for a while with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlbp9QoACgkQMUfUDdst+ynsOQCghpfAf3CJDr4PWGCKzDJzyQG9
rZYAn2VwKsqHzAxgLXZY5fQIjxSyaLek
=Mcvl
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Just a few patches this time around for the 4.6-rc1 merge window.
Largest is a new firmware driver, but there are some other updates to
the driver core in here as well, the shortlog has the details.
All have been in linux-next for a while with no reported issues"
* tag 'driver-core-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
Revert "driver-core: platform: probe of-devices only using list of compatibles"
firmware: qemu config needs I/O ports
firmware: qemu_fw_cfg.c: fix typo FW_CFG_DATA_OFF
driver-core: platform: probe of-devices only using list of compatibles
driver-core: platform: fix typo in documentation for multi-driver helper
component: remove impossible condition
drivers: dma-coherent: simplify dma_init_coherent_memory return value
devicetree: update documentation for fw_cfg ARM bindings
firmware: create directory hierarchy for sysfs fw_cfg entries
firmware: introduce sysfs driver for QEMU's fw_cfg device
kobject: export kset_find_obj() for module use
driver core: bus: use to_subsys_private and to_device_private_bus
driver core: bus: use list_for_each_entry*
debugfs: Add stub function for debugfs_create_automount().
kernfs: make kernfs_walk_ns() use kernfs_pr_cont_buf[]
Various enablers for assignment of Intel graphics devices and future
support of vGPU devices (Alex Williamson). This includes
- Handling the vfio type1 interface as an API rather than a specific
implementation, allowing multiple type1 providers.
- Capability chains, similar to PCI device capabilities, that allow
extending ioctls. Extensions here include device specific regions
and sparse mmap descriptions. The former is used to expose non-PCI
regions for IGD, including the OpRegion (particularly the Video
BIOS Table), and read only PCI config access to the host and LPC
bridge as drivers often depend on identifying those devices.
Sparse mmaps here are used to describe the MSIx vector table,
which vfio has always protected from mmap, but never had an API to
explicitly define that protection. In future vGPU support this is
expected to allow the description of PCI BARs that may mix direct
access and emulated access within a single region.
- The ability to expose the shadow ROM as an option ROM as IGD use
cases may rely on the ROM even though the physical device does not
make use of a PCI option ROM BAR.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJW6aT1AAoJECObm247sIsiiP4P/1xf7Z08/2QWVFQzex9CLcZk
+/iJlyb/fTpPVQE+NTKPz3Qh5h6ZhSd/57s85IUqq0T6tgVPkoGx8kkyCjBaw2y1
yMezXZlQqJdZqGzQNI4OiHWvO+/vGxYKjQMfUnMlDM6dJgz4lGncGFoSouFPa3Vp
mB12hGxrlk1cfIdb+C1KbfZcEdS0WhtigQtz8flBKgOfO+hYWmUO+CClJBhVw8Z4
RNcWNAxFfLuwUPVsPb6uOLG2g65SC2vmQ9k0Tnknf1znV3PFFVjITf0aM6uChLNP
S3SgqtPX+6yOFyCuSEs8UKhhmCbeQmAyKgt5BpxV3Rw3OMP4PsVAehr82vQmSj6g
2o96pR2s8MDPBr8eG7gdRe4DQe3PonpLkpDfaghcpYqhkGEqNVeW5/GjiOzGQqD3
xMshzxJ1Iz7DOHkQRUVqOfupDB0TusJmTVKwvXe6yIYL9pjkUS/sbN9U563HYSES
JTV68TMj0VKfKwD3XKYXvGH3km1sL4i5NMlAUrsDtsMkGlXEswoGbj82Mjc8+jUo
BvWQTJb+kouJQ88VhsO2abg1UrO9E6u82iHFHy9fEObxE8KH7pvROlS93ihMT1Wv
WQNuUcltdpHMRVX0BDknaPs3YtC3/TGgm3RcU5SZPbv/ys1471ZmJxMlAAKcfITr
SuvkMTYElF5b1pigv46c
=/lJn
-----END PGP SIGNATURE-----
Merge tag 'vfio-v4.6-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
"Various enablers for assignment of Intel graphics devices and future
support of vGPU devices (Alex Williamson). This includes
- Handling the vfio type1 interface as an API rather than a specific
implementation, allowing multiple type1 providers.
- Capability chains, similar to PCI device capabilities, that allow
extending ioctls. Extensions here include device specific regions
and sparse mmap descriptions. The former is used to expose non-PCI
regions for IGD, including the OpRegion (particularly the Video
BIOS Table), and read only PCI config access to the host and LPC
bridge as drivers often depend on identifying those devices.
Sparse mmaps here are used to describe the MSIx vector table, which
vfio has always protected from mmap, but never had an API to
explicitly define that protection. In future vGPU support this is
expected to allow the description of PCI BARs that may mix direct
access and emulated access within a single region.
- The ability to expose the shadow ROM as an option ROM as IGD use
cases may rely on the ROM even though the physical device does not
make use of a PCI option ROM BAR"
* tag 'vfio-v4.6-rc1' of git://github.com/awilliam/linux-vfio:
vfio/pci: return -EFAULT if copy_to_user fails
vfio/pci: Expose shadow ROM as PCI option ROM
vfio/pci: Intel IGD host and LCP bridge config space access
vfio/pci: Intel IGD OpRegion support
vfio/pci: Enable virtual register in PCI config space
vfio/pci: Add infrastructure for additional device specific regions
vfio: Define device specific region type capability
vfio/pci: Include sparse mmap capability for MSI-X table regions
vfio: Define sparse mmap capability for regions
vfio: Add capability chain helpers
vfio: Define capability chains
vfio: If an IOMMU backend fails, keep looking
vfio/pci: Fix unsigned comparison overflow
* nokia-modem: add N950 and N9 support
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJW6ZcAAAoJENju1/PIO/qafF0P/jXGXdXTAOIQKuPqhOZIM41T
gpvDYxIYho5SCm6xbn6YxyGpUG7ENaVFSRB4iQDH8vzvscjpsTcwWFTMVs6ILV3e
akbt8zpiKe1CbOQy7hT2sKUxSli87nlVqaksOMdMZbrIy/Z8KeiAWnDeTVQX04v/
wTpeW/EnCHNk5vwFjZWwXnAy+imiP1ICmvdzWnbAK5du1KDMEQbJcKgG8hCcW5hg
8mcddaFzhtQow9RURw0xpwj0qxcXAqPPK4s4Fb5I//LKpKucYHCuiDOaufdKDR3q
Vgn7y9XU4ZY0gxFGWr4y7nuW7/C/ObXW8wHnS/JbfCvvZQ9NjOb84LBqi8nDVbl3
Hdj1MQVKu7bpJ2IX9U/D298xWDk7RqIrV0BIZYqogpWWstQUjkMJaXXlUXzeIK1X
DwKUYzxKJCJXllMkogqZMHJqtHlow/3FTqasr7v3rMMfu5JjqsRHdUDoZwc7PYQp
3qSIdbfFQs146Q+3jIOw7fonc/KDqUR7XNWaWZCUhEhIQZ62g611fEyAJDgL3UVp
Kl0vRMa1e5Zc5RBa7e8wfNF8s2LhSfH8jvwoP2224V7OgmyU2jrIzeP4vORBwWvb
unSV4ZbvOY6Q8btKOtruHqB5ZSY2s741qiNxJMVOcEQMMFNdu0/YzCKpA5aq0NsA
Jur0Z5I3vxXqFHUG9FSE
=gBkN
-----END PGP SIGNATURE-----
Merge tag 'hsi-for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-hsi
Pull HSI updates from Sebastian Reichel:
"nokia-modem: add N950 and N9 support"
* tag 'hsi-for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-hsi:
HSI: ssi-protocol: Use handshake logic from n950
HSI: nokia-modem: add n950 and n9 support
* add types for USB Type C and PD chargers
* add act8945a charger driver
* add ACPI/DT bindings for goldfish-battery
* add support for versatile reset controller
* misc. fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJW6XstAAoJENju1/PIO/qaSmMP+gJY9U4yPscEoW2UkvKJ2y81
/Bxvy8GWSnnaFFPLMHYk/n53EPJbPnewETBftv1xRHzgJ6wQlgM/5EAEU2Qwud+A
V8K3V8VStIZLIrQ1tU8r10UE5OW95lsupCAlFp9AHYrBgIiRDUHlqWtxuhwH12F2
uLZt7moLHJFZtvg3EWl4agsvoEWfqPOKwmOpda+PyIUOlZrAIu3nItvKwBKWdvr3
WuomqfknkekkvX2lSB8R4OfE+w4fkYTFak2+4ymNiaQWrLeEqLuJbICUNGMD+APb
nX8TmQ8wmYYHLZTu455ctGJIdpYfrRI/IPAsdW/vMYnQ/8GiO0dgke5dfzZU08Du
PAWb/NM2pjFfG76fqG46hQPq/Jsns9k/SM1e3Q6fCkwewqtx8P56vUML6FtomYe7
HP3oPqlNeEpcF/jMmxKfSNiF78pmRLpCkWuhhYaT4Suv8bJhoQRE7Ur00Y47hp9i
taAIDE3HNFucqTEXMudBktEKsDPRpEx2K1fZEvZbCF8sOUK4xWPQr1yuPTXrM7jm
sWl/T4s1tn9/P6esdZUTXGghBiehj71mCHsOfXBQZPifaB2s8uINEf+GHQyQcLL4
X2HiMWrv9Pyn+PLVVwWRmTpSVIFWrCHNQPQ1ofgsgWkrEqsAbdZ4O2srlCqg/4WL
irCHBm6fbGpoVRQHg3xS
=L4ho
-----END PGP SIGNATURE-----
Merge tag 'for-v4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply
Pull power supply and reset changes from Sebastian Reichel:
- add types for USB Type C and PD chargers
- add act8945a charger driver
- add ACPI/DT bindings for goldfish-battery
- add support for versatile reset controller
- misc fixes
* tag 'for-v4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply: (24 commits)
power: pm2301-charger: use __maybe_unused to hide pm functions
power: ipaq-micro-battery: use __maybe_unused to hide pm functions
power_supply: 88pm860x_charger: do not pass NULL to power_supply_put
jz4740-battery: Correct voltage change check
power_supply: lp8788-charger: initialize boolean 'found'
goldfish: Enable ACPI-based enumeration for goldfish battery
power: goldfish_battery: add devicetree bindings
power: act8945a: add charger driver for ACT8945A
power: add documentation for ACT8945A's charger DT bindings
ARM: dts: n900: Rename isp1704 to isp1707 to match correct name
power_supply: bq27xxx_battery: Add of modalias and match table when CONFIG_OF is enabled
power_supply: bq2415x_charger: Add of modalias and match table when CONFIG_OF is enabled
power_supply: bq2415x_charger: Do not add acpi modalias when CONFIG_ACPI is not enabled
power_supply: isp1704_charger: Add compatible of match for nxp,isp1707
power_supply: isp1704_charger: Error messages when probe fail
power_supply: Add types for USB Type C and PD chargers
power: bq24735-charger: add 'ti,external-control' option
power: bq24735-charger: document 'ti,external-control' option
power: bq24735-charger: fix failed i2c with ac-detect
power: reset: Fix dependencies for !HAS_IOMEM archs
...