IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
commit 7a9cdebdcc upstream.
Jann Horn points out that the vmacache_flush_all() function is not only
potentially expensive, it's buggy too. It also happens to be entirely
unnecessary, because the sequence number overflow case can be avoided by
simply making the sequence number be 64-bit. That doesn't even grow the
data structures in question, because the other adjacent fields are
already 64-bit.
So simplify the whole thing by just making the sequence number overflow
case go away entirely, which gets rid of all the complications and makes
the code faster too. Win-win.
[ Oleg Nesterov points out that the VMACACHE_FULL_FLUSHES statistics
also just goes away entirely with this ]
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Will Deacon <will.deacon@arm.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit cdbf92675f ("mm: numa: avoid waiting on freed migrated pages")
was an incomplete backport of the upstream commit. It is necessary to
always reset page_nid before attempting any early exit.
The original commit conflicted due to lack of commit 82b0f8c39a
("mm: join struct fault_env and vm_fault") in 4.9 so it wasn't a clean
application, and the change must have just gotten lost in the noise.
Signed-off-by: Chas Williams <chas3@att.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 16e536ef47 ]
/sys/../zswap/stored_pages keeps rising in a zswap test with
"zswap.max_pool_percent=0" parameter. But it should not compress or
store pages any more since there is no space in the compressed pool.
Reproduce steps:
1. Boot kernel with "zswap.enabled=1"
2. Set the max_pool_percent to 0
# echo 0 > /sys/module/zswap/parameters/max_pool_percent
3. Do memory stress test to see if some pages have been compressed
# stress --vm 1 --vm-bytes $mem_available"M" --timeout 60s
4. Watching the 'stored_pages' number increasing or not
The root cause is:
When zswap_max_pool_percent is set to 0 via kernel parameter,
zswap_is_full() will always return true due to zswap_shrink(). But if
the shinking is able to reclain a page successfully the code then
proceeds to compressing/storing another page, so the value of
stored_pages will keep changing.
To solve the issue, this patch adds a zswap_is_full() check again after
zswap_shrink() to make sure it's now under the max_pool_percent, and to
not compress/store if we reached the limit.
Link: http://lkml.kernel.org/r/20180530103936.17812-1-liwang@redhat.com
Signed-off-by: Li Wang <liwang@redhat.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Huang Ying <huang.ying.caritas@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 1e8e18f694 ]
There is a special case that the size is "(N << KASAN_SHADOW_SCALE_SHIFT)
Pages plus X", the value of X is [1, KASAN_SHADOW_SCALE_SIZE-1]. The
operation "size >> KASAN_SHADOW_SCALE_SHIFT" will drop X, and the
roundup operation can not retrieve the missed one page. For example:
size=0x28006, PAGE_SIZE=0x1000, KASAN_SHADOW_SCALE_SHIFT=3, we will get
shadow_size=0x5000, but actually we need 6 pages.
shadow_size = round_up(size >> KASAN_SHADOW_SCALE_SHIFT, PAGE_SIZE);
This can lead to a kernel crash when kasan is enabled and the value of
mod->core_layout.size or mod->init_layout.size is like above. Because
the shadow memory of X has not been allocated and mapped.
move_module:
ptr = module_alloc(mod->core_layout.size);
...
memset(ptr, 0, mod->core_layout.size); //crashed
Unable to handle kernel paging request at virtual address ffff0fffff97b000
......
Call trace:
__asan_storeN+0x174/0x1a8
memset+0x24/0x48
layout_and_allocate+0xcd8/0x1800
load_module+0x190/0x23e8
SyS_finit_module+0x148/0x180
Link: http://lkml.kernel.org/r/1529659626-12660-1-git-send-email-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Dmitriy Vyukov <dvyukov@google.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Libin <huawei.libin@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 377eeaa8e1 upstream
For the L1TF workaround its necessary to limit the swap file size to below
MAX_PA/2, so that the higher bits of the swap offset inverted never point
to valid memory.
Add a mechanism for the architecture to override the swap file size check
in swapfile.c and add a x86 specific max swapfile check function that
enforces that limit.
The check is only enabled if the CPU is vulnerable to L1TF.
In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system
with 46bit PA it is 32TB. The limit is only per individual swap file, so
it's always possible to exceed these limits with multiple swap files or
partitions.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 42e4089c78 upstream
For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
table entry. This sets the high bits in the CPU's address space, thus
making sure to point to not point an unmapped entry to valid cached memory.
Some server system BIOSes put the MMIO mappings high up in the physical
address space. If such an high mapping was mapped to unprivileged users
they could attack low memory by setting such a mapping to PROT_NONE. This
could happen through a special device driver which is not access
protected. Normal /dev/mem is of course access protected.
To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.
Valid page mappings are allowed because the system is then unsafe anyways.
It's not expected that users commonly use PROT_NONE on MMIO. But to
minimize any impact this is only enforced if the mapping actually refers to
a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
the check for root.
For mmaps this is straight forward and can be handled in vm_insert_pfn and
in remap_pfn_range().
For mprotect it's a bit trickier. At the point where the actual PTEs are
accessed a lot of state has been changed and it would be difficult to undo
on an error. Since this is a uncommon case use a separate early page talk
walk pass for MMIO PROT_NONE mappings that checks for this condition
early. For non MMIO and non PROT_NONE there are no changes.
[dwmw2: Backport to 4.9]
[groeck: Backport to 4.4]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9f15bde671 upstream.
It was reported that a kernel crash happened in mem_cgroup_iter(), which
can be triggered if the legacy cgroup-v1 non-hierarchical mode is used.
Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b8f
......
Call trace:
mem_cgroup_iter+0x2e0/0x6d4
shrink_zone+0x8c/0x324
balance_pgdat+0x450/0x640
kswapd+0x130/0x4b8
kthread+0xe8/0xfc
ret_from_fork+0x10/0x20
mem_cgroup_iter():
......
if (css_tryget(css)) <-- crash here
break;
......
The crashing reason is that mem_cgroup_iter() uses the memcg object whose
pointer is stored in iter->position, which has been freed before and
filled with POISON_FREE(0x6b).
And the root cause of the use-after-free issue is that
invalidate_reclaim_iterators() fails to reset the value of iter->position
to NULL when the css of the memcg is released in non- hierarchical mode.
Link: http://lkml.kernel.org/r/1531994807-25639-1-git-send-email-jing.xia@unisoc.com
Fixes: 6df38689e0 ("mm: memcontrol: fix possible memcg leak due to interrupted reclaim")
Signed-off-by: Jing Xia <jing.xia.mail@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <chunyan.zhang@unisoc.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7810e6781e upstream.
In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
allocations that can ignore memory policies. The zonelist is obtained
from current CPU's node. This is a problem for __GFP_THISNODE
allocations that want to allocate on a different node, e.g. because the
allocating thread has been migrated to a different CPU.
This has been observed to break SLAB in our 4.4-based kernel, because
there it relies on __GFP_THISNODE working as intended. If a slab page
is put on wrong node's list, then further list manipulations may corrupt
the list because page_to_nid() is used to determine which node's
list_lock should be locked and thus we may take a wrong lock and race.
Current SLAB implementation seems to be immune by luck thanks to commit
511e3a0588 ("mm/slab: make cache_grow() handle the page allocated on
arbitrary node") but there may be others assuming that __GFP_THISNODE
works as promised.
We can fix it by simply removing the zonelist reset completely. There
is actually no reason to reset it, because memory policies and cpusets
don't affect the zonelist choice in the first place. This was different
when commit 183f6371aa ("mm: ignore mempolicies when using
ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their
own restricted zonelists.
We might consider this for 4.17 although I don't know if there's
anything currently broken.
SLAB is currently not affected, but in kernels older than 4.7 that don't
yet have 511e3a0588 ("mm/slab: make cache_grow() handle the page
allocated on arbitrary node") it is. That's at least 4.4 LTS. Older
ones I'll have to check.
So stable backports should be more important, but will have to be
reviewed carefully, as the code went through many changes. BTW I think
that also the ac->preferred_zoneref reset is currently useless if we
don't also reset ac->nodemask from a mempolicy to NULL first (which we
probably should for the OOM victims etc?), but I would leave that for a
separate patch.
Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Fixes: 183f6371aa ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 423913ad4a upstream.
Commit be83bbf806 ("mmap: introduce sane default mmap limits") was
introduced to catch problems in various ad-hoc character device drivers
doing mmap and getting the size limits wrong. In the process, it used
"known good" limits for the normal cases of mapping regular files and
block device drivers.
It turns out that the "s_maxbytes" limit was less "known good" than I
thought. In particular, /proc doesn't set it, but exposes one regular
file to mmap: /proc/vmcore. As a result, that file got limited to the
default MAX_INT s_maxbytes value.
This went unnoticed for a while, because apparently the only thing that
needs it is the s390 kernel zfcpdump, but there might be other tools
that use this too.
Vasily suggested just changing s_maxbytes for all of /proc, which isn't
wrong, but makes me nervous at this stage. So instead, just make the
new mmap limit always be MAX_LFS_FILESIZE for regular files, which won't
affect anything else. It wasn't the regular file case I was worried
about.
I'd really prefer for maxsize to have been per-inode, but that is not
how things are today.
Fixes: be83bbf806 ("mmap: introduce sane default mmap limits")
Reported-by: Vasily Gorbik <gor@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit be83bbf806 upstream.
The internal VM "mmap()" interfaces are based on the mmap target doing
everything using page indexes rather than byte offsets, because
traditionally (ie 32-bit) we had the situation that the byte offset
didn't fit in a register. So while the mmap virtual address was limited
by the word size of the architecture, the backing store was not.
So we're basically passing "pgoff" around as a page index, in order to
be able to describe backing store locations that are much bigger than
the word size (think files larger than 4GB etc).
But while this all makes a ton of sense conceptually, we've been dogged
by various drivers that don't really understand this, and internally
work with byte offsets, and then try to work with the page index by
turning it into a byte offset with "pgoff << PAGE_SHIFT".
Which obviously can overflow.
Adding the size of the mapping to it to get the byte offset of the end
of the backing store just exacerbates the problem, and if you then use
this overflow-prone value to check various limits of your device driver
mmap capability, you're just setting yourself up for problems.
The correct thing for drivers to do is to do their limit math in page
indices, the way the interface is designed. Because the generic mmap
code _does_ test that the index doesn't overflow, since that's what the
mmap code really cares about.
HOWEVER.
Finding and fixing various random drivers is a sisyphean task, so let's
just see if we can just make the core mmap() code do the limiting for
us. Realistically, the only "big" backing stores we need to care about
are regular files and block devices, both of which are known to do this
properly, and which have nice well-defined limits for how much data they
can access.
So let's special-case just those two known cases, and then limit other
random mmap users to a backing store that still fits in "unsigned long".
Realistically, that's not much of a limit at all on 64-bit, and on
32-bit architectures the only worry might be the GPU drivers, which can
have big physical address spaces.
To make it possible for drivers like that to say that they are 64-bit
clean, this patch does repurpose the "FMODE_UNSIGNED_OFFSET" bit in the
file flags to allow drivers to mark their file descriptors as safe in
the full 64-bit mmap address space.
[ The timing for doing this is less than optimal, and this should really
go in a merge window. But realistically, this needs wide testing more
than it needs anything else, and being main-line is the only way to do
that.
So the earlier the better, even if it's outside the proper development
cycle - Linus ]
Cc: Kees Cook <keescook@chromium.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Dave Airlie <airlied@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e92bb4dd96 ]
When page_mapping() is called and the mapping is dereferenced in
page_evicatable() through shrink_active_list(), it is possible for the
inode to be truncated and the embedded address space to be freed at the
same time. This may lead to the following race.
CPU1 CPU2
truncate(inode) shrink_active_list()
... page_evictable(page)
truncate_inode_page(mapping, page);
delete_from_page_cache(page)
spin_lock_irqsave(&mapping->tree_lock, flags);
__delete_from_page_cache(page, NULL)
page_cache_tree_delete(..)
... mapping = page_mapping(page);
page->mapping = NULL;
...
spin_unlock_irqrestore(&mapping->tree_lock, flags);
page_cache_free_page(mapping, page)
put_page(page)
if (put_page_testzero(page)) -> false
- inode now has no pages and can be freed including embedded address_space
mapping_unevictable(mapping)
test_bit(AS_UNEVICTABLE, &mapping->flags);
- we've dereferenced mapping which is potentially already free.
Similar race exists between swap cache freeing and page_evicatable()
too.
The address_space in inode and swap cache will be freed after a RCU
grace period. So the races are fixed via enclosing the page_mapping()
and address_space usage in rcu_read_lock/unlock(). Some comments are
added in code to make it clear what is protected by the RCU read lock.
Link: http://lkml.kernel.org/r/20180212081227.1940-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 77da2ba064 ]
This patch fixes a corner case for KSM. When two pages belong or
belonged to the same transparent hugepage, and they should be merged,
KSM fails to split the page, and therefore no merging happens.
This bug can be reproduced by:
* making sure ksm is running (in case disabling ksmtuned)
* enabling transparent hugepages
* allocating a THP-aligned 1-THP-sized buffer
e.g. on amd64: posix_memalign(&p, 1<<21, 1<<21)
* filling it with the same values
e.g. memset(p, 42, 1<<21)
* performing madvise to make it mergeable
e.g. madvise(p, 1<<21, MADV_MERGEABLE)
* waiting for KSM to perform a few scans
The expected outcome is that the all the pages get merged (1 shared and
the rest sharing); the actual outcome is that no pages get merged (1
unshared and the rest volatile)
The reason of this behaviour is that we increase the reference count
once for both pages we want to merge, but if they belong to the same
hugepage (or compound page), the reference counter used in both cases is
the one of the head of the compound page. This means that
split_huge_page will find a value of the reference counter too high and
will fail.
This patch solves this problem by testing if the two pages to merge
belong to the same hugepage when attempting to merge them. If so, the
hugepage is split safely. This means that the hugepage is not split if
not necessary.
Link: http://lkml.kernel.org/r/1521548069-24758-1-git-send-email-imbrenda@linux.vnet.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Co-authored-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit a06ad633a3 ]
Calling swapon() on a zero length swap file on SSD can lead to a
divide-by-zero.
Although creating such files isn't possible with mkswap and they woud be
considered invalid, it would be better for the swapon code to be more
robust and handle this condition gracefully (return -EINVAL).
Especially since the fix is small and straightforward.
To help with wear leveling on SSD, the swapon syscall calculates a
random position in the swap file using modulo p->highest_bit, which is
set to maxpages - 1 in read_swap_header.
If the swap file is zero length, read_swap_header sets maxpages=1 and
last_page=0, resulting in p->highest_bit=0 and we divide-by-zero when we
modulo p->highest_bit in swapon syscall.
This can be prevented by having read_swap_header return zero if
last_page is zero.
Link: http://lkml.kernel.org/r/5AC747C1020000A7001FA82C@prv-mh.provo.novell.com
Signed-off-by: Thomas Abraham <tabraham@suse.com>
Reported-by: <Mark.Landis@Teradata.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 69d763fc6d ]
Minchan Kim asked the following question -- what locks protects
address_space destroying when race happens between inode trauncation and
__isolate_lru_page? Jan Kara clarified by describing the race as follows
CPU1 CPU2
truncate(inode) __isolate_lru_page()
...
truncate_inode_page(mapping, page);
delete_from_page_cache(page)
spin_lock_irqsave(&mapping->tree_lock, flags);
__delete_from_page_cache(page, NULL)
page_cache_tree_delete(..)
... mapping = page_mapping(page);
page->mapping = NULL;
...
spin_unlock_irqrestore(&mapping->tree_lock, flags);
page_cache_free_page(mapping, page)
put_page(page)
if (put_page_testzero(page)) -> false
- inode now has no pages and can be freed including embedded address_space
if (mapping && !mapping->a_ops->migratepage)
- we've dereferenced mapping which is potentially already free.
The race is theoretically possible but unlikely. Before the
delete_from_page_cache, truncate_cleanup_page is called so the page is
likely to be !PageDirty or PageWriteback which gets skipped by the only
caller that checks the mappping in __isolate_lru_page. Even if the race
occurs, a substantial amount of work has to happen during a tiny window
with no preemption but it could potentially be done using a virtual
machine to artifically slow one CPU or halt it during the critical
window.
This patch should eliminate the race with truncation by try-locking the
page before derefencing mapping and aborting if the lock was not
acquired. There was a suggestion from Huang Ying to use RCU as a
side-effect to prevent mapping being freed. However, I do not like the
solution as it's an unconventional means of preserving a mapping and
it's not a context where rcu_read_lock is obviously protecting rcu data.
Link: http://lkml.kernel.org/r/20180104102512.2qos3h5vqzeisrek@techsingularity.net
Fixes: c824493528 ("mm: compaction: make isolate_lru_page() filter-aware again")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 56521e7a02 ]
As Xiaojun reported the ltp of migrate_pages01 will fail on arm64 system
which has 4 nodes[0...3], all have memory and CONFIG_NODES_SHIFT=2:
migrate_pages01 0 TINFO : test_invalid_nodes
migrate_pages01 14 TFAIL : migrate_pages_common.c:45: unexpected failure - returned value = 0, expected: -1
migrate_pages01 15 TFAIL : migrate_pages_common.c:55: call succeeded unexpectedly
In this case the test_invalid_nodes of migrate_pages01 will call:
SYSC_migrate_pages as:
migrate_pages(0, , {0x0000000000000001}, 64, , {0x0000000000000010}, 64) = 0
The new nodes specifies one or more node IDs that are greater than the
maximum supported node ID, however, the errno is not set to EINVAL as
expected.
As man pages of set_mempolicy[1], mbind[2], and migrate_pages[3]
mentioned, when nodemask specifies one or more node IDs that are greater
than the maximum supported node ID, the errno should set to EINVAL.
However, get_nodes only check whether the part of bits
[BITS_PER_LONG*BITS_TO_LONGS(MAX_NUMNODES), maxnode) is zero or not, and
remain [MAX_NUMNODES, BITS_PER_LONG*BITS_TO_LONGS(MAX_NUMNODES)
unchecked.
This patch is to check the bits of [MAX_NUMNODES, maxnode) in get_nodes
to let migrate_pages set the errno to EINVAL when nodemask specifies one
or more node IDs that are greater than the maximum supported node ID,
which follows the manpage's guide.
[1] http://man7.org/linux/man-pages/man2/set_mempolicy.2.html
[2] http://man7.org/linux/man-pages/man2/mbind.2.html
[3] http://man7.org/linux/man-pages/man2/migrate_pages.2.html
Link: http://lkml.kernel.org/r/1510882624-44342-3-git-send-email-xieyisheng1@huawei.com
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Reported-by: Tan Xiaojun <tanxiaojun@huawei.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Chris Salls <salls@cs.ucsb.edu>
Cc: Christopher Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ab1e8d8960 upstream.
It is unsafe to do virtual to physical translations before mm_init() is
called if struct page is needed in order to determine the memory section
number (see SECTION_IN_PAGE_FLAGS). This is because only in mm_init()
we initialize struct pages for all the allocated memory when deferred
struct pages are used.
My recent fix in commit c9e97a1997 ("mm: initialize pages on demand
during boot") exposed this problem, because it greatly reduced number of
pages that are initialized before mm_init(), but the problem existed
even before my fix, as Fengguang Wu found.
Below is a more detailed explanation of the problem.
We initialize struct pages in four places:
1. Early in boot a small set of struct pages is initialized to fill the
first section, and lower zones.
2. During mm_init() we initialize "struct pages" for all the memory that
is allocated, i.e reserved in memblock.
3. Using on-demand logic when pages are allocated after mm_init call
(when memblock is finished)
4. After smp_init() when the rest free deferred pages are initialized.
The problem occurs if we try to do va to phys translation of a memory
between steps 1 and 2. Because we have not yet initialized struct pages
for all the reserved pages, it is inherently unsafe to do va to phys if
the translation itself requires access of "struct page" as in case of
this combination: CONFIG_SPARSE && !CONFIG_SPARSE_VMEMMAP
The following path exposes the problem:
start_kernel()
trap_init()
setup_cpu_entry_areas()
setup_cpu_entry_area(cpu)
get_cpu_gdt_paddr(cpu)
per_cpu_ptr_to_phys(addr)
pcpu_addr_to_page(addr)
virt_to_page(addr)
pfn_to_page(__pa(addr) >> PAGE_SHIFT)
We disable this path by not allowing NEED_PER_CPU_KM with deferred
struct pages feature.
The problems are discussed in these threads:
http://lkml.kernel.org/r/20180418135300.inazvpxjxowogyge@wfg-t540p.sh.intel.comhttp://lkml.kernel.org/r/20180419013128.iurzouiqxvcnpbvz@wfg-t540p.sh.intel.comhttp://lkml.kernel.org/r/20180426202619.2768-1-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180515175124.1770-1-pasha.tatashin@oracle.com
Fixes: 3a80a7fa79 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Dennis Zhou <dennisszhou@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ebded02788 upstream.
In the generic read paths the kernel looks up a page in the page cache
and if it's up to date, it is used. If not, the page lock is acquired
to wait for IO to complete and then check the page. If multiple
processes are waiting on IO, they all serialise against the lock and
duplicate the checks. This is unnecessary.
The page lock in itself does not give any guarantees to the callers
about the page state as it can be immediately truncated or reclaimed
after the page is unlocked. It's sufficient to wait_on_page_locked and
then continue if the page is up to date on wakeup.
It is possible that a truncated but up-to-date page is returned but the
reference taken during read prevents it disappearing underneath the
caller and the data is still valid if PageUptodate.
The overall impact is small as even if processes serialise on the lock,
the lock section is tiny once the IO is complete. Profiles indicated
that unlock_page and friends are generally a tiny portion of a
read-intensive workload. An artificial test was created that had
instances of dd access a cache-cold file on an ext4 filesystem and
measure how long the read took.
paralleldd
4.4.0 4.4.0
vanilla avoidlock
Amean Elapsd-1 5.28 ( 0.00%) 5.15 ( 2.50%)
Amean Elapsd-4 5.29 ( 0.00%) 5.17 ( 2.12%)
Amean Elapsd-7 5.28 ( 0.00%) 5.18 ( 1.78%)
Amean Elapsd-12 5.20 ( 0.00%) 5.33 ( -2.50%)
Amean Elapsd-21 5.14 ( 0.00%) 5.21 ( -1.41%)
Amean Elapsd-30 5.30 ( 0.00%) 5.12 ( 3.38%)
Amean Elapsd-48 5.78 ( 0.00%) 5.42 ( 6.21%)
Amean Elapsd-79 6.78 ( 0.00%) 6.62 ( 2.46%)
Amean Elapsd-110 9.09 ( 0.00%) 8.99 ( 1.15%)
Amean Elapsd-128 10.60 ( 0.00%) 10.43 ( 1.66%)
The impact is small but intuitively, it makes sense to avoid unnecessary
calls to lock_page.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 316bda0e6c upstream.
We assume there is enough inactive page cache if the size of inactive
file lru is greater than the size of active file lru, in which case we
force-scan file lru ignoring anonymous pages. While this logic works
fine when there are plenty of page cache pages, it fails if the size of
file lru is small (several MB): in this case (lru_size >> prio) will be
0 for normal scan priorities, as a result, if inactive file lru happens
to be larger than active file lru, anonymous pages of a cgroup will
never get evicted unless the system experiences severe memory pressure,
even if there are gigabytes of unused anonymous memory there, which is
unfair in respect to other cgroups, whose workloads might be page cache
oriented.
This patch attempts to fix this by elaborating the "enough inactive page
cache" check: it makes it not only check that inactive lru size > active
lru size, but also that we will scan something from the cgroup at the
current scan priority. If these conditions do not hold, we proceed to
SCAN_FRACT as usual.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2e898e4c0a upstream.
lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.
unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the given inode is switching writeback domains. Switches occur when
enough writes are issued from a new domain.
This existing pattern is thus suspicious:
lock_page_memcg(page);
unlocked_inode_to_wb_begin(inode, &locked);
...
unlocked_inode_to_wb_end(inode, locked);
unlock_page_memcg(page);
If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock. This suggests the
possibility of deadlock if an interrupt occurs before unlock_page_memcg().
truncate
__cancel_dirty_page
lock_page_memcg
unlocked_inode_to_wb_begin
unlocked_inode_to_wb_end
<interrupts mistakenly enabled>
<interrupt>
end_page_writeback
test_clear_page_writeback
lock_page_memcg
<deadlock>
unlock_page_memcg
Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).
If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:
cd /mnt/cgroup/memory
mkdir a b
echo 1 > a/memory.move_charge_at_immigrate
echo 1 > b/memory.move_charge_at_immigrate
(
echo $BASHPID > a/cgroup.procs
while true; do
dd if=/dev/zero of=/mnt/big bs=1M count=256
done
) &
while true; do
sync
done &
sleep 1h &
SLEEP=$!
while true; do
echo $SLEEP > a/cgroup.procs
echo $SLEEP > b/cgroup.procs
done
The deadlock does not seem possible, so it's debatable if there's any
reason to modify the kernel. I suggest we should to prevent future
surprises. And Wang Long said "this deadlock occurs three times in our
environment", so there's more reason to apply this, even to stable.
Stable 4.4 has minor conflicts applying this patch. For a clean 4.4 patch
see "[PATCH for-4.4] writeback: safer lock nesting"
https://lkml.org/lkml/2018/4/11/146
Wang Long said "this deadlock occurs three times in our environment"
[gthelen@google.com: v4]
Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
[akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
Fixes: 682aa8e1a6 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Signed-off-by: Greg Thelen <gthelen@google.com>
Reported-by: Wang Long <wanglong19@meituan.com>
Acked-by: Wang Long <wanglong19@meituan.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: <stable@vger.kernel.org> [v4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[natechancellor: Applied to 4.4 based on Greg's backport on lkml.org]
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit abc1be13fd upstream.
f2fs specifies the __GFP_ZERO flag for allocating some of its pages.
Unfortunately, the page cache also uses the mapping's GFP flags for
allocating radix tree nodes. It always masked off the __GFP_HIGHMEM
flag, and masks off __GFP_ZERO in some paths, but not all. That causes
radix tree nodes to be allocated with a NULL list_head, which causes
backtraces like:
__list_del_entry+0x30/0xd0
list_lru_del+0xac/0x1ac
page_cache_tree_insert+0xd8/0x110
The __GFP_DMA and __GFP_DMA32 flags would also be able to sneak through
if they are ever used. Fix them all by using GFP_RECLAIM_MASK at the
innermost location, and remove it from earlier in the callchain.
Link: http://lkml.kernel.org/r/20180411060320.14458-2-willy@infradead.org
Fixes: 449dd6984d ("mm: keep page cache radix tree nodes in check")
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reported-by: Chris Fries <cfries@google.com>
Debugged-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c20cd45eb0 upstream.
page_cache_read has been historically using page_cache_alloc_cold to
allocate a new page. This means that mapping_gfp_mask is used as the
base for the gfp_mask. Many filesystems are setting this mask to
GFP_NOFS to prevent from fs recursion issues. page_cache_read is called
from the vm_operations_struct::fault() context during the page fault.
This context doesn't need the reclaim protection normally.
ceph and ocfs2 which call filemap_fault from their fault handlers seem
to be OK because they are not taking any fs lock before invoking generic
implementation. xfs which takes XFS_MMAPLOCK_SHARED is safe from the
reclaim recursion POV because this lock serializes truncate and punch
hole with the page faults and it doesn't get involved in the reclaim.
There is simply no reason to deliberately use a weaker allocation
context when a __GFP_FS | __GFP_IO can be used. The GFP_NOFS protection
might be even harmful. There is a push to fail GFP_NOFS allocations
rather than loop within allocator indefinitely with a very limited
reclaim ability. Once we start failing those requests the OOM killer
might be triggered prematurely because the page cache allocation failure
is propagated up the page fault path and end up in
pagefault_out_of_memory.
We cannot play with mapping_gfp_mask directly because that would be racy
wrt. parallel page faults and it might interfere with other users who
really rely on NOFS semantic from the stored gfp_mask. The mask is also
inode proper so it would even be a layering violation. What we can do
instead is to push the gfp_mask into struct vm_fault and allow fs layer
to overwrite it should the callback need to be called with a different
allocation context.
Initialize the default to (mapping_gfp_mask | __GFP_FS | __GFP_IO)
because this should be safe from the page fault path normally. Why do
we care about mapping_gfp_mask at all then? Because this doesn't hold
only reclaim protection flags but it also might contain zone and
movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have
to respect those.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a9f2a846f0 upstream.
cache_reap() is initially scheduled in start_cpu_timer() via
schedule_delayed_work_on(). But then the next iterations are scheduled
via schedule_delayed_work(), i.e. using WORK_CPU_UNBOUND.
Thus since commit ef55718044 ("workqueue: schedule WORK_CPU_UNBOUND
work on wq_unbound_cpumask CPUs") there is no guarantee the future
iterations will run on the originally intended cpu, although it's still
preferred. I was able to demonstrate this with
/sys/module/workqueue/parameters/debug_force_rr_cpu. IIUC, it may also
happen due to migrating timers in nohz context. As a result, some cpu's
would be calling cache_reap() more frequently and others never.
This patch uses schedule_delayed_work_on() with the current cpu when
scheduling the next iteration.
Link: http://lkml.kernel.org/r/20180411070007.32225-1-vbabka@suse.cz
Fixes: ef55718044 ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f351574172 upstream.
Provide a function, kmemdup_nul(), that will create a NUL-terminated string
from an unterminated character array where the length is known in advance.
This is better than kstrndup() in situations where we already know the
string length as the strnlen() in kstrndup() is superfluous.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit af27d9403f upstream.
We get a warning about some slow configurations in randconfig kernels:
mm/memory.c:83:2: error: #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid. [-Werror=cpp]
The warning is reasonable by itself, but gets in the way of randconfig
build testing, so I'm hiding it whenever CONFIG_COMPILE_TEST is set.
The warning was added in 2013 in commit 75980e97da ("mm: fold
page->_last_nid into page->flags where possible").
Cc: stable@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b050e3769c upstream.
Since commit 97a16fc82a ("mm, page_alloc: only enforce watermarks for
order-0 allocations"), __zone_watermark_ok() check for high-order
allocations will shortcut per-migratetype free list checks for
ALLOC_HARDER allocations, and return true as long as there's free page
of any migratetype. The intention is that ALLOC_HARDER can allocate
from MIGRATE_HIGHATOMIC free lists, while normal allocations can't.
However, as a side effect, the watermark check will then also return
true when there are pages only on the MIGRATE_ISOLATE list, or (prior to
CMA conversion to ZONE_MOVABLE) on the MIGRATE_CMA list. Since the
allocation cannot actually obtain isolated pages, and might not be able
to obtain CMA pages, this can result in a false positive.
The condition should be rare and perhaps the outcome is not a fatal one.
Still, it's better if the watermark check is correct. There also
shouldn't be a performance tradeoff here.
Link: http://lkml.kernel.org/r/20171102125001.23708-1-vbabka@suse.cz
Fixes: 97a16fc82a ("mm, page_alloc: only enforce watermarks for order-0 allocations")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e048cb32f6 upstream.
The align_offset parameter is used by bitmap_find_next_zero_area_off()
to represent the offset of map's base from the previous alignment
boundary; the function ensures that the returned index, plus the
align_offset, honors the specified align_mask.
The logic introduced by commit b5be83e308 ("mm: cma: align to physical
address, not CMA region position") has the cma driver calculate the
offset to the *next* alignment boundary. In most cases, the base
alignment is greater than that specified when making allocations,
resulting in a zero offset whether we align up or down. In the example
given with the commit, the base alignment (8MB) was half the requested
alignment (16MB) so the math also happened to work since the offset is
8MB in both directions. However, when requesting allocations with an
alignment greater than twice that of the base, the returned index would
not be correctly aligned.
Also, the align_order arguments of cma_bitmap_aligned_mask() and
cma_bitmap_aligned_offset() should not be negative so the argument type
was made unsigned.
Fixes: b5be83e308 ("mm: cma: align to physical address, not CMA region position")
Link: http://lkml.kernel.org/r/20170628170742.2895-1-opendmb@gmail.com
Signed-off-by: Angus Clark <angus@angusclark.org>
Signed-off-by: Doug Berger <opendmb@gmail.com>
Acked-by: Gregory Fong <gregory.0xf0@gmail.com>
Cc: Doug Berger <opendmb@gmail.com>
Cc: Angus Clark <angus@angusclark.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Lucas Stach <l.stach@pengutronix.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Shiraz Hashim <shashim@codeaurora.org>
Cc: Jaewon Kim <jaewon31.kim@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 18365225f0 upstream.
Laurent Dufour has noticed that hwpoinsoned pages are kept charged. In
his particular case he has hit a bad_page("page still charged to
cgroup") when onlining a hwpoison page. While this looks like something
that shouldn't happen in the first place because onlining hwpages and
returning them to the page allocator makes only little sense it shows a
real problem.
hwpoison pages do not get freed usually so we do not uncharge them (at
least not since commit 0a31bc97c8 ("mm: memcontrol: rewrite uncharge
API")). Each charge pins memcg (since e8ea14cc6e ("mm: memcontrol:
take a css reference for each charged page")) as well and so the
mem_cgroup and the associated state will never go away. Fix this leak
by forcibly uncharging a LRU hwpoisoned page in delete_from_lru_cache().
We also have to tweak uncharge_list because it cannot rely on zero ref
count for these pages.
[akpm@linux-foundation.org: coding-style fixes]
Fixes: 0a31bc97c8 ("mm: memcontrol: rewrite uncharge API")
Link: http://lkml.kernel.org/r/20170502185507.GB19165@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 561b5e0709 upstream.
Commit 1be7107fbe ("mm: larger stack guard gap, between vmas") has
introduced a regression in some rust and Java environments which are
trying to implement their own stack guard page. They are punching a new
MAP_FIXED mapping inside the existing stack Vma.
This will confuse expand_{downwards,upwards} into thinking that the
stack expansion would in fact get us too close to an existing non-stack
vma which is a correct behavior wrt safety. It is a real regression on
the other hand.
Let's work around the problem by considering PROT_NONE mapping as a part
of the stack. This is a gros hack but overflowing to such a mapping
would trap anyway an we only can hope that usespace knows what it is
doing and handle it propely.
Fixes: 1be7107fbe ("mm: larger stack guard gap, between vmas")
Link: http://lkml.kernel.org/r/20170705182849.GA18027@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Debugged-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fd5bb66cd9 upstream.
Change the zpool/compressor param callback function to release the
zswap_pools_lock spinlock before calling param_set_charp, since that
function may sleep when it calls kmalloc with GFP_KERNEL.
While this problem has existed for a while, I wasn't able to trigger it
using a tight loop changing either/both the zpool and compressor params; I
think it's very unlikely to be an issue on the stable kernels, especially
since most zswap users will change the compressor and/or zpool from sysfs
only one time each boot - or zero times, if they add the params to the
kernel boot.
Fixes: c99b42c352 ("zswap: use charp for zswap param strings")
Link: http://lkml.kernel.org/r/20170126155821.4545-1-ddstreet@ieee.org
Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
Reported-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>