IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
This is a 1:1 conversion. The major part of this patch is converting
the test framework from userspace to kernel space and mirroring the
algorithm now used in find_swap_entry().
Signed-off-by: Matthew Wilcox <willy@infradead.org>
I found another victim of the radix tree being hard to use. Because
there was no call to radix_tree_preload(), khugepaged was allocating
radix_tree_nodes using GFP_ATOMIC.
I also converted a local_irq_save()/restore() pair to
disable()/enable().
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Both callers of __delete_from_swap_cache have the swp_entry_t already,
so pass that in to make constructing the XA_STATE easier.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Combine __add_to_swap_cache and add_to_swap_cache into one function
since there is no more need to preload.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
This is essentially xa_cmpxchg() with the locking handled above us,
and it doesn't have to handle replacing a NULL entry.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
We construct an XA_STATE and use it to delete the node with
xas_store() rather than adding a special function for this unique
use case. Includes a test that simulates this usage for the
test suite.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Instead of calling find_get_pages_range() and putting any reference,
use xas_find() to iterate over any entries in the range, skipping the
shadow/swap entries.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Slight change of strategy here; if we have trouble getting hold of a
page for whatever reason (eg a compound page is split underneath us),
don't spin to stabilise the page, just continue the iteration, like we
would if we failed to trylock the page. Since this is a speculative
optimisation, it feels like we should allow the process to take an extra
fault if it turns out to need this page instead of spending time to pin
down a page it may not need.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
The 'end' parameter of the xas_for_each iterator avoids a useless
iteration at the end of the range.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
There's no direct replacement for radix_tree_for_each_contig()
in the XArray API as it's an unusual thing to do. Instead,
open-code a loop using xas_next(). This removes the only user of
radix_tree_for_each_contig() so delete the iterator from the API and
the test suite code for it.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
The 'end' parameter of the xas_for_each iterator avoids a useless
iteration at the end of the range.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Use the XArray APIs to add and replace pages in the page cache. This
removes two uses of the radix tree preload API and is significantly
shorter code. It also removes the last user of __radix_tree_create()
outside radix-tree.c itself, so make it static.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
The page cache offers the ability to search for a miss in the previous or
next N locations. Rather than teach the XArray about the page cache's
definition of a miss, use xas_prev() and xas_next() to search the page
array. This should be more efficient as it does not have to start the
lookup from the top for each index.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
This is a direct replacement for struct radix_tree_node. A couple of
struct members have changed name, so convert those. Use a #define so
that radix tree users continue to work without change.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Introduce xarray value entries and tagged pointers to replace radix
tree exceptional entries. This is a slight change in encoding to allow
the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
value entry). It is also a change in emphasis; exceptional entries are
intimidating and different. As the comment explains, you can choose
to store values or pointers in the xarray and they are both first-class
citizens.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
9092c71bb724 ("mm: use sc->priority for slab shrink targets") changed the
way that the target slab pressure is calculated and made it
priority-based:
delta = freeable >> priority;
delta *= 4;
do_div(delta, shrinker->seeks);
The problem is that on a default priority (which is 12) no pressure is
applied at all, if the number of potentially reclaimable objects is less
than 4096 (1<<12).
This causes the last objects on slab caches of no longer used cgroups to
(almost) never get reclaimed. It's obviously a waste of memory.
It can be especially painful, if these stale objects are holding a
reference to a dying cgroup. Slab LRU lists are reparented on memcg
offlining, but corresponding objects are still holding a reference to the
dying cgroup. If we don't scan these objects, the dying cgroup can't go
away. Most likely, the parent cgroup hasn't any directly charged objects,
only remaining objects from dying children cgroups. So it can easily hold
a reference to hundreds of dying cgroups.
If there are no big spikes in memory pressure, and new memory cgroups are
created and destroyed periodically, this causes the number of dying
cgroups grow steadily, causing a slow-ish and hard-to-detect memory
"leak". It's not a real leak, as the memory can be eventually reclaimed,
but it could not happen in a real life at all. I've seen hosts with a
steadily climbing number of dying cgroups, which doesn't show any signs of
a decline in months, despite the host is loaded with a production
workload.
It is an obvious waste of memory, and to prevent it, let's apply a minimal
pressure even on small shrinker lists. E.g. if there are freeable
objects, let's scan at least min(freeable, scan_batch) objects.
This fix significantly improves a chance of a dying cgroup to be
reclaimed, and together with some previous patches stops the steady growth
of the dying cgroups number on some of our hosts.
Link: http://lkml.kernel.org/r/20180905230759.12236-1-guro@fb.com
Fixes: 9092c71bb724 ("mm: use sc->priority for slab shrink targets")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW5qpOgAKCRDh3BK/laaZ
PDCQAQCIKLg0aLeWOkfUO76mBjlp5srKgJfrqpFoyuozD6l2fQEAl/W2x9NOduV+
PK4sCYMT8SpI0hMrbv9P4zZ683kmaA8=
=RnZU
-----END PGP SIGNATURE-----
Merge tag 'ovl-fixes-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
"This fixes a regression in the recent file stacking update, reported
and fixed by Amir Goldstein. The fix is fairly trivial, but involves
adding a fadvise() f_op and the associated churn in the vfs. As
discussed on -fsdevel, there are other possible uses for this method,
than allowing proper stacking for overlays.
And there's one other fix for a syzkaller detected oops"
* tag 'ovl-fixes-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix oopses in ovl_fill_super() failure paths
ovl: add ovl_fadvise()
vfs: implement readahead(2) using POSIX_FADV_WILLNEED
vfs: add the fadvise() file operation
Documentation/filesystems: update documentation of file_operations
ovl: fix GPF in swapfile_activate of file from overlayfs over xfs
ovl: respect FIEMAP_FLAG_SYNC flag
Jann Horn points out that the vmacache_flush_all() function is not only
potentially expensive, it's buggy too. It also happens to be entirely
unnecessary, because the sequence number overflow case can be avoided by
simply making the sequence number be 64-bit. That doesn't even grow the
data structures in question, because the other adjacent fields are
already 64-bit.
So simplify the whole thing by just making the sequence number overflow
case go away entirely, which gets rid of all the complications and makes
the code faster too. Win-win.
[ Oleg Nesterov points out that the VMACACHE_FULL_FLUSHES statistics
also just goes away entirely with this ]
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Will Deacon <will.deacon@arm.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAluRkywQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpm8uEAC8vBFb5tzZ2dOeRbGQ6LaPTToBmRrLtOcP
kDRnfZIw0raNStOpn1dkGLz8IOSjwOGftx9Q4pJed25vynTEq5lYmmLVUlJQ6cJ7
oNpYiCdPxJvbKz5fChGG2nHHa1RLer1d728NZtkeZU/ChPmw56EO5ORghE7zPG7K
Z/0qHYsgwS427o8pUDsymmt6I62IJGrjzqJdC0pqBy6RePQWtlwkmtd7CIgFiffY
tDnk6RSwcihnIalMMLvFXeGf6cSaZvuH4oK1QNdfojAyS8kWeA6gHtjRS8UcuuUY
t6o+hU0vki8bghoNoI40RrLgAmV91BVv1/Voo79dQvDWAigyie51HwFFkqdWzJxJ
g4MCZYpys26w/VUGBFCku0hiRIAhZFO8Sun5zbVCJpyt8hTXF0RrG3CpwmCF7Lc0
m+h8tJanEMCesfYMztTD31L1BOFhJeOgBJr4a5QURy0LbIvC0V52IKiOQ0475E8E
H10rQaRw/7Am+mZugedMUGMgYD/eN33NQoRuTWZdck/58big2SU78zGpR/GqTmy3
w9v2I8ksBTivzEayBV0G4Z5Gxu7QYA7NMsO5RS/wuGfUX8D/1QtQU9Ejh5TESbek
R3WUyhXJJ2S+DWTUlmX7TgPxYxG3sXatQbSAgFJiucxyIRdpdqfeoXmOHvPrWZEq
O3VDm0D6pw==
=qhv7
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20180906' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Small collection of fixes that should go into this release. This
contains:
- Small series that fixes a race between blkcg teardown and writeback
(Dennis Zhou)
- Fix disallowing invalid block size settings from the nbd ioctl (me)
- BFQ fix for a use-after-free on last release of a bfqg (Konstantin
Khlebnikov)
- Fix for the "don't warn for flush" fix (Mikulas)"
* tag 'for-linus-20180906' of git://git.kernel.dk/linux-block:
block: bfq: swap puts in bfqg_and_blkg_put
block: don't warn when doing fsync on read-only devices
nbd: don't allow invalid blocksize settings
blkcg: use tryget logic when associating a blkg with a bio
blkcg: delay blkg destruction until after writeback has finished
Revert "blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()"
It looks like I missed the PUD path when doing VM_MIXEDMAP removal.
This can be triggered by:
1. Boot with memmap=4G!8G
2. build ndctl with destructive flag on
3. make TESTS=device-dax check
[ +0.000675] kernel BUG at mm/huge_memory.c:824!
Applying the same change that was applied to vmf_insert_pfn_pmd() in the
original patch.
Link: http://lkml.kernel.org/r/153565957352.35524.1005746906902065126.stgit@djiang5-desk3.ch.intel.com
Fixes: e1fb4a08649 ("dax: remove VM_MIXEDMAP for fsdax and device dax")
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Tested-by: Vishal Verma <vishal.l.verma@intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When scanning for movable pages, filter out Hugetlb pages if hugepage
migration is not supported. Without this we hit infinte loop in
__offline_pages() where we do
pfn = scan_movable_pages(start_pfn, end_pfn);
if (pfn) { /* We have movable pages */
ret = do_migrate_range(pfn, end_pfn);
goto repeat;
}
Fix this by checking hugepage_migration_supported both in
has_unmovable_pages which is the primary backoff mechanism for page
offlining and for consistency reasons also into scan_movable_pages
because it doesn't make any sense to return a pfn to non-migrateable
huge page.
This issue was revealed by, but not caused by 72b39cfc4d75 ("mm,
memory_hotplug: do not fail offlining too early").
Link: http://lkml.kernel.org/r/20180824063314.21981-1-aneesh.kumar@linux.ibm.com
Fixes: 72b39cfc4d75 ("mm, memory_hotplug: do not fail offlining too early")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reported-by: Haren Myneni <haren@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Scooped from an email from Matthew.
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If kmemleak built in to the kernel, but is disabled by default, the
debugfs file is never registered. Because of this, it is not possible
to find out if the kernel is built with kmemleak support by checking for
the presence of this file. To allow this, always register the file.
After this patch, if the file doesn't exist, kmemleak is not available
in the kernel. If writing "scan" or any other value than "clear" to
this file results in EBUSY, then kmemleak is available but is disabled
by default and can be activated via the kernel command line.
Catalin: "that's also consistent with a late disabling of kmemleak when
the debugfs entry sticks around."
Link: http://lkml.kernel.org/r/20180824131220.19176-1-vincent.whitchurch@axis.com
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu
notifiers") has added an ability to skip over vmas with blockable mmu
notifiers. This however didn't call tlb_finish_mmu as it should.
As a result inc_tlb_flush_pending has been called without its pairing
dec_tlb_flush_pending and all callers mm_tlb_flush_pending would flush
even though this is not really needed. This alone is not harmful and it
seems there shouldn't be any such callers for oom victims at all but
there is no real reason to skip tlb_finish_mmu on early skip either so
call it.
[mhocko@suse.com: new changelog]
Link: http://lkml.kernel.org/r/b752d1d5-81ad-7a35-2394-7870641be51c@i-love.sakura.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the memcg OOM killer runs out of killable tasks, it currently
prints a WARN with no further OOM context. This has caused some user
confusion.
Warnings indicate a kernel problem. In a reported case, however, the
situation was triggered by a nonsensical memcg configuration (hard limit
set to 0). But without any VM context this wasn't obvious from the
report, and it took some back and forth on the mailing list to identify
what is actually a trivial issue.
Handle this OOM condition like we handle it in the global OOM killer:
dump the full OOM context and tell the user we ran out of tasks.
This way the user can identify misconfigurations easily by themselves
and rectify the problem - without having to go through the hassle of
running into an obscure but unsettling warning, finding the appropriate
kernel mailing list and waiting for a kernel developer to remote-analyze
that the memcg configuration caused this.
If users cannot make sense of why the OOM killer was triggered or why it
failed, they will still report it to the mailing list, we know that from
experience. So in case there is an actual kernel bug causing this,
kernel developers will very likely hear about it.
Link: http://lkml.kernel.org/r/20180821160406.22578-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, blkcg destruction relies on a sequence of events:
1. Destruction starts. blkcg_css_offline() is called and blkgs
release their reference to the blkcg. This immediately destroys
the cgwbs (writeback).
2. With blkgs giving up their reference, the blkcg ref count should
become zero and eventually call blkcg_css_free() which finally
frees the blkcg.
Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
on the completion of all writeback associated with the blkcg. A count of
the number of cgwbs is maintained and once that goes to zero, blkg
destruction can follow. This should prevent premature blkg destruction
related to writeback.
The new process for blkcg cleanup is as follows:
1. Destruction starts. blkcg_css_offline() is called which offlines
writeback. Blkg destruction is delayed on the cgwb_refcnt count to
avoid punting potentially large amounts of outstanding writeback
to root while maintaining any ongoing policies. Here, the base
cgwb_refcnt is put back.
2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called
and handles destruction of blkgs. This is where the css reference
held by each blkg is released.
3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
This finally frees the blkg.
It seems in the past blk-throttle didn't do the most understandable
things with taking data from a blkg while associating with current. So,
the simplification and unification of what blk-throttle is doing caused
this.
Fixes: 08e18eab0c579 ("block: add bi_blkg to the bio for cgroups")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The implementation of readahead(2) syscall is identical to that of
fadvise64(POSIX_FADV_WILLNEED) with a few exceptions:
1. readahead(2) returns -EINVAL for !mapping->a_ops and fadvise64()
ignores the request and returns 0.
2. fadvise64() checks for integer overflow corner case
3. fadvise64() calls the optional filesystem fadvise() file operation
Unite the two implementations by calling vfs_fadvise() from readahead(2)
syscall. Check the !mapping->a_ops in readahead(2) syscall to preserve
documented syscall ABI behaviour.
Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: d1d04ef8572b ("ovl: stack file ops")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This is going to be used by overlayfs and possibly useful
for other filesystems.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* memory_failure() gets confused by dev_pagemap backed mappings. The
recovery code has specific enabling for several possible page states
that needs new enabling to handle poison in dax mappings. Teach
memory_failure() about ZONE_DEVICE pages.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9ui8ACgkQYGjFFmlT
OEpNRw//XGj9s7sezfJFeol4psJlRUd935yii/gmJRgi/yPf2VxxQG9qyM6SMBUc
75jASfOL6FSsfxHz0kplyWzMDNdrTkNNAD+9rv80FmY7GqWgcas9DaJX7jZ994vI
5SRO7pfvNZcXlo7IhqZippDw3yxkIU9Ufi0YQKaEUm7GFieptvCZ0p9x3VYfdvwM
BExrxQe0X1XUF4xErp5P78+WUbKxP47DLcucRDig8Q7dmHELUdyNzo3E1SVoc7m+
3CmvyTj6XuFQgOZw7ZKun1BJYfx/eD5ZlRJLZbx6wJHRtTXv/Uea8mZ8mJ31ykN9
F7QVd0Pmlyxys8lcXfK+nvpL09QBE0/PhwWKjmZBoU8AdgP/ZvBXLDL/D6YuMTg6
T4wwtPNJorfV4lVD06OliFkVI4qbKbmNsfRq43Ns7PCaLueu4U/eMaSwSH99UMaZ
MGbO140XW2RZsHiU9yTRUmZq73AplePEjxtzR8oHmnjo45nPDPy8mucWPlkT9kXA
oUFMhgiviK7dOo19H4eaPJGqLmHM93+x5tpYxGqTr0dUOXUadKWxMsTnkID+8Yi7
/kzQWCFvySz3VhiEHGuWkW08GZT6aCcpkREDomnRh4MEnETlZI8bblcuXYOCLs6c
nNf1SIMtLdlsl7U1fEX89PNeQQ2y237vEDhFQZftaalPeu/JJV0=
=Ftop
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm memory-failure update from Dave Jiang:
"As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.
In order to support reliable reverse mapping of user space addresses:
1/ Add new locking in the memory_failure() rmap path to prevent races
that would typically be handled by the page lock.
2/ Since dev_pagemap pages are hidden from the page allocator and the
"compound page" accounting machinery, add a mechanism to determine
the size of the mapping that encompasses a given poisoned pfn.
3/ Given pmem errors can be repaired, change the speculatively
accessed poison protection, mce_unmap_kpfn(), to be reversible and
otherwise allow ongoing access from the kernel.
A side effect of this enabling is that MADV_HWPOISON becomes usable
for dax mappings, however the primary motivation is to allow the
system to survive userspace consumption of hardware-poison via dax.
Specifically the current behavior is:
mce: Uncorrected hardware memory error in user-access at af34214200
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
mce: [Hardware Error]: Machine check events logged
{1}[Hardware Error]: event severity: corrected
Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
[..]
Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
mce: Memory error not recovered
<reboot>
...and with these changes:
Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
Memory failure: 0x20cb00: recovery action for dax page: Recovered
Given all the cross dependencies I propose taking this through
nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
folks"
* tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm, pmem: Restore page attributes when clearing errors
x86/memory_failure: Introduce {set, clear}_mce_nospec()
x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
mm, memory_failure: Teach memory_failure() about dev_pagemap pages
filesystem-dax: Introduce dax_lock_mapping_entry()
mm, memory_failure: Collect mapping size in collect_procs()
mm, madvise_inject_error: Let memory_failure() optionally take a page reference
mm, dev_pagemap: Do not clear ->mapping on final put
mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
filesystem-dax: Set page->index
device-dax: Set page->index
device-dax: Enable page_mapping()
device-dax: Convert to vmf_insert_mixed and vm_fault_t
This is not normally noticeable, but repeated forks are unnecessarily
expensive because they repeatedly dirty the parent page tables during
the page table copy operation.
It's trivial to just avoid write protecting the page table entry if it
was already not writable.
This patch was inspired by
https://bugzilla.kernel.org/show_bug.cgi?id=200447
which points to an ancient "waste time re-doing fork" issue in the
presence of lots of signals.
That bug was fixed by Eric Biederman's signal handling series
culminating in commit c3ad2c3b02e9 ("signal: Don't restart fork when
signals come in"), but the unnecessary work for repeated forks is still
work just fixing, particularly since the fix is trivial.
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge yet more updates from Andrew Morton:
- the rest of MM
- various misc fixes and tweaks
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
mm: Change return type int to vm_fault_t for fault handlers
lib/fonts: convert comments to utf-8
s390: ebcdic: convert comments to UTF-8
treewide: convert ISO_8859-1 text comments to utf-8
drivers/gpu/drm/gma500/: change return type to vm_fault_t
docs/core-api: mm-api: add section about GFP flags
docs/mm: make GFP flags descriptions usable as kernel-doc
docs/core-api: split memory management API to a separate file
docs/core-api: move *{str,mem}dup* to "String Manipulation"
docs/core-api: kill trailing whitespace in kernel-api.rst
mm/util: add kernel-doc for kvfree
mm/util: make strndup_user description a kernel-doc comment
fs/proc/vmcore.c: hide vmcoredd_mmap_dumps() for nommu builds
treewide: correct "differenciate" and "instanciate" typos
fs/afs: use new return type vm_fault_t
drivers/hwtracing/intel_th/msu.c: change return type to vm_fault_t
mm: soft-offline: close the race against page allocation
mm: fix race on soft-offlining free huge pages
namei: allow restricted O_CREAT of FIFOs and regular files
hfs: prevent crash on exit from failed search
...
Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.
Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")
The aim is to change the return type of finish_fault() and
handle_mm_fault() to vm_fault_t type. As part of that clean up return
type of all other recursively called functions have been changed to
vm_fault_t type.
The places from where handle_mm_fault() is getting invoked will be
change to vm_fault_t type but in a separate patch.
vmf_error() is the newly introduce inline function in 4.17-rc6.
[akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>