Go to file
David Stevens ac492b9c70 mm/khugepaged: skip shmem with userfaultfd
Make sure that collapse_file respects any userfaultfds registered with
MODE_MISSING.  If userspace has any such userfaultfds registered, then for
any page which it knows to be missing, it may expect a
UFFD_EVENT_PAGEFAULT.  This means collapse_file needs to be careful when
collapsing a shmem range would result in replacing an empty page with a
THP, to avoid breaking userfaultfd.

Synchronization when checking for userfaultfds in collapse_file is tricky
because the mmap locks can't be used to prevent races with the
registration of new userfaultfds.  Instead, we provide synchronization by
ensuring that userspace cannot observe the fact that pages are missing
before we check for userfaultfds.  Although this allows registration of a
userfaultfd to race with collapse_file, it ensures that userspace cannot
observe any pages transition from missing to present after such a race
occurs.  This makes such a race indistinguishable to the collapse
occurring immediately before the userfaultfd registration.

The first step to provide this synchronization is to stop filling gaps
during the loop iterating over the target range, since the page cache lock
can be dropped during that loop.  The second step is to fill the gaps with
XA_RETRY_ENTRY after the page cache lock is acquired the final time, to
avoid races with accesses to the page cache that only take the RCU read
lock.

The fact that we don't fill holes during the initial iteration means that
collapse_file now has to handle faults occurring during the collapse. 
This is done by re-validating the number of missing pages after acquiring
the page cache lock for the final time.

This fix is targeted at khugepaged, but the change also applies to
MADV_COLLAPSE.  MADV_COLLAPSE on a range with a userfaultfd will now
return EBUSY if there are any missing pages (instead of succeeding on
shmem and returning EINVAL on anonymous memory).  There is also now a
window during MADV_COLLAPSE where a fault on a missing page will cause the
syscall to fail with EAGAIN.

The fact that intermediate page cache state can no longer be observed
before the rollback of a failed collapse is also technically a
userspace-visible change (via at least SEEK_DATA and SEEK_END), but it is
exceedingly unlikely that anything relies on being able to observe that
transient state.

Link: https://lkml.kernel.org/r/20230404120117.2562166-4-stevensd@google.com
Signed-off-by: David Stevens <stevensd@chromium.org>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jiaqi Yan <jiaqiyan@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18 16:29:52 -07:00
arch xtensa: reword ARCH_FORCE_MAX_ORDER prompt and help text 2023-04-18 16:29:46 -07:00
block block: remove obsolete config BLOCK_COMPAT 2023-03-16 09:35:44 -06:00
certs Kbuild updates for v6.3 2023-02-26 11:53:25 -08:00
crypto asymmetric_keys: log on fatal failures in PE/pkcs7 2023-03-21 16:23:56 +00:00
Documentation sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changes 2023-04-16 12:31:58 -07:00
drivers drm/ttm: remove comment referencing now-removed vmf_insert_mixed_prot() 2023-04-05 19:42:56 -07:00
fs sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changes 2023-04-18 14:53:49 -07:00
include mm/khugepaged: skip shmem with userfaultfd 2023-04-18 16:29:52 -07:00
init init,mm: fold late call to page_ext_init() to page_alloc_init_late() 2023-04-05 19:42:54 -07:00
io_uring block-6.3-2023-03-24 2023-03-24 14:10:39 -07:00
ipc Merge branch 'work.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2023-02-24 19:20:07 -08:00
kernel cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic" 2023-04-18 16:29:49 -07:00
lib lib/test_vmalloc.c: add vm_map_ram()/vm_unmap_ram() test case 2023-04-18 16:29:47 -07:00
LICENSES LICENSES: Add the copyleft-next-0.3.1 license 2022-11-08 15:44:01 +01:00
mm mm/khugepaged: skip shmem with userfaultfd 2023-04-18 16:29:52 -07:00
net mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
rust Rust fixes for 6.3-rc1 2023-03-03 14:51:15 -08:00
samples kmemleak-test: fix kmemleak_test.c build logic 2023-04-18 16:29:47 -07:00
scripts kasan: remove hwasan-kernel-mem-intrinsic-prefix=1 for clang-14 2023-04-18 16:29:43 -07:00
security mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
sound ALSA: hda/ca0132: fixup buffer overrun at tuning_ctl_set() 2023-03-14 17:04:53 +01:00
tools sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changes 2023-04-18 14:53:49 -07:00
usr usr/gen_init_cpio.c: remove unnecessary -1 values from int file 2022-10-03 14:21:44 -07:00
virt KVM/riscv changes for 6.3 2023-02-15 12:33:28 -05:00
.clang-format cpumask: re-introduce constant-sized cpumask optimizations 2023-03-05 14:30:34 -08:00
.cocciconfig
.get_maintainer.ignore get_maintainer: add Alan to .get_maintainer.ignore 2022-08-20 15:17:44 -07:00
.gitattributes .gitattributes: use 'dts' diff driver for *.dtso files 2023-02-26 15:28:23 +09:00
.gitignore kbuild: rpm-pkg: move source components to rpmbuild/SOURCES 2023-03-16 22:45:56 +09:00
.mailmap mailmap: update jtoppins' entry to reference correct email 2023-04-16 10:41:25 -07:00
.rustfmt.toml rust: add .rustfmt.toml 2022-09-28 09:02:20 +02:00
COPYING
CREDITS There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
Kbuild Kbuild updates for v6.1 2022-10-10 12:00:45 -07:00
Kconfig
MAINTAINERS MAINTAINERS: extend memblock entry to include MM initialization 2023-04-05 19:42:55 -07:00
Makefile Linux 6.3-rc4 2023-03-26 14:40:20 -07:00
README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.