linux

iv/linux

History

Suren Baghdasaryan 2b4f3b4987 fork: lock VMAs of the parent process when forking

Patch series "Avoid memory corruption caused by per-VMA locks", v4.

A memory corruption was reported in [1] with bisection pointing to the
patch [2] enabling per-VMA locks for x86.  Based on the reproducer
provided in [1] we suspect this is caused by the lack of VMA locking while
forking a child process.

Patch 1/2 in the series implements proper VMA locking during fork.  I
tested the fix locally using the reproducer and was unable to reproduce
the memory corruption problem.

This fix can potentially regress some fork-heavy workloads.  Kernel build
time did not show noticeable regression on a 56-core machine while a
stress test mapping 10000 VMAs and forking 5000 times in a tight loop
shows ~7% regression.  If such fork time regression is unacceptable,
disabling CONFIG_PER_VMA_LOCK should restore its performance.  Further
optimizations are possible if this regression proves to be problematic.

Patch 2/2 disables per-VMA locks until the fix is tested and verified.


This patch (of 2):

When forking a child process, parent write-protects an anonymous page and
COW-shares it with the child being forked using copy_present_pte(). 
Parent's TLB is flushed right before we drop the parent's mmap_lock in
dup_mmap().  If we get a write-fault before that TLB flush in the parent,
and we end up replacing that anonymous page in the parent process in
do_wp_page() (because, COW-shared with the child), this might lead to some
stale writable TLB entries targeting the wrong (old) page.  Similar issue
happened in the past with userfaultfd (see flush_tlb_page() call inside
do_wp_page()).

Lock VMAs of the parent process when forking a child, which prevents
concurrent page faults during fork operation and avoids this issue.  This
fix can potentially regress some fork-heavy workloads.  Kernel build time
did not show noticeable regression on a 56-core machine while a stress
test mapping 10000 VMAs and forking 5000 times in a tight loop shows ~7%
regression.  If such fork time regression is unacceptable, disabling
CONFIG_PER_VMA_LOCK should restore its performance.  Further optimizations
are possible if this regression proves to be problematic.

Link: https://lkml.kernel.org/r/20230706011400.2949242-1-surenb@google.com
Link: https://lkml.kernel.org/r/20230706011400.2949242-2-surenb@google.com
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/
Reported-by: Jacob Young <jacobly.alt@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=3D217624
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Holger Hoffsttte <holger@applied-asynchrony.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

2023-07-08 09:29:29 -07:00

bpf

bpf-next-for-netdev

2023-06-24 14:52:28 -07:00

cgroup

- Yosry Ahmed brought back some cgroup v1 stats in OOM logs.

2023-06-28 10:28:11 -07:00

configs

mm/slab: rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATED

2023-05-26 19:01:47 +02:00

debug

kdb: use srcu console list iterator

2022-12-02 11:25:00 +01:00

dma

dma-mapping uodates for Linux 6.5

2023-06-29 21:12:20 -07:00

entry

ptrace: Provide set/get interface for syscall user dispatch

2023-04-16 14:23:07 +02:00

events

cxl for v6.5

2023-07-01 08:58:41 -07:00

futex

- Prevent the leaking of a debug timer in futex_waitv()

2023-01-01 11:15:05 -08:00

gcov

gcov: add support for checksum field

2022-12-21 14:31:52 -08:00

irq

irqchip updates for 6.5

2023-06-26 11:05:49 +02:00

kcsan

kcsan: Don't expect 64 bits atomic builtins from 32 bits architectures

2023-06-09 23:29:50 +10:00

livepatch

livepatch: Make 'klp_stack_entries' static

2023-06-05 13:56:52 +02:00

locking

- Arnd Bergmann has fixed a bunch of -Wmissing-prototypes in

2023-06-28 10:59:38 -07:00

module

Kbuild updates for v6.5

2023-07-01 09:24:31 -07:00

power

- Yosry Ahmed brought back some cgroup v1 stats in OOM logs.

2023-06-28 10:28:11 -07:00

printk

seqlock/latch: Provide raw_read_seqcount_latch_retry()

2023-06-05 21:11:03 +02:00

rcu

Merge branches 'doc.2023.05.10a', 'fixes.2023.05.11a', 'kvfree.2023.05.10a', 'nocb.2023.05.11a', 'rcu-tasks.2023.05.10a', 'torture.2023.05.15a' and 'rcu-urgent.2023.06.06a' into HEAD

2023-06-07 13:44:06 -07:00

sched

cgroup: Changes for v6.5

2023-06-27 16:54:21 -07:00

time

hardening updates for v6.5-rc1

2023-06-27 21:24:18 -07:00

trace

Probes updates for v6.5:

2023-06-30 10:44:53 -07:00

.gitignore

…

acct.c

acct: fix potential integer overflow in encode_comp_t()

2022-11-30 16:13:18 -08:00

async.c

…

audit_fsnotify.c

audit: fix potential double free on error path from fsnotify_add_inode_mark

2022-08-22 18:50:06 -04:00

audit_tree.c

audit: use fsnotify group lock helpers

2022-04-25 14:37:28 +02:00

audit_watch.c

audit_init_parent(): constify path

2022-09-01 17:39:30 -04:00

audit.c

audit: use time_after to compare time

2022-08-29 19:47:03 -04:00

audit.h

audit: avoid missing-prototype warnings

2023-05-17 11:34:55 -04:00

auditfilter.c

…

auditsc.c

capability: just use a 'u64' instead of a 'u32[2]' array

2023-03-01 10:01:22 -08:00

backtracetest.c

…

bounds.c

mm: multi-gen LRU: minimal implementation

2022-09-26 19:46:09 -07:00

capability.c

capability: fix kernel-doc warnings in capability.c

2023-05-22 14:30:52 -04:00

cfi.c

cfi: Switch to -fsanitize=kcfi

2022-09-26 10:13:13 -07:00

compat.c

sched_getaffinity: don't assume 'cpumask_size()' is fully initialized

2023-03-14 19:32:38 -07:00

configs.c

…

context_tracking.c

locking/atomic: treewide: use raw_atomic*_<op>()

2023-06-05 09:57:20 +02:00

cpu_pm.c

cpuidle, cpu_pm: Remove RCU fiddling from cpu_pm_{enter,exit}()

2023-01-13 11:48:15 +01:00

cpu.c

cpu/hotplug: Fix off by one in cpuhp_bringup_mask()

2023-05-23 18:06:40 +02:00

crash_core.c

mm, treewide: redefine MAX_ORDER sanely

2023-04-05 19:42:46 -07:00

crash_dump.c

…

cred.c

cred: Do not default to init_cred in prepare_kernel_cred()

2022-11-01 10:04:52 -07:00

delayacct.c

delayacct: track delays from IRQ/SOFTIRQ

2023-04-18 16:39:34 -07:00

dma.c

…

exec_domain.c

…

exit.c

fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-01 17:15:33 -04:00

extable.c

context_tracking: Take NMI eqs entrypoints over RCU

2022-07-05 13:32:59 -07:00

fail_function.c

kernel/fail_function: fix memory leak with using debugfs_lookup()

2023-02-08 13:36:22 +01:00

fork.c

fork: lock VMAs of the parent process when forking

2023-07-08 09:29:29 -07:00

freezer.c

freezer,sched: Rewrite core freezer logic

2022-09-07 21:53:50 +02:00

gen_kheaders.sh

Revert "kheaders: substituting --sort in archive creation"

2023-05-28 16:20:21 +09:00

groups.c

security: Add LSM hook to setgroups() syscall

2022-07-15 18:21:49 +00:00

hung_task.c

kernel/hung_task.c: set some hung_task.c variables storage-class-specifier to static

2023-04-08 13:45:37 -07:00

iomem.c

…

irq_work.c

trace: Add trace_ipi_send_cpu()

2023-03-24 11:01:29 +01:00

jump_label.c

jump_label: Prevent key->enabled int overflow

2022-12-01 15:53:05 -08:00

kallsyms_internal.h

kallsyms: Reduce the memory occupied by kallsyms_seqs_of_names[]

2022-11-12 18:47:36 -08:00

kallsyms_selftest.c

kallsyms: Delete an unused parameter related to {module_}kallsyms_on_each_symbol()

2023-03-19 13:27:19 -07:00

kallsyms_selftest.h

kallsyms: Add self-test facility

2022-11-15 00:42:02 -08:00

kallsyms.c

v6.5-rc1-modules-next

2023-06-28 15:51:08 -07:00

kcmp.c

…

Kconfig.freezer

…

Kconfig.hz

…

Kconfig.locks

…

Kconfig.preempt

…

kcov.c

kcov: add prototypes for helper functions

2023-06-09 17:44:17 -07:00

kexec_core.c

kexec: enable kexec_crash_size to support two crash kernel regions

2023-06-09 17:44:24 -07:00

kexec_elf.c

…

kexec_file.c

- Arnd Bergmann has fixed a bunch of -Wmissing-prototypes in

2023-06-28 10:59:38 -07:00

kexec_internal.h

panic, kexec: make __crash_kexec() NMI safe

2022-09-11 21:55:06 -07:00

kexec.c

kexec: introduce sysctl parameters kexec_load_limit_*

2023-02-02 22:50:05 -08:00

kheaders.c

kheaders: Use array declaration instead of char

2023-03-24 20:10:59 -07:00

kprobes.c

fprobe: Pass return address to the handlers

2023-06-06 21:39:55 +09:00

ksyms_common.c

kallsyms: make kallsyms_show_value() as generic function

2023-06-08 12:27:20 -07:00

ksysfs.c

kernel/ksysfs.c: use sysfs_emit for sysfs show handlers

2023-03-24 17:09:14 +01:00

kthread.c

- Arnd Bergmann has fixed a bunch of -Wmissing-prototypes in

2023-06-28 10:59:38 -07:00

latencytop.c

latencytop: use the last element of latency_record of system

2022-09-11 21:55:12 -07:00

Makefile

v6.5-rc1-modules-next

2023-06-28 15:51:08 -07:00

module_signature.c

…

notifier.c

notifiers: add tracepoints to the notifiers infrastructure

2023-04-08 13:45:38 -07:00

nsproxy.c

convert setns(2) to fdget()/fdput()

2023-04-20 22:55:35 -04:00

padata.c

padata: use alignment when calculating the number of worker threads

2023-03-14 17:06:44 +08:00

panic.c

panic: hide unused global functions

2023-06-09 17:44:15 -07:00

params.c

kallsyms: Replace all non-returning strlcpy with strscpy

2023-06-14 12:27:38 -07:00

pid_namespace.c

pid: use struct_size_t() helper

2023-07-01 08:26:23 -07:00

pid_sysctl.h

kernel: pid_namespace: remove unused set_memfd_noexec_scope()

2023-06-19 16:19:28 -07:00

pid.c

pid: use struct_size_t() helper

2023-07-01 08:26:23 -07:00

profile.c

kernel/profile.c: simplify duplicated code in profile_setup()

2022-09-11 21:55:12 -07:00

ptrace.c

ptrace: Provide set/get interface for syscall user dispatch

2023-04-16 14:23:07 +02:00

range.c

…

reboot.c

kernel/reboot: Add SYS_OFF_MODE_RESTART_PREPARE mode

2022-10-04 15:59:36 +02:00

regset.c

…

relay.c

relayfs: fix out-of-bounds access in relay_file_read

2023-05-02 17:23:27 -07:00

resource_kunit.c

…

resource.c

dax/kmem: Fix leak of memory-hotplug resources

2023-02-17 14:58:01 -08:00

rseq.c

rseq: Extend struct rseq with per-memory-map concurrency ID

2022-12-27 12:52:12 +01:00

scftorture.c

…

scs.c

scs: add support for dynamic shadow call stacks

2022-11-09 18:06:35 +00:00

seccomp.c

seccomp: simplify sysctls with register_sysctl_init()

2023-04-13 11:49:20 -07:00

signal.c

v6.5-rc1-sysctl-next

2023-06-28 16:05:21 -07:00

smp.c

trace,smp: Add tracepoints for scheduling remotelly called functions

2023-06-16 22:08:09 +02:00

smpboot.c

cpu/hotplug: Remove unused state functions

2023-05-15 13:45:00 +02:00

smpboot.h

…

softirq.c

Revert "softirq: Let ksoftirqd do its job"

2023-05-09 21:50:27 +02:00

stackleak.c

stackleak: allow to specify arch specific stackleak poison function

2023-04-20 11:36:35 +02:00

stacktrace.c

…

static_call_inline.c

static_call: Add call depth tracking support

2022-10-17 16:41:16 +02:00

static_call.c

…

stop_machine.c

Scheduler changes in this cycle were:

2022-05-24 11:11:13 -07:00

sys_ni.c

cachestat: implement cachestat syscall

2023-06-09 16:25:16 -07:00

sys.c

riscv: Add prctl controls for userspace vector management

2023-06-08 07:16:53 -07:00

sysctl-test.c

kernel/sysctl-test: use SYSCTL_{ZERO/ONE_HUNDRED} instead of i_{zero/one_hundred}

2022-09-08 16:56:45 -07:00

sysctl.c

v6.5-rc1-sysctl-next

2023-06-28 16:05:21 -07:00

task_work.c

task_work: use try_cmpxchg in task_work_add, task_work_cancel_match and task_work_run

2022-09-11 21:55:10 -07:00

taskstats.c

genetlink: start to validate reserved header bytes

2022-08-29 12:47:15 +01:00

torture.c

torture: Fix hang during kthread shutdown phase

2023-01-05 12:10:35 -08:00

tracepoint.c

tracepoint: Allow livepatch module add trace event

2023-02-18 14:34:36 -05:00

tsacct.c

taskstats: version 12 with thread group and exe info

2022-04-29 14:38:03 -07:00

ucount.c

ucounts: Split rlimit and ucount values and max values

2022-05-18 18:24:57 -05:00

uid16.c

…

uid16.h

…

umh.c

sysctl: fix unused proc_cap_handler() function warning

2023-06-29 15:19:43 -07:00

up.c

…

user_namespace.c

userns: fix a struct's kernel-doc notation

2023-02-02 22:50:04 -08:00

user-return-notifier.c

…

user.c

kernel/user: Allow user_struct::locked_vm to be usable for iommufd

2022-11-30 20:16:49 -04:00

usermode_driver.c

blob_to_mnt(): kern_unmount() is needed to undo kern_mount()

2022-05-19 23:25:47 -04:00

utsname_sysctl.c

utsname: simplify one-level sysctl registration for uts_kern_table

2023-04-13 11:49:35 -07:00

utsname.c

…

vhost_task.c

vhost: Fix worker hangs due to missed wake up calls

2023-06-08 15:43:09 -04:00

watch_queue.c

watch_queue: prevent dangling pipe pointer

2023-06-06 10:47:04 +02:00

watchdog_buddy.c

watchdog/hardlockup: move SMP barriers from common code to buddy code

2023-06-19 16:25:28 -07:00

watchdog_perf.c

watchdog/perf: add a weak function for an arch to detect if perf can use NMIs

2023-06-09 17:44:21 -07:00

watchdog.c

watchdog/sparc64: define HARDLOCKUP_DETECTOR_SPARC64

2023-06-19 16:25:29 -07:00

workqueue_internal.h

workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE

2023-05-17 17:02:08 -10:00

workqueue.c

workqueue: Changes for v6.5

2023-06-27 16:32:52 -07:00