From 8913970c19915bbe773d97d42989cd85b7fdc098 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 18 Oct 2021 15:15:22 -0700 Subject: [PATCH 01/19] mm/userfaultfd: selftests: fix memory corruption with thp enabled In RHEL's gating selftests we've encountered memory corruption in the uffd event test even with upstream kernel: # ./userfaultfd anon 128 4 nr_pages: 32768, nr_pages_per_cpu: 32768 bounces: 3, mode: rnd racing read, userfaults: 6240 missing (6240) 14729 wp (14729) bounces: 2, mode: racing read, userfaults: 1444 missing (1444) 28877 wp (28877) bounces: 1, mode: rnd read, userfaults: 6055 missing (6055) 14699 wp (14699) bounces: 0, mode: read, userfaults: 82 missing (82) 25196 wp (25196) testing uffd-wp with pagemap (pgsize=4096): done testing uffd-wp with pagemap (pgsize=2097152): done testing events (fork, remap, remove): ERROR: nr 32427 memory corruption 0 1 (errno=0, line=963) ERROR: faulting process failed (errno=0, line=1117) It can be easily reproduced when global thp enabled, which is the default for RHEL. It's also known as a side effect of commit 0db282ba2c12 ("selftest: use mmap instead of posix_memalign to allocate memory", 2021-07-23), which is imho right itself on using mmap() to make sure the addresses will be untagged even on arm. The problem is, for each test we allocate buffers using two allocate_area() calls. We assumed these two buffers won't affect each other, however they could, because mmap() could have found that the two buffers are near each other and having the same VMA flags, so they got merged into one VMA. It won't be a big problem if thp is not enabled, but when thp is agressively enabled it means when initializing the src buffer it could accidentally setup part of the dest buffer too when there's a shared THP that overlaps the two regions. Then some of the dest buffer won't be able to be trapped by userfaultfd missing mode, then it'll cause memory corruption as described. To fix it, do release_pages() after initializing the src buffer. Since the previous two release_pages() calls are after uffd_test_ctx_clear() which will unmap all the buffers anyway (which is stronger than release pages; as unmap() also tear town pgtables), drop them as they shouldn't really be anything useful. We can mark the Fixes tag upon 0db282ba2c12 as it's reported to only happen there, however the real "Fixes" IMHO should be 8ba6e8640844, as before that commit we'll always do explicit release_pages() before registration of uffd, and 8ba6e8640844 changed that logic by adding extra unmap/map and we didn't release the pages at the right place. Meanwhile I don't have a solid glue anyway on whether posix_memalign() could always avoid triggering this bug, hence it's safer to attach this fix to commit 8ba6e8640844. Link: https://lkml.kernel.org/r/20210923232512.210092-1-peterx@redhat.com Fixes: 8ba6e8640844 ("userfaultfd/selftests: reinitialize test context in each test") Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1994931 Signed-off-by: Peter Xu Reported-by: Li Wang Tested-by: Li Wang Reviewed-by: Axel Rasmussen Cc: Andrea Arcangeli Cc: Nadav Amit Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/userfaultfd.c | 23 ++++++++++++++++++++--- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 10ab56c2484a..60aa1a4fc69b 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -414,9 +414,6 @@ static void uffd_test_ctx_init_ext(uint64_t *features) uffd_test_ops->allocate_area((void **)&area_src); uffd_test_ops->allocate_area((void **)&area_dst); - uffd_test_ops->release_pages(area_src); - uffd_test_ops->release_pages(area_dst); - userfaultfd_open(features); count_verify = malloc(nr_pages * sizeof(unsigned long long)); @@ -437,6 +434,26 @@ static void uffd_test_ctx_init_ext(uint64_t *features) *(area_count(area_src, nr) + 1) = 1; } + /* + * After initialization of area_src, we must explicitly release pages + * for area_dst to make sure it's fully empty. Otherwise we could have + * some area_dst pages be errornously initialized with zero pages, + * hence we could hit memory corruption later in the test. + * + * One example is when THP is globally enabled, above allocate_area() + * calls could have the two areas merged into a single VMA (as they + * will have the same VMA flags so they're mergeable). When we + * initialize the area_src above, it's possible that some part of + * area_dst could have been faulted in via one huge THP that will be + * shared between area_src and area_dst. It could cause some of the + * area_dst won't be trapped by missing userfaults. + * + * This release_pages() will guarantee even if that happened, we'll + * proactively split the thp and drop any accidentally initialized + * pages within area_dst. + */ + uffd_test_ops->release_pages(area_dst); + pipefd = malloc(sizeof(int) * nr_cpus * 2); if (!pipefd) err("pipefd"); From cb185d5f1ebf900f4ae3bf84cee212e6dd035aca Mon Sep 17 00:00:00 2001 From: Nadav Amit Date: Mon, 18 Oct 2021 15:15:25 -0700 Subject: [PATCH 02/19] userfaultfd: fix a race between writeprotect and exit_mmap() A race is possible when a process exits, its VMAs are removed by exit_mmap() and at the same time userfaultfd_writeprotect() is called. The race was detected by KASAN on a development kernel, but it appears to be possible on vanilla kernels as well. Use mmget_not_zero() to prevent the race as done in other userfaultfd operations. Link: https://lkml.kernel.org/r/20210921200247.25749-1-namit@vmware.com Fixes: 63b2d4174c4ad ("userfaultfd: wp: add the writeprotect API to userfaultfd ioctl") Signed-off-by: Nadav Amit Tested-by: Li Wang Reviewed-by: Peter Xu Cc: Andrea Arcangeli Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/userfaultfd.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 003f0d31743e..22bf14ab2d16 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1827,9 +1827,15 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx, if (mode_wp && mode_dontwake) return -EINVAL; - ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start, - uffdio_wp.range.len, mode_wp, - &ctx->mmap_changing); + if (mmget_not_zero(ctx->mm)) { + ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start, + uffdio_wp.range.len, mode_wp, + &ctx->mmap_changing); + mmput(ctx->mm); + } else { + return -ESRCH; + } + if (ret) return ret; From 295be91f7ef0027fca2f2e4788e99731aa931834 Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Mon, 18 Oct 2021 15:15:29 -0700 Subject: [PATCH 03/19] mm/migrate: optimize hotplug-time demotion order updates Patch series "mm/migrate: 5.15 fixes for automatic demotion", v2. This contains two fixes for the "automatic demotion" code which was merged into 5.15: * Fix memory hotplug performance regression by watching suppressing any real action on irrelevant hotplug events. * Ensure CPU hotplug handler is registered when memory hotplug is disabled. This patch (of 2): == tl;dr == Automatic demotion opted for a simple, lazy approach to handling hotplug events. This noticeably slows down memory hotplug[1]. Optimize away updates to the demotion order when memory hotplug events should have no effect. This has no effect on CPU hotplug. There is no known problem on the CPU side and any work there will be in a separate series. == Background == Automatic demotion is a memory migration strategy to ensure that new allocations have room in faster memory tiers on tiered memory systems. The kernel maintains an array (node_demotion[]) to drive these migrations. The node_demotion[] path is calculated by starting at nodes with CPUs and then "walking" to nodes with memory. Only hotplug events which online or offline a node with memory (N_ONLINE) or CPUs (N_CPU) will actually affect the migration order. == Problem == However, the current code is lazy. It completely regenerates the migration order on *any* CPU or memory hotplug event. The logic was that these events are extremely rare and that the overhead from indiscriminate order regeneration is minimal. Part of the update logic involves a synchronize_rcu(), which is a pretty big hammer. Its overhead was large enough to be detected by some 0day tests that watch memory hotplug performance[1]. == Solution == Add a new helper (node_demotion_topo_changed()) which can differentiate between superfluous and impactful hotplug events. Skip the expensive update operation for superfluous events. == Aside: Locking == It took me a few moments to declare the locking to be safe enough for node_demotion_topo_changed() to work. It all hinges on the memory hotplug lock: During memory hotplug events, 'mem_hotplug_lock' is held for write. This ensures that two memory hotplug events can not be called simultaneously. CPU hotplug has a similar lock (cpuhp_state_mutex) which also provides mutual exclusion between CPU hotplug events. In addition, the demotion code acquire and hold the mem_hotplug_lock for read during its CPU hotplug handlers. This provides mutual exclusion between the demotion memory hotplug callbacks and the CPU hotplug callbacks. This effectively allows treating the migration target generation code to act as if it is single-threaded. 1. https://lore.kernel.org/all/20210905135932.GE15026@xsang-OptiPlex-9020/ Link: https://lkml.kernel.org/r/20210924161251.093CCD06@davehans-spike.ostc.intel.com Link: https://lkml.kernel.org/r/20210924161253.D7673E31@davehans-spike.ostc.intel.com Fixes: 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug events") Signed-off-by: Dave Hansen Reported-by: kernel test robot Reviewed-by: David Hildenbrand Cc: "Huang, Ying" Cc: Michal Hocko Cc: Wei Xu Cc: Oscar Salvador Cc: David Rientjes Cc: Dan Williams Cc: Greg Thelen Cc: Yang Shi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/migrate.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/mm/migrate.c b/mm/migrate.c index a6a7743ee98f..a6311e46f605 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -3239,8 +3239,18 @@ static int migration_offline_cpu(unsigned int cpu) * set_migration_target_nodes(). */ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, - unsigned long action, void *arg) + unsigned long action, void *_arg) { + struct memory_notify *arg = _arg; + + /* + * Only update the node migration order when a node is + * changing status, like online->offline. This avoids + * the overhead of synchronize_rcu() in most cases. + */ + if (arg->status_change_nid < 0) + return notifier_from_errno(0); + switch (action) { case MEM_GOING_OFFLINE: /* From 76af6a054da4055305ddb28c5eb151b9ee4f74f9 Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Mon, 18 Oct 2021 15:15:32 -0700 Subject: [PATCH 04/19] mm/migrate: add CPU hotplug to demotion #ifdef Once upon a time, the node demotion updates were driven solely by memory hotplug events. But now, there are handlers for both CPU and memory hotplug. However, the #ifdef around the code checks only memory hotplug. A system that has HOTPLUG_CPU=y but MEMORY_HOTPLUG=n would miss CPU hotplug events. Update the #ifdef around the common code. Add memory and CPU-specific #ifdefs for their handlers. These memory/CPU #ifdefs avoid unused function warnings when their Kconfig option is off. [arnd@arndb.de: rework hotplug_memory_notifier() stub] Link: https://lkml.kernel.org/r/20211013144029.2154629-1-arnd@kernel.org Link: https://lkml.kernel.org/r/20210924161255.E5FE8F7E@davehans-spike.ostc.intel.com Fixes: 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug events") Signed-off-by: Dave Hansen Signed-off-by: Arnd Bergmann Cc: "Huang, Ying" Cc: Michal Hocko Cc: Wei Xu Cc: Oscar Salvador Cc: David Rientjes Cc: Dan Williams Cc: David Hildenbrand Cc: Greg Thelen Cc: Yang Shi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memory.h | 5 ++++- mm/migrate.c | 42 +++++++++++++++++++++--------------------- mm/page_ext.c | 4 +--- mm/slab.c | 4 ++-- 4 files changed, 28 insertions(+), 27 deletions(-) diff --git a/include/linux/memory.h b/include/linux/memory.h index 7efc0a7c14c9..182c606adb06 100644 --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -160,7 +160,10 @@ int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func, #define register_hotmemory_notifier(nb) register_memory_notifier(nb) #define unregister_hotmemory_notifier(nb) unregister_memory_notifier(nb) #else -#define hotplug_memory_notifier(fn, pri) ({ 0; }) +static inline int hotplug_memory_notifier(notifier_fn_t fn, int pri) +{ + return 0; +} /* These aren't inline functions due to a GCC bug. */ #define register_hotmemory_notifier(nb) ({ (void)(nb); 0; }) #define unregister_hotmemory_notifier(nb) ({ (void)(nb); }) diff --git a/mm/migrate.c b/mm/migrate.c index a6311e46f605..700112c0123d 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -3066,7 +3066,7 @@ void migrate_vma_finalize(struct migrate_vma *migrate) EXPORT_SYMBOL(migrate_vma_finalize); #endif /* CONFIG_DEVICE_PRIVATE */ -#if defined(CONFIG_MEMORY_HOTPLUG) +#if defined(CONFIG_HOTPLUG_CPU) /* Disable reclaim-based migration. */ static void __disable_all_migrate_targets(void) { @@ -3208,25 +3208,6 @@ static void set_migration_target_nodes(void) put_online_mems(); } -/* - * React to hotplug events that might affect the migration targets - * like events that online or offline NUMA nodes. - * - * The ordering is also currently dependent on which nodes have - * CPUs. That means we need CPU on/offline notification too. - */ -static int migration_online_cpu(unsigned int cpu) -{ - set_migration_target_nodes(); - return 0; -} - -static int migration_offline_cpu(unsigned int cpu) -{ - set_migration_target_nodes(); - return 0; -} - /* * This leaves migrate-on-reclaim transiently disabled between * the MEM_GOING_OFFLINE and MEM_OFFLINE events. This runs @@ -3284,6 +3265,25 @@ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, return notifier_from_errno(0); } +/* + * React to hotplug events that might affect the migration targets + * like events that online or offline NUMA nodes. + * + * The ordering is also currently dependent on which nodes have + * CPUs. That means we need CPU on/offline notification too. + */ +static int migration_online_cpu(unsigned int cpu) +{ + set_migration_target_nodes(); + return 0; +} + +static int migration_offline_cpu(unsigned int cpu) +{ + set_migration_target_nodes(); + return 0; +} + static int __init migrate_on_reclaim_init(void) { int ret; @@ -3303,4 +3303,4 @@ static int __init migrate_on_reclaim_init(void) return 0; } late_initcall(migrate_on_reclaim_init); -#endif /* CONFIG_MEMORY_HOTPLUG */ +#endif /* CONFIG_HOTPLUG_CPU */ diff --git a/mm/page_ext.c b/mm/page_ext.c index dfb91653d359..2a52fd9ed464 100644 --- a/mm/page_ext.c +++ b/mm/page_ext.c @@ -269,7 +269,7 @@ static int __meminit init_section_page_ext(unsigned long pfn, int nid) total_usage += table_size; return 0; } -#ifdef CONFIG_MEMORY_HOTPLUG + static void free_page_ext(void *addr) { if (is_vmalloc_addr(addr)) { @@ -374,8 +374,6 @@ static int __meminit page_ext_callback(struct notifier_block *self, return notifier_from_errno(ret); } -#endif - void __init page_ext_init(void) { unsigned long pfn; diff --git a/mm/slab.c b/mm/slab.c index d0f725637663..874b3f8fe80d 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1095,7 +1095,7 @@ static int slab_offline_cpu(unsigned int cpu) return 0; } -#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG) +#if defined(CONFIG_NUMA) /* * Drains freelist for a node on each slab cache, used for memory hot-remove. * Returns -EBUSY if all objects cannot be drained so that the node is not @@ -1157,7 +1157,7 @@ static int __meminit slab_memory_callback(struct notifier_block *self, out: return notifier_from_errno(ret); } -#endif /* CONFIG_NUMA && CONFIG_MEMORY_HOTPLUG */ +#endif /* CONFIG_NUMA */ /* * swap the static kmem_cache_node with kmalloced memory From a6a0251c6fce496744121b4e08c899f45270dbcc Mon Sep 17 00:00:00 2001 From: Huang Ying Date: Mon, 18 Oct 2021 15:15:35 -0700 Subject: [PATCH 05/19] mm/migrate: fix CPUHP state to update node demotion order The node demotion order needs to be updated during CPU hotplug. Because whether a NUMA node has CPU may influence the demotion order. The update function should be called during CPU online/offline after the node_states[N_CPU] has been updated. That is done in CPUHP_AP_ONLINE_DYN during CPU online and in CPUHP_MM_VMSTAT_DEAD during CPU offline. But in commit 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug events"), the function to update node demotion order is called in CPUHP_AP_ONLINE_DYN during CPU online/offline. This doesn't satisfy the order requirement. For example, there are 4 CPUs (P0, P1, P2, P3) in 2 sockets (P0, P1 in S0 and P2, P3 in S1), the demotion order is - S0 -> NUMA_NO_NODE - S1 -> NUMA_NO_NODE After P2 and P3 is offlined, because S1 has no CPU now, the demotion order should have been changed to - S0 -> S1 - S1 -> NO_NODE but it isn't changed, because the order updating callback for CPU hotplug doesn't see the new nodemask. After that, if P1 is offlined, the demotion order is changed to the expected order as above. So in this patch, we added CPUHP_AP_MM_DEMOTION_ONLINE and CPUHP_MM_DEMOTION_DEAD to be called after CPUHP_AP_ONLINE_DYN and CPUHP_MM_VMSTAT_DEAD during CPU online and offline, and register the update function on them. Link: https://lkml.kernel.org/r/20210929060351.7293-1-ying.huang@intel.com Fixes: 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug events") Signed-off-by: "Huang, Ying" Cc: Thomas Gleixner Cc: Dave Hansen Cc: Yang Shi Cc: Zi Yan Cc: Michal Hocko Cc: Wei Xu Cc: Oscar Salvador Cc: David Rientjes Cc: Dan Williams Cc: David Hildenbrand Cc: Greg Thelen Cc: Keith Busch Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/cpuhotplug.h | 4 ++++ mm/migrate.c | 8 +++++--- 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index 832d8a74fa59..991911048857 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -72,6 +72,8 @@ enum cpuhp_state { CPUHP_SLUB_DEAD, CPUHP_DEBUG_OBJ_DEAD, CPUHP_MM_WRITEBACK_DEAD, + /* Must be after CPUHP_MM_VMSTAT_DEAD */ + CPUHP_MM_DEMOTION_DEAD, CPUHP_MM_VMSTAT_DEAD, CPUHP_SOFTIRQ_DEAD, CPUHP_NET_MVNETA_DEAD, @@ -240,6 +242,8 @@ enum cpuhp_state { CPUHP_AP_BASE_CACHEINFO_ONLINE, CPUHP_AP_ONLINE_DYN, CPUHP_AP_ONLINE_DYN_END = CPUHP_AP_ONLINE_DYN + 30, + /* Must be after CPUHP_AP_ONLINE_DYN for node_states[N_CPU] update */ + CPUHP_AP_MM_DEMOTION_ONLINE, CPUHP_AP_X86_HPET_ONLINE, CPUHP_AP_X86_KVM_CLK_ONLINE, CPUHP_AP_DTPM_CPU_ONLINE, diff --git a/mm/migrate.c b/mm/migrate.c index 700112c0123d..1852d787e6ab 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -3288,9 +3288,8 @@ static int __init migrate_on_reclaim_init(void) { int ret; - ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "migrate on reclaim", - migration_online_cpu, - migration_offline_cpu); + ret = cpuhp_setup_state_nocalls(CPUHP_MM_DEMOTION_DEAD, "mm/demotion:offline", + NULL, migration_offline_cpu); /* * In the unlikely case that this fails, the automatic * migration targets may become suboptimal for nodes @@ -3298,6 +3297,9 @@ static int __init migrate_on_reclaim_init(void) * rare case, do not bother trying to do anything special. */ WARN_ON(ret < 0); + ret = cpuhp_setup_state(CPUHP_AP_MM_DEMOTION_ONLINE, "mm/demotion:online", + migration_online_cpu, NULL); + WARN_ON(ret < 0); hotplug_memory_notifier(migrate_on_reclaim_callback, 100); return 0; From 5314454ea3ff6fc746eaf71b9a7ceebed52888fa Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Mon, 18 Oct 2021 15:15:39 -0700 Subject: [PATCH 06/19] ocfs2: fix data corruption after conversion from inline format Commit 6dbf7bb55598 ("fs: Don't invalidate page buffers in block_write_full_page()") uncovered a latent bug in ocfs2 conversion from inline inode format to a normal inode format. The code in ocfs2_convert_inline_data_to_extents() attempts to zero out the whole cluster allocated for file data by grabbing, zeroing, and dirtying all pages covering this cluster. However these pages are beyond i_size, thus writeback code generally ignores these dirty pages and no blocks were ever actually zeroed on the disk. This oversight was fixed by commit 693c241a5f6a ("ocfs2: No need to zero pages past i_size.") for standard ocfs2 write path, inline conversion path was apparently forgotten; the commit log also has a reasoning why the zeroing actually is not needed. After commit 6dbf7bb55598, things became worse as writeback code stopped invalidating buffers on pages beyond i_size and thus these pages end up with clean PageDirty bit but with buffers attached to these pages being still dirty. So when a file is converted from inline format, then writeback triggers, and then the file is grown so that these pages become valid, the invalid dirtiness state is preserved, mark_buffer_dirty() does nothing on these pages (buffers are already dirty) but page is never written back because it is clean. So data written to these pages is lost once pages are reclaimed. Simple reproducer for the problem is: xfs_io -f -c "pwrite 0 2000" -c "pwrite 2000 2000" -c "fsync" \ -c "pwrite 4000 2000" ocfs2_file After unmounting and mounting the fs again, you can observe that end of 'ocfs2_file' has lost its contents. Fix the problem by not doing the pointless zeroing during conversion from inline format similarly as in the standard write path. [akpm@linux-foundation.org: fix whitespace, per Joseph] Link: https://lkml.kernel.org/r/20210930095405.21433-1-jack@suse.cz Fixes: 6dbf7bb55598 ("fs: Don't invalidate page buffers in block_write_full_page()") Signed-off-by: Jan Kara Reviewed-by: Joseph Qi Tested-by: Joseph Qi Acked-by: Gang He Cc: Mark Fasheh Cc: Joel Becker Cc: Junxiao Bi Cc: Changwei Ge Cc: Jun Piao Cc: "Markov, Andrey" Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/ocfs2/alloc.c | 46 ++++++++++++---------------------------------- 1 file changed, 12 insertions(+), 34 deletions(-) diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c index f1cc8258d34a..5d9ae17bd443 100644 --- a/fs/ocfs2/alloc.c +++ b/fs/ocfs2/alloc.c @@ -7045,7 +7045,7 @@ void ocfs2_set_inode_data_inline(struct inode *inode, struct ocfs2_dinode *di) int ocfs2_convert_inline_data_to_extents(struct inode *inode, struct buffer_head *di_bh) { - int ret, i, has_data, num_pages = 0; + int ret, has_data, num_pages = 0; int need_free = 0; u32 bit_off, num; handle_t *handle; @@ -7054,26 +7054,17 @@ int ocfs2_convert_inline_data_to_extents(struct inode *inode, struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); struct ocfs2_dinode *di = (struct ocfs2_dinode *)di_bh->b_data; struct ocfs2_alloc_context *data_ac = NULL; - struct page **pages = NULL; - loff_t end = osb->s_clustersize; + struct page *page = NULL; struct ocfs2_extent_tree et; int did_quota = 0; has_data = i_size_read(inode) ? 1 : 0; if (has_data) { - pages = kcalloc(ocfs2_pages_per_cluster(osb->sb), - sizeof(struct page *), GFP_NOFS); - if (pages == NULL) { - ret = -ENOMEM; - mlog_errno(ret); - return ret; - } - ret = ocfs2_reserve_clusters(osb, 1, &data_ac); if (ret) { mlog_errno(ret); - goto free_pages; + goto out; } } @@ -7093,7 +7084,8 @@ int ocfs2_convert_inline_data_to_extents(struct inode *inode, } if (has_data) { - unsigned int page_end; + unsigned int page_end = min_t(unsigned, PAGE_SIZE, + osb->s_clustersize); u64 phys; ret = dquot_alloc_space_nodirty(inode, @@ -7117,15 +7109,8 @@ int ocfs2_convert_inline_data_to_extents(struct inode *inode, */ block = phys = ocfs2_clusters_to_blocks(inode->i_sb, bit_off); - /* - * Non sparse file systems zero on extend, so no need - * to do that now. - */ - if (!ocfs2_sparse_alloc(osb) && - PAGE_SIZE < osb->s_clustersize) - end = PAGE_SIZE; - - ret = ocfs2_grab_eof_pages(inode, 0, end, pages, &num_pages); + ret = ocfs2_grab_eof_pages(inode, 0, page_end, &page, + &num_pages); if (ret) { mlog_errno(ret); need_free = 1; @@ -7136,20 +7121,15 @@ int ocfs2_convert_inline_data_to_extents(struct inode *inode, * This should populate the 1st page for us and mark * it up to date. */ - ret = ocfs2_read_inline_data(inode, pages[0], di_bh); + ret = ocfs2_read_inline_data(inode, page, di_bh); if (ret) { mlog_errno(ret); need_free = 1; goto out_unlock; } - page_end = PAGE_SIZE; - if (PAGE_SIZE > osb->s_clustersize) - page_end = osb->s_clustersize; - - for (i = 0; i < num_pages; i++) - ocfs2_map_and_dirty_page(inode, handle, 0, page_end, - pages[i], i > 0, &phys); + ocfs2_map_and_dirty_page(inode, handle, 0, page_end, page, 0, + &phys); } spin_lock(&oi->ip_lock); @@ -7180,8 +7160,8 @@ int ocfs2_convert_inline_data_to_extents(struct inode *inode, } out_unlock: - if (pages) - ocfs2_unlock_and_free_pages(pages, num_pages); + if (page) + ocfs2_unlock_and_free_pages(&page, num_pages); out_commit: if (ret < 0 && did_quota) @@ -7205,8 +7185,6 @@ out_commit: out: if (data_ac) ocfs2_free_alloc_context(data_ac); -free_pages: - kfree(pages); return ret; } From b15fa9224e6e1239414525d8d556d824701849fc Mon Sep 17 00:00:00 2001 From: Valentin Vidic Date: Mon, 18 Oct 2021 15:15:42 -0700 Subject: [PATCH 07/19] ocfs2: mount fails with buffer overflow in strlen Starting with kernel 5.11 built with CONFIG_FORTIFY_SOURCE mouting an ocfs2 filesystem with either o2cb or pcmk cluster stack fails with the trace below. Problem seems to be that strings for cluster stack and cluster name are not guaranteed to be null terminated in the disk representation, while strlcpy assumes that the source string is always null terminated. This causes a read outside of the source string triggering the buffer overflow detection. detected buffer overflow in strlen ------------[ cut here ]------------ kernel BUG at lib/string.c:1149! invalid opcode: 0000 [#1] SMP PTI CPU: 1 PID: 910 Comm: mount.ocfs2 Not tainted 5.14.0-1-amd64 #1 Debian 5.14.6-2 RIP: 0010:fortify_panic+0xf/0x11 ... Call Trace: ocfs2_initialize_super.isra.0.cold+0xc/0x18 [ocfs2] ocfs2_fill_super+0x359/0x19b0 [ocfs2] mount_bdev+0x185/0x1b0 legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 path_mount+0x454/0xa20 __x64_sys_mount+0x103/0x140 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae Link: https://lkml.kernel.org/r/20210929180654.32460-1-vvidic@valentin-vidic.from.hr Signed-off-by: Valentin Vidic Reviewed-by: Joseph Qi Cc: Mark Fasheh Cc: Joel Becker Cc: Junxiao Bi Cc: Changwei Ge Cc: Gang He Cc: Jun Piao Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/ocfs2/super.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index c86bd4e60e20..5c914ce9b3ac 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -2167,11 +2167,17 @@ static int ocfs2_initialize_super(struct super_block *sb, } if (ocfs2_clusterinfo_valid(osb)) { + /* + * ci_stack and ci_cluster in ocfs2_cluster_info may not be null + * terminated, so make sure no overflow happens here by using + * memcpy. Destination strings will always be null terminated + * because osb is allocated using kzalloc. + */ osb->osb_stackflags = OCFS2_RAW_SB(di)->s_cluster_info.ci_stackflags; - strlcpy(osb->osb_cluster_stack, + memcpy(osb->osb_cluster_stack, OCFS2_RAW_SB(di)->s_cluster_info.ci_stack, - OCFS2_STACK_LABEL_LEN + 1); + OCFS2_STACK_LABEL_LEN); if (strlen(osb->osb_cluster_stack) != OCFS2_STACK_LABEL_LEN) { mlog(ML_ERROR, "couldn't mount because of an invalid " @@ -2180,9 +2186,9 @@ static int ocfs2_initialize_super(struct super_block *sb, status = -EINVAL; goto bail; } - strlcpy(osb->osb_cluster_name, + memcpy(osb->osb_cluster_name, OCFS2_RAW_SB(di)->s_cluster_info.ci_cluster, - OCFS2_CLUSTER_NAME_LEN + 1); + OCFS2_CLUSTER_NAME_LEN); } else { /* The empty string is identical with classic tools that * don't know about s_cluster_info. */ From 5173ed72bcfcddda21ff274ee31c6472fa150f29 Mon Sep 17 00:00:00 2001 From: Peng Fan Date: Mon, 18 Oct 2021 15:15:45 -0700 Subject: [PATCH 08/19] memblock: check memory total_size mem=[X][G|M] is broken on ARM64 platform, there are cases that even type.cnt is 1, but total_size is not 0 because regions are merged into 1. So only check 'cnt' is not enough, total_size should be used, othersize bootargs 'mem=[X][G|B]' not work anymore. Link: https://lkml.kernel.org/r/20210930024437.32598-1-peng.fan@oss.nxp.com Fixes: e888fa7bb882 ("memblock: Check memory add/cap ordering") Signed-off-by: Peng Fan Reviewed-by: Mike Rapoport Cc: Geert Uytterhoeven Cc: David Hildenbrand Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memblock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memblock.c b/mm/memblock.c index 5c3503c98b2f..b91df5cf54d3 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1692,7 +1692,7 @@ void __init memblock_cap_memory_range(phys_addr_t base, phys_addr_t size) if (!size) return; - if (memblock.memory.cnt <= 1) { + if (!memblock_memory->total_size) { pr_warn("%s: No memory registered yet\n", __func__); return; } From 6d2aec9e123bb9c49cb5c7fc654f25f81e688e8c Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Mon, 18 Oct 2021 15:15:49 -0700 Subject: [PATCH 09/19] mm/mempolicy: do not allow illegal MPOL_F_NUMA_BALANCING | MPOL_LOCAL in mbind() syzbot reported access to unitialized memory in mbind() [1] Issue came with commit bda420b98505 ("numa balancing: migrate on fault among multiple bound nodes") This commit added a new bit in MPOL_MODE_FLAGS, but only checked valid combination (MPOL_F_NUMA_BALANCING can only be used with MPOL_BIND) in do_set_mempolicy() This patch moves the check in sanitize_mpol_flags() so that it is also used by mbind() [1] BUG: KMSAN: uninit-value in __mpol_equal+0x567/0x590 mm/mempolicy.c:2260 __mpol_equal+0x567/0x590 mm/mempolicy.c:2260 mpol_equal include/linux/mempolicy.h:105 [inline] vma_merge+0x4a1/0x1e60 mm/mmap.c:1190 mbind_range+0xcc8/0x1e80 mm/mempolicy.c:811 do_mbind+0xf42/0x15f0 mm/mempolicy.c:1333 kernel_mbind mm/mempolicy.c:1483 [inline] __do_sys_mbind mm/mempolicy.c:1490 [inline] __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486 __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae Uninit was created at: slab_alloc_node mm/slub.c:3221 [inline] slab_alloc mm/slub.c:3230 [inline] kmem_cache_alloc+0x751/0xff0 mm/slub.c:3235 mpol_new mm/mempolicy.c:293 [inline] do_mbind+0x912/0x15f0 mm/mempolicy.c:1289 kernel_mbind mm/mempolicy.c:1483 [inline] __do_sys_mbind mm/mempolicy.c:1490 [inline] __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486 __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae ===================================================== Kernel panic - not syncing: panic_on_kmsan set ... CPU: 0 PID: 15049 Comm: syz-executor.0 Tainted: G B 5.15.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1ff/0x28e lib/dump_stack.c:106 dump_stack+0x25/0x28 lib/dump_stack.c:113 panic+0x44f/0xdeb kernel/panic.c:232 kmsan_report+0x2ee/0x300 mm/kmsan/report.c:186 __msan_warning+0xd7/0x150 mm/kmsan/instrumentation.c:208 __mpol_equal+0x567/0x590 mm/mempolicy.c:2260 mpol_equal include/linux/mempolicy.h:105 [inline] vma_merge+0x4a1/0x1e60 mm/mmap.c:1190 mbind_range+0xcc8/0x1e80 mm/mempolicy.c:811 do_mbind+0xf42/0x15f0 mm/mempolicy.c:1333 kernel_mbind mm/mempolicy.c:1483 [inline] __do_sys_mbind mm/mempolicy.c:1490 [inline] __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486 __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae Link: https://lkml.kernel.org/r/20211001215630.810592-1-eric.dumazet@gmail.com Fixes: bda420b98505 ("numa balancing: migrate on fault among multiple bound nodes") Signed-off-by: Eric Dumazet Reported-by: syzbot Acked-by: Mel Gorman Cc: "Huang, Ying" Cc: Matthew Wilcox Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mempolicy.c | 16 +++++----------- 1 file changed, 5 insertions(+), 11 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 1592b081c58e..d12e0608fced 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -856,16 +856,6 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, goto out; } - if (flags & MPOL_F_NUMA_BALANCING) { - if (new && new->mode == MPOL_BIND) { - new->flags |= (MPOL_F_MOF | MPOL_F_MORON); - } else { - ret = -EINVAL; - mpol_put(new); - goto out; - } - } - ret = mpol_set_nodemask(new, nodes, scratch); if (ret) { mpol_put(new); @@ -1458,7 +1448,11 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) return -EINVAL; if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) return -EINVAL; - + if (*flags & MPOL_F_NUMA_BALANCING) { + if (*mode != MPOL_BIND) + return -EINVAL; + *flags |= (MPOL_F_MOF | MPOL_F_MORON); + } return 0; } From 2127d22509aec3a83dffb2a3c736df7ba747a7ce Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Mon, 18 Oct 2021 15:15:52 -0700 Subject: [PATCH 10/19] mm, slub: fix two bugs in slab_debug_trace_open() Patch series "Fixups for slub". This series contains various bug fixes for slub. We fix memoryleak, use-afer-free, NULL pointer dereferencing and so on in slub. More details can be found in the respective changelogs. This patch (of 5): It's possible that __seq_open_private() will return NULL. So we should check it before using lest dereferencing NULL pointer. And in error paths, we forgot to release private buffer via seq_release_private(). Memory will leak in these paths. Link: https://lkml.kernel.org/r/20210916123920.48704-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210916123920.48704-2-linmiaohe@huawei.com Fixes: 64dd68497be7 ("mm: slub: move sysfs slab alloc/free interfaces to debugfs") Signed-off-by: Miaohe Lin Reviewed-by: Vlastimil Babka Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Greg Kroah-Hartman Cc: Faiyaz Mohammed Cc: Andrey Konovalov Cc: Andrey Ryabinin Cc: Kees Cook Cc: Bharata B Rao Cc: Roman Gushchin Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/slub.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 3d2025f7163b..ed160b6c54f8 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -6108,10 +6108,15 @@ static int slab_debug_trace_open(struct inode *inode, struct file *filep) struct kmem_cache *s = file_inode(filep)->i_private; unsigned long *obj_map; - obj_map = bitmap_alloc(oo_objects(s->oo), GFP_KERNEL); - if (!obj_map) + if (!t) return -ENOMEM; + obj_map = bitmap_alloc(oo_objects(s->oo), GFP_KERNEL); + if (!obj_map) { + seq_release_private(inode, filep); + return -ENOMEM; + } + if (strcmp(filep->f_path.dentry->d_name.name, "alloc_traces") == 0) alloc = TRACK_ALLOC; else @@ -6119,6 +6124,7 @@ static int slab_debug_trace_open(struct inode *inode, struct file *filep) if (!alloc_loc_track(t, PAGE_SIZE / sizeof(struct location), GFP_KERNEL)) { bitmap_free(obj_map); + seq_release_private(inode, filep); return -ENOMEM; } From 899447f669da76cc3605665e1a95ee877bc464cc Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Mon, 18 Oct 2021 15:15:55 -0700 Subject: [PATCH 11/19] mm, slub: fix mismatch between reconstructed freelist depth and cnt If object's reuse is delayed, it will be excluded from the reconstructed freelist. But we forgot to adjust the cnt accordingly. So there will be a mismatch between reconstructed freelist depth and cnt. This will lead to free_debug_processing() complaining about freelist count or a incorrect slub inuse count. Link: https://lkml.kernel.org/r/20210916123920.48704-3-linmiaohe@huawei.com Fixes: c3895391df38 ("kasan, slub: fix handling of kasan_slab_free hook") Signed-off-by: Miaohe Lin Reviewed-by: Vlastimil Babka Cc: Andrey Konovalov Cc: Andrey Ryabinin Cc: Bharata B Rao Cc: Christoph Lameter Cc: David Rientjes Cc: Faiyaz Mohammed Cc: Greg Kroah-Hartman Cc: Joonsoo Kim Cc: Kees Cook Cc: Pekka Enberg Cc: Roman Gushchin Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/slub.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index ed160b6c54f8..a56a6423d4e8 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1701,7 +1701,8 @@ static __always_inline bool slab_free_hook(struct kmem_cache *s, } static inline bool slab_free_freelist_hook(struct kmem_cache *s, - void **head, void **tail) + void **head, void **tail, + int *cnt) { void *object; @@ -1728,6 +1729,12 @@ static inline bool slab_free_freelist_hook(struct kmem_cache *s, *head = object; if (!*tail) *tail = object; + } else { + /* + * Adjust the reconstructed freelist depth + * accordingly if object's reuse is delayed. + */ + --(*cnt); } } while (object != old_tail); @@ -3480,7 +3487,7 @@ static __always_inline void slab_free(struct kmem_cache *s, struct page *page, * With KASAN enabled slab_free_freelist_hook modifies the freelist * to remove objects, whose reuse must be delayed. */ - if (slab_free_freelist_hook(s, &head, &tail)) + if (slab_free_freelist_hook(s, &head, &tail, &cnt)) do_slab_free(s, page, head, tail, cnt, addr); } From 9037c57681d25e4dcc442d940d6dbe24dd31f461 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Mon, 18 Oct 2021 15:15:59 -0700 Subject: [PATCH 12/19] mm, slub: fix potential memoryleak in kmem_cache_open() In error path, the random_seq of slub cache might be leaked. Fix this by using __kmem_cache_release() to release all the relevant resources. Link: https://lkml.kernel.org/r/20210916123920.48704-4-linmiaohe@huawei.com Fixes: 210e7a43fa90 ("mm: SLUB freelist randomization") Signed-off-by: Miaohe Lin Reviewed-by: Vlastimil Babka Cc: Andrey Konovalov Cc: Andrey Ryabinin Cc: Bharata B Rao Cc: Christoph Lameter Cc: David Rientjes Cc: Faiyaz Mohammed Cc: Greg Kroah-Hartman Cc: Joonsoo Kim Cc: Kees Cook Cc: Pekka Enberg Cc: Roman Gushchin Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/slub.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/slub.c b/mm/slub.c index a56a6423d4e8..bf1793fb4ce5 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -4210,8 +4210,8 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags) if (alloc_kmem_cache_cpus(s)) return 0; - free_kmem_cache_nodes(s); error: + __kmem_cache_release(s); return -EINVAL; } From 67823a544414def2a36c212abadb55b23bcda00c Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Mon, 18 Oct 2021 15:16:02 -0700 Subject: [PATCH 13/19] mm, slub: fix potential use-after-free in slab_debugfs_fops When sysfs_slab_add failed, we shouldn't call debugfs_slab_add() for s because s will be freed soon. And slab_debugfs_fops will use s later leading to a use-after-free. Link: https://lkml.kernel.org/r/20210916123920.48704-5-linmiaohe@huawei.com Fixes: 64dd68497be7 ("mm: slub: move sysfs slab alloc/free interfaces to debugfs") Signed-off-by: Miaohe Lin Reviewed-by: Vlastimil Babka Cc: Andrey Konovalov Cc: Andrey Ryabinin Cc: Bharata B Rao Cc: Christoph Lameter Cc: David Rientjes Cc: Faiyaz Mohammed Cc: Greg Kroah-Hartman Cc: Joonsoo Kim Cc: Kees Cook Cc: Pekka Enberg Cc: Roman Gushchin Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/slub.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index bf1793fb4ce5..f3df0f04a472 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -4887,13 +4887,15 @@ int __kmem_cache_create(struct kmem_cache *s, slab_flags_t flags) return 0; err = sysfs_slab_add(s); - if (err) + if (err) { __kmem_cache_release(s); + return err; + } if (s->flags & SLAB_STORE_USER) debugfs_slab_add(s); - return err; + return 0; } void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller) From 3ddd60268c24bcac9d744404cc277e9dc52fe6b6 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Mon, 18 Oct 2021 15:16:06 -0700 Subject: [PATCH 14/19] mm, slub: fix incorrect memcg slab count for bulk free kmem_cache_free_bulk() will call memcg_slab_free_hook() for all objects when doing bulk free. So we shouldn't call memcg_slab_free_hook() again for bulk free to avoid incorrect memcg slab count. Link: https://lkml.kernel.org/r/20210916123920.48704-6-linmiaohe@huawei.com Fixes: d1b2cf6cb84a ("mm: memcg/slab: uncharge during kmem_cache_free_bulk()") Signed-off-by: Miaohe Lin Reviewed-by: Vlastimil Babka Cc: Andrey Konovalov Cc: Andrey Ryabinin Cc: Bharata B Rao Cc: Christoph Lameter Cc: David Rientjes Cc: Faiyaz Mohammed Cc: Greg Kroah-Hartman Cc: Joonsoo Kim Cc: Kees Cook Cc: Pekka Enberg Cc: Roman Gushchin Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/slub.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/slub.c b/mm/slub.c index f3df0f04a472..d8f77346376d 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -3420,7 +3420,9 @@ static __always_inline void do_slab_free(struct kmem_cache *s, struct kmem_cache_cpu *c; unsigned long tid; - memcg_slab_free_hook(s, &head, 1); + /* memcg_slab_free_hook() is already called for bulk free. */ + if (!tail) + memcg_slab_free_hook(s, &head, 1); redo: /* * Determine the currently cpus per cpu slab. From b0e901280d9860a0a35055f220e8e457f300f40a Mon Sep 17 00:00:00 2001 From: Lukas Bulwahn Date: Mon, 18 Oct 2021 15:16:09 -0700 Subject: [PATCH 15/19] elfcore: correct reference to CONFIG_UML Commit 6e7b64b9dd6d ("elfcore: fix building with clang") introduces special handling for two architectures, ia64 and User Mode Linux. However, the wrong name, i.e., CONFIG_UM, for the intended Kconfig symbol for User-Mode Linux was used. Although the directory for User Mode Linux is ./arch/um; the Kconfig symbol for this architecture is called CONFIG_UML. Luckily, ./scripts/checkkconfigsymbols.py warns on non-existing configs: UM Referencing files: include/linux/elfcore.h Similar symbols: UML, NUMA Correct the name of the config to the intended one. [akpm@linux-foundation.org: fix um/x86_64, per Catalin] Link: https://lkml.kernel.org/r/20211006181119.2851441-1-catalin.marinas@arm.com Link: https://lkml.kernel.org/r/YV6pejGzLy5ppEpt@arm.com Link: https://lkml.kernel.org/r/20211006082209.417-1-lukas.bulwahn@gmail.com Fixes: 6e7b64b9dd6d ("elfcore: fix building with clang") Signed-off-by: Lukas Bulwahn Cc: Arnd Bergmann Cc: Nathan Chancellor Cc: Nick Desaulniers Cc: Catalin Marinas Cc: Barret Rhoden Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/elfcore.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/elfcore.h b/include/linux/elfcore.h index 2aaa15779d50..957ebec35aad 100644 --- a/include/linux/elfcore.h +++ b/include/linux/elfcore.h @@ -109,7 +109,7 @@ static inline int elf_core_copy_task_fpregs(struct task_struct *t, struct pt_reg #endif } -#if defined(CONFIG_UM) || defined(CONFIG_IA64) +#if (defined(CONFIG_UML) && defined(CONFIG_X86_32)) || defined(CONFIG_IA64) /* * These functions parameterize elf_core_dump in fs/binfmt_elf.c to write out * extra segments containing the gate DSO contents. Dumping its From 032146cda85566abcd1c4884d9d23e4e30a07e9a Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 18 Oct 2021 15:16:12 -0700 Subject: [PATCH 16/19] vfs: check fd has read access in kernel_read_file_from_fd() If we open a file without read access and then pass the fd to a syscall whose implementation calls kernel_read_file_from_fd(), we get a warning from __kernel_read(): if (WARN_ON_ONCE(!(file->f_mode & FMODE_READ))) This currently affects both finit_module() and kexec_file_load(), but it could affect other syscalls in the future. Link: https://lkml.kernel.org/r/20211007220110.600005-1-willy@infradead.org Fixes: b844f0ecbc56 ("vfs: define kernel_copy_file_from_fd()") Signed-off-by: Matthew Wilcox (Oracle) Reported-by: Hao Sun Reviewed-by: Kees Cook Acked-by: Christian Brauner Cc: Al Viro Cc: Mimi Zohar Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/kernel_read_file.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c index 87aac4c72c37..1b07550485b9 100644 --- a/fs/kernel_read_file.c +++ b/fs/kernel_read_file.c @@ -178,7 +178,7 @@ int kernel_read_file_from_fd(int fd, loff_t offset, void **buf, struct fd f = fdget(fd); int ret = -EBADF; - if (!f.file) + if (!f.file || !(f.file->f_mode & FMODE_READ)) goto out; ret = kernel_read_file(f.file, offset, buf, buf_size, file_size, id); From 79f9bc5843142b649575f887dccdf1c07ad75c20 Mon Sep 17 00:00:00 2001 From: Sean Christopherson Date: Mon, 18 Oct 2021 15:16:16 -0700 Subject: [PATCH 17/19] mm/secretmem: fix NULL page->mapping dereference in page_is_secretmem() Check for a NULL page->mapping before dereferencing the mapping in page_is_secretmem(), as the page's mapping can be nullified while gup() is running, e.g. by reclaim or truncation. BUG: kernel NULL pointer dereference, address: 0000000000000068 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 6 PID: 4173897 Comm: CPU 3/KVM Tainted: G W RIP: 0010:internal_get_user_pages_fast+0x621/0x9d0 Code: <48> 81 7a 68 80 08 04 bc 0f 85 21 ff ff 8 89 c7 be RSP: 0018:ffffaa90087679b0 EFLAGS: 00010046 RAX: ffffe3f37905b900 RBX: 00007f2dd561e000 RCX: ffffe3f37905b934 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffe3f37905b900 ... CR2: 0000000000000068 CR3: 00000004c5898003 CR4: 00000000001726e0 Call Trace: get_user_pages_fast_only+0x13/0x20 hva_to_pfn+0xa9/0x3e0 try_async_pf+0xa1/0x270 direct_page_fault+0x113/0xad0 kvm_mmu_page_fault+0x69/0x680 vmx_handle_exit+0xe1/0x5d0 kvm_arch_vcpu_ioctl_run+0xd81/0x1c70 kvm_vcpu_ioctl+0x267/0x670 __x64_sys_ioctl+0x83/0xa0 do_syscall_64+0x56/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Link: https://lkml.kernel.org/r/20211007231502.3552715-1-seanjc@google.com Fixes: 1507f51255c9 ("mm: introduce memfd_secret system call to create "secret" memory areas") Signed-off-by: Sean Christopherson Reported-by: Darrick J. Wong Reported-by: Stephen Tested-by: Darrick J. Wong Reviewed-by: David Hildenbrand Reviewed-by: Mike Rapoport Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/secretmem.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h index 21c3771e6a56..988528b5da43 100644 --- a/include/linux/secretmem.h +++ b/include/linux/secretmem.h @@ -23,7 +23,7 @@ static inline bool page_is_secretmem(struct page *page) mapping = (struct address_space *) ((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS); - if (mapping != page->mapping) + if (!mapping || mapping != page->mapping) return false; return mapping->a_ops == &secretmem_aops; From 1ca7554d05ac038c98271f8968ed821266ecaa9c Mon Sep 17 00:00:00 2001 From: Marek Szyprowski Date: Mon, 18 Oct 2021 15:16:19 -0700 Subject: [PATCH 18/19] mm/thp: decrease nr_thps in file's mapping on THP split Decrease nr_thps counter in file's mapping to ensure that the page cache won't be dropped excessively on file write access if page has been already split. I've tried a test scenario running a big binary, kernel remaps it with THPs, then force a THP split with /sys/kernel/debug/split_huge_pages. During any further open of that binary with O_RDWR or O_WRITEONLY kernel drops page cache for it, because of non-zero thps counter. Link: https://lkml.kernel.org/r/20211012120237.2600-1-m.szyprowski@samsung.com Signed-off-by: Marek Szyprowski Fixes: 09d91cda0e82 ("mm,thp: avoid writes to file with THP in pagecache") Fixes: 06d3eff62d9d ("mm/thp: fix node page state in split_huge_page_to_list()") Acked-by: Matthew Wilcox (Oracle) Reviewed-by: Yang Shi Cc: Cc: Song Liu Cc: Rik van Riel Cc: "Kirill A . Shutemov" Cc: Johannes Weiner Cc: Hillf Danton Cc: Hugh Dickins Cc: William Kucharski Cc: Oleg Nesterov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5e9ef0fc261e..92192cb086c7 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2700,12 +2700,14 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) if (mapping) { int nr = thp_nr_pages(head); - if (PageSwapBacked(head)) + if (PageSwapBacked(head)) { __mod_lruvec_page_state(head, NR_SHMEM_THPS, -nr); - else + } else { __mod_lruvec_page_state(head, NR_FILE_THPS, -nr); + filemap_nr_thps_dec(mapping); + } } __split_huge_page(page, list, end); From 362d5dfc483c7786bcfff27ff48e7b6f493f7c5a Mon Sep 17 00:00:00 2001 From: Andrej Shadura Date: Mon, 18 Oct 2021 15:16:22 -0700 Subject: [PATCH 19/19] mailmap: add Andrej Shadura Add a mapping for my old work email for BelDisplayTech to the personal email, and make sure the Collabora email has the correct spelling of the first name. Link: https://lkml.kernel.org/r/20210917091016.30232-1-andrew.shadura@collabora.co.uk Signed-off-by: Andrej Shadura Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- .mailmap | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.mailmap b/.mailmap index 6e849110cb4e..90e614d2bf7e 100644 --- a/.mailmap +++ b/.mailmap @@ -33,6 +33,8 @@ Al Viro Andi Kleen Andi Shyti Andreas Herrmann +Andrej Shadura +Andrej Shadura Andrew Morton Andrew Murray Andrew Murray