linux/fs
Yafang Shao 681ce86235 vfs: Delete the associated dentry when deleting a file
Our applications, built on Elasticsearch[0], frequently create and
delete files.  These applications operate within containers, some with a
memory limit exceeding 100GB.  Over prolonged periods, the accumulation
of negative dentries within these containers can amount to tens of
gigabytes.

Upon container exit, directories are deleted.  However, due to the
numerous associated dentries, this process can be time-consuming.  Our
users have expressed frustration with this prolonged exit duration,
which constitutes our first issue.

Simultaneously, other processes may attempt to access the parent
directory of the Elasticsearch directories.  Since the task responsible
for deleting the dentries holds the inode lock, processes attempting
directory lookup experience significant delays.  This issue, our second
problem, is easily demonstrated:

  - Task 1 generates negative dentries:
  $ pwd
  ~/test
  $ mkdir es && cd es/ && ./create_and_delete_files.sh

  [ After generating tens of GB dentries ]

  $ cd ~/test && rm -rf es

  [ It will take a long duration to finish ]

  - Task 2 attempts to lookup the 'test/' directory
  $ pwd
  ~/test
  $ ls

  The 'ls' command in Task 2 experiences prolonged execution as Task 1
  is deleting the dentries.

We've devised a solution to address both issues by deleting associated
dentry when removing a file.  Interestingly, we've noted that a similar
patch was proposed years ago[1], although it was rejected citing the
absence of tangible issues caused by negative dentries.  Given our
current challenges, we're resubmitting the proposal.  All relevant
stakeholders from previous discussions have been included for reference.

Some alternative solutions are also under discussion[2][3], such as
shrinking child dentries outside of the parent inode lock or even
asynchronously shrinking child dentries.  However, given the
straightforward nature of the current solution, I believe this approach
is still necessary.

[ NOTE! This is a pretty fundamental change in how we deal with
  unlinking dentries, and it doesn't change the fact that you can have
  lots of negative dentries from just doing negative lookups.

  But the kernel test robot is at least initially happy with this from a
  performance angle, so I'm applying this ASAP just to get more testing
  and as a "known fix for an issue people hit in real life".

  Put another way: we should still look at the alternatives, and this
  patch may get reverted if somebody finds a performance regression on
  some other load.       - Linus ]

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://github.com/elastic/elasticsearch [0]
Link: https://patchwork.kernel.org/project/linux-fsdevel/patch/1502099673-31620-1-git-send-email-wangkai86@huawei.com [1]
Link: https://lore.kernel.org/linux-fsdevel/20240511200240.6354-2-torvalds@linux-foundation.org/ [2]
Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wjEMf8Du4UFzxuToGDnF3yLaMcrYeyNAaH1NJWa6fwcNQ@mail.gmail.com/ [3]
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Wangkai <wangkai86@huawei.com>
Cc: Colin Walters <walters@verbum.org>
Tested-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/all/202405221518.ecea2810-oliver.sang@intel.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-05-22 08:49:13 -07:00
..
9p netfs: Remove the old writeback code 2024-05-01 18:07:38 +01:00
adfs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
affs affs: remove SLAB_MEM_SPREAD flag usage 2024-02-26 11:36:28 +01:00
afs vfs-6.10.netfs 2024-05-13 12:14:03 -07:00
autofs dcache stuff for this cycle 2024-01-11 20:11:35 -08:00
bcachefs bd_inode series 2024-05-21 09:51:42 -07:00
befs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
bfs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
btrfs bd_inode series 2024-05-21 09:51:42 -07:00
cachefiles Assorted commits that had missed the last merge window... 2024-05-21 13:11:44 -07:00
ceph netfs: Switch to using unsigned long long rather than loff_t 2024-05-01 18:07:35 +01:00
coda mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
configfs
cramfs use ->bd_mapping instead of ->bd_inode->i_mapping 2024-05-03 02:36:51 -04:00
crypto The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
debugfs vfs: Convert debugfs to use the new mount API 2024-03-26 09:04:54 +01:00
devpts fs: Remove the now superfluous sentinel elements from ctl_table array 2023-12-28 04:57:57 -08:00
dlm dlm: return -ENOMEM if ls_recover_buf fails 2024-04-23 16:08:55 -05:00
ecryptfs hardening updates for 6.10-rc1 2024-05-13 14:14:05 -07:00
efivarfs efi: Clear up misconceptions about a maximum variable name size 2024-04-13 10:33:02 +02:00
efs efs: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:33 +01:00
erofs bd_inode series 2024-05-21 09:51:42 -07:00
exfat exfat: zero the reserved fields of file and stream extension dentries 2024-04-25 21:59:59 +09:00
exportfs fs: Create a generic is_dot_dotdot() utility 2024-01-23 10:58:56 -05:00
ext2 ext2: Remove LEGACY_DIRECT_IO dependency 2024-05-03 11:50:28 +02:00
ext4 bd_inode series 2024-05-21 09:51:42 -07:00
f2fs f2fs update for 6.10-rc1 2024-05-20 13:23:43 -07:00
fat fs: add kernel-doc comments to fat_parse_long() 2024-04-25 21:07:02 -07:00
freevxfs freevxfs: Convert freevxfs to the new mount API. 2024-03-26 09:04:53 +01:00
fuse virtiofs: include a newline in sysfs tag 2024-05-08 09:31:21 +02:00
gfs2 bd_inode series 2024-05-21 09:51:42 -07:00
hfs hfs: really remove hfs_writepage 2023-12-29 11:58:34 -08:00
hfsplus hfsplus: refactor copy_name to not use strncpy 2024-04-24 16:55:28 -07:00
hostfs hostfs: use d_splice_alias() calling conventions to simplify failure exits 2023-12-21 12:51:00 -05:00
hpfs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
hugetlbfs The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
iomap Kbuild updates for v6.10 2024-05-18 12:39:20 -07:00
isofs isofs: Use *-y instead of *-objs in Makefile 2024-05-09 18:09:57 +02:00
jbd2 bd_inode series 2024-05-21 09:51:42 -07:00
jffs2 jffs2: prevent xattr node from overflowing the eraseblock 2024-04-17 13:21:34 +02:00
jfs fs,block: yield devices early 2024-03-27 13:17:15 +01:00
kernfs kernfs: annotate different lockdep class for of->mutex of writable files 2024-04-14 06:55:46 -04:00
lockd lockd: host: Remove unnecessary statements'host = NULL;' 2024-05-06 09:07:20 -04:00
minix minix: convert minix to use the new mount api 2024-03-26 09:04:55 +01:00
netfs cifs: Fix locking in cifs_strict_readv() 2024-05-13 17:02:05 -05:00
nfs The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
nfs_common
nfsd \n 2024-05-20 12:31:43 -07:00
nilfs2 bd_inode series 2024-05-21 09:51:42 -07:00
nls
notify Revert "fanotify: remove unneeded sub-zero check for unsigned value" 2024-05-20 12:43:58 -07:00
ntfs3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2024-05-02 12:06:25 -07:00
ocfs2 Mainly singleton patches, documented in their respective changelogs. 2024-05-19 14:02:03 -07:00
omfs
openpromfs openpromfs: finish conversion to the new mount API 2024-03-26 09:04:54 +01:00
orangefs orangefs: fix out-of-bounds fsid access 2024-05-14 17:44:14 -07:00
overlayfs Assorted commits that had missed the last merge window... 2024-05-21 13:11:44 -07:00
proc Assorted commits that had missed the last merge window... 2024-05-21 13:11:44 -07:00
pstore pstore/zone: Don't clear memory twice 2024-03-09 12:33:22 -08:00
qnx4 mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
qnx6 qnx6: convert qnx6 to use the new mount api 2024-03-26 09:04:53 +01:00
quota quota: fix to propagate error of mark_dquot_dirty() to caller 2024-04-12 14:52:29 +02:00
ramfs mm: switch mm->get_unmapped_area() to a flag 2024-04-25 20:56:25 -07:00
reiserfs getting rid of bogus set_blocksize() uses, switching it 2024-05-21 08:34:51 -07:00
romfs fs,block: yield devices early 2024-03-27 13:17:15 +01:00
smb cifs: fix data corruption in read after invalidate 2024-05-15 17:22:59 -05:00
squashfs Mainly singleton patches, documented in their respective changelogs. 2024-05-19 14:02:03 -07:00
sysfs fs: sysfs: Fix reference leak in sysfs_break_active_protection() 2024-04-11 15:16:48 +02:00
sysv sysv: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:31 +01:00
tracefs tracing cleanups for v6.10: 2024-05-17 18:34:27 -07:00
ubifs This pull request contains updates for UBI and UBIFS: 2024-03-21 15:09:29 -07:00
udf udf: Use a folio in udf_write_end() 2024-04-23 15:37:02 +02:00
ufs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
unicode kbuild: use $(src) instead of $(srctree)/$(src) for source directory 2024-05-10 04:34:52 +09:00
vboxsf vboxsf: explicitly deny setlease attempts 2024-04-03 16:06:39 +02:00
verity fsverity: use register_sysctl_init() to avoid kmemleak warning 2024-05-03 08:30:58 -07:00
xfs getting rid of bogus set_blocksize() uses, switching it 2024-05-21 08:34:51 -07:00
zonefs zonefs: Use str_plural() to fix Coccinelle warning 2024-04-10 07:23:47 +09:00
aio.c Assorted commits that had missed the last merge window... 2024-05-21 13:11:44 -07:00
anon_inodes.c fs: Create anon_inode_getfile_fmode() 2024-04-26 10:33:05 +02:00
attr.c lsm/stable-6.9 PR 20240312 2024-03-12 20:03:34 -07:00
backing-file.c fs: Use KMEM_CACHE instead of kmem_cache_create 2024-02-02 13:11:50 +01:00
bad_inode.c
binfmt_elf_fdpic.c binfmt_elf_fdpic: fix /proc/<pid>/auxv 2024-04-24 15:55:28 -07:00
binfmt_elf_test.c
binfmt_elf.c Mainly singleton patches, documented in their respective changelogs. 2024-05-19 14:02:03 -07:00
binfmt_flat.c
binfmt_misc.c execve updates for v6.7-rc1 2023-10-30 19:28:19 -10:00
binfmt_script.c
buffer.c bd_inode series 2024-05-21 09:51:42 -07:00
char_dev.c As usual, lots of singleton and doubleton patches all over the tree and 2023-11-02 20:53:31 -10:00
compat_binfmt_elf.c
coredump.c fs/coredump: Enable dynamic configuration of max file note size 2024-05-08 09:53:00 -07:00
d_path.c
dax.c dax: use huge_zero_folio 2024-04-25 20:56:20 -07:00
dcache.c vfs: Delete the associated dentry when deleting a file 2024-05-22 08:49:13 -07:00
direct-io.c fs/direct-io: remove redundant assignment to variable retval 2024-04-11 10:21:24 +02:00
drop_caches.c
eventfd.c eventfd: strictly check the count parameter of eventfd_write to avoid inputting illegal strings 2024-02-08 10:12:26 +01:00
eventpoll.c epoll: be better about file lifetimes 2024-05-05 14:00:48 -07:00
exec.c The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
fcntl.c fcntl: add F_DUPFD_QUERY fcntl() 2024-05-10 08:26:31 +02:00
fhandle.c fs: Annotate struct file_handle with __counted_by() and use struct_size() 2024-04-05 15:53:47 +02:00
file_table.c lsm/stable-6.9 PR 20240312 2024-03-12 20:03:34 -07:00
file.c get_file_rcu(): no need to check for NULL separately 2024-04-15 16:03:24 -04:00
filesystems.c
fs_context.c
fs_parser.c __fs_parse: Correct a documentation comment 2024-02-02 13:11:50 +01:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c fs/writeback: remove unnecessary return in writeback_inodes_sb 2024-04-05 15:53:45 +02:00
fsopen.c
init.c
inode.c bcachefs updates for 6.9 2024-03-15 09:00:09 -07:00
internal.h pidfs: remove config option 2024-03-13 12:53:53 -07:00
ioctl.c fs/ioctl: Add a comment to keep the logic in sync with LSM policies 2024-05-13 06:58:35 +02:00
Kconfig - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames 2024-03-14 17:43:30 -07:00
Kconfig.binfmt
kernel_read_file.c
libfs.c shmem: Fix shmem_rename2() 2024-04-17 13:49:44 +02:00
locks.c filelock: fix deadlock detection in POSIX locking 2024-02-20 09:53:33 +01:00
Makefile vfs-6.9.pidfd 2024-03-11 10:21:06 -07:00
mbcache.c vfs: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:31 +01:00
mnt_idmapping.c fs/mnt_idmapping.c: Return -EINVAL when no map is written 2024-02-08 10:12:37 +01:00
mount.h mounts: keep list of mounts in an rbtree 2023-11-18 14:56:16 +01:00
mpage.c block, fs: Restore the per-bio/request data lifetime fields 2024-02-06 14:31:05 +01:00
namei.c vfs-6.10.misc 2024-05-13 11:40:06 -07:00
namespace.c fs: relax mount_setattr() permission checks 2024-02-07 21:16:29 +01:00
nsfs.c pidfs: remove config option 2024-03-13 12:53:53 -07:00
open.c do_dentry_open(): kill inode argument 2024-04-15 16:03:25 -04:00
pidfs.c fs/pidfs: make 'lsof' happy with our inode changes 2024-05-21 08:08:00 -07:00
pipe.c fs/pipe: Convert to lockdep_cmp_fn 2024-02-02 13:11:49 +01:00
pnode.c mounts: keep list of mounts in an rbtree 2023-11-18 14:56:16 +01:00
pnode.h
posix_acl.c lsm/stable-6.9 PR 20240312 2024-03-12 20:03:34 -07:00
proc_namespace.c namespace: extract show_path() helper 2023-11-18 14:56:16 +01:00
read_write.c Assorted commits that had missed the last merge window... 2024-05-21 13:11:44 -07:00
readdir.c fsnotify: optionally pass access range in file permission hooks 2023-12-12 16:20:02 +01:00
remap_range.c vfs: export remap and write check helpers 2024-04-15 14:54:13 -07:00
select.c fs/select: rework stack allocation hack for clang 2024-02-20 09:23:52 +01:00
seq_file.c seq_file: Simplify __seq_puts() 2024-05-02 16:28:20 +02:00
signalfd.c signalfd: convert to ->read_iter() 2024-04-10 16:23:04 -06:00
splice.c remove call_{read,write}_iter() functions 2024-04-15 16:03:25 -04:00
stack.c
stat.c statx: stx_subvol 2024-03-26 09:01:18 +01:00
statfs.c
super.c \n 2024-05-20 12:31:43 -07:00
sync.c
sysctls.c fs: Remove the now superfluous sentinel elements from ctl_table array 2023-12-28 04:57:57 -08:00
timerfd.c timerfd: convert to ->read_iter() 2024-04-10 16:23:02 -06:00
userfaultfd.c The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
utimes.c
xattr.c evm: Move to LSM infrastructure 2024-02-15 23:43:47 -05:00