linux/fs
Ritesh Harjani 07b5b8e1ac ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
There could be a race in function ext4_mb_discard_group_preallocations()
where the 1st thread may iterate through group's bb_prealloc_list and
remove all the PAs and add to function's local list head.
Now if the 2nd thread comes in to discard the group preallocations,
it will see that the group->bb_prealloc_list is empty and will return 0.

Consider for a case where we have less number of groups
(for e.g. just group 0),
this may even return an -ENOSPC error from ext4_mb_new_blocks()
(where we call for ext4_mb_discard_group_preallocations()).
But that is wrong, since 2nd thread should have waited for 1st thread
to release all the PAs and should have retried for allocation.
Since 1st thread was anyway going to discard the PAs.

The algorithm using this percpu seq counter goes below:
1. We sample the percpu discard_pa_seq counter before trying for block
   allocation in ext4_mb_new_blocks().
2. We increment this percpu discard_pa_seq counter when we either allocate
   or free these blocks i.e. while marking those blocks as used/free in
   mb_mark_used()/mb_free_blocks().
3. We also increment this percpu seq counter when we successfully identify
   that the bb_prealloc_list is not empty and hence proceed for discarding
   of those PAs inside ext4_mb_discard_group_preallocations().

Now to make sure that the regular fast path of block allocation is not
affected, as a small optimization we only sample the percpu seq counter
on that cpu. Only when the block allocation fails and when freed blocks
found were 0, that is when we sample percpu seq counter for all cpus using
below function ext4_get_discard_pa_seq_sum(). This happens after making
sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.

It can be well argued that why don't just check for grp->bb_free to
see if there are any free blocks to be allocated. So here are the two
concerns which were discussed:-

1. If for some reason the blocks available in the group are not
   appropriate for allocation logic (say for e.g.
   EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then
   the retry logic may result into infinte looping since grp->bb_free is
   non-zero.

2. Also before preallocation was clubbed with block allocation with the
   same ext4_lock_group() held, there were lot of races where grp->bb_free
   could not be reliably relied upon.
Due to above, this patch considers discard_pa_seq logic to determine if
we should retry for block allocation. Say if there are are n threads
trying for block allocation and none of those could allocate or discard
any of the blocks, then all of those n threads will fail the block
allocation and return -ENOSPC error. (Since the seq counter for all of
those will match as no block allocation/discard was done during that
duration).

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/7f254686903b87c419d798742fd9a1be34f0657b.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
..
9p 9p: read only once on O_NONBLOCK 2020-03-27 09:29:56 +00:00
adfs fs/adfs: bigdir: Fix an error code in adfs_fplus_read() 2020-01-25 11:31:59 -05:00
affs affs: fix a memory leak in affs_remount 2019-11-18 14:26:43 +01:00
afs afs: Make record checking use TASK_UNINTERRUPTIBLE when appropriate 2020-04-24 16:33:32 +01:00
autofs LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat() 2020-03-13 21:08:17 -04:00
befs fs: Fill in max and min timestamps in superblock 2019-08-30 07:27:17 -07:00
bfs fs: Fill in max and min timestamps in superblock 2019-08-30 07:27:17 -07:00
btrfs btrfs: fix gcc-4.8 build warning for struct initializer 2020-04-30 12:17:49 +02:00
cachefiles cachefiles: drop direct usage of ->bmap method. 2020-02-03 08:05:56 -05:00
ceph ceph: fix potential bad pointer deref in async dirops cb's 2020-04-13 19:33:47 +02:00
cifs cifs: fix uninitialised lease_key in open_shroot() 2020-04-22 20:29:11 -05:00
coda y2038: add inode timestamp clamping 2019-09-19 09:42:37 -07:00
configfs utimes: Clamp the timestamps in notify_change() 2019-12-08 19:10:50 -05:00
cramfs cramfs: switch to use of errofc() et.al. 2020-02-07 14:48:41 -05:00
crypto fscrypt updates for 5.7 2020-03-31 12:58:36 -07:00
debugfs debugfs: remove return value of debugfs_create_u32() 2020-04-17 17:08:50 +02:00
devpts devpts_pty_kill(): don't bother with d_delete() 2019-09-03 09:30:56 -04:00
dlm dlm: use SO_SNDTIMEO_NEW instead of SO_SNDTIMEO_OLD 2019-12-18 18:07:31 +01:00
ecryptfs eCryptfs fixes for 5.6-rc3 2020-02-17 21:08:37 -08:00
efivarfs efi: Use more granular check for availability for variable services 2020-02-23 21:59:42 +01:00
efs fs: Fill in max and min timestamps in superblock 2019-08-30 07:27:17 -07:00
erofs erofs: handle corrupted images whose decompressed size less than it'd be 2020-03-03 23:40:52 +08:00
exfat exfat: truncate atimes to 2s granularity 2020-04-22 20:14:06 +09:00
exportfs race in exportfs_decode_fh() 2019-11-11 09:21:59 -05:00
ext2 ext2: fix empty body warnings when -Wextra is used 2020-03-23 13:01:37 +01:00
ext4 ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling 2020-06-03 23:16:53 -04:00
f2fs f2fs-for-5.7-rc1 2020-04-07 13:48:26 -07:00
fat fat: fix uninit-memory access for partial initialized inode 2020-03-06 07:06:09 -06:00
freevxfs fs: Fill in max and min timestamps in superblock 2019-08-30 07:27:17 -07:00
fscache proc: convert everything to "struct proc_ops" 2020-02-04 03:05:26 +00:00
fuse fuse: fix stack use after return 2020-02-13 09:16:07 +01:00
gfs2 We've got a lot of patches (39) for this merge window. Most of these patches 2020-03-31 14:16:03 -07:00
hfs hfs/hfsplus: use 64-bit inode timestamps 2019-12-18 18:07:32 +01:00
hfsplus hfsplus: fix crash and filesystem corruption when deleting files 2020-04-10 15:36:20 -07:00
hostfs hostfs: Use kasprintf() instead of fixed buffer formatting 2020-03-29 23:23:00 +02:00
hpfs fs: compat_ioctl: move FITRIM emulation into file systems 2019-10-23 17:23:46 +02:00
hugetlbfs hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race 2020-04-02 09:35:32 -07:00
iomap fibmap: Warn and return an error in case of block > INT_MAX 2020-04-30 07:57:46 -07:00
isofs y2038: add inode timestamp clamping 2019-09-19 09:42:37 -07:00
jbd2 jbd2: improve comments about freeing data buffers whose page mapping is NULL 2020-03-05 20:25:05 -05:00
jffs2 fs_parse: fold fs_parameter_desc/fs_parameter_spec 2020-02-07 14:48:37 -05:00
jfs Trivial cleanup for jfs 2020-02-05 05:28:20 +00:00
kernfs kernfs: Add option to enable user xattrs 2020-03-16 15:53:47 -04:00
lockd proc: convert everything to "struct proc_ops" 2020-02-04 03:05:26 +00:00
minix fs: Fill in max and min timestamps in superblock 2019-08-30 07:27:17 -07:00
nfs NFS: Fix a race in __nfs_list_for_each_server() 2020-04-30 15:08:26 -04:00
nfs_common treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
nfsd SUNRPC: Fix backchannel RPC soft lockups 2020-04-17 12:40:31 -04:00
nilfs2 fs: compat_ioctl: move FITRIM emulation into file systems 2019-10-23 17:23:46 +02:00
nls treewide: Add SPDX license identifier - Makefile/Kconfig 2019-05-21 10:50:46 +02:00
notify fanotify: Fix the checks in fanotify_fsid_equal 2020-03-30 12:40:53 +02:00
ntfs fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_t 2020-03-28 13:21:08 +01:00
ocfs2 dlmfs_file_write(): fix the bogosity in handling non-zero *ppos 2020-04-23 13:45:27 -04:00
omfs fs: omfs: Initialize filesystem timestamp ranges 2019-08-30 08:11:25 -07:00
openpromfs Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-07-19 10:42:02 -07:00
orangefs orangefs: don't mess with I_DIRTY_TIMES in orangefs_flush 2020-04-08 09:39:11 -04:00
overlayfs ovl: enable xino automatically in more cases 2020-03-27 16:51:02 +01:00
proc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2020-04-25 12:25:32 -07:00
pstore pstore/ram: Replace zero-length array with flexible-array member 2020-03-09 14:45:40 -07:00
qnx4 fs: Fill in max and min timestamps in superblock 2019-08-30 07:27:17 -07:00
qnx6 fs: Fill in max and min timestamps in superblock 2019-08-30 07:27:17 -07:00
quota \n 2020-01-30 15:37:41 -08:00
ramfs fs_parse: fold fs_parameter_desc/fs_parameter_spec 2020-02-07 14:48:37 -05:00
reiserfs reiserfs: clean up several indentation issues 2020-04-07 10:43:44 -07:00
romfs Merge branch 'work.mount2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-09-19 10:06:57 -07:00
squashfs Merge branch 'work.mount2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-09-19 10:06:57 -07:00
sysfs sysfs: remove redundant __compat_only_sysfs_link_entry_to_kobj fn 2020-04-05 11:34:35 -07:00
sysv fs: sysv: Initialize filesystem timestamp ranges 2019-08-30 07:27:18 -07:00
tracefs simple_recursive_removal(): kernel-side rm -rf for ramfs-style filesystems 2019-12-10 22:29:58 -05:00
ubifs This pull request contains fixes for UBI and UBIFS: 2020-04-07 12:40:56 -07:00
udf change email address for Pali Rohár 2020-04-10 15:36:22 -07:00
ufs y2038: add inode timestamp clamping 2019-09-19 09:42:37 -07:00
unicode .gitignore: add SPDX License Identifier 2020-03-25 11:50:48 +01:00
vboxsf fs: Add VirtualBox guest shared folder (vboxsf) support 2020-02-08 17:34:58 -05:00
verity fs-verity: use u64_to_user_ptr() 2020-01-14 13:28:28 -08:00
xfs xfs: move inode flush to the sync workqueue 2020-04-16 09:07:42 -07:00
zonefs zonfs: Fix handling of read-only zones 2020-03-25 11:28:26 +09:00
aio.c aio: prevent potential eventfd recursion on poll 2020-02-03 17:27:47 -07:00
anon_inodes.c Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-07-19 10:42:02 -07:00
attr.c utimes: Clamp the timestamps in notify_change() 2019-12-08 19:10:50 -05:00
bad_inode.c
binfmt_aout.c treewide: Add SPDX license identifier for more missed files 2019-05-21 10:50:45 +02:00
binfmt_elf_fdpic.c y2038: elfcore: Use __kernel_old_timeval for process times 2019-11-15 14:38:29 +01:00
binfmt_elf.c fs/binfmt_elf.c: don't free interpreter's ELF pheaders on common path 2020-04-07 10:43:44 -07:00
binfmt_em86.c treewide: Add SPDX license identifier for more missed files 2019-05-21 10:50:45 +02:00
binfmt_flat.c fs/binfmt_flat.c: remove set but not used variable 'inode' 2019-07-16 19:23:22 -07:00
binfmt_misc.c Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-07-19 10:42:02 -07:00
binfmt_script.c treewide: Add SPDX license identifier for more missed files 2019-05-21 10:50:45 +02:00
block_dev.c block: remove unused header 2020-04-21 09:51:10 -06:00
buffer.c block-5.7-2020-04-24 2020-04-24 12:44:19 -07:00
char_dev.c chardev: Avoid potential use-after-free in 'chrdev_open()' 2020-01-06 20:10:26 +01:00
compat_binfmt_elf.c y2038: elfcore: Use __kernel_old_timeval for process times 2019-11-15 14:38:29 +01:00
compat.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 2019-06-19 17:09:55 +02:00
coredump.c coredump: fix null pointer dereference on coredump 2020-04-21 11:11:56 -07:00
d_path.c [PATCH] fix d_absolute_path() interplay with fsmount() 2019-08-30 19:31:09 -04:00
dax.c dax,iomap: Add helper dax_iomap_zero() to zero a range 2020-04-02 19:15:03 -07:00
dcache.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-12-08 11:08:28 -08:00
dcookies.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
direct-io.c fs/direct-io.c: include fs/internal.h for missing prototype 2020-01-04 13:55:09 -08:00
drop_caches.c fs: avoid softlockups in s_inodes iterators 2019-12-18 00:03:01 -05:00
eventfd.c eventfd: track eventfd_signal() recursion depth 2020-02-03 17:27:38 -07:00
eventpoll.c fs/epoll: make nesting accounting safe for -rt kernel 2020-04-07 10:43:44 -07:00
exec.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2020-04-02 11:22:17 -07:00
fcntl.c fcntl: Distribute switch variables for initialization 2020-03-03 10:55:06 -05:00
fhandle.c fs/handle.c - fix up kerneldoc 2019-08-07 21:51:47 -04:00
file_table.c vfs: Export flush_delayed_fput for use by knfsd. 2019-08-19 11:00:39 -04:00
file.c io_uring: make sure openat/openat2 honor rlimit nofile 2020-03-20 08:47:27 -06:00
filesystems.c fs/filesystems.c: downgrade user-reachable WARN_ONCE() to pr_warn_once() 2020-04-10 15:36:22 -07:00
fs_context.c add prefix to fs_context->log 2020-02-07 14:48:35 -05:00
fs_parser.c fs_parse: remove pr_notice() about each validation 2020-04-02 09:35:26 -07:00
fs_pin.c switch the remnants of releasing the mountpoint away from fs_pin 2019-07-16 22:52:37 -04:00
fs_struct.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
fs_types.c
fs-writeback.c writeback: Export inode_io_list_del() 2020-06-03 23:16:49 -04:00
fsopen.c add prefix to fs_context->log 2020-02-07 14:48:35 -05:00
inode.c futex: Fix inode life-time issue 2020-03-06 11:06:15 +01:00
internal.h writeback: Export inode_io_list_del() 2020-06-03 23:16:49 -04:00
io_uring.c io_uring: punt splice async because of inode mutex 2020-05-01 08:50:57 -06:00
io-wq.c io_uring: use io-wq manager as backup task if task is exiting 2020-04-03 11:35:57 -06:00
io-wq.h io_uring: use io-wq manager as backup task if task is exiting 2020-04-03 11:35:57 -06:00
ioctl.c fibmap: Warn and return an error in case of block > INT_MAX 2020-04-30 07:57:46 -07:00
Kconfig exfat: add Kconfig and Makefile 2020-03-05 21:00:40 -05:00
Kconfig.binfmt binfmt_flat: make support for old format binaries optional 2019-06-24 09:16:47 +10:00
libfs.c libfs: fix infoleak in simple_attr_read() 2020-03-24 13:27:16 +01:00
locks.c locks: reinstate locks_delete_block optimization 2020-03-18 13:03:38 -07:00
Makefile exfat: add Kconfig and Makefile 2020-03-05 21:00:40 -05:00
mbcache.c treewide: Add SPDX license identifier for more missed files 2019-05-21 10:50:45 +02:00
mount.h switch the remnants of releasing the mountpoint away from fs_pin 2019-07-16 22:52:37 -04:00
mpage.c fs: move guard_bio_eod() after bio_set_op_attrs 2020-01-09 08:16:12 -07:00
namei.c fix a braino in legitimize_path() 2020-04-06 10:38:59 -04:00
namespace.c LOOKUP_MOUNTPOINT: fold path_mountpointat() into path_lookupat() 2020-03-13 21:08:17 -04:00
no-block.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
nsfs.c fs/nsfs.c: Added ns_match 2020-03-12 17:33:11 -07:00
open.c Merge branch 'work.dotdot1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-04-02 12:30:08 -07:00
pipe.c mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page() 2020-04-02 09:35:28 -07:00
pnode.c propagate_one(): mnt_set_mountpoint() needs mount_lock 2020-04-27 10:37:14 -04:00
pnode.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 209 2019-05-30 11:29:53 -07:00
posix_acl.c fs/posix_acl.c: fix kernel-doc warnings 2020-01-04 13:55:09 -08:00
proc_namespace.c vfs: subtype handling moved to fuse 2019-09-06 21:28:49 +02:00
read_write.c powerpc: Add back __ARCH_WANT_SYS_LLSEEK macro 2020-04-03 00:09:59 +11:00
readdir.c readdir: make user_access_begin() use the real access range 2020-01-23 10:15:28 -08:00
select.c y2038: syscalls: change remaining timeval to __kernel_old_timeval 2019-11-15 14:38:29 +01:00
seq_file.c fs/seq_file.c: seq_read(): add info message about buggy .next functions 2020-04-10 15:36:22 -07:00
signalfd.c
splice.c splice: make do_splice public 2020-03-02 14:04:31 -07:00
stack.c sched/rt, fs: Use CONFIG_PREEMPTION 2019-12-08 14:37:36 +01:00
stat.c fs: make two stat prep helpers available 2020-01-20 17:03:54 -07:00
statfs.c vfs: Fix EOVERFLOW testing in put_compat_statfs64 2019-10-03 14:21:35 -07:00
super.c Fix use after free in get_tree_bdev() 2020-04-28 14:37:40 -07:00
sync.c fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback 2019-05-14 09:47:50 -07:00
timerfd.c timerfd: Make timerfd_settime() time namespace aware 2020-01-14 12:20:53 +01:00
userfaultfd.c userfaultfd: wp: declare _UFFDIO_WRITEPROTECT conditionally 2020-04-07 10:43:40 -07:00
utimes.c utimes: Clamp the timestamps in notify_change() 2019-12-08 19:10:50 -05:00
xattr.c kernfs: Add removed_size out param for simple_xattr_set 2020-03-16 15:53:47 -04:00