linux/fs/ocfs2
Heming Zhao 4eb7b93e03 ocfs2: improve write IO performance when fragmentation is high
The group_search function ocfs2_cluster_group_search() should
bypass groups with insufficient space to avoid unnecessary
searches.

This patch is particularly useful when ocfs2 is handling huge
number small files, and volume fragmentation is very high.
In this case, ocfs2 is busy with looking up available la window
from //global_bitmap.

This patch introduces a new member in the Group Description (gd)
struct called 'bg_contig_free_bits', representing the max
contigous free bits in this gd. When ocfs2 allocates a new
la window from //global_bitmap, 'bg_contig_free_bits' helps
expedite the search process.

Let's image below path.

1. la state (->local_alloc_state) is set THROTTLED or DISABLED.

2. when user delete a large file and trigger
   ocfs2_local_alloc_seen_free_bits set osb->local_alloc_state
   unconditionally.

3. a write IOs thread run and trigger the worst performance path

```
ocfs2_reserve_clusters_with_limit
 ocfs2_reserve_local_alloc_bits
  ocfs2_local_alloc_slide_window //[1]
   + ocfs2_local_alloc_reserve_for_window //[2]
   + ocfs2_local_alloc_new_window //[3]
      ocfs2_recalc_la_window
```

[1]:
will be called when la window bits used up.

[2]:
under la state is ENABLED, and this func only check global_bitmap
free bits, it will succeed in general.

[3]:
will use the default la window size to search clusters then fail.
ocfs2_recalc_la_window attempts other la window sizes.
the timing complexity is O(n^4), resulting in a significant time
cost for scanning global bitmap. This leads to a dramatic slowdown
in write I/Os (e.g., user space 'dd').

i.e.
an ocfs2 partition size: 1.45TB, cluster size: 4KB,
la window default size: 106MB.
The partition is fragmentation by creating & deleting huge mount of
small files.

before this patch, the timing of [3] should be
(the number got from real world):
- la window size change order (size: MB):
  106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8
  only 0.8MB succeed, 0.8MB also triggers la window to disable.
  ocfs2_local_alloc_new_window retries 8 times, first 7 times totally
  runs in worst case.
- group chain number: 242
  ocfs2_claim_suballoc_bits calls for-loop 242 times
- each chain has 49 block group
  ocfs2_search_chain calls while-loop 49 times
- each bg has 32256 blocks
  ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits.
  for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use
  (32256/64) (this is not worst value) for timing calucation.

the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times)

In the worst case, user space writes 1MB data will trigger 42M scanning
times.

under this patch, the timing is '7*242*49 = 83006', reduced by three
orders of magnitude.

Link: https://lkml.kernel.org/r/20240328125203.20892-2-heming.zhao@suse.com
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 21:07:03 -07:00
..
cluster ocfs2: port block device access to file 2024-02-25 12:05:26 +01:00
dlm ocfs2/dlm: use bitmap API instead of hand-writing it 2022-11-18 13:55:06 -08:00
dlmfs ocfs2: remove SLAB_MEM_SPREAD flag usage 2024-03-14 09:17:29 -07:00
acl.c ocfs2: convert to new timestamp accessors 2023-10-18 14:08:24 +02:00
acl.h fs: port ->set_acl() to pass mnt_idmap 2023-01-19 09:24:27 +01:00
alloc.c fs: convert block_write_full_page to block_write_full_folio 2023-12-29 11:58:35 -08:00
alloc.h
aops.c ocfs2: remove writepage implementation 2023-12-29 11:58:35 -08:00
aops.h
blockcheck.c
blockcheck.h
buffer_head_io.c ocfs2: fix a spelling typo in comment 2023-11-01 12:46:59 -07:00
buffer_head_io.h
dcache.c ocfs2_find_match(): there's no such thing as NULL or negative ->d_parent 2023-12-21 12:53:30 -05:00
dcache.h
dir.c __ocfs2_add_entry(), ocfs2_prepare_dir_for_insert(): namelen checks 2023-12-21 12:53:21 -05:00
dir.h
dlmglue.c ocfs2: spelling fix 2024-02-22 15:38:51 -08:00
dlmglue.h
export.c dlm: implement EXPORT_OP_ASYNC_LOCK 2023-11-16 11:59:19 -06:00
export.h
extent_map.c
extent_map.h
file.c ocfs2: enable ocfs2_listxattr for special files 2024-03-14 09:17:29 -07:00
file.h fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
filecheck.c
filecheck.h
heartbeat.c ocfs2: fix a typo in a comment 2022-07-29 18:12:36 -07:00
heartbeat.h
inode.c ocfs2: convert to new timestamp accessors 2023-10-18 14:08:24 +02:00
inode.h quota: Properly annotate i_dquot arrays with __rcu 2024-02-08 12:04:59 +01:00
ioctl.c ocfs2: update inode ctime in ocfs2_fileattr_set 2024-04-25 21:07:01 -07:00
ioctl.h fs: port ->fileattr_set() to pass mnt_idmap 2023-01-19 09:24:27 +01:00
journal.c ocfs2: annotate struct ocfs2_replay_map with __counted_by 2023-10-18 14:43:21 -07:00
journal.h ocfs2: use flexible array in 'struct ocfs2_recovery_map' 2023-08-18 10:18:57 -07:00
Kconfig fs: add CONFIG_BUFFER_HEAD 2023-08-02 09:13:09 -06:00
localalloc.c ocfs2: correctly use ocfs2_find_next_zero_bit() 2024-04-25 21:07:01 -07:00
localalloc.h
locks.c ocfs2: adapt to breakup of struct file_lock 2024-02-05 13:11:43 +01:00
locks.h
Makefile
mmap.c
mmap.h
move_extents.c ocfs2: improve write IO performance when fragmentation is high 2024-04-25 21:07:03 -07:00
move_extents.h
namei.c ocfs2: Avoid touching renamed directory if parent does not change 2023-11-25 02:53:20 -05:00
namei.h
ocfs1_fs_compat.h
ocfs2_fs.h ocfs2: improve write IO performance when fragmentation is high 2024-04-25 21:07:03 -07:00
ocfs2_ioctl.h
ocfs2_lockid.h
ocfs2_lockingver.h
ocfs2_trace.h ocfs2: remove writepage implementation 2023-12-29 11:58:35 -08:00
ocfs2.h ocfs2: always read both high and low parts of dinode link count 2022-12-11 19:30:19 -08:00
quota_global.c quota: Set nofs allocation context when acquiring dqio_sem 2024-01-23 19:21:11 +01:00
quota_local.c quota: Set nofs allocation context when acquiring dqio_sem 2024-01-23 19:21:11 +01:00
quota.h
refcounttree.c ocfs2: convert to new timestamp accessors 2023-10-18 14:08:24 +02:00
refcounttree.h
reservations.c ocfs2: correctly use ocfs2_find_next_zero_bit() 2024-04-25 21:07:01 -07:00
reservations.h
resize.c ocfs2: improve write IO performance when fragmentation is high 2024-04-25 21:07:03 -07:00
resize.h
slot_map.c ocfs2: Annotate struct ocfs2_slot_info with __counted_by 2023-10-02 09:48:52 -07:00
slot_map.h
stack_o2cb.c ocfs2: use bitmap API in fill_node_map 2022-11-18 13:55:06 -08:00
stack_user.c ocfs2: adapt to breakup of struct file_lock 2024-02-05 13:11:43 +01:00
stackglue.c fs: Remove the now superfluous sentinel elements from ctl_table array 2023-12-28 04:57:57 -08:00
stackglue.h
suballoc.c ocfs2: improve write IO performance when fragmentation is high 2024-04-25 21:07:03 -07:00
suballoc.h ocfs2: improve write IO performance when fragmentation is high 2024-04-25 21:07:03 -07:00
super.c - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min 2024-03-14 18:03:09 -07:00
super.h
symlink.c ocfs2: Convert ocfs2 to read_folio 2022-05-09 16:21:46 -04:00
symlink.h
sysfile.c
sysfile.h
uptodate.c
uptodate.h
xattr.c vfs-6.7.ctime 2023-10-30 09:47:13 -10:00
xattr.h ocfs2: move ocfs2_xattr_handlers and ocfs2_xattr_handler_map to .rodata 2023-10-09 16:24:20 +02:00