3819 Commits

Author SHA1 Message Date
Kent Overstreet
5b9bc272e6 bcachefs: Kill writing old accounting to journal
More ripping out of the old disk space accounting.

Note that the new disk space accounting is incompatible with the old,
and writing out old style disk space accounting with the new code is
infeasible.

This means upgrading and downgrading past this version requires
regenerating accounting.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
3afb8dbf03 bcachefs: kill bch2_fs_usage_read()
With bch2_ioctl_fs_usage(), this is now dead code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
6b39638b84 bcachefs: Convert bch2_ioctl_fs_usage() to new accounting
This converts bch2_ioctl_fs_usage() to read from the new disk
accounting, via bch2_fs_replicas_usage_read().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
72a6bb098c bcachefs: Kill bch2_fs_usage_initialize()
Deleting code for the old disk accounting scheme.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
f5095b9f85 bcachefs: dev_usage updated by new accounting
Reading disk accounting now requires an eytzinger lookup (see:
bch2_accounting_mem_read()), but the per-device counters are used
frequently enough that we'd like to still be able to read them with just
a percpu sum, as in the old code.

This patch special cases the device counters; when we update in-memory
accounting we also update the old style percpu counters if it's a deice
counter update.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
2e8d686a4a bcachefs: Coalesce accounting keys before journal replay
This fixes a performance regression in journal replay; without
colaescing accounting keys we have multiple keys at the same position,
which means journal_keys_peek_upto() has to skip past many overwritten
keys - turning journal replay into an O(n^2) algorithm.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
1d16c605cc bcachefs: Disk space accounting rewrite
Main part of the disk accounting rewrite.

This is a wholesale rewrite of the existing disk space accounting, which
relies on percepu counters that are sharded by journal buffer, and
rolled up and added to each journal write.

With the new scheme, every set of counters is a distinct key in the
accounting btree; this fixes scaling limitations of the old scheme,
where counters took up space in each journal entry and required multiple
percpu counters.

Now, in memory accounting requires a single set of percpu counters - not
multiple for each in flight journal buffer - and in the future we'll
probably also have counters that don't use in memory percpu counters,
they're not strictly required.

An accounting update is now a normal btree update, using the btree write
buffer path. At transaction commit time, we apply accounting updates to
the in memory counters, which are percpu counters indexed in an
eytzinger tree by the accounting key.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
5d9667d1d6 bcachefs: btree write buffer knows how to accumulate bch_accounting keys
Teach the btree write buffer how to accumulate accounting keys - instead
of having the newer key overwrite the older key as we do with other
updates, we need to add them together.

Also, add a flag so that write buffer flush knows when journal replay is
finished flushing accounting, and teach it to hold accounting keys until
that flag is set.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
9dec2a473b bcachefs: Accumulate accounting keys in journal replay
Until accounting keys hit the btree, they are deltas, not new versions
of the existing key; this means we have to teach journal replay to
accumulate them.

Additionally, the journal doesn't track precisely which entries have
been flushed to the btree; it only tracks a range of entries that may
possibly still need to be flushed.

That means we need to compare accounting keys against the version in the
btree and only flush updates that are newer.

There's another wrinkle with the write buffer: if the write buffer
starts flushing accounting keys before journal replay has finished
flushing accounting keys, journal replay will see the version number
from the new updates and updates from the journal will be lost.

To avoid this, journal replay has to flush accounting keys first, and
we'll be adding a flag so that write buffer flush knows to hold
accounting keys until then.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Kent Overstreet
2744e5c9eb bcachefs: KEY_TYPE_accounting
New key type for the disk space accounting rewrite.

 - Holds a variable sized array of u64s (may be more than one for
   accounting e.g. compressed and uncompressed size, or buckets and
   sectors for a given data type)

 - Updates are deltas, not new versions of the key: this means updates
   to accounting can happen via the btree write buffer, which we'll be
   teaching to accumulate deltas.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:13 -04:00
Thomas Bertschinger
929d954330 bcachefs: use new mount API
This updates bcachefs to use the new mount API:

- Update the file_system_type to use the new init_fs_context()
  function.

- Define the new fs_context_operations functions.

- No longer register bch2_mount() and bch2_remount(); these are now
  called via the new fs_context functions.

- Define a new helper type, bch2_opts_parse that includes a struct
  bch_opts and additionally a printbuf used to save options that can't
  be parsed until after the FS is opened. This enables us to parse as
  many options as possible prior to opening the filesystem while saving
  those options that need the open FS for later parsing.

Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Thomas Bertschinger
1c12d1caf8 bcachefs: Add error code to defer option parsing
This introduces a new error code, option_needs_open_fs, which is used to
indicate that an attempt was made to parse a mount option prior to
opening a filesystem, when that mount option requires an open filesystem
in order to be validated.

Returning this error results in bch2_parse_one_mount_opt() saving that
option for later parsing, after the filesystem is opened.

Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Thomas Bertschinger
9b7f0b5d3d bcachefs: add printbuf arg to bch2_parse_mount_opts()
Mount options that take the name of a device that may be part of a
filesystem, for example "metadata_target", cannot be validated until
after the filesystem has been opened. However, an attempt to parse those
options may be made prior to the filesystem being opened.

This change adds a printbuf parameter to bch2_parse_mount_opts() which
will be used to save those mount options, when they are supplied prior
to the FS being opened, so that they can be parsed later.

This functionality is not currently needed, but will be used after
bcachefs starts using the new mount API to parse mount options. This is
because using the new mount API, we will process mount options prior to
opening the FS, but the new API doesn't provide a convenient way to
"replay" mount option parsing. So we save these options ourselves to
accomplish this.

This change also splits out the code to parse a single option into
bch2_parse_one_mount_opt(), which will be useful when using the new
mount API which deals with a single mount option at a time.

Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
7773df19c3 bcachefs: metadata version bucket_stripe_sectors
New on disk format version for bch_alloc->stripe_sectors and
BCH_DATA_unstriped - accounting for unstriped data in stripe buckets.

Upgrade/downgrade requires regenerating alloc info - but only if erasure
coding is in use.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
2612e29142 bcachefs: BCH_DATA_unstriped
Add a new pseudo data type, to track buckets that are members of a
stripe, but have unstriped data in them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
55f7962da3 bcachefs: bch_alloc->stripe_sectors
Add a separate counter to bch_alloc_v4 for amount of striped data; this
lets us separately track striped and unstriped data in a bucket, which
lets us see when erasure coding has failed to update extents with stripe
pointers, and also find buckets to continue updating if we crash mid way
through creating a new stripe.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
c13d526d9d bcachefs: check_key_has_inode()
Consolidate duplicated checks for extents/dirents/xattrs - these keys
should all have a corresponding inode of the correct type.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Thomas Bertschinger
51fc436c80 bcachefs: allow passing full device path for target options
The output of mount options such as "metadata_target" in `/proc/mounts`
uses the full path to the device.

mount(8) from util-linux uses the output from `/proc/mounts` to pass
existing mount options when performing a remount, so bcachefs should
accept as input the same form that it prints as output.

Without this change:

$ mount -t bcachefs -o metadata_target=vdb /dev/vdb /mnt
$ strace mount -o remount /mnt
...
fsconfig(4, FSCONFIG_SET_STRING, "metadata_target", "/dev/vdb", 0) = -1 EINVAL (Invalid argument)
...

Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
3811f48aa3 bcachefs: bch2_printbuf_strip_trailing_newline()
Add a new helper to fix inode_to_text()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Thomas Bertschinger
babe30fe8d bcachefs: don't expose "read_only" as a mount option
When "read_only" is exposed as a mount option, it is redundant with the
standard option "ro" and gives users multiple ways to specify that a
bcachefs filesystem should be mounted read-only. This presents the risk
of having inconsistent options specified.

This can be seen when remounting a read-only filesystem in read-write
mode, using mount(8) from util-linux. Because mount(8) parses the
existing mount options from `/proc/mounts` and applies them when
remounting, it can end up applying both "read_only" and "rw":

$ mount img -o ro /mnt
$ strace mount -o remount,rw /mnt
...
fsconfig(4, FSCONFIG_SET_FLAG, "read_only", NULL, 0) = 0
fsconfig(4, FSCONFIG_SET_FLAG, "rw", NULL, 0) = 0
...

Making "read_only" no longer a mount option means this edge case cannot
occur.

Fixes: 62719cf33c3a ("bcachefs: Fix nochanges/read_only interaction")
Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Thomas Bertschinger
03ec0927fa bcachefs: make offline fsck set read_only fs flag
A subsequent change will remove "read_only" as a mount option in favor
of the standard option "ro", meaning the userspace fsck command cannot
pass it to the fsck ioctl. Instead, in offline fsck, set "read_only"
kernel-side without trying to parse it as a mount option.

For compatibility with versions of the "bcachefs fsck" command that try
to pass the "read_only" mount opt, remove it from the mount options
string prior to parsing when it is present.

Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
652bc7fabc bcachefs: btree_ptr_sectors_written() now takes bkey_s_c
this is for the userspace metadata dump tool

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
9cc8eb3098 bcachefs: Check for bsets past bch_btree_ptr_v2.sectors_written
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Uros Bizjak
68573b936d bcachefs: Use try_cmpxchg() family of functions instead of cmpxchg()
Use try_cmpxchg() family of functions instead of
cmpxchg (*ptr, old, new) == old. x86 CMPXCHG instruction returns
success in ZF flag, so this change saves a compare after cmpxchg
(and related move instruction in front of cmpxchg).

Also, try_cmpxchg() implicitly assigns old *ptr value to "old" when
cmpxchg fails. There is no need to re-read the value in the loop.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
e76a2b65b0 bcachefs: add might_sleep() annotations for fsck_err()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
546b65378d bcachefs: fix missing include
fs-common.h needs dirent.h for enum bch_rename_mode

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Youling Tang
630d565dda bcachefs: Use filemap_read() to simplify the execution flow
Using filemap_read() can reduce unnecessary code execution
for non IOCB_DIRECT paths.

Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Youling Tang
da6fa380d3 bcachefs: Align the display format of btrees/inodes/keys
Before patch:
```
 #cat btrees/inodes/keys
 u64s 17 type inode_v3 0:4096:U32_MAX len 0 ver 0:   mode=40755
   flags= (16300000)
   bi_size=0
```

After patch:
```
 #cat btrees/inodes/keys
 u64s 17 type inode_v3 0:4096:U32_MAX len 0 ver 0:
   mode=40755
   flags=(16300000)
   bi_size=0
```

Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Youling Tang
12e7ff1a1e bcachefs: Fix missing spaces in journal_entry_dev_usage_to_text
Fixed missing spaces displayed in journal_entry_dev_usage_to_text
while adjusting the display format to improve readability.

before:
```
 # bcachefs list_journal -a -t alloc:1:0 /dev/sdb
 ...
     dev_usage: dev=0free: buckets=233180 sectors=0 fragmented=0sb: buckets=13 sectors=6152 fragmented=504journal: buckets=1847 sectors=945664 fragmented=0btree: buckets=20 sectors=10240 fragmented=0user: buckets=1419 sectors=726513 fragmented=15cached: buckets=0 sectors=0 fragmented=0parity: buckets=0 sectors=0 fragmented=0stripe: buckets=0 sectors=0 fragmented=0need_gc_gens: buckets=0 sectors=0 fragmented=0need_discard: buckets=1 sectors=0 fragmented=0
```

after:
```
 # bcachefs list_journal -a -t alloc:1:0 /dev/sdb
 ...
     dev_usage: dev=0
       free: buckets=233180 sectors=0 fragmented=0
       sb: buckets=13 sectors=6152 fragmented=504
       journal: buckets=1847 sectors=945664 fragmented=0
       btree: buckets=20 sectors=10240 fragmented=0
       user: buckets=1419 sectors=726513 fragmented=15
       cached: buckets=0 sectors=0 fragmented=0
       parity: buckets=0 sectors=0 fragmented=0
       stripe: buckets=0 sectors=0 fragmented=0
       need_gc_gens: buckets=0 sectors=0 fragmented=0
       need_discard: buckets=1 sectors=0 fragmented=0
```
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:12 -04:00
Kent Overstreet
f369de8267 bcachefs: fix ei_update_lock lock ordering
ei_update_lock is largely vestigal and will probably be removed, but
we're not ready for that just yet.

this fixes some lockdep splats with the new lockdep support for btree
node locks; they're harmless, since we were taking ei_update_lock before
actually locking any btree nodes, but "any btree nodes locked" are now
tracked at the btree_trans level.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:11 -04:00
Kent Overstreet
cdda2126ab bcachefs: bch2_btree_reserve_cache_to_text()
Add a pretty printer so the btree reserve cache can be seen in sysfs; as
it pins open_buckets we need it for tracking down open_buckets issues.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:11 -04:00
Kent Overstreet
d06a26d24d bcachefs: sysfs trigger_freelist_wakeup
another debugging knob

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:11 -04:00
Kent Overstreet
a1e7a97f22 bcachefs: sysfs internal/trigger_journal_writes
another debugging knob - trigger the journal to do ready journal writes

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:11 -04:00
Kent Overstreet
26a170aa61 bcachefs: add capacity, reserved to fs_alloc_debug_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:11 -04:00
Kent Overstreet
8a3c8303e2 bcachefs: uninline fallocate functions
better stack traces

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:11 -04:00
Kent Overstreet
52fd0f9620 bcachefs: btree ids are 64 bit bitmasks
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:11 -04:00
Kent Overstreet
3de8fd4a33 bcachefs: Print allocator stuck on timeout in fallocate path
same as in io_write.c, if we're waiting on the allocator for an
excessive amount of time, print what's going on

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-14 19:00:11 -04:00
Kent Overstreet
1841027c7d bcachefs: bch2_gc_btree() should not use btree_root_lock
btree_root_lock is for the root keys in btree_root, not the pointers to
the nodes themselves; this fixes a lock ordering issue between
btree_root_lock and btree node locks.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-11 20:10:55 -04:00
Kent Overstreet
f236ea4bca bcachefs: Set PF_MEMALLOC_NOFS when trans->locked
proper lock ordering is: fs_reclaim -> btree node locks

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-11 20:10:55 -04:00
Kent Overstreet
f0f3e51148 bcachefs; Use trans_unlock_long() when waiting on allocator
not using unlock_long() blocks key cache reclaim, and the allocator may
take awhile

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-11 20:10:55 -04:00
Kent Overstreet
aacd897d4d Revert "bcachefs: Mark bch_inode_info as SLAB_ACCOUNT"
This reverts commit 86d81ec5f5f05846c7c6e48ffb964b24cba2e669.

This wasn't tested with memcg enabled, it immediately hits a null ptr
deref in list_lru_add().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-11 20:01:38 -04:00
Kent Overstreet
fd80d14005 bcachefs: fix scheduling while atomic in break_cycle()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10 12:59:28 -04:00
Kent Overstreet
6f692b1672 bcachefs: Fix RCU splat
Reported-by: syzbot+e74fea078710bbca6f4b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10 12:46:22 -04:00
Kent Overstreet
7d7f71cd87 bcachefs: Add missing bch2_trans_begin()
this fixes a 'transaction should be locked' error in backpointers fsck

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10 09:53:39 -04:00
Kent Overstreet
0f6f8f7693 bcachefs: Fix missing error check in journal_entry_btree_keys_validate()
Closes: https://syzkaller.appspot.com/bug?extid=8996d8f176cf946ef641
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10 09:53:39 -04:00
Kent Overstreet
f49d2c9835 bcachefs: Warn on attempting a move with no replicas
Instead of popping an assert in bch2_write(), WARN and print out some
debugging info.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10 09:53:39 -04:00
Kent Overstreet
ad8b68cd39 bcachefs: bch2_data_update_to_text()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10 09:53:39 -04:00
Kent Overstreet
0f1f7324da bcachefs: Log mount failure error code
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10 09:53:39 -04:00
Kent Overstreet
8ed58789fc bcachefs: Fix undefined behaviour in eytzinger1_first()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10 09:53:39 -04:00
Youling Tang
86d81ec5f5 bcachefs: Mark bch_inode_info as SLAB_ACCOUNT
After commit 230e9fc28604 ("slab: add SLAB_ACCOUNT flag"), we need to mark
the inode cache as SLAB_ACCOUNT, similar to commit 5d097056c9a0 ("kmemcg:
account for certain kmem allocations to memcg")

Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-07-10 09:53:39 -04:00