103 Commits

Author SHA1 Message Date
Kent Overstreet
3e1547116f bcachefs: x-macroize alloc_reserve enum
This makes an array of strings available, like our other enums.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:29 -04:00
Kent Overstreet
fa8e94faee bcachefs: Heap allocate printbufs
This patch changes printbufs dynamically allocate and reallocate a
buffer as needed. Stack usage has become a bit of a problem, and a major
cause of that has been static size string buffers on the stack.

The most involved part of this refactoring is that printbufs must now be
exited with printbuf_exit().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:25 -04:00
Kent Overstreet
5222a4607c bcachefs: BTREE_ITER_WITH_JOURNAL
This adds a new btree iterator flag, BTREE_ITER_WITH_JOURNAL, that is
automatically enabled when initializing a btree iterator before journal
replay has completed - it overlays the contents of the journal with the
btree.

This lets us delete bch2_btree_and_journal_walk() and just use the
normal btree iterator interface instead - which also lets us delete a
significant amount of duplicated code.

Note that BTREE_ITER_WITH_JOURNAL is still unoptimized in this patch -
we're redoing the binary search over keys in the journal every time we
call bch2_btree_iter_peek().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:21 -04:00
Kent Overstreet
d248ee5637 bcachefs: Add iter_flags arg to bch2_btree_delete_range()
Will be used by the new snapshot tests, to pass in
BTREE_ITER_ALL_SNAPSHOTS.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:20 -04:00
Kent Overstreet
13f914ecb9 bcachefs: Kill bch2_ec_mem_alloc()
bch2_ec_mem_alloc() was only used by GC, and there's no real need to
preallocate the stripes radix tree since we can cope fine with memory
allocation failure when we use the radix tree. This deletes a fair bit
of code, and it's also needed for the upcoming patch because
bch2_btree_iter_peek_prev() won't be working before journal replay
completes (and using it was incorrect previously, as well).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:20 -04:00
Kent Overstreet
abe19d458e bcachefs: Refactor open_bucket code
Prep work for adding a hash table of open buckets - instead of embedding
a bch_extent_ptr, we need to refer to the bucket directly so that we're
not calling sector_to_bucket() in the hash table lookup code, which has
an expensive divide.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:20 -04:00
Kent Overstreet
e409999069 bcachefs: Turn encoded_extent_max into a regular option
It'll now be handled at format time and in sysfs like other options - it
still can only be set at format time, though.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:19 -04:00
Kent Overstreet
bf0fdb4d89 bcachefs: Don't erasure code cached ptrs
It doesn't make much sense to be erasure coding cached pointers, we
should be erasure coding one of the dirty pointers in an extent. This
patch makes sure we're passing BCH_WRITE_CACHED when we expect the new
pointer to be a cached pointer, and tweaks the write path to not
allocate from a stripe when BCH_WRITE_CACHED is set - and fixes an
assertion we were hitting in the ec path where when adding the stripe to
an extent and deleting the other pointers the pointer to the stripe
didn't exist (because dropping all dirty pointers from an extent turns
it into a KEY_TYPE_error key).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:18 -04:00
Kent Overstreet
990d42d187 bcachefs: Split out struct gc_stripe from struct stripe
We have two radix trees of stripes - one that mirrors some information
from the stripes btree in normal operation, and another that GC uses to
recalculate block usage counts.

The normal one is now only used for finding partially empty stripes in
order to reuse them - the normal stripes radix tree and the GC stripes
radix tree are used significantly differently, so this patch splits them
into separate types.

In an upcoming patch we'll be replacing c->stripes with a btree that
indexes stripes by the order we want to reuse them.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:18 -04:00
Kent Overstreet
b547d005d5 bcachefs: Erasure coding fixes
When we added the stripe and stripe_redundancy fields to alloc keys, we
neglected to add them to the functions that convert back and forth with
the in-memory types.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:18 -04:00
Kent Overstreet
58e1ea4bcb bcachefs: Push c->mark_lock usage down to where it is needed
This changes the bch2_mark_key() and related paths to take mark lock
where it is needed, instead of taking it in the upper transaction commit
path - by pushing down locking we'll be able to handle fsck errors
locally instead of requiring a separate check in the btree_gc code for
replicas being marked.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:18 -04:00
Kent Overstreet
fc6c01e2ea bcachefs: Convert bucket_alloc_ret to negative error codes
Start a new header, errcode.h, for bcachefs-private error codes - more
error codes will be converted later.

This patch just converts bucket_alloc_ret so that they can be mixed with
standard error codes and passed as ERR_PTR errors - the ec.c code was
doing this already, but incorrectly.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:17 -04:00
Kent Overstreet
6404dcc9c2 bcachefs: More enum strings
This patch converts more enums in the on disk format to our standard
x-macro-with-strings deal - to enable better pretty-printing.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:17 -04:00
Kent Overstreet
2debb1b875 bcachefs: BTREE_TRIGGER_INSERT now only means insert
This allows triggers to distinguish between a key entering the btree -
i.e. being called from the trans commit path - vs. being called on a key
that already exists, i.e. by GC.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:15 -04:00
Kent Overstreet
904823de49 bcachefs: Convert bch2_mark_key() to take a btree_trans *
This helps to unify the interface between bch2_mark_key() and
bch2_trans_mark_key() - and it also gives access to the journal
reservation and journal seq in the mark_key path.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:15 -04:00
Kent Overstreet
961b2d6282 bcachefs: Assorted ec fixes
- The backpointer that ec_stripe_update_ptrs() uses now needs to include
  the snapshot ID, which means we have to change where we add the
  backpointer to after getting the snapshot ID for the new extents

- ec_stripe_update_ptrs() needs to be calling bch2_trans_begin()

- improve error message in bch2_mark_stripe()

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:15 -04:00
Kent Overstreet
9a796fdb06 bcachefs: bch2_trans_exit() no longer returns errors
Now that peek_node()/next_node() are converted to return errors
directly, we don't need bch2_trans_exit() to return errors - it's
cleaner this way and wasn't used much anymore.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:14 -04:00
Kent Overstreet
67e0dd8f0d bcachefs: btree_path
This splits btree_iter into two components: btree_iter is now the
externally visible componont, and it points to a btree_path which is now
reference counted.

This means we no longer have to clone iterators up front if they might
be mutated - btree_path can be shared by multiple iterators, and cloned
if an iterator would mutate a shared btree_path. This will help us use
iterators more efficiently, as well as slimming down the main long lived
state in btree_trans, and significantly cleans up the logic for iterator
lifetimes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:11 -04:00
Kent Overstreet
5f8077cca8 bcachefs: Kill BTREE_ITER_SET_POS_AFTER_COMMIT
BTREE_ITER_SET_POS_AFTER_COMMIT is used internally to automagically
advance extent btree iterators on sucessful commit.

But with the upcomnig btree_path patch it's getting more awkward to
support, and it adds overhead to core data structures that's only used
in a few places, and can be easily done by the caller instead.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:11 -04:00
Kent Overstreet
9f6bd30703 bcachefs: Reduce iter->trans usage
Disfavoured, and should go away.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:10 -04:00
Dan Robertson
33a1f84b76 bcachefs: ensure iter->should_be_locked is set
Ensure that iter->should_be_locked value is set to true before we
call bch2_trans_update in ec_stripe_update_ptrs.

Signed-off-by: Dan Robertson <dan@dlrobertson.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:07 -04:00
Kent Overstreet
297d89343d bcachefs: Extensive triggers cleanups
- We no longer mark subsets of extents, they're marked like regular
   keys now - which means we can drop the offset & sectors arguments
   to trigger functions
 - Drop other arguments that are no longer needed anymore in various
   places - fs_usage
 - Drop the logic for handling extents in bch2_mark_update() that isn't
   needed anymore, to match bch2_trans_mark_update()
 - Better logic for hanlding the BTREE_ITER_CACHED_NOFILL case, where we
   don't have an old key to mark

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:07 -04:00
Kent Overstreet
bc3f8b25f3 bcachefs: Check for errors from bch2_trans_update()
Upcoming refactoring is going to change bch2_trans_update() to start
returning transaction restarts.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:05 -04:00
Kent Overstreet
ac1019d32b bcachefs: Clean up bch2_btree_and_journal_walk()
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
7b7278bbaf bcachefs: Fix two btree iterator leaks
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:01 -04:00
Kent Overstreet
e0ba3b6429 bcachefs: Replace bch2_btree_iter_next() calls with bch2_btree_iter_advance
The way btree iterators work internally has been changing, particularly
with the iter->real_pos changes, and bch2_btree_iter_next() is no longer
hyper optimized - it's just advance followed by peek, so it's more
efficient to just call advance where we're not using the return value of
bch2_btree_iter_next().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:56 -04:00
Kent Overstreet
50dc0f692a bcachefs: Require all btree iterators to be freed
We keep running into occasional bugs with btree transaction iterators
overflowing - this will make those bugs more visible.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:56 -04:00
Kent Overstreet
b3b66e3044 bcachefs: Have fsck check for stripe pointers matching stripe
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:55 -04:00
Kent Overstreet
41f8b09edc bcachefs: Rename BTREE_ID enums for consistency with other enums
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:55 -04:00
Robbie Litchfield
0ef837a0cc bcachefs: Fix unnecessary read amplificaiton when allocating ec stripes
When allocating an erasure coding stripe, bcachefs will always reuse any
partial stripes before reserving a new stripe. This causes unnecessary
read amplification when preparing a stripe for writing. This patch changes
bcachefs to always reserve new stripes first, only relying on stripe reuse
when copygc needs more time to empty buckets from existing stripes.

Signed-off-by: Robbie Litchfield <blam.kiwi@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:53 -04:00
Kent Overstreet
7f4e1d5d0f bcachefs: KEY_TYPE_alloc_v2
This introduces a new version of KEY_TYPE_alloc, which uses the new
varint encoding introduced for inodes. This means we'll eventually be
able to support much larger bucket sizes (for SMR devices), and the
read/write time fields are expanded to 64 bits - which will be used in
the next patch to get rid of the periodic rescaling of those fields.

Also, for buckets that are members of erasure coded stripes, this adds
persistent fields for the index of the stripe they're members of and the
stripe redundancy. This is part of work to get rid of having to scan and
read into memory the alloc and stripes btrees at mount time.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:52 -04:00
Kent Overstreet
0093a50f27 bcachefs: Fix build in userspace
The userspace bch_err() macro doesn't use the filesystem argument. Could
also be fixed with a better macro.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:52 -04:00
Kent Overstreet
33ccd7188e bcachefs: Don't allocate stripes at POS_MIN
In the future, stripe index 0 will be a sentinal value. This patch
doesn't disallow stripes at POS_MIN yet, leaving that for when we do the
on disk format changes.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:51 -04:00
Kent Overstreet
6c7585b098 bcachefs: Rework allocating buckets for stripes
Allocating buckets for existing stripes was busted, in part because the
data structures were too contorted. This reworks new stripes so that we
have an array of open buckets that matches blocks in the stripe, and
it's sparse if we're reusing an existing stripe.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:51 -04:00
Kent Overstreet
c6e658ee9f bcachefs: Preserve stripe blockcounts on existing stripes
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:51 -04:00
Kent Overstreet
6e53151b7b bcachefs: Kill stripe->dirty
This makes bch2_stripes_write() work more like bch2_alloc_write().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:51 -04:00
Kent Overstreet
ac95800629 bcachefs: Factor out bch2_ec_stripes_heap_start()
This fixes a bug where mark and sweep gc incorrectly was clearing out
the stripes heap and causing assertions to fire later - simpler to just
create the stripes heap after gc has finished.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:51 -04:00
Kent Overstreet
2a3731e34d bcachefs: Erasure coding fixes & refactoring
- Originally bch_extent_stripe_ptr didn't contain the block index,
   instead we'd have to search through the stripe pointers to figure out
   which pointer matched. When the block field was added to
   bch_extent_stripe_ptr, not all of the code was updated to use it.
   This patch fixes that, and we also now verify that field where it
   makes sense.

 - The ec_stripe_buf_init/exit() functions have been improved, and are
   now used by the bch2_ec_read_extent() (recovery read) path.

 - get_stripe_key() is now used by bch2_ec_read_extent().

 - We now have a getter and setter for checksums within a stripe, like
   we had previously for block sector counts, and ec_generate_checksums
   and ec_validate_checksums are now quite a bit smaller and cleaner.

ec.c still needs a lot of work, but this patch is slowly moving things
in the right direction.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:51 -04:00
Kent Overstreet
07a1006ae8 bcachefs: Reduce/kill BKEY_PADDED use
With various newer key types - stripe keys, inline data extents - the
old approach of calculating the maximum size of the value is becoming
more and more error prone. Better to switch to bkey_on_stack, which can
dynamically allocate if necessary to handle any size bkey.

In particular we also want to get rid of BKEY_EXTENT_VAL_U64s_MAX.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:50 -04:00
Kent Overstreet
8deed5f4e5 bcachefs: Use separate new stripes for copygc and non-copygc
Allocations for copygc have to be kept separate from everything else,
so that copygc doesn't get starved.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:50 -04:00
Kent Overstreet
2c40a2403e bcachefs: Change allocations for ec stripes to blocking
We don't want writes to not get erasure coded just because the allocator
temporarily wasn't keeping up.

However, it's not guaranteed that these allocations will ever succeed,
we can currently get stuck - especially if devices are different sizes -
we still have work to do in this area.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:50 -04:00
Kent Overstreet
81d8599e19 bcachefs: Don't read existing stripes synchronously in write path
Previously, in the stripe creation path, when reusing an existing stripe
we'd read the existing stripe synchronously - ouch.

Now, we allocate two stripe bufs if we're using an existing stripe, so
that we can do the read asynchronously - and, we read the full stripe so
that we can run recovery, if necessary.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:50 -04:00
Kent Overstreet
3187aa8d57 bcachefs: Don't use BTREE_INSERT_USE_RESERVE so much
Previously, we were using BTREE_INSERT_RESERVE in a lot of places where
it no longer makes sense.

 - we now have more open_buckets than we used to, and the reserves work
   better, so we shouldn't need to use BTREE_INSERT_RESERVE just because
   we're holding open_buckets pinned anymore.

 - We have the btree key cache for updates to the alloc btree, meaning
   we no longer need the btree reserve to ensure the allocator can make
   forward progress.

This means that we should only need a reserve for btree updates to
ensure that copygc can make forward progress.

Since it's now just for copygc, we can also fold RESERVE_BTREE into
RESERVE_MOVINGGC (the allocator's freelist reserve).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:50 -04:00
Kent Overstreet
ffb7c3d370 bcachefs: Add BCH_BKEY_PTRS_MAX
This now means "the maximum number of pointers within a bkey" - and
bch_devs_list is updated to use it instead of BCH_REPLICAS_MAX, since
stripes can contain more than BCH_REPLICAS_MAX pointers.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:50 -04:00
Kent Overstreet
66bddc6c2b bcachefs: Only try to get existing stripe once in stripe create path
The stripe creation path was too state-machiney: it would always run the
full state machine until it had succesfully created a new stripe.

But if we tried to get and reuse an existing stripe after we'd already
allocated some buckets, the buckets we'd allocated might have conflicted
with the blocks in the existing stripe we need to keep - oops.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:49 -04:00
Kent Overstreet
0fefe8d8ef bcachefs: Improve some IO error messages
it's useful to know whether an error was for a read or a write - this
also standardizes error messages a bit more.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:49 -04:00
Kent Overstreet
b735d73a00 bcachefs: Build fixes for 32bit x86
PAGE_SIZE and size_t are not unsigned longs on 32 bit, annoying...

also switch to atomic64_cmpxchg instead of cmpxchg() for
journal_seq_copy, as atomic64_cmpxchg has a fallback that uses spinlocks
for when it's not supported.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:46 -04:00
Kent Overstreet
af4d05c46b bcachefs: Account for stripe parity sectors separately
Instead of trying to charge EC parity to the data within the stripe
(which is subject to rounding errors), let's charge it to the stripe
itself. It should also make -ENOSPC issues easier to deal with if we
charge for parity blocks up front, and means we can also make more fine
grained accounting available to the user.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:45 -04:00
Kent Overstreet
8d6b6222bf bcachefs: Improvements to writing alloc info
Now that we've got transactional alloc info updates (and have for
awhile), we don't need to write it out on shutdown, and we don't need to
write it out on startup except when GC found errors - this is a big
improvement to mount/unmount performance.

This patch also fixes a few bugs where we weren't writing out alloc
info (on new filesystems, and new devices) and should have been.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:44 -04:00
Kent Overstreet
f3721e12d0 bcachefs: Perf improvements for bch_alloc_read()
On large filesystems reading in the alloc info takes a significant
amount of time. But we don't need to be calling into the fully general
bch2_mark_key() path, just open code what we need in
bch2_alloc_read_fn().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:44 -04:00