IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
bch2_trans_begin() invalidates all iterators, until they're revalidated
by calling peek() or traverse().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The error code when we fail to allocate a node in the btree node cache
doesn't make it to bch2_btree_path_traverse_all(). Instead, we need to
stash a flag in btree_trans so we know we have to take the cannibalize
lock.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The loop that traverses paths in traverse_all() needs to be a little bit
tricky, because traversing a path can cause other paths to be added (or
perhaps removed) at about the same position.
The old logic was buggy, replace it with simpler logic.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
__bch2_btree_node_lock() was implementing the wrong lock ordering for
cached vs. non cached paths - this fixes it to match the btree path sort
order as defined by __btree_path_cmp(), and also simplifies the code
some.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This consolidates some of the btree node lock path, so that when we're
blocked taking a write lock on a node it shows up in
bch2_btree_trans_to_text(), along with intent and read locks.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The old .debugcheck methods are no more and this just calls the .invalid
method, which doesn't add much since we already check that when doing
btree updates and when reading metadata in.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- Updates to non key cache iterators will now be transparently
redirected to the key cache for cached btrees.
- Except when creating new keys: then the update goes to underlying
btree
For for iterating over a cached btree to work, we need to ensure that if
a key exists in the key cache, it also exists in the btree - otherwise
the iterator code will skip past it and not check the key cache.
Otherwise, for consistency, all updates should go to the same place -
the key cache.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is the start of cache coherency with the btree key cache - this
adds a btree iterator flag that causes lookups to also check the key
cache when we're iterating over the btree (not iterating over the key
cache).
Note that we could still race with another thread creating at item in
the key cache and updating it, since we aren't holding the key cache
locked if it wasn't found. The next patch for the update path will
address this by causing the transaction to restart if the key cache is
found to be dirty.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, when doing updates and running triggers before journal
replay completes, triggers would see the incorrect key for the old key
being overwritten - this patch updates the trigger code to check the
journal keys when necessary, needed for the upcoming allocator rewrite.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We currently need to call bch2_btree_path_peek_slot() multiple times in
the transaction commit path - and some of those need to be updated to
also check the keys from journal replay, too. Let's consolidate this and
stash the key being overwritten in btree_insert_entry.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Add a new helper that returns true if the given btree ID uses the btree
key cache. This enables some new cleanups, since the helper can check
the options for whether caching is enabled on a given btree.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
More prep work for getting rid of the in-memory bucket array: now that
we have BTREE_ITER_WITH_JOURNAL, the allocator code can do ntree lookups
before journal replay is finished, and there's no longer any need for it
to get allocation information from the in-memory bucket array.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
With BTREE_ITER_FILTER_SNAPSHOTS, we have to distinguish between the
path where the key was found, and the path for inserting into the
current snapshot. This adds a new field to struct btree_iter for saving
a path for the current snapshot, and plumbs it through
bch2_trans_update().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This splits bch2_btree_iter() up into two functions: an inner function
that handles BTREE_ITER_WITH_JOURNAL, BTREE_ITER_WITH_UPDATES, and
iterating acrcoss leaf nodes, and an outer one that implements
BTREE_ITER_FILTER_SNAPHSOTS.
This is prep work for remember a btree_path at our update position in
BTREE_ITER_FILTER_SNAPSHOTS mode.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This improves the transaction restart tracepoints - adding distinct
tracepoints for all the locations and reasons a transaction might have
been restarted, and ensures that there's a tracepoint for every
transaction restart.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Symbol decoding, via %ps, isn't supported in userspace - this will also
be faster when we're using trans->fn in the fast path, as with the new
BCH_JSET_ENTRY_log journal messages.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds a new btree iterator flag, BTREE_ITER_WITH_JOURNAL, that is
automatically enabled when initializing a btree iterator before journal
replay has completed - it overlays the contents of the journal with the
btree.
This lets us delete bch2_btree_and_journal_walk() and just use the
normal btree iterator interface instead - which also lets us delete a
significant amount of duplicated code.
Note that BTREE_ITER_WITH_JOURNAL is still unoptimized in this patch -
we're redoing the binary search over keys in the journal every time we
call bch2_btree_iter_peek().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a flag to not mark the initial btree_path as preserve, for
paths that we expect to be cheap to reconstitute if necessary - this
solves a btree_path overflow caused by need_whiteout_for_snapshot().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This fixes some bugs when we hit an error very early in the filesystem
startup path, before most things have been initialized.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This puts the btree_transactions sysfs/debugfs file behind a separate
config option - it's highly useful, but not cheap enough to enable
permenantly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a new assertion to be used by bch2_inode_update_after_write(),
which updates the VFS inode based on the update to the btree inode we
just did - we require that the btree inode still be locked when we do
that update.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- We should only be clearing should_be_locked in btree_path_set_pos() -
it's the responsiblity of the btree_path code, not the btree_iter
code.
- bch2_path_put() needs to pay attention to path->should_be_locked, to
ensure we don't drop locks we're supposed to be keeping.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The bch2_btree_path_upgrade() call was failing and tripping an assert -
path->level + 1 is in this case not necessarily exactly what we want,
fix it by upgrading exactly the locks we want.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This helps to unify the interface between bch2_mark_key() and
bch2_trans_mark_key() - and it also gives access to the journal
reservation and journal seq in the mark_key path.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
But we don't need to call it from outside the btree iterator code
anymore, since it's called by bch2_trans_begin() and
bch2_btree_path_traverse().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Code that uses for_each_btree_key often wants transaction restarts to be
handled locally and not returned. Originally, we wouldn't return
transaction restarts if there was a single iterator in the transaction -
the reasoning being if there weren't other iterators being invalidated,
and the current iterator was being advanced/retraversed, there weren't
any locks or iterators we were required to preserve.
But with the btree_path conversion that approach doesn't work anymore -
even when we're using for_each_btree_key() with a single iterator there
will still be two paths in the transaction, since we now always preserve
the path at the pos the iterator was initialized at - the reason being
that on restart we often restart from the same place.
And it turns out there's now a lot of for_each_btree_key() uses that _do
not_ want transaction restarts handled locally, and should be returning
them.
This patch splits out for_each_btree_key_norestart() and
for_each_btree_key_continue_norestart(), and converts existing users as
appropriate. for_each_btree_key(), for_each_btree_key_continue(), and
for_each_btree_node() now handle transaction restarts themselves by
calling bch2_trans_begin() when necessary - and the old hack to not
return transaction restarts when there's a single path in the
transaction has been deleted.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Now that peek_node()/next_node() are converted to return errors
directly, we don't need bch2_trans_exit() to return errors - it's
cleaner this way and wasn't used much anymore.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This changes for_each_btree_node() to work like for_each_btree_key(),
and to that end bch2_btree_iter_peek_node() and next_node() also return
error ptrs.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- check for getting to the end of the btree in bch2_path_verify_locks
and __btree_path_traverse_all(), this fixes an infinite loop in
__btree_path_traverse_all().
- relax requirement in bch2_btree_node_upgrade() that we must want an
intent lock, this fixes bugs with paths that point to interior nodes
(nonzero level).
- bch2_btree_node_update_key(): fix it to upgrade the path to an intent
lock, if necessary
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Btree node iterators want the interior btree_path to point to the same
pos as the returned btree node - this fixes a regression from the
introduction of btree_path, where rewriting/updating keys of btree nodes
(e.g. in bch2_dev_metadata_drop()) via btree node iterators.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Now that all the existing code has been converted for snapshots, this
patch changes the code for initializing a btree iterator to require a
snapshot to be specified, and also change bkey_invalid() to allow for
non U32_MAX snapshot IDs.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
For snapshots, we need to implement btree lookups that return the first
key that's an ancestor of the snapshot ID the lookup is being done in -
and filter out keys in unrelated snapshots. This patch adds the btree
iterator flag BTREE_ITER_FILTER_SNAPSHOTS which does that filtering.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Figured out the bug we were chasing, and it had nothing to do with
locking btree iterators/paths out of order.
This reverts commit ff08733dd298c969aec7c7828095458f73fd5374.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We need to take all needed intent locks when relocking an iterator:
bch2_btree_path_traverse() had a special cased, faster version of this,
but it really should be in up_until_good_node() so that set_pos() can
use it too.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This patch significantly reduces the number of btree lookups required in
the extent update path.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
New rule is: if a btree path holds any locks it should be holding
precisely the locks wanted (accoringing to path->level and
path->locks_want).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Since iter->real_pos was introduced, we no longer have to deal with
extent btree iterators that have skipped past deleted keys - this is a
real performance improvement on btree updates.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
btree_path_traverse_all() traverses btree iterators in sorted order, and
thus shouldn't see transaction restarts due to potential deadlocks - but
sometimes we do. This patch adds some more assertions and tracks some
more state to help track this down.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This splits btree_iter into two components: btree_iter is now the
externally visible componont, and it points to a btree_path which is now
reference counted.
This means we no longer have to clone iterators up front if they might
be mutated - btree_path can be shared by multiple iterators, and cloned
if an iterator would mutate a shared btree_path. This will help us use
iterators more efficiently, as well as slimming down the main long lived
state in btree_trans, and significantly cleans up the logic for iterator
lifetimes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We really only need to distinguish between btree iterators and btree key
cache iterators - this is more prep work for btree_path.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This was used for an optimization that hasn't existing in quite awhile
- iter->uptodate will probably be going away as well.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
These utility functions are for managing btree node state within a
btree_trans - rename them for consistency, and drop some unneeded
arguments.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This is prep work for splitting btree_path out from btree_iter -
btree_path will not have a pointer to btree_trans.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
BTREE_ITER_SET_POS_AFTER_COMMIT is used internally to automagically
advance extent btree iterators on sucessful commit.
But with the upcomnig btree_path patch it's getting more awkward to
support, and it adds overhead to core data structures that's only used
in a few places, and can be easily done by the caller instead.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This factors out bch2_dump_trans_iters_updates() from the iter alloc
overflow path, and makes some small improvements to what it prints.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
iter->real_pos needs to match the key returned or bad things will happen
when we go to update the key at that position. When we returned a
pending update from btree_trans_peek_updates(), this wasn't necessarily
the case.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This makes the flow control in bch2_btree_iter_peek() and
bch2_btree_iter_peek_prev() a bit cleaner.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This will be used to make other operations on btree iterators within a
transaction more efficient, and enable some other improvements to how we
manage btree iterators.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
iter->should_be_locked means that if bch2_btree_iter_relock() fails, we
need to restart the transaction.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
If there's more than one iterator in the btree_trans, it's requried to
call bch2_trans_begin() to handle transaction restarts.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Start tracking when btree transactions have been restarted - and assert
that we're always calling bch2_trans_begin() immediately after
transaction restart.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Btree node merging now happens prior to transaction commit, not after,
so we don't need to pay attention to BTREE_INSERT_NOUNLOCK.
Also, foreground_maybe_merge shouldn't be calling
bch2_btree_iter_traverse_all() - this is becoming private to the btree
iterator code and should only be called by bch2_trans_begin().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
On transaction restart iterators won't be locked anymore - make sure
we're always checking for errors.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
bch2_btree_iter_traverse_all() may loop, and it needs to clear
iter->should_be_locked on every iteration.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds a new helper for btree_cache.c that does what we want where
the iterator is still being traverse - and also eliminates some
unnecessary transaction restarts.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We weren't correctly verifying that we had interior node intent locks -
this patch also fixes bugs uncovered by the new assertions.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Add basic kernel docs for bch2_trans_reset and bch2_trans_begin.
Signed-off-by: Dan Robertson <dan@dlrobertson.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This assertion is checking that what the iterator points to is
consistent with iter->real_pos, and since it's an internal btree
ordering property it should be using bpos_cmp.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Internal btree code really wants a POS_MAX with all fields ~0; external
code more likely wants the snapshot field to be 0, because when we're
passing it to bch2_trans_get_iter() it's used for the snapshot we're
operating in, which should be 0 for most btrees that don't use
snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
In !BTREE_ITER_IS_EXTENTS mode, we shouldn't be looking at k->size, i.e.
we shouldn't use bkey_start_pos().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Now that extent handling has been lifted to bch2_trans_update(), we
don't need to keep two different lists of updates.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This lifts handling of overlapping extents out of __bch2_trans_commit()
and moves it to where we first do the update - which means that
BTREE_ITER_WITH_UPDATES can now work correctly in extents mode.
Also, this patch reworks how extent triggers work: previously, on
partial extent overwrite we would pass this information to the trigger,
telling it what part of the extent was being overwritten. But, this
approach has had too many subtle corner cases - now, we only mark whole
extents, meaning on partial extent overwrite we unmark the old extent
and mark the new extent.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This codepath won't just be for extents in the future, it'll also be for
BTREE_ITER_FILTER_SNAPSHOTS mode.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This drops bch2_btree_iter_peek_with_updates() and replaces it with a
new flag, BTREE_ITER_WITH_UPDATES, and also reworks
bch2_btree_iter_peek_slot() to respect it too.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds the ability for btree iterators to own child iterators - to be
used by an upcoming rework of bch2_btree_iter_peek_slot(), so we can
scan forwards while maintaining our current position.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
As a rule we don't want to be holding btree locks while submitting IO -
this will improve overall filesystem latency.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
When we switched to using bch2_btree_bset_insert_key() for extents it
turned out it started leaving invalid keys around - of type deleted but
nonzero size - but this is fine (if ugly) because they're never written
out.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Add a field to struct btree_iter for tracking whether it should be
locked - this fixes spurious transaction restarts in
bch2_trans_relock().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This patch adds some new tracepoints to the btree iterator code, and
adds new fields to the existing tracepoints - primarily for the iterator
position.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
fs/bcachefs/bset.c edited prefetch macro to add clang support
fs/bcachefs/btree_iter.c bugfix: initialize iter->real_pos in bch2_btree_iter_init for later use
fs/bcachefs/io.c bugfix: eliminated undefined behavior (negative bitshift)
fs/bcachefs/buckets.c bugfix: invert sign to handle 64bit abs()
Signed-off-by: Brett Holman <bpholman5@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
By not re-fetching the next update we were going into an infinite loop.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Trying to debug an issue where after traverse_all() we shouldn't have to
traverse any iterators... yet we are
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need to also set iter->uptodate to indicate it needs to be traversed.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a regression from
52d86202fd bcachefs: Improve bch2_btree_iter_traverse_all()
We want to avoid mucking with other iterators in the btree transaction
in operations that are only supposed to be touching individual iterators
- that patch was a cleanup to move lock ordering handling to
bch2_btree_iter_traverse_all(). But it broke upgrading of cloned
iterators.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a livelock with btree node splits.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
By changing it to upgrade iterators to intent locks to avoid lock
restarts we can simplify __bch2_btree_node_lock() quite a bit - this
fixes a probable bug where it could potentially drop a lock on an
unrelated error but still succeed instead of causing a transaction
restart.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Avoid cloning iterators if we don't have to.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The patch that changed bch2_trans_relock() to not look at iter->uptodate
also tried to add an optimization by only having it relock
btree_iter_key() iterators (iterators that are live or have been marked
as keep). But, this wasn't thought through - this pops internal iterator
assertions because on transaction restart, when we're traversing
iterators we traverse all iterators marked as linked, and having
bch2_trans_relock() skip some of those mean that it can skil the
iterator that bch2_btree_iter_traverse_one() is currently traversing.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_btree_iter_peek() wasn't properly checking for
BTREE_ITER_IS_EXTENTS when updating iter->pos.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need to flush the btree key cache when it's too dirty, because
otherwise the shrinker won't be able to reclaim memory - this is done by
journal reclaim. But journal reclaim also kicks btree node writes: this
meant that btree node writes were getting kicked much too often just
because we needed to flush btree key cache keys.
This patch splits journal pins into two different lists, and teaches
journal reclaim to not flush btree node writes when it only needs to
flush key cache keys.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is an important cleanup, eliminating an unnecessary copy in the
transaction commit path.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Eventually BTREE_ITER_NODES should be going away. This patch is to fix a
transaction iterator overflow in the btree node merge path because
BTREE_ITER_NODES iterators couldn't be reused.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since we're no longer doing btree node merging post commit, we can now
delete a bunch of code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're getting away from relying on iter->uptodate - this changes
bch2_trans_relock() to more directly specify which iterators should be
relocked.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch starts treating the bpos.snapshot field like part of the key
in the btree code:
* bpos_successor() and bpos_predecessor() now include the snapshot field
* Keys in btrees that will be using snapshots (extents, inodes, dirents
and xattrs) now always have their snapshot field set to U32_MAX
The btree iterator code gets a new flag, BTREE_ITER_ALL_SNAPSHOTS, that
determines whether we're iterating over keys in all snapshots or not -
internally, this controlls whether bkey_(successor|predecessor)
increment/decrement the snapshot field, or only the higher bits of the
key.
We add a new member to struct btree_iter, iter->snapshot: when
BTREE_ITER_ALL_SNAPSHOTS is not set, iter->pos.snapshot should always
equal iter->snapshot, which will be 0 for btrees that don't use
snapshots, and alsways U32_MAX for btrees that will use snapshots
(until we enable snapshot creation).
This patch also introduces a new metadata version number, and compat
code for reading from/writing to older versions - this isn't a forced
upgrade (yet).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With snapshots, we're going to need to differentiate between comparisons
that should and shouldn't include the snapshot field. bpos_cmp is now
the comparison function that does include the snapshot field, used by
core btree code.
Upper level filesystem code generally does _not_ want to compare against
the snapshot field - that code wants keys to compare as equal even when
one of them is in an ancestor snapshot.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When we pass BTREE_INSERT_NOUNLOCK bch2_trans_commit isn't supposed to
unlock after a successful commit, but it was calling
bch2_trans_cond_resched() - oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
External (to the btree iterator code) users of bch2_btree_iter_traverse
expect that on success the iterator will be pointed at iter->pos and
have that position locked - but since we split iter->pos and
iter->real_pos, that means it has to update iter->real_pos if necessary.
Internal users don't expect it to modify iter->real_pos, so we need two
separate functions.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
peek() has to update iter->real_pos - there's no need for
bch2_btree_iter_set_pos() to update it as well.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Ideally we'll be getting rid of peek_with_updates(), but the callers
will need to be checked.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This just gives some internal helpers some better names.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since we're no longer doing next() immediately followed by peek(), this
optimization isn't doing anything anymore.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This means bch2_btree_iter_traverse_one() can be made more efficient.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree node iterators need to obey the regular btree node invarionts
w.r.t. iter->real_pos; once they do, bch2_btree_iter_traverse will have
less that it needs to check.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The way btree iterators work internally has been changing, particularly
with the iter->real_pos changes, and bch2_btree_iter_next() is no longer
hyper optimized - it's just advance followed by peek, so it's more
efficient to just call advance where we're not using the return value of
bch2_btree_iter_next().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We keep running into occasional bugs with btree transaction iterators
overflowing - this will make those bugs more visible.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is a bit clearer than using bch2_btree_iter_free().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Change fsck code to always put btree iterators - also, make some flow
control improvements to deal with lock restarts better, and refactor
check_extents() to not walk extents twice for counting/checking
i_sectors.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
this is used in only one place now, so just inline it into the caller.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since we added iter->real_pos, btree_iter_set_pos_to_(next|prev)_leaf no
longer modify iter->pos, so we don't have to save it at the start
anymore.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, bpos_diff() did not handle borrows correctly. Minor thing
considering how it was used, but worth fixing.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
New helper to clean things up a bit - also, improve iter->flags
handling.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
KEY_TYPE_discard used to be used for extent whiteouts, but when handling
over overlapping extents was lifted above the core btree code it became
unused. This patch updates various code to reflect that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This allocation is required for filesystem operations to make forward
progress, thus needs a mempool.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When snapshots arrive, we won't necessarily be able to arbitrarily split
existis - when we need to split an existing extent, we'll have to check
if the extent was overwritten in child snapshots and if so emit a
whiteout for the split in the child snapshot.
Because extents couldn't span btree nodes previously, journal replay
would sometimes have to split existing extents. That's no good anymore,
but fortunately since extent handling has already been lifted above most
of the btree code there's no real need for that rule anymore.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need to differentiate between the search position of a btree
iterator, vs. what it actually points at (what we found). This matters
for extents, where iter->pos will typically be the start of the key we
found and iter->real_pos will be the end of the key we found (which soon
won't necessarily be in the same btree node!) and it will also matter
for snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This makes bch2_btree_iter_peek_prev() and bch2_btree_iter_prev()
consistent with peek() and next(), w.r.t. iter->pos.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a new common helper for advancing past the last key returned
by peek().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The only reason we were keeping this around was for
BTREE_INSERT_NOUNLOCK semantics - if bch2_btree_iter_set_pos() advances
to the next leaf node, it'll drop the lock on the node that we just
inserted to.
But we don't rely on BTREE_INSERT_NOUNLOCK semantics for the extents
btree, just the inodes btree, and if we do need it for the extents btree
in the future we can do it more cleanly by cloning the iterator - this
lets us delete some special cases in the btree iterator code, which is
complicated enough as it is.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There's no good reason for these functions to not be using
bch2_btree_iter_set_pos().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_btree_and_journal_walk() walks the btree overlaying keys from the
journal; it was introduced so that we could read in the alloc btree
prior to journal replay being done, when journalling of updates to
interior btree nodes was introduced.
But it didn't have btree node prefetching, which introduced a severe
regression with mount times, particularly on spinning rust. This patch
implements btree node prefetching for the btree + journal walk,
hopefully fixing that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With various newer key types - stripe keys, inline data extents - the
old approach of calculating the maximum size of the value is becoming
more and more error prone. Better to switch to bkey_on_stack, which can
dynamically allocate if necessary to handle any size bkey.
In particular we also want to get rid of BKEY_EXTENT_VAL_U64s_MAX.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
__btree_trans_get_iter() was using bch2_btree_iter_upgrade, but it
shouldn't have been because on failure bch2_btree_iter_upgrade may drop
locks in other iterators, expecting the transaction to be restarted. But
__btree_trans_get_iter can't return an error to indicate that we need to
restart thet transaction - oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If a btree node merger is followed by a split or compact of the parent
node, we could end up with the parent btree node iterator pointing to
the whiteout inserted by the btree node merge operation - the fix is to
ensure that interior btree node iterators always point to the first non
whiteout.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Also, print out more information on btree transaction iterator overflow.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Metadata corruption bugs are hard to debug if we can't see exactly what
went wrong - try to allocate a bigger buffer so we can print out
everything we have.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since we now always preallocate the maximum number of iterators when we
initialize a btree transaction, getting an iterator never fails - we can
delete a fair amount of error path code.
This patch also simplifies the iterator allocation code a bit.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bcachefs-tools doesn't have a real percpu (per thread) implementation
yet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
this will reduce transaction restarts, from observation of tracepoints.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
this function is only used by debug code, but we'd like to always build
it so we know that it does build.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The checks for lock ordering violations weren't quite right.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Allocating our array of btree iters is a big enough allocation that it
hits the buddy allocator, and we're seeing lots of lock contention.
Sticking a single element buffer in front of it should help.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
These haven't been in used since reallocing iterators has been disabled,
and saves us a lot of stack if we get rid of it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It's not used much anymore, the module paramter interface is better.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The check for whether locking a btree node would deadlock was wrong - we
have to check that interior nodes are locked before descendents, but
this check was wrong when consider cached vs. non cached iterators.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We have a bug where we can get stuck with a process spinning in
transaction restarts - need more information.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There was a bug where bch2_trans_update() would incorrectly delete a
pending update where the new update did not actually overwrite the
existing update, because we were incorrectly using BTREE_ITER_TYPE when
sorting pending btree updates.
This affects the pending patch to use cached iterators for inode
updates.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We use sentinal values that aren't NULL to indicate there's a btree node
at a higher level; occasionally, this may result in
btree_iter_up_until_good_node() stopping at one of those sentinal
values.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The code that checks lock ordering was recently changed to go off of the
pos of the btree node, rather than the iterator, but the btree cache
code didn't update to handle iterators that point to cached bkeys. Oops
Also, update various debug code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This introduces a new kind of btree iterator, cached iterators, which
point to keys cached in a hash table. The cache also acts as a write
cache - in the update path, we journal the update but defer updating the
btree until the cached entry is flushed by journal reclaim.
Cache coherency is for now up to the users to handle, which isn't ideal
but should be good enough for now.
These new iterators will be used for updating inodes and alloc info (the
alloc and stripes btrees).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Btree node lock ordering is based on the logical key. However, 'struct
btree' may be reused for a different btree node under memory pressure.
This patch uses the new six lock callback to check if a btree node is no
longer the node we wanted to lock before blocking.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
__bch2_btree_node_lock() was incorrectly using iter->pos as a proxy for
btree node lock ordering, this caused an off by one error that was
triggered by bch2_btree_node_get_sibling() getting the previous node.
This refactors the code to compare against btree node keys directly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_btree_iter_downgrade() was looping over all iterators in a
transaction; bch2_trans_downgrade() should be doing that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There was a bad interaction with bch2_btree_iter_set_pos_same_leaf(),
which can leave a btree node locked that is just outside iter->pos,
breaking the lock ordering checks in __bch2_btree_node_lock(). Ideally
we should get rid of this corner case, but for now fix it locally with
verbose comments.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Intented to help debug deadlocks, since we can't use lockdep to check
btree node lock ordering.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This will help with iterator overflow bugs.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, BTREE_ID_INODES was special - inodes were indexed by the
inode field, which meant the offset field of struct bpos wasn't used,
which led to special cases in e.g. the btree iterator code.
Now, inodes in the inodes btree are indexed by the offset field.
Also: prevously min_key was special for extents btrees, min_key for
extents would equal max_key for the previous node. Now, min_key =
bkey_successor() of the previous node, same as non extent btrees.
This means we can completely get rid of
btree_type_sucessor/predecessor.
Also make some improvements to the metadata IO validate/compat code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We were incorrectly not restarting the transaction when re-traversing
iterators.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Ever since the btree code was first written, handling of overwriting
existing extents - including partially overwriting and splittin existing
extents - was handled as part of the core btree insert path. The modern
transaction and iterator infrastructure didn't exist then, so that was
the only way for it to be done.
This patch moves that outside of the core btree code to a pass that runs
at transaction commit time.
This is a significant simplification to the btree code and overall
reduction in code size, but more importantly it gets us much closer to
the core btree code being completely independent of extents and is
important prep work for snapshots.
This introduces a new feature bit; the old and new extent update models
are incompatible when the filesystem needs journal replay.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Introduce a new iterator method that provides a consistent view of the
btree plus uncommitted updates.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Ensure that iter->pos always lies between the start and end of iter->k
(the last key returned). Also, bch2_btree_iter_set_pos() now invalidates
the key that peek() or next() returned.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
More aggressively checking iterator invariants, and fixing the resulting
bugs. Also greatly simplifying iter_next() and iter_next_slot() - they
were hyper optimized before, but the optimizations were getting too
brittle.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
All iterators should be released now with bch2_trans_iter_put(), so
TRANS_RESET_ITERS shouldn't be needed anymore, and TRANS_RESET_MEM is
always used.
Also convert more code to __bch2_trans_do().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
NULL is used to mean "reach end of traversal" - we were only
initializing the leaf node in the iterator to the right sentinal value.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Prep work for changing the core btree update path to handle extents like
regular keys; we need to reduce the scope of what BTREE_ITER_IS_EXTENTS
means
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This one takes an additional argument for whether we're searching for >=
or > the search key.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This wasn't originally required, but this is the model we're moving
towards.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Rework some of the helper comparison functions for consistency
- Currently trying to refactor all the logic that's different for
extents in the btree iterator code. The main difference is that for non
extents we search for a key greater than or equal to the search key,
while for extents we search for a key strictly greater than the search
key (iter->pos).
So that logic is now handled by btree_iter_search_key(), which computes
the real search key based on iter->pos and whether or not we're
searching for a key >= or > iter->pos.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
On transaction restart (-EINTR), we need to traverse all iterators.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, when doing multiple update in the same transaction commit
that overwrote each other, we relied on doing the updates in the same
order as the bch2_trans_update() calls in order to get the correct
result. But that wasn't correct for triggers; bch2_trans_mark_update()
when marking overwrites would do the wrong thing because it hadn't seen
the update that was being overwritten.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This should be private to btree_update_leaf.c, and we might end up
removing it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Clean up a bit of duplicated code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This assertion was wrong for interior nodes (and wasn't terribly useful
to begin with)
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>