421 Commits

Author SHA1 Message Date
Kent Overstreet
ee2c6ea776 bcachefs: btree_iter->ip_allocated
In debug mode, we now track where btree iterators and paths are
initialized/allocated - helpful in tracking down btree path overflows.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:49 -04:00
Kent Overstreet
6c36318cc7 bcachefs: key cache: Don't hold btree locks while using GFP_RECLAIM
This is something we need to do more widely: instead of bothering with
GFP_NOIO/GFP_NOFS, if we need to allocate memory while holding locks:

 - first attempt the allocation with GFP_NOWAIT
 - if that fails, drop btree locks with bch2_trans_unlock(), then
   retry with GFP_KERNEL.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:49 -04:00
Kent Overstreet
c82ed3047b bcachefs: Fix bch2_btree_path_traverse_all()
We need to take a ref on a path while we're traversing it: this fixes a
bug with paths getting reused while being traversed, in the key cache
fill code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:49 -04:00
Kent Overstreet
ee94c413a7 bcachefs: Delete a faulty assertion
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:48 -04:00
Kent Overstreet
9d7f2a4111 bcachefs: bch2_btree_trans_to_text(): print blocked time
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:48 -04:00
Kent Overstreet
e242b92af5 bcachefs: Fix for long running btree transactions & key cache
While a btree transaction is running, we hold a SRCU read lock on the
btree key cache that prevents btree key cache keys from being freed -
this is so that relock() operations won't access freed memory.

The downside of this is that long running btree transactions prevent
memory from being freed from the key cache. This adds a check in
bch2_trans_begin() - if the transaction has been running longer than 1
second, drop and retake the SRCU read lock and zero out pointers to
unlock key cache paths.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:48 -04:00
Kent Overstreet
08f7803159 bcachefs: bch2_trans_revalidate_updates_in_node()
When we started stashing the key being overwritten in
btree_insert_entry, this introduced a typical iterator invalidation
problem, triggered by btree node splits or resorts.

Previously, dealt with this by unconditionally re-validating those
stashed pointers in the transaction commit path. This patch gets rid of
that by doing it only when needed, in bch2_trans_node_add() or
bch2_trans_node_reinit_iter().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:48 -04:00
Kent Overstreet
321bdc73f3 bcachefs: bkey_min(), bkey_max()
Parallel to bpos_min(), bpos_max() - trivial refactoring.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:48 -04:00
Kent Overstreet
ef0732861a bcachefs: Add a missing bch2_btree_path_traverse() call
bch2_btree_iter_peek_upto() in snapshots mode may need to keep a
btree_path for the insert position, not just the position of the key
we're returning. The code was incorrectly assuming this would be in the
same btree node - we were missing a bch2_btree_path_traverse() call.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:47 -04:00
Kent Overstreet
5c792e1b64 bcachefs: Fix a btree iter assertion pop
This fixes a (harmless) broken invariant in __bch2_btree_path_set_pos():
iterators to interior nodes should point to the first non whiteout.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:47 -04:00
Kent Overstreet
1617d56dc9 bcachefs: Key cache now works for snapshots btrees
This switches btree_key_cache_fill() to use a btree iterator, not a
btree path, so that it can search for keys in previous snapshots.

We also add another iterator flag, BTREE_ITER_KEY_CACHE_FILL, to avoid
recursion back into the key cache.

This will allow us to re-enable the key cache for inodes in the next
patch.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:47 -04:00
Kent Overstreet
087e53c255 bcachefs: Bring back BTREE_ITER_CACHED_NOFILL
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:47 -04:00
Kent Overstreet
dcced06942 bcachefs: Kill __btree_trans_peek_key_cache()
There was no reason for this to be a separate helper - we always want
the relock call that btree_trans_peek_key_cache() did.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:47 -04:00
Kent Overstreet
52bf51b91f bcachefs: Fix __btree_trans_peek_key_cache()
We were returning a pointer to a variable on the stack - oops.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:47 -04:00
Kent Overstreet
e88a75ebe8 bcachefs: New bpos_cmp(), bkey_cmp() replacements
This patch introduces
 - bpos_eq()
 - bpos_lt()
 - bpos_le()
 - bpos_gt()
 - bpos_ge()

and equivalent replacements for bkey_cmp().

Looking at the generated assembly these could probably be improved
further, but we already see a significant code size improvement with
this patch.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:47 -04:00
Kent Overstreet
c96f108b05 bcachefs: Optimize bch2_trans_iter_init()
When flags & btree_id are constants, we can constant fold the entire
calculation of the actual iterator flags - and the whole thing becomes
small enough to inline.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:46 -04:00
Kent Overstreet
c9ee99ad8c bcachefs: Move some asserts behind CONFIG_BCACHEFS_DEBUG
Convert some non-critical asserts in long-stable code to debug asserts.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:46 -04:00
Kent Overstreet
0f35e0860a bcachefs: Fix return code from btree_path_traverse_one()
trans->restarted is a positive error code, not the usual negative

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:46 -04:00
Kent Overstreet
b2d1d56b1d bcachefs: Fixes for building in userspace
- Marking a non-static function as inline doesn't actually work and is
   now causing problems - drop that

 - Introduce BCACHEFS_LOG_PREFIX for when we want to prefix log messages
   with bcachefs (filesystem name)

 - Userspace doesn't have real percpu variables (maybe we can get this
   fixed someday), put an #ifdef around bch2_disk_reservation_add()
   fastpath

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:46 -04:00
Kent Overstreet
984dc67e3b bcachefs: Improve __bch2_btree_path_make_mut()
btree_path_copy() doesn't need to call
bch2_btree_path_check_sort_fast() - the newly allocated path will always
be in the correct position, post copy; also delete some redundant
branches from __bch2_btree_path_make_mut().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:45 -04:00
Kent Overstreet
0cc455b3ca bcachefs: Inlining improvements
- Don't call into bch2_encrypt_bio() when we're not encrypting
 - Pull slowpath out of trans_lock_write()
 - Make sure bc2h_trans_journal_res_get() gets inlined.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:45 -04:00
Kent Overstreet
a101957649 bcachefs: More style fixes
Fixes for various checkpatch errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:45 -04:00
Kent Overstreet
c167f9e541 bcachefs: Journal keys overlay fixes
- In the btree iterator code that overlays keys from the journal, we
   were incorrectly specifying level=0 instead of the btree_path's
   current level in a few places
 - When we didn't do journal replay, we shouldn't free the journal keys:
   this fixes cmd_list and cmd_dump, which run in norecovery mode

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:45 -04:00
Kent Overstreet
1f69368c5c bcachefs: Fix an out-of-bounds shift
roundup_pow_of_two() is undefined for 0 - oops.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:45 -04:00
Kent Overstreet
c81f5836a4 bcachefs: Don't touch c->flags in bch2_trans_iter_init()
This moves the JOURNAL_REPLAY_DONE flag check from
bch2_trans_iter_init() to bch2_trans_init(), where we stash a copy in
btree_trans - gaining us a small performance improvement.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:44 -04:00
Kent Overstreet
3e3e02e6bc bcachefs: Assorted checkpatch fixes
checkpatch.pl gives lots of warnings that we don't want - suggested
ignore list:

 ASSIGN_IN_IF
 UNSPECIFIED_INT	- bcachefs coding style prefers single token type names
 NEW_TYPEDEFS		- typedefs are occasionally good
 FUNCTION_ARGUMENTS	- we prefer to look at functions in .c files
			  (hopefully with docbook documentation), not .h
			  file prototypes
 MULTISTATEMENT_MACRO_USE_DO_WHILE
			- we have _many_ x-macros and other macros where
			  we can't do this

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:44 -04:00
Kent Overstreet
307e3c1319 bcachefs: Optimize bch2_trans_init()
Now we store the transaction's fn idx in a local variable, instead of
redoing the lookup every time we call bch2_trans_init().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:44 -04:00
Kent Overstreet
29aa78f15e bcachefs: Split out __btree_path_up_until_good_node()
This breaks up btree_path_up_until_good_node() so that only the fastpath
gets inlined.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:44 -04:00
Kent Overstreet
d7e4e51370 bcachefs: Switch to local_clock() for fastpath time source
local_clock() isn't always completely accurate - e.g. on machines with
TSC drift - but ktime_get_ns() overhead is too high, unfortunately.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:44 -04:00
Kent Overstreet
dccedaaa52 bcachefs: Fix btree node prefetchig
We were forgetting to count down the number of nodes to prefetch, firing
off _way_ more than intended - whoops.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:44 -04:00
Kent Overstreet
f42238b5cd bcachefs: Fix a rare path in bch2_btree_path_peek_slot()
In the drop_alloc tests, we may end up calling
bch2_btree_iter_peek_slot() on an interior level that doesn't exist.
Previously, this would hit the path->uptodate assertion in
bch2_btree_path_peek_slot(); this path first checks a NULL btree node,
which is how we know we're at the end of the btree.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:43 -04:00
Kent Overstreet
7dcbdbd85c bcachefs: bch2_path_put_nokeep()
The btree iterator code may allocate extra btree paths, temporarily,
that do not refer to keys being returned: we don't need to wait until
transaction restart to drop these, when they're not referenced they
should be deleted right away.

This fixes a transaction path overflow bug.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:43 -04:00
Kent Overstreet
969576ecae bcachefs: bch2_btree_iter_peek() now works with interior nodes
Needed by the next patch, which will be iterating over keys in nodes at
level 1.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:43 -04:00
Kent Overstreet
29cea6f483 bcachefs: Fix bch2_btree_path_up_until_good_node()
There was a rare bug when path->locks_want was nonzero, but not
BTREE_MAX_DEPTH, where we'd return on a valid node that wasn't locked -
oops.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:42 -04:00
Kent Overstreet
c298fd7d34 bcachefs; Mark __bch2_trans_iter_init as inline
This function is fairly small and only used in two places: one very hot,
the other cold, so it should definitely be inlined.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:42 -04:00
Kent Overstreet
3f3bc66ef0 bcachefs: Optimize btree_path_alloc()
- move slowpath code to a separate function, btree_path_overflow()
 - no need to use hweight64
 - copy nr_max_paths from btree_transaction_stats to btree_trans,
   avoiding a data dependency in the fast path

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:42 -04:00
Kent Overstreet
14d8f26ad0 bcachefs: Inline bch2_trans_kmalloc() fast path
Small performance optimization.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:42 -04:00
Kent Overstreet
a8f3542843 bcachefs: bch2_print_string_as_lines()
This adds a helper for printing a large buffer one line at a time, to
avoid the 1k printk limit.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:41 -04:00
Kent Overstreet
e9174370d0 bcachefs: bch2_btree_node_relock_notrace()
Most of the node_relock_fail trace events are generated from
bch2_btree_path_verify_level(), when debugcheck_iterators is enabled -
but we're not interested in these trace events, they don't indicate that
we're in a slowpath.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:41 -04:00
Kent Overstreet
afbc719468 bcachefs: Improve bch2_btree_trans_to_text()
This is just a formatting/readability improvement.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:41 -04:00
Kent Overstreet
0d7009d7ca bcachefs: Delete old deadlock avoidance code
This deletes our old lock ordering based deadlock avoidance code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:41 -04:00
Kent Overstreet
96d994b37c bcachefs: Print deadlock cycle in debugfs
In the event that we're not finished debugging the cycle detector, this
adds a new file to debugfs that shows what the cycle detector finds, if
anything. By comparing this with btree_transactions, which shows held
locks for every btree_transaction, we'll be able to determine if it's
the cycle detector that's buggy or something else.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:41 -04:00
Kent Overstreet
33bd5d0686 bcachefs: Deadlock cycle detector
We've outgrown our own deadlock avoidance strategy.

The btree iterator API provides an interface where the user doesn't need
to concern themselves with lock ordering - different btree iterators can
be traversed in any order. Without special care, this will lead to
deadlocks.

Our previous strategy was to define a lock ordering internally, and
whenever we attempt to take a lock and trylock() fails, we'd check if
the current btree transaction is holding any locks that cause a lock
ordering violation. If so, we'd issue a transaction restart, and then
bch2_trans_begin() would re-traverse all previously used iterators, but
in the correct order.

That approach had some issues, though.
 - Sometimes we'd issue transaction restarts unnecessarily, when no
   deadlock would have actually occured. Lock ordering restarts have
   become our primary cause of transaction restarts, on some workloads
   totally 20% of actual transaction commits.

 - To avoid deadlock or livelock, we'd often have to take intent locks
   when we only wanted a read lock: with the lock ordering approach, it
   is actually illegal to hold _any_ read lock while blocking on an intent
   lock, and this has been causing us unnecessary lock contention.

 - It was getting fragile - the various lock ordering rules are not
   trivial, and we'd been seeing occasional livelock issues related to
   this machinery.

So, since bcachefs is already a relational database masquerading as a
filesystem, we're stealing the next traditional database technique and
switching to a cycle detector for avoiding deadlocks.

When we block taking a btree lock, after adding ourself to the waitlist
but before sleeping, we do a DFS of btree transactions waiting on other
btree transactions, starting with the current transaction and walking
our held locks, and transactions blocking on our held locks.

If we find a cycle, we emit a transaction restart. Occasionally (e.g.
the btree split path) we can not allow the lock() operation to fail, so
if necessary we'll tell another transaction that it has to fail.

Result: trans_restart_would_deadlock events are reduced by a factor of
10 to 100, and we'll be able to delete a whole bunch of grotty, fragile
code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:41 -04:00
Kent Overstreet
845cffed0d bcachefs: Add a debug assert
Chasing down a strange locking bug.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:41 -04:00
Kent Overstreet
57ce827442 bcachefs: Make an assertion more informative
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:40 -04:00
Kent Overstreet
e4215d0fec bcachefs: All held locks must be in a btree path
With the new deadlock cycle detector, it's critical that all held locks
be marked in a btree_path, because that's what the cycle detector
traverses - any locks that aren't correctly marked will cause deadlocks.

This changes the btree_path to allocate some btree_paths for the new
nodes, since until the final update is done we otherwise don't have a
path referencing them.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:40 -04:00
Kent Overstreet
38474c2642 bcachefs: Avoid using btree_node_lock_nopath()
With the upcoming cycle detector, we have to be careful about using
btree_node_lock_nopath - in particular, using it to take write locks can
cause deadlocks.

All held locks need to be tracked in a btree_path, so that the cycle
detector knows about them - unless we know that we cannot cause
deadlocks for other reasons: e.g. we are only taking read locks, or
we're in very early fsck (topology repair).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:40 -04:00
Kent Overstreet
4e6defd106 bcachefs: btree_bkey_cached_common->cached
Add a type descriptor to btree_bkey_cached_common - there's no reason
not to since we've got padding that was otherwise unused, and this is a
nice cleanup (and helpful in later patches).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:40 -04:00
Kent Overstreet
674cfc2624 bcachefs: Add persistent counters for all tracepoints
Also, do some reorganizing/renaming, convert atomic counters in bch_fs
to persistent counters, and add a few missing counters.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:39 -04:00
Kent Overstreet
c240c3a944 bcachefs: Print lock counts in debugs btree_transactions
Improve our debugfs output, to help in debugging deadlocks: this shows,
for every btree node we print, the current number of readers/intent
locks/write locks held.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:39 -04:00