IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
This just removes a redundant comparison - there's more work we could do
here to remove some redundant copying.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Standard splitting out of the slow path from the fast path of a
function. We may follow this up in another patch with inlining the fast
path into btree_iter.c.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If the last journal write didn't complete sucessfully due to a torn
write, we'll detect it as a checksum error. In that case, we should just
pretend that journal entry was never written.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Print out the journal entries we read and will replay as soon as
possible - if we get an error walidating keys it's helpful to know where
it was in the journal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It turns out we need bch2_extent_trim_atomi() even when we're deleting
extents one at a time because it's possible for one reflink_p to
reference arbitrarily many reflink_v extents. This doesn't normally
happen, but the data move path can fragment existing extents in the
background.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
b->write_type needs to be set atomically with setting the
btree_node_need_write flag, so move it into b->flags.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Centralize format strings in bcachefs.h
- Add bch2_fmt_inum_offset() and related helpers
- Switch error messages for inodes to also print out the offset, in
bytes
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Warnings ought to always have a format string/log message - makes them
considerably more useful.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, when we exited from the loop body with a break statement
_ret wouldn't have been assigned to yet, and we could spuriously return
a transaction restart error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This improves the bkey_format calculation when splitting btree nodes.
Previously, we'd use a format calculated for the original node for the
lower of the two new nodes.
This was particularly bad on sequential insertions, where we iteratively
split the last btree node, whos format has to include KEY_MAX.
Now, we calculate formats precisely for the keys the two new nodes will
contain. This also should make splitting a bit more efficient, since
we're only copying keys once (from the original node to the new node,
instead of new node, replacement node, then upper split).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This switches where we take quota reservations to be per bch_wirte_op
instead of per dio_write, so we can drop the quota reservation in the
same place as we call i_sectors_acct(), and only take/release
ei_quota_lock once.
In the future we'd like ei_quota_lock to not be a mutex, so that we can
avoid punting to process context before deliving write completions in
nocow mode.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The genradix code can handle multiple threads trying to allocate at the
same time - we don't need the genradix_ptr_alloc() call to happen under
a lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a regression from percpu freedlists in the btree key cache
code: in a rare error path, we were immediately freeing a bkey_cached
that had been used before and should've waited for an SRCU barrier.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
These were wrappers around atomic operations that verified that the
counter wasn't negative, but they're dead code - delete.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Marking a non-static function as inline doesn't actually work and is
now causing problems - drop that
- Introduce BCACHEFS_LOG_PREFIX for when we want to prefix log messages
with bcachefs (filesystem name)
- Userspace doesn't have real percpu variables (maybe we can get this
fixed someday), put an #ifdef around bch2_disk_reservation_add()
fastpath
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We have a unique lock used for controlling adding to the pagecache: the
lock has two states, where both states are shared - the lock may be held
multiple times for either state - but not both states at the same time.
This is exactly what we need for nocow mode locking, so this patch pulls
it out of fs.c into its own file.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BCH_WRITE_FLUSH is a write flag that causes a journal flush. It's only
used in the direct IO path, and this will allow for some consolidation
with the regular fsync path, which will help with the upcoming nocow
mode.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- factor out more slowpath code into non-inline function
- use bch2_print_string_as_lines(), so our error message doesn't get
truncated
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree_path_copy() doesn't need to call
bch2_btree_path_check_sort_fast() - the newly allocated path will always
be in the correct position, post copy; also delete some redundant
branches from __bch2_btree_path_make_mut().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Don't call into bch2_encrypt_bio() when we're not encrypting
- Pull slowpath out of trans_lock_write()
- Make sure bc2h_trans_journal_res_get() gets inlined.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- With BCH_WRITE_SYNC, we no longer need the completion in struct
dio_write
- Pull out bch2_dio_write_copy_iov() into a separate non-inline
function, it's code that doesn't run in the common case
- Copy mapping and inode pointers into dio_write, avoiding pointer
chasing at the start of bch2_dio_write_loop()
- kthread_use_mm() is not needed in the common case; move it into
bch2_dio_write_loop_async()
- factor out various helpers from bch2_dio_write_loop() and rework
control flow for better icache utilization
Other small optimizations:
- bch2_keylist_free() is only used in one place, at the end of the
bch2_write() path - drop the reinit
- in bch2_disk_reservation_put(), check if res->sectors is nonzero
before touching c->online_reserved, since that will likely be a cache
miss
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bcachefs: More DIO write path optimization
Better code prefetching (?)
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a new flag for the write path, BCH_WRITE_SYNC, and switches
the O_DIRECT write path to use it when we're not running asynchronously.
It runs the btree update after the write in the original thread's
context instead of a kworker, cutting context switches in half.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This factors out a properly-documented helper for deciding when we want
to sort a btree node with MAX_BSETS bsets down to a single bset.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This replaces sysfs btree_avg_write_size with btree_write_stats, which
now breaks out statistics by the source of the btree write.
Btree writes that are too small are a source of inefficiency, and
excessive btree resort overhead - this will let us see what's causing
them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Per fstests generic/275, on -ENOSPC we're supposed write until the
filesystem is full - i.e. do a partial write instead of failing the full
write.
This is a partial fix for the buffered write path: we'll still fail on a
page boundary.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- In the btree iterator code that overlays keys from the journal, we
were incorrectly specifying level=0 instead of the btree_path's
current level in a few places
- When we didn't do journal replay, we shouldn't free the journal keys:
this fixes cmd_list and cmd_dump, which run in norecovery mode
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Use __func__ in error messages that refer to function name, and do so
more uniformly - suggested by checkpatch.pl
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds an inlined version of bch2_bkey_cmp_packed(), and uses it in
bch2_sort_keys(), where it's part of the inner loop.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Long ago, bkey_unpack_key() was added to bset.h instead of bkey.h
because bkey.h didn't include btree_types.h, which it needs for the
compiled unpack function.
This patch finally moves it to the proper location.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This moves the JOURNAL_REPLAY_DONE flag check from
bch2_trans_iter_init() to bch2_trans_init(), where we stash a copy in
btree_trans - gaining us a small performance improvement.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
checkpatch.pl gives lots of warnings that we don't want - suggested
ignore list:
ASSIGN_IN_IF
UNSPECIFIED_INT - bcachefs coding style prefers single token type names
NEW_TYPEDEFS - typedefs are occasionally good
FUNCTION_ARGUMENTS - we prefer to look at functions in .c files
(hopefully with docbook documentation), not .h
file prototypes
MULTISTATEMENT_MACRO_USE_DO_WHILE
- we have _many_ x-macros and other macros where
we can't do this
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- add bch2_dev_usage_read_fast(), which doesn't return by value -
bch_dev_usage is big enough that we don't want the silent memcpy
- tweak the allocation path to only call bch2_dev_usage_read() once per
bucket allocated
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
crc.compression_type & nouce gets reset to inside bch2_rechecksum_bio(),
we set it back to the previous values calculated. This fixes
incompressible extents being marked as uncompressed.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This shouldn't be needed anymore, since we don't rely on the pointer
validity that this was guarding against anymore - we get a new good
reference and save it right after this function.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This separates out the slowpath of bch2_trans_update_by_path_trace()
into a new non-inlined helper.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We don't want to fire the bucket_alloc_fail tracepoint on transaction
restart, when we can retry immediately - only when we the allocation
actually has to block.
Also, switch from sched_clock() to local_clock(), as we've been doing
elsewhere.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now we store the transaction's fn idx in a local variable, instead of
redoing the lookup every time we call bch2_trans_init().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The shrinker assumes freed key cache items are ordered by age, so that
it doesn't have to scan the full list to find items that are old enough
(according to the srcu code) to be freed.
But percpu freelists broke this ordering; this patch fixes this by
ensuring we insert items into the proper position.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
A single block can't be compressed, so it's incompressible.
This stops rebalance repeatably marking extents as uncompressed.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Sometimes the user may need to change durability after formatting to
match current hardware setup, this option provides a quick and flexible
alternative to removing then adding the device.
It is HIGHLY ADVISED TO RUN REREPLICATE after changing this value so the
system doesn't remain degraded.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Appending new nodes to the end of the list means we're more likely to
evict old entries when btree_cache_scan() is started.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- We now correctly allow soft limits to be exceeded, instead of always
returning -EDQUOT
- Disk quota grate times/warnings can now be set, not just the
systemwide defaults
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
local_clock() isn't always completely accurate - e.g. on machines with
TSC drift - but ktime_get_ns() overhead is too high, unfortunately.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- In userspace, we don't have real percpu variables; this patch
disables the percpu freelists in userspace
- add some error messages for the asserts in
bch2_fs_btree_key_cache_exit(); we've been hitting this (only in
userspace, oddly), perhaps this will help us track down the error.
- bkey_cached_reuse() should likely be taking the key cache lock, and
it's a slowpath so it doesn't hurt to
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We were forgetting to count down the number of nodes to prefetch, firing
off _way_ more than intended - whoops.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We don't actually allocate memory under the btree key cache lock - so
there's no recursion concerns, and the shrinker can just use
mutex_lock().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
On journal read, previously we would do full journal entry validation
immediately after reading a journal entry.
However, this would lead to errors for journal entries we weren't
actually going to use, either because they were too old or too new
(newer than the most recent flush).
We've observed write tearing on journal entries newer than the newest
flush - which makes sense, prior to a flush there's no guarantees about
write persistence.
This patch defers full journal entry validation until the end of the
journal read path, when we know which journal entries we'll want to use.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Prep work for the next patch, to defer journal entry validation: we now
track for each replica whether we had a good checksum.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This module provides a fast 64bit implementation of basic statistics
functions, including mean, variance and standard deviation in both
weighted and unweighted variants, the unweighted variant has a 32bit
limitation per sample to prevent overflow when squaring.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When modifying a file, we may be required to drop the suid/sgid bits -
we were missing a file_modified() call to do this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
An error case was jumping to the wrong label, creating an infinite loop
- oops.
This fixes fstests generic/648.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We were removing 1 more entry than we were supposed to - oops.
Also some other simplifications and cleanups, and bring back the abort
preference code in a better fashion.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We already have support for the flag's semantics: inode options are
inherited by children if they were explicitly set on the parent. This
patch just maps the FS_XFLAG_PROJINHERIT flag to the "this option was
epxlicitly set" bit.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is the right thing to do, and conforms with our own behaviour on
rename and xfs's behaviour on hardlink.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
For compliance with other quota implementations, we should be
initializing quota information with a default 1 week timelimit: this
fixes fstests generic/235.
Also, this adds to_text() functions for some quota structs - useful
debugging aids.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree nodes can be written by other threads (shrinker, journal reclaim)
with only a read lock, but brand new nodes should only be written by the
thread doing the split/interior update. bch2_btree_update_add_new_node()
sets btree node flags to indicate that this is a new node and should not
be written out by other threads, thus we need to call it before dropping
our write lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a new helper, quota_reserve_range(), which takes a quota
reservation for unallocated blocks in a given file range, and uses it in
bch2_remap_file_range().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the drop_alloc tests, we may end up calling
bch2_btree_iter_peek_slot() on an interior level that doesn't exist.
Previously, this would hit the path->uptodate assertion in
bch2_btree_path_peek_slot(); this path first checks a NULL btree node,
which is how we know we're at the end of the btree.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The btree iterator code may allocate extra btree paths, temporarily,
that do not refer to keys being returned: we don't need to wait until
transaction restart to drop these, when they're not referenced they
should be deleted right away.
This fixes a transaction path overflow bug.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Negating without casting to a signed integer means the value wasn't
getting sign extended properly - oops.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, bch2_btree_update_start() would always take all intent
locks, all the way up to the root.
We've finally got data from users where this became a scalability issue
- so, this patch fixes bch2_btree_update_start() to only take the locks
we need.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we have an error path plumbed through, there's no need to be
using bch2_btree_node_lock_write_nofail().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The next patch in the series is (finally!) going to change btree splits
(and interior updates in general) to not take intent locks all the way
up to the root - instead only locking the nodes they'll need to modify.
However, this will be introducing a race since if we're not holding a
write lock on a btree node it can be written out by another thread, and
then we might not have enough space for a new bset entry.
We can handle this by retrying - we just need to introduce a new error
path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In order to avoid locking all btree nodes up to the root for btree node
splits, we're going to have to introduce a new error path into
bch2_btree_insert_node(); this mean we can't have done any writes or
modified global state before that point.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We'd like to prioritize aborting transactions that have done less work -
however, it appears breaking cycles by telling other threads to abort
may still be buggy, so disable that for now.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Some lock operations can't fail; a cycle of nofail locks is impossible
to recover from. So we want to get rid of these nofail locking
operations, but as this is tricky it'll be done incrementally.
If such a cycle happens, this patch prints out which codepaths are
involved so we know what to work on next.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cached pointers are generally dropped, not moved: this led to an
assertion firing in the data update path when there were no new replicas
being written.
This path adds a data_options field for pointers to be dropped, and
tweaks move_extent() to check if we're only dropping pointers, not
writing new ones, before kicking off a data update operation.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When errors=panic, we want to make sure we print the error before
calling bch2_inconsistent_error().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
btree_node_lock_nopath() is something we'd like to get rid of, it's
always prone to deadlocks if we accidentally are holding other locks,
because it doesn't mark the lock it's taking in a path: we'll want to
get rid of it in the future, but for now this patch works it by calling
bch2_trans_unlock().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This changes bch2_check_for_deadlock() to print the longest chains it
finds - when we have a deadlock because the cycle detector isn't finding
something, this will let us see what it's missing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We were incorrectly returning -BCH_ERR_insufficient_devices when we'd
received a different error from bch2_bucket_alloc_trans(), which
(erronously) turns into -EROFS further up the call chain.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_btree_delete_range_trans() was using btree_trans_too_many_iters()
to avoid path overflow, but this was buggy here (and also
btree_trans_too_many_iters() is suspect in general).
btree_trans_too_many_iters() only returns true when we're close to the
maximum number of paths - within 8 - but extent insert/delete assumes
that it can use more paths than that.
Instead, we need to call bch2_trans_begin() on every loop iteration.
Since we don't want to call bch2_trans_begin() (restarting the outer
transaction) if the call was a no-op - if we had no work to do - we have
to structure things a bit oddly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This refactoring puts our various allocation path counters into a
dedicated struct - the upcoming nocow patch is going to add another
counter.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There was a rare bug when path->locks_want was nonzero, but not
BTREE_MAX_DEPTH, where we'd return on a valid node that wasn't locked -
oops.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This used to be needed more for buffered IO, but now the block layer has
writeback throttling - we can delete this now.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It now includes more info - whether the bucket was for metadata or data
- and also call it in the same place as the bucket_alloc_fail
tracepoint.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This function is fairly small and only used in two places: one very hot,
the other cold, so it should definitely be inlined.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This moves the slowpath of check_pos_snapshot_overwritten() to a
separate function, and inlines the fast path - helping performance on
btrees that don't use snapshot and for users that aren't using
snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, jset_validate() was formatting the initial part of an error
string for every entry it validating - expensive.
This moves that code to journal_entry_err_msg(), which is now only
called if there's an actual error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- move slowpath code to a separate function, btree_path_overflow()
- no need to use hweight64
- copy nr_max_paths from btree_transaction_stats to btree_trans,
avoiding a data dependency in the fast path
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need counters to be initialized before initializing shrinkers - the
shrinker callbacks will update those counters. This fixes a segfault in
userspace.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We've seen long error messages get truncated here, so convert to the new
bch2_print_string_as_lines().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- factor out fsck_err_get()
- if the "bcachefs (%s):" prefix has already been applied, don't
duplicate it
- convert to printbufs instead of static char arrays
- tidy up control flow a bit
- use bch2_print_string_as_lines(), to avoid messages getting truncated
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a helper for printing a large buffer one line at a time, to
avoid the 1k printk limit.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Most of the node_relock_fail trace events are generated from
bch2_btree_path_verify_level(), when debugcheck_iterators is enabled -
but we're not interested in these trace events, they don't indicate that
we're in a slowpath.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're still seeing OOM issues caused by the btree node cache shrinker
not sufficiently freeing memory: thus, this patch changes the shrinker
to not exit if __GFP_FS was not supplied.
Instead, tweak btree node memory allocation so that we never invoke
memory reclaim while holding the btree node cache lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is a major oopsy - we should always be unlocking before calling
closure_sync(), else we'll cause a deadlock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We were checking for -EAGAIN, but we're not returned that when we didn't
pass a closure to wait with - oops.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Before we had the deadlock cycle detector, we didn't want to be holding
read locks when taking intent locks, because blocking on an intent lock
while holding a read lock was a lock ordering violation that could
cause a deadlock.
With the cycle detector this is no longer an issue, so this code can be
deleted.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
In order for bch2_btree_node_lock_write_nofail() to never produce a
deadlock, we must ensure we're never holding read locks when using it.
Fortunately, it's only used from code paths where any read locks may be
safely dropped.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the event that we're not finished debugging the cycle detector, this
adds a new file to debugfs that shows what the cycle detector finds, if
anything. By comparing this with btree_transactions, which shows held
locks for every btree_transaction, we'll be able to determine if it's
the cycle detector that's buggy or something else.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We've outgrown our own deadlock avoidance strategy.
The btree iterator API provides an interface where the user doesn't need
to concern themselves with lock ordering - different btree iterators can
be traversed in any order. Without special care, this will lead to
deadlocks.
Our previous strategy was to define a lock ordering internally, and
whenever we attempt to take a lock and trylock() fails, we'd check if
the current btree transaction is holding any locks that cause a lock
ordering violation. If so, we'd issue a transaction restart, and then
bch2_trans_begin() would re-traverse all previously used iterators, but
in the correct order.
That approach had some issues, though.
- Sometimes we'd issue transaction restarts unnecessarily, when no
deadlock would have actually occured. Lock ordering restarts have
become our primary cause of transaction restarts, on some workloads
totally 20% of actual transaction commits.
- To avoid deadlock or livelock, we'd often have to take intent locks
when we only wanted a read lock: with the lock ordering approach, it
is actually illegal to hold _any_ read lock while blocking on an intent
lock, and this has been causing us unnecessary lock contention.
- It was getting fragile - the various lock ordering rules are not
trivial, and we'd been seeing occasional livelock issues related to
this machinery.
So, since bcachefs is already a relational database masquerading as a
filesystem, we're stealing the next traditional database technique and
switching to a cycle detector for avoiding deadlocks.
When we block taking a btree lock, after adding ourself to the waitlist
but before sleeping, we do a DFS of btree transactions waiting on other
btree transactions, starting with the current transaction and walking
our held locks, and transactions blocking on our held locks.
If we find a cycle, we emit a transaction restart. Occasionally (e.g.
the btree split path) we can not allow the lock() operation to fail, so
if necessary we'll tell another transaction that it has to fail.
Result: trans_restart_would_deadlock events are reduced by a factor of
10 to 100, and we'll be able to delete a whole bunch of grotty, fragile
code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Previously, if we were trying to upgrade from a read to an intent lock
but we held an additional read lock via another btree_path,
bch2_btree_node_upgrade() would always fail, in six_lock_tryupgrade().
This patch factors out the code that __bch2_btree_node_lock_write() uses
to temporarily drop extra read locks, so that six_lock_tryupgrade() can
succeed.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This brings back an important optimization, to avoid touching the wait
lists an extra time, while preserving the property that a thread is on a
lock waitlist iff it is waiting - it is never removed from the waitlist
until it has the lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There was a lost wakeup between a read unlock in percpu mode and a write
lock. The unlock path unlocks, then executes a barrier, then checks for
waiters; correspondingly, the lock side should set the wait bit and
execute a barrier, then attempt to take the lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is needed by the cycle detector in bcachefs - we need a way to
iterater over waitlist entries while dropping and retaking the waitlist
lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This switches to a single list of waiters, instead of separate lists for
read and intent, and switches write locks to also use the wait lists
instead of being handled differently.
Also, removal from the wait list is now done by the process waiting on
the lock, not the process doing the wakeup. This is needed for the new
deadlock cycle detector - we need tasks to stay on the waitlist until
they've successfully acquired the lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Continuing the saga of introducing private dedicated error codes for
each error path, this patch converts ENOSPC to error codes that are
subtypes of ENOSPC. We've recently had a test failure where we got
-ENOSPC where we shouldn't have, and didn't have enough information to
tell where it came from, so this patch will solve that problem.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The next patch is going to be adding private error codes for all the
places we return -ENOSPC.
Additionally, this patch updates return paths at all module boundaries
to call bch2_err_class(), to return the standard error code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With the new deadlock cycle detector, it's critical that all held locks
be marked in a btree_path, because that's what the cycle detector
traverses - any locks that aren't correctly marked will cause deadlocks.
This changes the btree_path to allocate some btree_paths for the new
nodes, since until the final update is done we otherwise don't have a
path referencing them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Centralizing the transaction restart/tracepoint in
bch2_btree_path_upgrade() lets us improve the tracepoint - now it emits
old and new locks_want.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Spotted a lockup once that appeared to be a lost wakeup. Adding a manual
trigger for lock wakeups will make it easy to tell if that's what it is
next time it occurs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We have counters with longer names now, so adjust the tabstop - also,
make sure there's always a space printed between the name and the
number.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When subvolumes & snapshots were rolled out, hash_redo_key() was
disabled due to some new complications - namely, bch2_hash_set() works
at the subvolume level, and fsck does not run in a defined subvolume,
instead working at the snapshot ID level.
This patch splits out bch2_hash_set_snapshot() from bch2_hash_set(), and
makes one small tweak for fsck:
- Normally, bch2_hash_set() (and other dirent code) needs to know what
subvolume we're in, because dirents that point to other subvolumes
should only be visible in the subvolume they were created in, not
other snapshots. We can't check that in fsck, so we just assume that
all dirents are visible.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This removes an optimization that didn't actually save us any memory,
due to alignment, but did make the code more complicated than it needed
to be. We were also seeing a bug where journal_seq_base wasn't getting
correctly initailized, so hopefully it'll fix that too.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Little bit of tidying up, this makes the counters a little bit clearer
as to what's happening.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With the upcoming cycle detector, we have to be careful about using
btree_node_lock_nopath - in particular, using it to take write locks can
cause deadlocks.
All held locks need to be tracked in a btree_path, so that the cycle
detector knows about them - unless we know that we cannot cause
deadlocks for other reasons: e.g. we are only taking read locks, or
we're in very early fsck (topology repair).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Similar to "bcachefs: Fix usage of six lock's percpu mode", six locks
have a percpu mode, but we can't switch between percpu and non percpu
modes while a lock is in use: threads attempting to take a read lock may
race, and we'll end up with the read count permanently off.
Fixing this the "correct" way, in six_lock_pcpu_(alloc|free) would
require an RCU barrier, and we don't want to do that - instead, we have
to permanently segragate percpu and non percpu objects, including when
on freelists.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Ideally, all the code in btree_locking.c should be converted, but then
we'd want to convert btree_path to point to btree_key_cached_common too,
and then we'd be in for a much bigger cleanup - but a bit of incremental
cleanup will still be helpful for the next patches.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a type descriptor to btree_bkey_cached_common - there's no reason
not to since we've got padding that was otherwise unused, and this is a
nice cleanup (and helpful in later patches).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Have to be careful with bit fields - when subtracting, this was
overflowing into the write_locking bit.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Taking a write lock will be able to fail, with the new cycle detector -
unless we pass it nofail, which is possible but not preferred.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the future, with the new deadlock cycle detector, we won't be using
bare six_lock_* anymore: lock wait entries will all be embedded in
btree_trans, and we will need a btree_trans context whenever locking a
btree node.
This patch plumbs a btree_trans to the few places that need it, and adds
two new locking functions
- btree_node_lock_nopath, which may fail returning a transaction
restart, and
- btree_node_lock_nopath_nofail, to be used in places where we know we
cannot deadlock (i.e. because we're holding no other locks).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
six locks are unfair: while a thread is blocked trying to take a write
lock, new read locks will fail. The new deadlock cycle detector makes
use of our existing lock tracing, so we need to tell it we're holding a
write lock before we take the lock for it to work correctly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since we've now got time_stats for lock hold times (per btree
transaction), we don't need this anymore.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Also, do some reorganizing/renaming, convert atomic counters in bch_fs
to persistent counters, and add a few missing counters.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
On failure to get a journal pre-reservation because we're called from
journal reclaim we're not supposed to return a transaction restart error
- this fixes a livelock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It now prints the error name when the btree node is an error pointer;
also, don't trace failures when the the btree node is
BCH_ERR_no_btree_node_up.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Don't decrease BTREE_ITER_MAX when building with CONFIG_LOCKDEP
anymore. The lockdep table sizes are configurable now, we don't need
this anymore.
- btree_trans_too_many_iters() is less conservative now. Previously it
was causing a transaction restart if we had used more than
BTREE_ITER_MAX / 2 paths, change this to BTREE_ITER_MAX - 8.
This helps with excessive transaction restarts/livelocks in the bucket
allocator path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The upcoming lock cycle detection code will need to know precisely which
locks every btree_trans is holding, including write locks - this patch
updates btree_node_locked_type to include write locks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Improve our debugfs output, to help in debugging deadlocks: this shows,
for every btree node we print, the current number of readers/intent
locks/write locks held.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch
- tracks maximum bch2_trans_kmalloc() memory used in btree_transaction_stats
- makes it available in debugfs
- switches bch2_trans_init() to using that for the amount of memory to
preallocate, instead of the parameter passed in
This drastically reduces transaction restarts, and means we no longer
need to track this in the source code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
six_lock_count now counts up whether a write lock held, and this patch
now also correctly counts six_lock->intent_lock_recurse.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Previously, we used two different bit arrays for tracking held btree
node locks. This patch switches to an array of two bit integers, which
will let us track, in a future patch, when we hold a write lock.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Held btree locks are tracked in btree_path->nodes_locked and
btree_path->nodes_intent_locked. Upcoming patches are going to change
the representation in struct btree_path, so this patch switches to
proper helpers instead of direct access to these fields.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Start to centralize some of the locking code in a new file; more locking
code will be moving here in the future.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Device labels are represented as pointers in the member info section: we
need to get and then set the label for it to be kept correctly.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The new convention is that functions that handle transaction restarts
within an existing transaction context should return
-BCH_ERR_transaction_restart_nested when they did so, since they
invalidated the outer transaction context.
This also means bch2_btree_delete_range_trans() is changed to only call
bch2_trans_begin() after a transaction restart, not on every loop
iteration.
This is to fix a bug in fsck, in check_inode() when we truncate an inode
with BCH_INODE_I_SIZE_DIRTY set.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- fsck_inode_rm() wasn't returning BCH_ERR_transaction_restart_nested
- change bch2_trans_verify_not_restarted() to call panic() - we don't
want these errors to be missed
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
iter->k needs to be consistent with iter->pos - required for
bch2_btree_iter_(rewind|advance) to work correctly.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
When returning a key from the key cache, in BTREE_ITER_WITH_KEY_CACHE
mode, we don't want to set should_be_locked on iter->path; we're not
returning a key from that path, so we donn't need to, and also since we
traversed the key cache iterator before setting should_be_locked on that
path it might be unlocked (if we unlocked, bch2_trans_relock() won't
have relocked it).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
For debugging the eytzinger search tree code, and low level bkey packing
code, it can be helpful to see things in binary: this patch improves our
helpers for doing so.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We should be calling btree_node_mem_ptr_set() before path_level_init(),
since we already touched the key that btree_node_mem_ptr_set() will
modify and path_level_init() will be doing the lookup in the child btree
node we're recursing to.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Topology repair may change btree node min/max keys: when it does so, we
need to always rebuild eytzinger search trees because nodes directly
depend on those values.
This fixes a bug found by the 'kill_btree_node' test, where we'd pop an
assertion in bch2_bset_search_linear().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
For now this is just a BUG_ON() - we may want to change this to return
an error in the future.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This improves flush_buf() so that it always returns nonzero when we're
done reading and ready to return to userspace, and so that it returns
the value we want to return to userspace (number of bytes read, if there
wasn't an error).
In the future we'll be better abstracting this mechanism and pulling it
out of bcachefs, and using it to replace seq_file.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We were iterating starting at BCACHEFS_ROOT_INO, but snapshots start at
POS_MIN - meaning this code was never getting run.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Reported-by: Olexa Bilaniuk <obilaniu@gmail.com>
Instead of counting transaction restarts, count when the transaction is
restarted: if bch2_trans_begin() was called when the transaction wasn't
restarted we need to ensure restart_count is still incremented.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Turns out this assertion was something we could legitimately hit - add a
comment describing what's going on, and handle it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We need a way to check if the machinery for handling btree_paths with in
a transaction is behaving reasonably, as it often has not been - we've
had bugs with transaction path overflows caused by duplicate paths and
plenty of other things.
This patch tracks, per transaction fn, the most btree paths ever
allocated by that transaction and makes it available in debugfs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Going to be adding more things to this in the next patch.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes an assertion about unexpected transaction restarts -
bch2_delete_range_trans() handles transaction restarts.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
bch2_path_put() is supposed to drop paths that aren't needed on
transaction restart, or to hold locks that we're supposed to keep until
transaction commit: but it was failing to free paths in some cases that
it should have, leading to transaction path overflows with lots of
duplicate paths.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
These were used more prior to getting rid of the in-memory bucket arrays
- they don't serve much purpose anymore, and deleting them lets us write
better assertions.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Our types are exported to the tracepoint code, so it's not necessary to
break things out individually when passing them to tracepoints - we can
also call other functions from TP_fast_assign().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This was noticed when a test hit this error and didn't fail, because
fsck wasn't returning that it fixed errors.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
It doesn't make any sense to set should_be_locked on btree_paths that
aren't locked, and is often a bug - this patch adds assertions and fixes
some of those bugs.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- use strlcpy(), not strncpy()
- add tracepoints for btree_path alloc and free
- give the tracepoint for key cache upgrade fail a proper name
- add a tracepoint for btree_node_upgrade_fail
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Clearing path->preserve means the path will be dropping in
bch2_trans_begin() - but on transaction restart, we're likely to need
that path again.
This fixes a livelock in the allocation path.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
bch2_btree_trans_to_text() is used to print btree_transactions owned by
other threads; thus, it needs to be particularly careful. This fixes a
null ptr deref caused by racing with the owning thread changing
path->l[].b.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>