IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
This function is fairly small and only used in two places: one very hot,
the other cold, so it should definitely be inlined.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This moves the slowpath of check_pos_snapshot_overwritten() to a
separate function, and inlines the fast path - helping performance on
btrees that don't use snapshot and for users that aren't using
snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, jset_validate() was formatting the initial part of an error
string for every entry it validating - expensive.
This moves that code to journal_entry_err_msg(), which is now only
called if there's an actual error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- move slowpath code to a separate function, btree_path_overflow()
- no need to use hweight64
- copy nr_max_paths from btree_transaction_stats to btree_trans,
avoiding a data dependency in the fast path
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need counters to be initialized before initializing shrinkers - the
shrinker callbacks will update those counters. This fixes a segfault in
userspace.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We've seen long error messages get truncated here, so convert to the new
bch2_print_string_as_lines().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- factor out fsck_err_get()
- if the "bcachefs (%s):" prefix has already been applied, don't
duplicate it
- convert to printbufs instead of static char arrays
- tidy up control flow a bit
- use bch2_print_string_as_lines(), to avoid messages getting truncated
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a helper for printing a large buffer one line at a time, to
avoid the 1k printk limit.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Most of the node_relock_fail trace events are generated from
bch2_btree_path_verify_level(), when debugcheck_iterators is enabled -
but we're not interested in these trace events, they don't indicate that
we're in a slowpath.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're still seeing OOM issues caused by the btree node cache shrinker
not sufficiently freeing memory: thus, this patch changes the shrinker
to not exit if __GFP_FS was not supplied.
Instead, tweak btree node memory allocation so that we never invoke
memory reclaim while holding the btree node cache lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is a major oopsy - we should always be unlocking before calling
closure_sync(), else we'll cause a deadlock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We were checking for -EAGAIN, but we're not returned that when we didn't
pass a closure to wait with - oops.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Before we had the deadlock cycle detector, we didn't want to be holding
read locks when taking intent locks, because blocking on an intent lock
while holding a read lock was a lock ordering violation that could
cause a deadlock.
With the cycle detector this is no longer an issue, so this code can be
deleted.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
In order for bch2_btree_node_lock_write_nofail() to never produce a
deadlock, we must ensure we're never holding read locks when using it.
Fortunately, it's only used from code paths where any read locks may be
safely dropped.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the event that we're not finished debugging the cycle detector, this
adds a new file to debugfs that shows what the cycle detector finds, if
anything. By comparing this with btree_transactions, which shows held
locks for every btree_transaction, we'll be able to determine if it's
the cycle detector that's buggy or something else.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We've outgrown our own deadlock avoidance strategy.
The btree iterator API provides an interface where the user doesn't need
to concern themselves with lock ordering - different btree iterators can
be traversed in any order. Without special care, this will lead to
deadlocks.
Our previous strategy was to define a lock ordering internally, and
whenever we attempt to take a lock and trylock() fails, we'd check if
the current btree transaction is holding any locks that cause a lock
ordering violation. If so, we'd issue a transaction restart, and then
bch2_trans_begin() would re-traverse all previously used iterators, but
in the correct order.
That approach had some issues, though.
- Sometimes we'd issue transaction restarts unnecessarily, when no
deadlock would have actually occured. Lock ordering restarts have
become our primary cause of transaction restarts, on some workloads
totally 20% of actual transaction commits.
- To avoid deadlock or livelock, we'd often have to take intent locks
when we only wanted a read lock: with the lock ordering approach, it
is actually illegal to hold _any_ read lock while blocking on an intent
lock, and this has been causing us unnecessary lock contention.
- It was getting fragile - the various lock ordering rules are not
trivial, and we'd been seeing occasional livelock issues related to
this machinery.
So, since bcachefs is already a relational database masquerading as a
filesystem, we're stealing the next traditional database technique and
switching to a cycle detector for avoiding deadlocks.
When we block taking a btree lock, after adding ourself to the waitlist
but before sleeping, we do a DFS of btree transactions waiting on other
btree transactions, starting with the current transaction and walking
our held locks, and transactions blocking on our held locks.
If we find a cycle, we emit a transaction restart. Occasionally (e.g.
the btree split path) we can not allow the lock() operation to fail, so
if necessary we'll tell another transaction that it has to fail.
Result: trans_restart_would_deadlock events are reduced by a factor of
10 to 100, and we'll be able to delete a whole bunch of grotty, fragile
code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Previously, if we were trying to upgrade from a read to an intent lock
but we held an additional read lock via another btree_path,
bch2_btree_node_upgrade() would always fail, in six_lock_tryupgrade().
This patch factors out the code that __bch2_btree_node_lock_write() uses
to temporarily drop extra read locks, so that six_lock_tryupgrade() can
succeed.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This brings back an important optimization, to avoid touching the wait
lists an extra time, while preserving the property that a thread is on a
lock waitlist iff it is waiting - it is never removed from the waitlist
until it has the lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There was a lost wakeup between a read unlock in percpu mode and a write
lock. The unlock path unlocks, then executes a barrier, then checks for
waiters; correspondingly, the lock side should set the wait bit and
execute a barrier, then attempt to take the lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is needed by the cycle detector in bcachefs - we need a way to
iterater over waitlist entries while dropping and retaking the waitlist
lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This switches to a single list of waiters, instead of separate lists for
read and intent, and switches write locks to also use the wait lists
instead of being handled differently.
Also, removal from the wait list is now done by the process waiting on
the lock, not the process doing the wakeup. This is needed for the new
deadlock cycle detector - we need tasks to stay on the waitlist until
they've successfully acquired the lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Continuing the saga of introducing private dedicated error codes for
each error path, this patch converts ENOSPC to error codes that are
subtypes of ENOSPC. We've recently had a test failure where we got
-ENOSPC where we shouldn't have, and didn't have enough information to
tell where it came from, so this patch will solve that problem.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The next patch is going to be adding private error codes for all the
places we return -ENOSPC.
Additionally, this patch updates return paths at all module boundaries
to call bch2_err_class(), to return the standard error code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With the new deadlock cycle detector, it's critical that all held locks
be marked in a btree_path, because that's what the cycle detector
traverses - any locks that aren't correctly marked will cause deadlocks.
This changes the btree_path to allocate some btree_paths for the new
nodes, since until the final update is done we otherwise don't have a
path referencing them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Centralizing the transaction restart/tracepoint in
bch2_btree_path_upgrade() lets us improve the tracepoint - now it emits
old and new locks_want.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Spotted a lockup once that appeared to be a lost wakeup. Adding a manual
trigger for lock wakeups will make it easy to tell if that's what it is
next time it occurs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We have counters with longer names now, so adjust the tabstop - also,
make sure there's always a space printed between the name and the
number.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When subvolumes & snapshots were rolled out, hash_redo_key() was
disabled due to some new complications - namely, bch2_hash_set() works
at the subvolume level, and fsck does not run in a defined subvolume,
instead working at the snapshot ID level.
This patch splits out bch2_hash_set_snapshot() from bch2_hash_set(), and
makes one small tweak for fsck:
- Normally, bch2_hash_set() (and other dirent code) needs to know what
subvolume we're in, because dirents that point to other subvolumes
should only be visible in the subvolume they were created in, not
other snapshots. We can't check that in fsck, so we just assume that
all dirents are visible.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This removes an optimization that didn't actually save us any memory,
due to alignment, but did make the code more complicated than it needed
to be. We were also seeing a bug where journal_seq_base wasn't getting
correctly initailized, so hopefully it'll fix that too.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Little bit of tidying up, this makes the counters a little bit clearer
as to what's happening.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With the upcoming cycle detector, we have to be careful about using
btree_node_lock_nopath - in particular, using it to take write locks can
cause deadlocks.
All held locks need to be tracked in a btree_path, so that the cycle
detector knows about them - unless we know that we cannot cause
deadlocks for other reasons: e.g. we are only taking read locks, or
we're in very early fsck (topology repair).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Similar to "bcachefs: Fix usage of six lock's percpu mode", six locks
have a percpu mode, but we can't switch between percpu and non percpu
modes while a lock is in use: threads attempting to take a read lock may
race, and we'll end up with the read count permanently off.
Fixing this the "correct" way, in six_lock_pcpu_(alloc|free) would
require an RCU barrier, and we don't want to do that - instead, we have
to permanently segragate percpu and non percpu objects, including when
on freelists.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Ideally, all the code in btree_locking.c should be converted, but then
we'd want to convert btree_path to point to btree_key_cached_common too,
and then we'd be in for a much bigger cleanup - but a bit of incremental
cleanup will still be helpful for the next patches.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a type descriptor to btree_bkey_cached_common - there's no reason
not to since we've got padding that was otherwise unused, and this is a
nice cleanup (and helpful in later patches).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Have to be careful with bit fields - when subtracting, this was
overflowing into the write_locking bit.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Taking a write lock will be able to fail, with the new cycle detector -
unless we pass it nofail, which is possible but not preferred.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the future, with the new deadlock cycle detector, we won't be using
bare six_lock_* anymore: lock wait entries will all be embedded in
btree_trans, and we will need a btree_trans context whenever locking a
btree node.
This patch plumbs a btree_trans to the few places that need it, and adds
two new locking functions
- btree_node_lock_nopath, which may fail returning a transaction
restart, and
- btree_node_lock_nopath_nofail, to be used in places where we know we
cannot deadlock (i.e. because we're holding no other locks).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
six locks are unfair: while a thread is blocked trying to take a write
lock, new read locks will fail. The new deadlock cycle detector makes
use of our existing lock tracing, so we need to tell it we're holding a
write lock before we take the lock for it to work correctly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since we've now got time_stats for lock hold times (per btree
transaction), we don't need this anymore.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Also, do some reorganizing/renaming, convert atomic counters in bch_fs
to persistent counters, and add a few missing counters.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
On failure to get a journal pre-reservation because we're called from
journal reclaim we're not supposed to return a transaction restart error
- this fixes a livelock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It now prints the error name when the btree node is an error pointer;
also, don't trace failures when the the btree node is
BCH_ERR_no_btree_node_up.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Don't decrease BTREE_ITER_MAX when building with CONFIG_LOCKDEP
anymore. The lockdep table sizes are configurable now, we don't need
this anymore.
- btree_trans_too_many_iters() is less conservative now. Previously it
was causing a transaction restart if we had used more than
BTREE_ITER_MAX / 2 paths, change this to BTREE_ITER_MAX - 8.
This helps with excessive transaction restarts/livelocks in the bucket
allocator path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The upcoming lock cycle detection code will need to know precisely which
locks every btree_trans is holding, including write locks - this patch
updates btree_node_locked_type to include write locks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Improve our debugfs output, to help in debugging deadlocks: this shows,
for every btree node we print, the current number of readers/intent
locks/write locks held.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch
- tracks maximum bch2_trans_kmalloc() memory used in btree_transaction_stats
- makes it available in debugfs
- switches bch2_trans_init() to using that for the amount of memory to
preallocate, instead of the parameter passed in
This drastically reduces transaction restarts, and means we no longer
need to track this in the source code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
six_lock_count now counts up whether a write lock held, and this patch
now also correctly counts six_lock->intent_lock_recurse.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Previously, we used two different bit arrays for tracking held btree
node locks. This patch switches to an array of two bit integers, which
will let us track, in a future patch, when we hold a write lock.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Held btree locks are tracked in btree_path->nodes_locked and
btree_path->nodes_intent_locked. Upcoming patches are going to change
the representation in struct btree_path, so this patch switches to
proper helpers instead of direct access to these fields.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Start to centralize some of the locking code in a new file; more locking
code will be moving here in the future.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Device labels are represented as pointers in the member info section: we
need to get and then set the label for it to be kept correctly.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The new convention is that functions that handle transaction restarts
within an existing transaction context should return
-BCH_ERR_transaction_restart_nested when they did so, since they
invalidated the outer transaction context.
This also means bch2_btree_delete_range_trans() is changed to only call
bch2_trans_begin() after a transaction restart, not on every loop
iteration.
This is to fix a bug in fsck, in check_inode() when we truncate an inode
with BCH_INODE_I_SIZE_DIRTY set.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- fsck_inode_rm() wasn't returning BCH_ERR_transaction_restart_nested
- change bch2_trans_verify_not_restarted() to call panic() - we don't
want these errors to be missed
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
iter->k needs to be consistent with iter->pos - required for
bch2_btree_iter_(rewind|advance) to work correctly.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
When returning a key from the key cache, in BTREE_ITER_WITH_KEY_CACHE
mode, we don't want to set should_be_locked on iter->path; we're not
returning a key from that path, so we donn't need to, and also since we
traversed the key cache iterator before setting should_be_locked on that
path it might be unlocked (if we unlocked, bch2_trans_relock() won't
have relocked it).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
For debugging the eytzinger search tree code, and low level bkey packing
code, it can be helpful to see things in binary: this patch improves our
helpers for doing so.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We should be calling btree_node_mem_ptr_set() before path_level_init(),
since we already touched the key that btree_node_mem_ptr_set() will
modify and path_level_init() will be doing the lookup in the child btree
node we're recursing to.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Topology repair may change btree node min/max keys: when it does so, we
need to always rebuild eytzinger search trees because nodes directly
depend on those values.
This fixes a bug found by the 'kill_btree_node' test, where we'd pop an
assertion in bch2_bset_search_linear().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
For now this is just a BUG_ON() - we may want to change this to return
an error in the future.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This improves flush_buf() so that it always returns nonzero when we're
done reading and ready to return to userspace, and so that it returns
the value we want to return to userspace (number of bytes read, if there
wasn't an error).
In the future we'll be better abstracting this mechanism and pulling it
out of bcachefs, and using it to replace seq_file.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We were iterating starting at BCACHEFS_ROOT_INO, but snapshots start at
POS_MIN - meaning this code was never getting run.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Reported-by: Olexa Bilaniuk <obilaniu@gmail.com>
Instead of counting transaction restarts, count when the transaction is
restarted: if bch2_trans_begin() was called when the transaction wasn't
restarted we need to ensure restart_count is still incremented.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Turns out this assertion was something we could legitimately hit - add a
comment describing what's going on, and handle it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We need a way to check if the machinery for handling btree_paths with in
a transaction is behaving reasonably, as it often has not been - we've
had bugs with transaction path overflows caused by duplicate paths and
plenty of other things.
This patch tracks, per transaction fn, the most btree paths ever
allocated by that transaction and makes it available in debugfs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Going to be adding more things to this in the next patch.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes an assertion about unexpected transaction restarts -
bch2_delete_range_trans() handles transaction restarts.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
bch2_path_put() is supposed to drop paths that aren't needed on
transaction restart, or to hold locks that we're supposed to keep until
transaction commit: but it was failing to free paths in some cases that
it should have, leading to transaction path overflows with lots of
duplicate paths.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
These were used more prior to getting rid of the in-memory bucket arrays
- they don't serve much purpose anymore, and deleting them lets us write
better assertions.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Our types are exported to the tracepoint code, so it's not necessary to
break things out individually when passing them to tracepoints - we can
also call other functions from TP_fast_assign().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This was noticed when a test hit this error and didn't fail, because
fsck wasn't returning that it fixed errors.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
It doesn't make any sense to set should_be_locked on btree_paths that
aren't locked, and is often a bug - this patch adds assertions and fixes
some of those bugs.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- use strlcpy(), not strncpy()
- add tracepoints for btree_path alloc and free
- give the tracepoint for key cache upgrade fail a proper name
- add a tracepoint for btree_node_upgrade_fail
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Clearing path->preserve means the path will be dropping in
bch2_trans_begin() - but on transaction restart, we're likely to need
that path again.
This fixes a livelock in the allocation path.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
bch2_btree_trans_to_text() is used to print btree_transactions owned by
other threads; thus, it needs to be particularly careful. This fixes a
null ptr deref caused by racing with the owning thread changing
path->l[].b.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This uses the new trans->restart count to make sure we always correctly
return -BCH_ERR_transaction_restart_nested when we restart a nested
transaction - eliminating some other hacks and preparing for new
assertions.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Sometimes we see IO errors due to O_DIRECT alignment issues - having an
option to use buffered IO will be helpful.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Currently seeing a very rare and difficult to explain btree_path
inconsistency - this patch adds assertions to the only place that seems
to be missing them.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Nested btree transactions require special care, and an upcoming patch is
going to add assertions to that effect. We don't want to be using them
unnecessarily, so this patch switches bch2_bucket_trans_early() to not
handle transaction restarts.
This patch also adds a cursor so that on transaction restart we can
continue scanning from where the previous search for an empty bucket
left off.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_count_inode_sectors() uses for_each_btree_key() internally, which
handles lock restarts - the lockrestart_do() in check_i_sectors() is
redundant, and buggy here since the count that
bch2_count_inode_sectors() returns was interpreted as an error by
lockrestart_do().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- Convert to for_each_btree_key2(), for_each_btree_key_commit(),
for_each_btree_key_reverse()
- No more bare bch2_btree_iter_peek(); we're now fault-injection lock
restarts, so we always need a lockrestart_do() or equivalent.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds a new macro, like for_each_btree_key2(), but for iterating in
reverse order.
Also, change for_each_btree_key2() to properly check the return value of
bch2_btree_iter_advance().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
In CONFIG_BCACHEFS_DEBUG mode, we'll now randomly issue transaction
restarts - with a decaying probability based on the number of restarts
we've already had, to ensure that transactions eventually make forward
progress. This should help shake out some bugs.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we have error codes, with subtypes, we can switch to our own
error code for transaction restarts - and even better, a distinct error
code for each transaction restart reason: clearer code and better
debugging.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
In bch2_bucket_alloc_trans(), we're iterating over buckets - but not
directly with an iterator, since we're iterating over the freespace
btree.
This means that we need to clear iter->path->preserve, otherwise we'll
end up retaining a btree_path for every alloc key we touched - which is
not what we want here.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Instead of overloading standard error codes (EINTR/EAGAIN), and defining
short lists of error codes in multiple places that potentially end up
overlapping & conflicting, we're now going to have one master list of
error codes.
Error codes are defined with an x-macro: thus we also have
bch2_err_str() now.
Also, error codes have a class field. Now, instead of checking for
errors with ==, code should use bch2_err_matches(), which returns true
if the error is equal to or a sub-error of the error class.
This means we can define unique errors for every source location where
an error is generated, which will help improve our error messages.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We can rebuild alloc info if these btree roots are missing - no need to
bail out and say the filesystem is unrecoverable
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Like bch2_copygc_wait_amount, should_invalidate_buckets() needs to try
to ensure that there are always more buckets free than the largest
reserve.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
With the upcoming patches to add assertions for incorrect nested
transaction restart handling, this code is now bogus. Switch it to
for_each_btree_key_norestart() so that transaction restarts are only
handled in one place.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The new for_each_btree_key2() macro handles transaction retries,
allowing us to avoid nested transactions - which we want to avoid since
they're tricky to do completely correctly and upcoming assertions are
going to be checking for that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This will help us improve nested transactions - we need to add
assertions that whenever an inner transaction handles a restart, it
still returns -EINTR to the outer transaction.
This also adds nested_lockrestart_do() and nested_commit_do() which use
the new counters to correctly return -EINTR when the transaction was
restarted.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The new for_each_btree_key2() macro handles transaction retries,
allowing us to avoid nested transactions - which we want to avoid since
they're tricky to do completely correctly and upcoming assertions are
going to be checking for that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The new for_each_btree_key2() macro handles transaction retries,
allowing us to avoid nested transactions - which we want to avoid since
they're tricky to do completely correctly and upcoming assertions are
going to be checking for that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The new for_each_btree_key2() macro handles transaction retries,
allowing us to avoid nested transactions - which we want to avoid since
they're tricky to do completely correctly and upcoming assertions are
going to be checking for that.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The new for_each_btree_key2() macro handles transaction retries,
allowing us to avoid nested transactions - which we want to avoid since
they're tricky to do completely correctly and upcoming assertions are
going to be checking for that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds a new helper, bch2_trans_run(), that runs a function with a
btree_transaction context but without handling transaction restarts.
We're adding checks for nested transaction restart handling: when an
inner transaction handles a transaction restart it will still have to
return it to the outer transaction, or else assertions will be popped in
the outer transaction.
But some places don't need restart handling at the outer scope, so this
helper does what they need.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This converts bch2_gc_stripes_done() and bch2_gc_reflink_done() to the
new for_each_btree_key_commit() macro.
The new for_each_btree_key2() and for_each_btree_key_commit() macros
handles transaction retries, allowing us to avoid nested transactions -
which we want to avoid since they're tricky to do completely correctly
and upcoming assertions are going to be checking for that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The new for_each_btree_key2() macro handles transaction retries,
allowing us to avoid nested transactions - which we want to avoid since
they're tricky to do completely correctly and upcoming assertions are
going to be checking for that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The new for_each_btree_key2() macro handles transaction retries,
allowing us to avoid nested transactions - which we want to avoid since
they're tricky to do completely correctly and upcoming assertions are
going to be checking for that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The new for_each_btree_key2() macro handles transaction retries,
allowing us to avoid nested transactions - which we want to avoid since
they're tricky to do completely correctly and upcoming assertions are
going to be checking for that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The new for_each_btree_key2() macro handles transaction retries,
allowing us to avoid nested transactions - which we want to avoid since
they're tricky to do completely correctly and upcoming assertions are
going to be checking for that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The new for_each_btree_key2() macro handles transaction retries,
allowing us to avoid nested transactions - which we want to avoid since
they're tricky to do completely correctly and upcoming assertions are
going to be checking for that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We have an obvious wake up race if we do the wakeup _before_ updating
the counters the thing doing the waiting is reading.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We now record the length of time btree locks are held and expose this in debugfs.
Enabled via CONFIG_BCACHEFS_LOCK_TIME_STATS.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Printbufs indentation feature doesn't yet work with '\n' and '\t'. So we've
replaced all instances of '\n' with prt_newline.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need the caller name and a place to store our results, btree_trans provides this.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We try to ensure we never hold btree locks for too long - bcachefs tries
to be soft realtime. This adds a check when restarting a transaction,
where a transaction restart is cheap - if we've been holding locks for
too long, drop and retake them.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This introduces two new macros for iterating through the btree, with
transaction restart handling
- for_each_btree_key2()
- for_each_btree_key_commit()
Every iteration is now in an implicit transaction, and - as with
lockrestart_do() and commit_do() - returning -EINTR will cause the
transaction to be restarted, at the same key.
This patch converts a bunch of code that was open coding this to these
new macros, saving a substantial amount of code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
When we find an extent past an inode's i_size, we need to do the
deletion in the inode's snapshot (which will emit a whiteout if
necessary); and we also need to note that we now have an a key at that
position and snapshot, so that we don't go into an infinite loop.
Also, switch to walking inodes in reverse older, oldest snapshot to
newest, so that we emit the fewest whiteouts possible.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Fsck now checks for keys in different snapshot IDs that are now
redundant due to other snapshots being deleted - it needs to for its own
algorithms to not get confused.
When it detects this it should re-run the post snapshot deletion cleanup
- this patch does that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- Bunch of refactoring, and move some code out of
bch2_snapshots_start() and into bch2_snapshots_check(), for constency
with the rest of fsck
- Interior snapshot nodes no longer point to a subvolume; this is so we
don't end up with dangling subvol references when deleting or require
scanning the full snapshots btree.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This makes the snapshots_seen data structure fsck private and improves
it; we now also track the equivalence class for each snapshot id we've
seen, which means we can detect when snapshot deletion hasn't finished
or run correctly (which will otherwise confuse fsck).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
fsck doesn't want to run while we're cleaning up deleted snapshots - if
that work needs to be done, we want it to have finished before fsck
runs, otherwise fsck will get confused when it finds multiple keys in
the same snapshot ID equivalence class (i.e. the mechanism that
snapshot deletion uses for cleaning up redundant keys).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We should never see an inode marked as unlinked that's a subvolume root
(or a directory) in fsck, but even if we do it's not correct for fsck to
delete the subvolume: subvolumes are owned by dirents, and if we find a
dangling subvolume (not marked as unlinked) we want fsck to reattach it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
snapshots_seen is becoming private to fsck, and snapshot_id_list is
actually what the data update path needs.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Snapshots being deleted won't in general have a corresponding subvolume:
this fixes a spurious fsck error where we'd complain about a snapshot
pointing to a missing subvolume - but the subvolume had been deleted,
and the snapshot was pending deletion as well.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Better/more descriptive naming, and prep for adding
nested_lockrestart_do() and nested_commit_do().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
There's no need to print fsck errors for errors that are expected, and
the user has already opted to repair.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
These messages log the updates we're doing in bch2_check_fix_ptrs(),
which is useful when debugging but not usually needed.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
There's no point reading an extent in order to move it if the write is
going to fail because we're shutting down. This patch changes the move
path so that moving_io now owns a ref on c->writes - as a bonus,
rebalance and copygc will now notice that we're shutting down and exit
quicker.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- add bch2_moving_ctxt_(init|exit)
- split out __bch2_evacutae_bucket() which takes an existing
moving_ctxt, this will be used for improving copygc performance by
pipelining across multiple buckets
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
move_ratelimit() now has a bool that specifies whether we want to
wait for copygc to finish.
When copygc is running, we're probably low on free buckets instead
of consuming the remaining buckets, we want to wait for copygc to
finish.
This should help with performance, and run away bucket fragmentation.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch significantly cleans up and simplifies the data_update
interface. Instead of only being able to specify a single pointer by
device to rewrite, we're now able to specify any or all of the pointers
in the original extent to be rewrited, as a bitmask.
data_cmd is no more: the various pred functions now just return true if
the extent should be moved/updated. All the data_update path does is
rewrite existing replicas, or add new ones.
This fixes a bug where with background compression on replicated
filesystems, where rebalance -> data_update would incorrectly drop the
wrong old replica, and keep trying to recompress an extent pointer and
each time failing to drop the right replica. Oops.
Now, the data update path doesn't look at the io options to decide which
pointers to keep and which to drop - it only goes off of the
data_update_options passed to it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
bch2_check_alloc_key() was failing to check buckets that didn't have
alloc keys yet (because they'd never been used) - they still need to be
added to the freespace btree.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- In check_alloc_key(), previously we were re-initializing iterators
for the need_discard and freespace btrees for every alloc key we
checked. But this was causing us to redo lookups into the journal
keys every time, since those lookups are cached in struct btree_iter.
This initializes the iterators in bch2_check_alloc_info and passes
them into check_alloc_key().
- Make the looping more consistent/efficient in bch2_check_alloc_info()
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This runs before we go rw for journal replay, but after we're allowed to
go rw. It might be time to consider killing BTREE_INSERT_LAZY_RW,
though.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- invalidate_one_bucket() now returns 1 when we don't have any buckets
on this device to invalidate, ensuring we don't spin
- the tracepoint invocation is moved to after the transaction commit,
and we now include the number of cached sectors in the tracepoint
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
If a btree node is unreadable, it's the topology repair that fixes that
and it's kicked off by btree_gc, so btree_gc needs to touch every node
and very that they can be read.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
__dev_available() now calculates available buckets correctly. Previously
it would almost always return 0 when we have cached data.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we were at the end of the node, when breaking out of the loop we'd
pop the assertion on line 446 when cur wasn't NULL.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
-o verbose is very useful, and we're starting to use it more for runtime
debug statements - making it possible to enable at runtime is a no
brainer.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This improves the "copygc requested to run but no buckets found" to show
the device that requires copygc to be run on - we'll definitely need to
improve this more.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This is the start of reorganizing the data IO paths. The plan is to also
break apart io.c into data_read.c and data_write.c, and migrate_write
will be renamed to the data_update path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, dev_buckets_available() only counted buckets that are
eligible to be allocated right now - i.e. buckets that don't have cached
data, or need discard, or need gc gens, etc.
But most users of this function want to know how many buckets are
eligible to be allocated from without moving data around - copygc,
allocator striping, which means we should be including cached data
buckets etc.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Originally, the btree key cache code would always allocate new entries
by reusing from the recently-freed list, if that list wasn't empty. But
that behaviour was dropped, for lock contention reasons.
But it seems that entries stranded on the freed list have been
contributing to some of our oom issues, because long running btree
transactions will prevent them from being freed.
This patch re-adds allocating from the freed list, but it also adds
percpu buffers to solve the lock contention issues - and the new percpu
freed lists will improve the evict paths, too.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds a new option, move_bytes_in_flight, for configuring the amount
of IO in flight by copygc/rebalance - users with many devices in their
filesystem will want to increase this.
In the future we should be smarter about this, but this is an easy
improvement.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We have a hardcoded maximum on number of pointers in an extent that's
used by some other data structures - notably bch_devs_list - but we
weren't actually checking for it. Oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
If we fail to queue the work item because it's already in process, we
need to drop the ref we just took.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
If we're trying to get a ref and the refcount has been killed, it means
we're doing an emergency shutdown - we always want tryget_live().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We're seeing checksum errors in the bch2_rechecksum_bio() path - give it
a better error message to help track this down.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
When inserting a key type that's not valid for a given btree, we should
print out which btree we were inserting into.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With backpointers, alloc keys have gotten bigger, so we're needing more
memory here.
We're probably going to need to go with something more sophisticated
than a bump allocator, but - let's see if we can avoid doing that just
yet.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Like the previous patch for bucket invalidates, add another counter for
a core allocator path.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
b->written wasn't being reset to 0 in the btree node read retry path,
causing decrypting & validation of previously read bsets to not be
re-run - ouch.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Like bch2_do_discards(), we should check if this needs to be done when
going rw.
Also, add some sysfs code for debugging bucket invalidation.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Printbufs recently switched to using string_get_size() for printing
integers in human readable units. This updates __bch2_strtoh() to parse
numbers printed by string_get_size() - we now have to handle floating
point numbers, and new unit suffixes.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
bch2_dev_freespace_init() was using __bch2_trans_do() incorrectly, and
calling bch2_bucket_do_index() with a stale alloc key.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This converts bcachefs to the modern printbuf interface/implementation,
synced with the version to be submitted upstream.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We were forgetting to clear the read_in_flight flag - oops. This also
fixes it to not call bch2_fatal_error() before topology repair has had a
chance to do its thing.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We had a bug where btree_and_journal_iter would return the same key
twice - after deleting it (perhaps because it was present in both the
btree and the journal?)
This reworks btree_and_journal_iter to track the current position, much
like btree_paths, which makes the logic considerably simpler and more
robust.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
cmd_list_journal wasn't correctly listing the most recent journal
entries as blacklisted - because in the recovery path when just reading
the journal, we were failing to add those to the blacklist table.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Lately we've been doing a lot of debugging by looking at the journal to
see what was changed, and by what code path. This patch adds a new
journal entry type for recording overwrites, so that we don't have to
search backwards through the journal to see what was being overwritten
in order to work out what the triggers were supposed to be doing.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This takes copying the payload out of bch2_journal_add_entry(), which
means we can use it for journal_transaction_name() - also prep work for
journalling overwrites.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
When do_encrypt() was passed a vmalloc address and the buffer spanned
more than a single page, we were encrypting/decrypting completely
different pages than the ones intended.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Right now, we print an error message on btree node read error, and we
print that we're retrying, but we don't explicitly say if the retry
succeeded - this makes things a little clearer.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Previously, on every btree_iter_peek() operation we were searching the
journal keys, doing a full binary search - which was slow.
This patch fixes that by saving our position in the journal keys, so
that we only do a full binary search when moving our position backwards
or a large jump forwards.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
flush_dcache_page() is not a noop on arm, but we were using
virt_to_page() instead of vmalloc_to_page() for an address on the kernel
stack - vmalloc memory, leading to an oops in flush_dcache_page().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The only difference key_type_logon and key_type_user is that
key_type_logon keys can't be read by userspace.
However, userspace has actually been adding keys to both the logon and
user keychains, because userspace fsck requires the keychain interface -
so we might as well just use user and drop the logon keychain.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- Drop old unneeded parameter for whether we're in initial GC - which
was from when btree updates had to be done differently before we
went RW.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Per Dave Chinner and the xfs folks, .writepage is no longer needed, and
it's better not to define it if .writepages is the intended path.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Rust FFI lacks support for unnamed structs and unions. The space
saved in bch_option is not enough to be significant.
Signed-off-by: Brett Holman <bholman.devel@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is pretty expensive, and we've tested sufficiently with it now that
it doesn't need to be on by default.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
When merging extents, we have to check that we won't overflow size
fields in any CRC entries - but the check for this was wrong, because in
the loop it was in we weren't keeping a pointer to the (packed, encoded)
CRC field.
Fix this by moving it to its own loop.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Bkeys have gotten a lot bigger since this code was written and now are
often formatted across multiple lines - while the reason a bkey is
invalid will still be short and fit on a single line. This patch prints
the error bfore the bkey, making it a bit more readable.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
journal_iters_fix() was incorrectly rewinding iterators past keys they
had already returned, leading to those keys being double counted in the
bch2_gc() path - oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
btree updates before going RW are expensive if they're in random order,
since they use the list of keys for journal replay to insert, which is
just a gap buffer.
This patch improves the bucket invalidate path so that if
bch2_check_lrus() hasn't finished it only prints warnings instead of
doing an emergency shutdown, which means we can now set BCH_FS_MAY_GO_RW
before bch2_check_lrus().
Also, the filesystem state bits are reorganized a bit.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds a new superblock field for persisting counters
and adds a sysfs interface in counters/ exposing these counters.
The superblock field is ignored by older versions letting us avoid
an on disk version bump.
Each sysfs file outputs a counter that tracks since filesystem
creation and a counter for the current mount session.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Delete some obsolete tracepoints, organize alloc tracepoints better,
make a few tracepoints more consistent.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Can't take btree node locks while holding btree_reserve_cache_lock - it
would be nice if we could check this with lockdep.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We're seeing occasional firings of the assertion in the key cache
shutdown code that nr_dirty == 0, which means we must sometimes be doing
transaction commits after we've gone read only.
Cleanups & changes:
- BCH_FS_ALLOC_CLEAN renamed to BCH_FS_CLEAN_SHUTDOWN
- new helper bch2_btree_interior_updates_flush(), which returns true if
it had to wait
- bch2_btree_flush_writes() now also returns true if there were btree
writes in flight
- __bch2_fs_read_only now checks if btree writes were in flight in the
shutdown loop: btree write completion does a transaction update, to
update the pointer in the parent node
- assert that !BCH_FS_CLEAN_SHUTDOWN in __bch2_trans_commit
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
hash_check_key() was incorrectly handling transaction restarts - switch
it to for_each_btree_key_norestart().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
If we allocate a buffer that's a bit bigger than necessary the
transaction commit path will be much less likely to have to reallocate -
which requires a transaction restart.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds bch2_btree_iter_peek_all_levels(), which returns keys from
every level of the btree - interior nodes included - in monotonically
increasing order, soon to be used by the backpointers check & repair
code.
- BTREE_ITER_ALL_LEVELS can now be passed to for_each_btree_key() to
iterate thusly, much like BTREE_ITER_SLOTS
- The existing algorithm in bch2_btree_iter_advance() doesn't work with
peek_all_levels(): we have to defer the actual advancing until the
next time we call peek, where we have the btree path traversed and
uptodate. So, we add an advanced bit to btree_iter; when
BTREE_ITER_ALL_LEVELS is set bch2_btree_iter_advanced() just marks
the iterator as advanced.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds two new helpers to btree_iter.c for changing the level of a
path up or down - to be used by the new
bch2_btree_iter_peek_all_levels().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The new backpointers code will be using bch2_btree_iter_peek_slot() on
interior nodes - this patch updates peek_slot() to make that work.
- Pass the correct level to bch2_journal_keys_peek_slot()
- We should only set BTREE_ITER_CACHED or BTREE_ITER_WITH_KEY_CACHE
when using bch2_trans_iter_init(), not bch2_trans_node_iter_init()
- Update assertions
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Previously, btree_update_interior.c passed keys to bch2_trans_mark_*
that hadn't been fully initialized - they didn't have the key field
filled out, just the value.
With backpointers, we need to make sure keys are fully initialized
before marking them - because the backpointer points back to the
original key.
This patch tweaks the interior update paths to fix this.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
For backpointers, we'll need the full key location - that means btree_id
and btree level. This patch plumbs it through.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This improves some of our warnings and assertions - they imply possible
filesystem inconsistencies, so they should be calling
bch2_fs_inconsistent().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This option was useful when the replicas mechism was new and still being
debugged, but hasn't been used in ages - let's delete it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
A user reported an error where we hit an assertion due to deleting a key
in an internal snapshot node, when deleting a dirent that points to a
nonexisting inode.
We try to avoid doing updates to keys for internal snapshot nodes, but
upon inspection of the places where we remove dirents in fsck it appears
BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE is correct for all of them: either
the target dirent doesn't exist, or it's a directory with multiple
dirents pointing to it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
In journal replay, we weren't immediately dropping journal pins when we
start doing updates that ewern't from journal replay - leading to
journal reclaim getting stuck.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds error logging to a bunch of functions in fsck.c - in fsck,
reduntant error messages is probably better than not enough.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
If the dirent an inode points to doesn't exist, we shouldn't be
returning an error - just 0/false.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
When we detect a filesystem inconsistency, we should include the
relevent keys in the error message. This patch adds a parameter to pass
the key with the lru entry to bch2_lru_delete(), so that it can be
printed.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
When many journal replay keys have been overwritten,
bch2_journal_keys_peek() was taking excessively long to scan before it
found a key to return.
Fix this by introducing bch2_journal_keys_peek_upto() which takes a
parameter for the end of the range we want, so that we can terminate the
search much sooner, and replace all uses of bch2_journal_keys_peek()
with peek_upto() or peek_slot().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Error messages should always print out the full key when available -
this gives us a starting point when looking through the journal to debug
what went wrong.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
It's an error if a bucket is in state BCH_DATA_cached but not on the LRU
btree - i.e io_time[READ] == 0 - so, make sure it's set before adding
it.
Also, make some of the LRU code a bit clearer and more direct.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This patch updates bch2_open_buckets_to_text() to include the device and
bucket the open_bucket owns.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
In journal_entry_add(), we were repeatedly scanning the journal entries
radix tree to scan for old entries that can be freed, with O(n^2)
behaviour. This patch tweaks things to remember the previous last_seq,
so we don't have to scan for entries to free from the start.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We start doing allocations before the GC thread is created, which means
we need to check for that to avoid a null ptr deref.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We now pass a rw argument to .key_invalid methods so they can trigger
assertions for updates but not on existing keys. We shouldn't trigger
these extra assertions in journal replay - this patch changes the
transaction commit path accordingly.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- We weren't clearing the LRU btree
- bch2_alloc_read() runs before bch2_check_alloc_key() deletes alloc
keys for devices/buckets that don't exists, so it needs to check for
that
- bch2_check_lrus() needs to check that buckets exists
- improve some error messages
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
With backpointers this doesn't work anymore - backpointers always need
to be updated to point to the new extent position.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We need to ensure that work structs in bch_fs always get initialized -
otherwise an error in filesystem initialization can pop a warning in the
workqueue code when we try to cancel a work struct that wasn't
initialized.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Previously, the journal read path used a linked list for storing the
journal entries we read from disk. But there's been a bug that's been
causing journal_flush_delay to incorrectly be set to 0, leading to far
more journal entries than is normal being written out, which then means
filesystems are no longer able to start due to the O(n^2) behaviour of
inserting into/searching that linked list.
Fix this by switching to a radix tree.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
When there weren't any keys in the journal there's no need to allocate
the buffer - but doing that causes a spurious -ENOMEM.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Previously, we were missing accounting for buckets in need_gc_gens and
need_discard states. This matters because buckets in those states need
other btree operations done before they can be used, so they can't be
conuted when checking current number of free buckets against the
allocation watermark.
Also, we weren't directly counting free buckets at all. Now, data type 0
== BCH_DATA_free, and free buckets are counted; this means we can get
rid of the separate (poorly defined) count of unavailable buckets.
This is a new on disk format version, with upgrade and fsck required for
the accounting changes.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We're currently debugging an issue with discards not getting run; this
patch adds a manual trigger so we can then watch the tracepoint while it
runs.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- We were failing to start topology repair, because we hadn't set the
superblock flag indicating it needed to run
- set_node_min() forget to update the btree node's key
- bch2_gc_alloc_reset() didn't reset data type, leading to inserting an
invalid key that was empty but had nonzero data type
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
In the future printbufs will be mempool-ified, so we shouldn't be using
more than one at a time if we don't have to.
This also fixes an extra trailing newline.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We've been seeing this error in fsck and we weren't able to track down
where it came from - but now that .key_invalid methods take a rw
argument, we can safely check for this.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
In check_extents() and check_dirents(), we're working towards only
handling transaction restarts in one place, at the top level - but we're
not there yet. check_i_sectors() and check_subdir_count() handle
transaction restarts locally, which means the iterator for the
dirent/extent is left unlocked (should_be_locked == 0), leading to
asserts popping when we go to do updates.
This patch hacks around this for now, until we can delete the offending
code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds a new parameter to .key_invalid() methods for whether the key
is being read or written; the idea being that methods can do more
aggressive checks when a key is newly created and being written, when we
wouldn't want to delete the key because of those checks.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- Move checks for whether the device & bucket are valid from the
.key_invalid method to bch2_check_alloc_key(). This is because
.key_invalid() is called on keys that may no longer exist (post
journal replay), which is a problem when removing/resizing devices.
- We weren't checking the need_discard btree to ensure that every set
bucket has a corresponding alloc key. This refactors the code for
checking the freespace btree, so that it now checks both.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Btree updates before we go RW work by inserting into the array of keys
that journal replay will insert - but inserting into a flat array is
O(n), meaning if btree_gc needs to update many alloc keys, we're O(n^2).
Fortunately, the updates btree_gc does happens in sequential order,
which means a gap buffer works nicely here - this patch implements a gap
buffer for journal keys.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This behavior dates from the early, early days of bcache, and upon
further delving appears to not make any sense. The shrinker only works
in terms of 'objects' of unknown size; normalizing to pages only had the
effect of changing the batch size, which we could do directly - if we
wanted; we probably don't. Normalizing to pages meant our batch size was
very small, which seems to have been keeping us from doing as much
shrinking as we should be under heavy memory pressure; this patch
appears to alleviate some OOMs we've been seeing.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
mark_stripe_bucket() was busted; it was using @new unitialized.
Also, clean up all the gc mark functions, and convert them to the same
style.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This neatly avoids bugs where we fail partway through initializing a new
filesystem, if we just don't write out partly-initialized state.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
With printbufs, it's now easy to build up multi-line log messages and
emit them with one call, which is good because it prevents multiple
multi-line log messages from getting Interspersed in the log buffer;
this patch also improves the formatting and converts it to latest style.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This switches struct bucket to using a lock, instead of cmpxchg. And now
that the protected members no longer need to fit into a u64, we can
expand the sector counts to 32 bits.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
All code using the in-memory bucket array, excluding GC, has now been
converted to use the alloc btree directly - so we can finally delete it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is one of the last steps in getting rid of the main in-memory
bucket array.
This changes bch2_dev_usage_update() to take bkey_alloc_unpacked instead
of bucket_mark, and for the places where we are in fact working with
bucket_mark and don't have bkey_alloc_unpacked, we add a wrapper that
takes bucket_mark and converts to bkey_alloc_unpacked.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the old allocator code, preparing an existing empty bucket was part
of the same code path that invalidated buckets containing cached data.
In the new allocator code this is no longer the case: the main allocator
path finds empty buckets (via the new freespace btree), and can't
allocate buckets that contain cached data.
We now need a separate code path to invalidate buckets containing cached
data when we're low on empty buckets, which this patch implements. When
the number of free buckets decreases that triggers the new invalidate
path to run, which uses the LRU btree to pick cached data buckets to
invalidate until we're above our watermark.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the old allocator code, buckets would be discarded just prior to
being used - this made sense in bcache where we were discarding buckets
just after invalidating the cached data they contain, but in a
filesystem where we typically have more free space we want to be
discarding buckets when they become empty.
This patch implements the new behaviour - it checks the need_discard
btree for buckets awaiting discards, and then clears the appropriate
bit in the alloc btree, which moves the buckets to the freespace btree.
Additionally, discards are now enabled by default.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we have new persistent data structures for the allocator, this
patch converts the allocator to use them.
Now, foreground bucket allocation uses the freespace btree to find
buckets to allocate, instead of popping buckets off the freelist.
The background allocator threads are no longer needed and are deleted,
as well as the allocator freelists. Now we only need background tasks
for invalidating buckets containing cached data (when we are low on
empty buckets), and for issuing discards.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds two new btrees for the upcoming allocator rewrite: an extents
btree of free buckets, and a btree for buckets awaiting discards.
We also add a new trigger for alloc keys to keep the new btrees up to
date, and a compatibility path to initialize them on existing
filesystems.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This introduces a new alloc key which doesn't use varints. Soon we'll be
adding backpointers and storing them in alloc keys, which means our
pack/unpack workflow for alloc keys won't really work - we'll need to be
mutating alloc keys in place.
Instead of bch2_alloc_unpack(), we now have bch2_alloc_to_v4() that
converts older types of alloc keys to v4 if needed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This implements new persistent LRUs, to be used for buckets containing
cached data, as well as stripes ordered by time when a block became
empty.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Add a new superblock field which represents journal buckets as ranges:
also move code for the superblock journal fields to journal_sb.c.
This also reworks the code for resizing the journal to write the new
superblock before using the new journal buckets, and thus be a bit
safer.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
In the write path, after the write to the block device(s) complete we
have to punt to process context to do the btree update.
Instead of using the work item embedded in op->cl, this patch switches
to a per write-point work item. This helps with two different issues:
- lock contention: btree updates to the same writepoint will (usually)
be updating the same alloc keys
- context switch overhead: when we're bottlenecked on btree updates,
having a thread (running out of a work item) checking the write point
for completed ops is cheaper than queueing up a new work item and
waking up a kworker.
In an arbitrary benchmark, 4k random writes with fio running inside a
VM, this patch resulted in a 10% improvement in total iops.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This simplifies the logic in bch2_btree_update_start() a bit, handling
the unlock/block logic more locally.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since journal reclaim -> btree key cache flushing may require the
allocation of new btree nodes, it has an implicit dependency on copygc
in order to make forward progress - so we should avoid blocking copygc
unless the journal is really close to full.
This introduces watermarks to replace our single MAY_GET_UNRESERVED bit
in the journal, and adds a watermark for copygc and plumbs it through.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We don't actually want copygc allocations to be nowait - an allocation
for copygc might fail and then later succeed due to a bucket needing to
wait on journal commit, or to be discarded.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
When bch2_journal_pin_set() is updating an existing pin, we shouldn't
call bch2_journal_reclaim_fast() after dropping the old pin and before
dropping the new pin - that could reclaim the entry we're trying to pin.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
For backpointers, we'll need to delete old backpointers before adding
new backpointers - otherwise we'll run into spurious duplicate
backpointer errors.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
For backpointers, we need to switch the order triggers are run in: we
need to run triggers for deletions/overwrites before triggers for
inserts.
To avoid breaking the reflink triggers, this patch moves deleting of
indirect extents with refcount=0 to their triggers, instead of doing it
when we update those keys.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Print bucket:offset when the filesystem is online; this makes debugging
easier when correlating with alloc updates.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Add a new helper for logging messages to the journal - a new debugging
tool, an alternative to trace_printk().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This fixes a bug where __bch2_btree_node_update_key() wasn't clearing
should_be_locked, leading to bch2_btree_path_traverse() always failing -
all callers of btree_path_make_mut() want should_be_locked cleared, so
do it there.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
bch2_btree_iter_next_node() was mucking with other btree_path state
without setting path->update to be consistent with the fact that the
path is very much no longer uptodate - oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
bch2_journal_space_available -> bch2_journal_halt() self deadlocks on
journal lock; work around this by dropping/retaking journal lock before
we call bch2_fatal_error().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
When deleting an entry from a heap that was at entry h->used - 1, we'd
end up calling heap_sift() on an entry outside the heap - the entry we
just removed - which would end up re-adding it to the heap and deleting
something we didn't want to delete. Oops...
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We've been seeing a very strange bug where journal flush & reclaim delay
end up getting inexplicably zeroed, in the superblock. We're now
validating all the options in bch2_validate_super(), and 0 is no longer
a valid value for those options, but we need to be careful not to
prevent people's filesystems from mounting because of the new
validation.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Something funny is going on with the new code for restoring the journal
write point, and it's hard to reproduce.
We do want to debug this because resuming writing to the journal in the
wrong spot could be something serious. For now, replace the assertion
with an error message and revert to old behaviour when it happens.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We're seeing a very strange bug where journal_flush_delay sometimes gets
set to 0 in the superblock. Together with the preceding patch, this
should help us track it down.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This moves validation of superblock options to bch2_sb_validate(), so
they'll be checked in the write path as well.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Now we've got strings for metadata versions - this changes
bch2_sb_to_text() and our mount log message to use it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Previously, we'd go into an infinite loop when attempting to cache a
bkey in the key cache larger than 128 u64s - since we were only using a
u8 for the size field, it'd get rounded up to 256 then truncated to 0.
Oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>