IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
clang had a few more warnings about enum conversion, and also didn't
like the opts.c initializer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This refactoring centralizes defining per-btree properties.
bch2_key_types_allowed was also about to overflow a u32, so expand that
to a u64.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds bch2_run_explicit_recovery_pass(), for rewinding recovery and
explicitly running a specific recovery pass - this is a more general
replacement for how we were running topology repair before.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This introduces bch2_run_explicit_recovery_pass() and uses it for when
fsck detects that we need to re-run dead snaphots cleanup, and makes
dead snapshot cleanup more like a normal recovery pass.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The write buffer mechanism journals keys twice in certain
situations. A key is always journaled on write buffer insertion, and
is potentially journaled again if a write buffer flush falls into
either of the slow btree insert paths. This has shown to cause
journal recovery ordering problems in the event of an untimely
crash.
For example, consider if a key is inserted into index 0 of a write
buffer, the active write buffer switches to index 1, the key is
deleted in index 1, and then index 0 is flushed. If the original key
is rejournaled in the btree update from the index 0 flush, the (now
deleted) key is journaled in a seq buffer ahead of the latest
version of key (which was journaled when the key was deleted in
index 1). If the fs crashes while this is still observable in the
log, recovery sees the key from the btree update after the delete
key from the write buffer insert, which is the incorrect order. This
problem is occasionally reproduced by generic/388 and generally
manifests as one or more backpointer entry inconsistencies.
To avoid this problem, never rejournal write buffered key updates to
the associated btree. Instead, use prejournaled key updates to pass
the journal seq of the write buffer insert down to the btree insert,
which updates the btree leaf pin to reflect the seq of the key.
Note that tracking the seq is required instead of just using
NOJOURNAL here because otherwise we lose protection of the write
buffer pin when the buffer is flushed, which means the key can fall
off the tail of the on-disk journal before the btree leaf is flushed
and lead to similar recovery inconsistencies.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Introduce support for prejournaled key updates. This allows a
transaction to commit an update for a key that already exists (and
is pinned) in the journal. This is required for btree write buffer
updates as the current scheme of journaling both on write buffer
insertion and write buffer (slow path) flush is unsafe in certain
crash recovery scenarios.
Create a small trans update wrapper to pass along the seq where the
key resides into the btree_insert_entry. From there, trans commit
passes the seq into the btree insert path where it is used to manage
the journal pin for the associated btree leaf.
Note that this patch only introduces the underlying mechanism and
otherwise includes no functional changes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There is only one other caller so eliminate some boilerplate.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is in preparation to support prejournaled keys. We want the
ability to optionally pass a seq stored in the btree update rather
than the seq of the committing transaction.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Brian has been playing with bcachefs for several months now and has
offerred to commit time to patch review.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We commonly use no_data_io mode when debugging filesystem metadata
dumps, where data checksum/compression errors are expected and
unimportant - this patch suppresses these.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We currently don't track whether snapshot cleanup still needs to finish
(aside from running a full fsck), so it shouldn't be a fsck error yet -
fsck -n after fsck has succesfully completed shouldn't error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Delete a redundant bch2_snapshot_is_ancestor() check, and convert some
assertions to debug assertions.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Make the overlapping extent check/repair code more self contained.
This is prep work for hopefully reducing key_visible_in_snapshot() usage
here as well, and also includes a nice performance optimization to not
check ref_visible2() unless the extents potentially overlap.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This changes the main part of check_extents(), that checks the extent
against the corresponding inode, to not use key_visible_in_snapshot().
key_visible_in_snapshot() has to iterate over the list of ancestor
overwrites repeatedly calling bch2_snapshot_is_ancestor(), so this is a
significant performance improvement.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
More prep work for reducing key_visible_in_snapshot() usage - this
rearranges how KEY_TYPE_whitout keys are handled, so that they can be
marked off in inode_warker->inode->seen_this_pos.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We only want to synthesize an inode for the current snapshot ID for non
whiteouts - this refactoring lets us call walk_inode() earlier and clean
up some control flow.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Minor refactoring/dead code deletion, prep work for reworking
check_extent() to avoid key_visible_in_snapshot().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This improves the repair path for overlapping extents - we now verify
that we find in the btree the overlapping extents that the algorithm
detected, and fail the fsck run with a more useful error if it doesn't
match.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Prep work for changing check_extent() to avoid
key_visible_in_snapshot() - this adds the state to track whether an
inode has seen an extent at this pos.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Further optimization for bch2_snapshot_is_ancestor(). We add a small
inline bitmap to snapshot_t, which indicates which of the next 128
snapshot IDs are ancestors of the current id - eliminating the last few
iterations of the loop in bch2_snapshot_is_ancestor().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Mark these caches as reclaimable, so that available memory is correctly
reported when there is a lot of cached inodes.
Note that more work is needed - you should add __GFP_RECLAIMABLE to some
of the kmalloc calls, so that they are allocated from the "kmalloc-rcl-*"
caches.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This allows including a compression level when specifying a compression
type, e.g.
compression=zstd:15
Values from 1 through 15 indicate compression levels, 0 or unspecified
indicates the default.
For LZ4, values 3-15 specify that the HC algorithm should be used.
Note that for compatibility, extents themselves only include the
compression type, not the compression level. This means that specifying
the same compression algorithm but different compression levels for the
compression and background_compression options will have no effect.
XXX: perhaps we could add a warning for this
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Before, it was parsed as a bool but internally it was really an enum:
this lets us pass in all the possible values.
But we special case the option parsing: no supplied value is parsed as
FSCK_FIX_yes, to match the previous behaviour.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This switches the generic radix tree for the in-memory table of snapshot
nodes to a simple rcu array. This means we have to add new locking to
deal with reallocations, but is faster than traversing the radix tree.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In userspace, we want to be able to switch to buffered IO when we're
dealing with an image on a filesystem/device that doesn't support the
blocksize the filesystem was formatted with.
This plumbs through !opts.direct_io -> FMODE_BUFFERED, which will be
supported by the shim version of blkdev_get_by_path() in -tools, and it
adds a fallback to disable direct IO and retry for userspace.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, fallocate would only check the state of the extents btree
when determining if we need to create a reservation.
But the page cache might already have dirty data or a disk reservation.
This changes __bchfs_fallocate() to call bch2_seek_pagecache_hole() to
check for this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With "bcachefs: Snapshot depth, skiplist fields", we now can't run data
move operations until after bch2_check_snapshots() is complete.
Ideally we'd have the copygc (and rebalance) threads wait until
c->curr_recovery_pass has advanced, but the waitlist handling is tricky
- so for now, move starting copygc back to read_write_late().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
fixes
./include/linux/stddef.h:8:14: error: positional initialization of field in ‘struct’ declared with ‘designated_init’ attribute [-Werror=designated-init]
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This extents KEY_TYPE_snapshot to include some new fields:
- depth, to indicate depth of this particular node from the root
- skip[3], skiplist entries for quickly walking back up to the root
These are to improve bch2_snapshot_is_ancestor(), making it O(ln(n))
instead of O(n) in the snapshot tree depth.
Skiplist nodes are picked at random from the set of ancestor nodes, not
some fixed fraction.
This introduces bcachefs_metadata_version 1.1, snapshot_skiplists.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we've got forward compatibility sorted out, we should be doing
more frequent version upgrades in the future.
To avoid having to run a full fsck for every version upgrade, this
improves the BCH_METADATA_VERSIONS() table to explicitly specify a
bitmask of recovery passes to run when upgrading to or past a given
version.
This means we can also delete PASS_UPGRADE().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>