IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
fs-io.c is too big - time for some reorganization
- fs-dio.c: direct io
- fs-pagecache.c: pagecache data structures (bch_folio), utility code
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Awhile back, we changed bkey_format generation to ensure that the packed
representation could never represent fields larger than the unpacked
representation.
This was to ensure that bkey_packed_successor() always gave a sensible
result, but in the current code bkey_packed_successor() is only used in
a debug assertion - not for anything important.
This kills the requirement that we've gotten rid of those weird bkey
formats, and instead changes the assertion to check if we're dealing
with an old weird bkey format.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With "bcachefs: Snapshot depth, skiplist fields", we now can't run data
move operations until after bch2_check_snapshots() is complete.
Ideally we'd have the copygc (and rebalance) threads wait until
c->curr_recovery_pass has advanced, but the waitlist handling is tricky
- so for now, move starting copygc back to read_write_late().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we have distinct error codes for different memory allocation
failures, the early init log messages are no longer needed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- endianness fixes
- mark some things static
- fix a few __percpu annotations
- fix silent enum conversions
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need to allow filesystems with metadata from newer versions to be
mountable and usable by older versions.
This patch enables us to roll out new btrees without a new major version
number; we can now handle btree roots for unknown btree types.
The unknown btree roots will be retained, and fsck (including
backpointers) will check them, the same as other btree types.
We add a dynamic array for the extra, unknown btree roots, in addition
to the fixed size btree root array, and add new helpers for looking up
btree roots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
A crash immediately after device removal can result in an
unmountable filesystem due to recovery failure. The following
command reliably reproduces on a multi-device fs:
bcachefs device remove <dev> && xfs_io -xc shutdown <mnt>
The post-crash mount fails with an error similar to the following,
reported by fsck:
invalid journal entry dev_usage at offset 7994/8034 seq 12: bad dev, fixing
This refers to a device usage entry in the journal that refers to
the index of the just removed device. Recovery considers this an
invalid entry and fails to proceed.
Device usage entries are added to journal buffer writes via
bch_journal_write() -> bch2_journal_super_entries_add_common(),
which means any journal buffer write has content that refers to
member devices at the time of the journal write.
The device remove sequence already removes metadata references to
the device being removed. It then flushes any pins that refer to the
device, clears replica entries, removes the in-memory device object
and lastly updates the superblock to reflect that the device is no
longer present. The problem is that any journal writes that occur
during this sequence will include a dev usage entry so long as the
device is present. To avoid this problem, we can flush the journal
once more after the device entry is removed from the in-core
structures, but before the superblock is updated to fully remove the
device on-disk.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a null ptr deref in bch2_free_pending_node_rewrites() when
the list head wasn't initialized.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
As with previous conversions, replace -ENOENT uses with more informative
private error codes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If a device has keys in the bucket_gens btree associated with its
buckets and is removed from a bcachefs volume, fsck will complain
about the presence of keys associated with an invalid device index.
A repair removes the associated keys and restores correctness.
Update bch2_dev_remove_alloc() to remove device related keys at
device removal time to avoid the problem.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Like in the recovery, and device add, we have to check if devices don't
have the freespace btree initialized - this was missed in the device hot
add path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
A workqueue resource deadlock has been observed when running fsck
on a filesystem with a full/stuck journal. fsck is not currently
able to repair the fs due to fairly rapid emergency shutdown, but
rather than exit gracefully the fsck process hangs during the
shutdown sequence. Fortunately this is easily recoverable from
userspace, but the root cause involves code shared between the
kernel and userspace and so should be addressed.
The deadlock scenario involves the main task in the bch2_fs_stop()
-> bch2_fs_read_only() path waiting on write references to drain
with the fs state lock held. A bch2_read_only_work() workqueue task
is scheduled on the system_long_wq, blocked on the state lock.
Finally, various other write ref holding workqueue tasks are
scheduled to run on the same workqueue and must complete in order to
release references that the initial task is waiting on.
To avoid this problem, we can split the dependent workqueue tasks
across different workqueues. It's a bit of a waste to create a
dedicated wq for the read-only worker, but there are several tasks
throughout the fs that follow the pattern of acquiring a write
reference and then scheduling to the system wq. Use a local wq
for such tasks to break the subtle dependency between these and the
read-only worker.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a bug in bch2_evict_subvolume_inodes(): d_mark_dontcache()
doesn't handle the case where i_count is already 0, we need to grab and
put the inode in order for it to be dropped.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This implements a new shutdown path for erasure coding, which is needed
for the upcoming BCH_WRITE_WAIT_FOR_EC write path.
The process is:
- Cancel new stripes being built up
- Close out/cancel open buckets on write points or the partial list
that are for stripes
- Shutdown rebalance/copygc
- Then wait for in flight new stripes to finish
With BCH_WRITE_WAIT_FOR_EC, move ops will be waiting on stripes to fill
up before they complete; the new ec shutdown path is needed for shutting
down copygc/rebalance without deadlocking.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This also adds bch2_write_op_to_text(): now we can see outstand moves,
useful for debugging shutdown with the upcoming BCH_WRITE_WAIT_FOR_EC
and likely for other things in the future.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds private error codes for most (but not all) of our ENOMEM uses,
which makes it easier to track down assorted allocation failures.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we have much more efficient updates to the LRU btree, this
patch adds a new LRU that indexes buckets by fragmentation.
This means copygc no longer has to scan every bucket to find buckets
that need to be evacuated.
Changes:
- A new field in bch_alloc_v4, fragmentation_lru - this corresponds to
the bucket's position in the fragmentation LRU. We add a new field
for this instead of calculating it as needed because we may make the
fragmentation LRU optional; this field indicates whether a bucket is
on the fragmentation LRU.
Also, zoned devices will introduce variable bucket sizes; explicitly
recording the LRU position will be safer for them.
- A new copygc path for using the fragmentation LRU instead of
scanning every bucket and building up an in-memory heap.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This improves the nocow lock table so that hash table entries have
multiple locks, and locks specify which bucket they're for - i.e. we can
now resolve hash collisions.
This is important because the allocator has to skip buckets that are
locked in the nocow lock table, and previously hash collisions would
cause it to spuriously skip unlocked buckets.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With the new backpointer based copygc we don't need an explicit copygc
reserve, we're always evacuating buckets one at a time - so this is no
longer needed, and in fact removing it fixes a deadlock in
bch2_dev_allocator_remove().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch adds backpointers: we now have a reverse index from device
and offset on that device (specifically, offset within a bucket) back to
btree nodes and (non cached) data extents.
The first 40 backpointers within a bucket are stored in the alloc key;
after that backpointers spill over to the next backpointers btree. This
is to help avoid performance regressions from additional btree updates
on large streaming workloads.
This patch adds all the code for creating, checking and repairing
backpointers. The next patch in the series is going to use backpointers
for copygc - finally getting rid of the need to scan all extents to do
copygc.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a new method of doing btree updates - a straight write buffer,
implemented as a flat fixed size array.
This is only useful when we don't need to read from the btree in order
to do the update, and when reading is infrequent - perfect for the LRU
btree.
This will make LRU btree updates fast enough that we'll be able to use
it for persistently indexing buckets by fragmentation, which will be a
massive boost to copygc performance.
Changes:
- A new btree_insert_type enum, for btree_insert_entries. Specifies
btree, btree key cache, or btree write buffer.
- bch2_trans_update_buffered(): updates via the btree write buffer
don't need a btree path, so we need a new update path.
- Transaction commit path changes:
The update to the btree write buffer both mutates global, and can
fail if there isn't currently room. Therefore we do all write buffer
updates in the transaction all at once, and also if it fails we have
to revert filesystem usage counter changes.
If there isn't room we flush the write buffer in the transaction
commit error path and retry.
- A new persistent option, for specifying the number of entries in the
write buffer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In the distant past, it wasn't possible to start copygc until after
journal replay had finished. Now, the btree iterator code overlays keys
from the journal, so there's no reason not to start it earlier - and it
solves a rare deadlock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a debug mode where we split up the c->writes refcount into
distinct refcounts for every codepath that takes a reference, and adds
sysfs code to print the value of each ref.
This will make it easier to debug shutdown hangs due to refcount leaks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We shouldn't be overloading standard error codes now that we have
provisions for bcachefs-specific errorcodes: this patch converts super.c
and super-io.c to per error site errcodes, with a bit of cleanup.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
On startup, we need to ensure the first journal entry written is a flush
write: after a clean shutdown we generally don't read the journal, which
means we might be overwriting whatever was there previously, and there
must always be at least one flush entry in the journal or recovery will
fail.
Found by fstests generic/388.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
checkpatch.pl gives lots of warnings that we don't want - suggested
ignore list:
ASSIGN_IN_IF
UNSPECIFIED_INT - bcachefs coding style prefers single token type names
NEW_TYPEDEFS - typedefs are occasionally good
FUNCTION_ARGUMENTS - we prefer to look at functions in .c files
(hopefully with docbook documentation), not .h
file prototypes
MULTISTATEMENT_MACRO_USE_DO_WHILE
- we have _many_ x-macros and other macros where
we can't do this
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need counters to be initialized before initializing shrinkers - the
shrinker callbacks will update those counters. This fixes a segfault in
userspace.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Continuing the saga of introducing private dedicated error codes for
each error path, this patch converts ENOSPC to error codes that are
subtypes of ENOSPC. We've recently had a test failure where we got
-ENOSPC where we shouldn't have, and didn't have enough information to
tell where it came from, so this patch will solve that problem.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Device labels are represented as pointers in the member info section: we
need to get and then set the label for it to be kept correctly.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Like bch2_do_discards(), we should check if this needs to be done when
going rw.
Also, add some sysfs code for debugging bucket invalidation.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This converts bcachefs to the modern printbuf interface/implementation,
synced with the version to be submitted upstream.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a new superblock field for persisting counters
and adds a sysfs interface in counters/ exposing these counters.
The superblock field is ignored by older versions letting us avoid
an on disk version bump.
Each sysfs file outputs a counter that tracks since filesystem
creation and a counter for the current mount session.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We're seeing occasional firings of the assertion in the key cache
shutdown code that nr_dirty == 0, which means we must sometimes be doing
transaction commits after we've gone read only.
Cleanups & changes:
- BCH_FS_ALLOC_CLEAN renamed to BCH_FS_CLEAN_SHUTDOWN
- new helper bch2_btree_interior_updates_flush(), which returns true if
it had to wait
- bch2_btree_flush_writes() now also returns true if there were btree
writes in flight
- __bch2_fs_read_only now checks if btree writes were in flight in the
shutdown loop: btree write completion does a transaction update, to
update the pointer in the parent node
- assert that !BCH_FS_CLEAN_SHUTDOWN in __bch2_trans_commit
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- We weren't clearing the LRU btree
- bch2_alloc_read() runs before bch2_check_alloc_key() deletes alloc
keys for devices/buckets that don't exists, so it needs to check for
that
- bch2_check_lrus() needs to check that buckets exists
- improve some error messages
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We need to ensure that work structs in bch_fs always get initialized -
otherwise an error in filesystem initialization can pop a warning in the
workqueue code when we try to cancel a work struct that wasn't
initialized.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Previously, the journal read path used a linked list for storing the
journal entries we read from disk. But there's been a bug that's been
causing journal_flush_delay to incorrectly be set to 0, leading to far
more journal entries than is normal being written out, which then means
filesystems are no longer able to start due to the O(n^2) behaviour of
inserting into/searching that linked list.
Fix this by switching to a radix tree.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>