linux

iv/linux

Author	SHA1	Message	Date
Kent Overstreet	3e52c22255	bcachefs: Add journal_seq to inode & alloc keys Add fields to inode & alloc keys that record the journal sequence number when they were most recently modified. For alloc keys, this is needed to know what journal sequence number we have to flush before the bucket can be reused. Currently this is tracked in memory, but we'll be getting rid of the in memory bucket array. For inodes, this is needed for fsync when the inode has been evicted from the vfs cache. Currently we use a bloom filter per outstanding journal buf - but that mechanism has been broken since we added the ability to not issue a flush/fua for every journal write. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>	2023-10-22 17:09:16 -04:00
Kent Overstreet	904823de49	bcachefs: Convert bch2_mark_key() to take a btree_trans * This helps to unify the interface between bch2_mark_key() and bch2_trans_mark_key() - and it also gives access to the journal reservation and journal seq in the mark_key path. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>	2023-10-22 17:09:15 -04:00
Kent Overstreet	8325cd1ed4	bcachefs: Don't do upgrades in nochanges mode nochanges mode is often used for getting data off of otherwise nonrecoverable filesystems, which is often because of errors hit during fsck. Don't force version upgrade & fsck in nochanges mode, so that it's more likely to mount. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>	2023-10-22 17:09:15 -04:00
Kent Overstreet	4db650277d	bcachefs: Subvol dirents are now only visible in parent subvol This changes the on disk format for dirents that point to subvols so that they also record the subvolid of the parent subvol, so that we can filter them out in other subvolumes. This also updates the dirent code to do that filtering, and in particular tweaks the rename code - we need to ensure that there's only ever one dirent (counting multiplicities in different snapshots) that point to a subvolume. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>	2023-10-22 17:09:14 -04:00
Kent Overstreet	bfe88863cf	bcachefs: New on disk format to fix reflink_p pointers We had a bug where reflink_p pointers weren't being initialized to 0, and when we started using the second word, things broke badly. This patch revs the on disk format version and adds cleanup code to zero out the second word of reflink_p pointers before we start using it. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>	2023-10-22 17:09:14 -04:00
Kent Overstreet	0476fa948e	bcachefs: Rev the on disk format version for snapshots This will cause the compat code to be run that creates entries in the subvolumes and snapshots btrees. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>	2023-10-22 17:09:13 -04:00
Kent Overstreet	42d237320e	bcachefs: Snapshot creation, deletion This is the final patch in the patch series implementing snapshots. This patch implements two new ioctls that work like creation and deletion of directories, but fancier. - BCH_IOCTL_SUBVOLUME_CREATE, for creating new subvolumes and snaphots - BCH_IOCTL_SUBVOLUME_DESTROY, for deleting subvolumes and snapshots Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>	2023-10-22 17:09:13 -04:00
Kent Overstreet	6fed42bb77	bcachefs: Plumb through subvolume id To implement snapshots, we need every filesystem btree operation (every btree operation without a subvolume) to start by looking up the subvolume and getting the current snapshot ID, with bch2_subvolume_get_snapshot() - then, that snapshot ID is used for doing btree lookups in BTREE_ITER_FILTER_SNAPSHOTS mode. This patch adds those bch2_subvolume_get_snapshot() calls, and also switches to passing around a subvol_inum instead of just an inode number. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>	2023-10-22 17:09:12 -04:00
Kent Overstreet	14b393ee76	bcachefs: Subvolumes, snapshots This patch adds subvolume.c - support for the subvolumes and snapshots btrees and related data types and on disk data structures. The next patches will start hooking up this new code to existing code. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>	2023-10-22 17:09:12 -04:00
Kent Overstreet	67e0dd8f0d	bcachefs: btree_path This splits btree_iter into two components: btree_iter is now the externally visible componont, and it points to a btree_path which is now reference counted. This means we no longer have to clone iterators up front if they might be mutated - btree_path can be shared by multiple iterators, and cloned if an iterator would mutate a shared btree_path. This will help us use iterators more efficiently, as well as slimming down the main long lived state in btree_trans, and significantly cleans up the logic for iterator lifetimes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:09:11 -04:00
Kent Overstreet	78cf784eaa	bcachefs: Further reduce iter->trans usage This is prep work for splitting btree_path out from btree_iter - btree_path will not have a pointer to btree_trans. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>	2023-10-22 17:09:11 -04:00
Brett Holman	8dd6ed9451	bcachefs: add progress stats to sysfs This adds progress stats to sysfs for copygc, rebalance, recovery, and the cmd_job ioctls. Signed-off-by: Brett Holman <bholman.devel@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:09:10 -04:00
Kent Overstreet	877da05ffb	bcachefs: Zero out mem_ptr field in btree ptr keys from journal replay This fixes a bad ptr deref on recovery from unclean shutdown in bch2_btree_node_get_noiter(). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>	2023-10-22 17:09:10 -04:00
Kent Overstreet	9f1833cadd	bcachefs: Update btree ptrs after every write This closes a significant hole (and last known hole) in our ability to verify metadata. Previously, since btree nodes are log structured, we couldn't detect lost btree writes that weren't the first write to a given node. Additionally, this seems to have lead to some significant metadata corruption on multi device filesystems with metadata replication: since a write may have made it to one device and not another, if we read that btree node back from the replica that did have that write and started appending after that point, the other replica would have a gap in the bset entries and reading from that replica wouldn't find the rest of the bsets. But, since updates to interior btree nodes are now journalled, we can close this hole by updating pointers to btree nodes after every write with the currently written number of sectors, without negatively affecting performance. This means we will always detect lost or corrupt metadata - it also means that our btree is now a curious hybrid of COW and non COW btrees, with all the benefits of both (excluding complexity). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>	2023-10-22 17:09:08 -04:00
Kent Overstreet	8c3f6da9fc	bcachefs: Improve iter->should_be_locked Adding iter->should_be_locked introduced a regression where it ended up not being set on the iterator passed to bch2_btree_update_start(), which is definitely not what we want. This patch requires it to be set when calling bch2_trans_update(), and adds various fixups to make that happen. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>	2023-10-22 17:09:06 -04:00
Kent Overstreet	90d22a660a	bcachefs: Fix overflow in journal_replay_entry_early If filesystem on disk was used by a version with a larger BCH_DATA_NR thas the currently running version, we don't want this to cause a buffer overrun. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>	2023-10-22 17:09:06 -04:00
Kent Overstreet	c0ebe3e48c	bcachefs: Assorted endianness fixes Found by sparse Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>	2023-10-22 17:09:05 -04:00
Kent Overstreet	3a402c8dab	bcachefs: Fix some refcounting bugs We really need debug mode assertions that ca->ref and ca->io_ref are used correctly. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:09:03 -04:00
Kent Overstreet	ac1019d32b	bcachefs: Clean up bch2_btree_and_journal_walk() Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:09:02 -04:00
Kent Overstreet	aae15aafcd	bcachefs: New and improved topology repair code This splits out btree topology repair into a separate pass, and makes some improvements: - When we have to pick which of two overlapping nodes to drop keys from, we use the btree node header sequence number to preserve the newer node - the gc code has been changed so that it doesn't bail out if we're continuing/ignoring on fsck error - this way the dump tool can skip running the repair pass but still walk all reachable metadata - add a new superblock flag indicating when a filesystem is known to have btree topology issues, and the topology repair pass should be run - changing the start/end of a node might mean keys in that node have to be deleted: this patch handles that better by splitting it out into a separate function and running it explicitly in the topology repair code, previously those keys were only being dropped when the btree node was read in. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:09:02 -04:00
Kent Overstreet	4932e07ea0	bcachefs: Fix key cache assertion Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:09:02 -04:00
Kent Overstreet	d62ab355d7	bcachefs: Fix bch2_trans_mark_dev_sb() Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:09:00 -04:00
Kent Overstreet	423300e8fe	bcachefs: BCH_BEATURE_atomic_nlink is obsolete Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:09:00 -04:00
Kent Overstreet	8a85b20cd7	bcachefs: Inode backpointers are now required This lets us simplify fsck quite a bit, which we need for making fsck snapshot aware. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:59 -04:00
Kent Overstreet	3a14d58e7b	bcachefs: Drop bch2_fsck_inode_nlink() We've had BCH_FEATURE_atomic_nlink for quite some time, we can drop this now. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:59 -04:00
Kent Overstreet	e751c01a8e	bcachefs: Start using bpos.snapshot field This patch starts treating the bpos.snapshot field like part of the key in the btree code: * bpos_successor() and bpos_predecessor() now include the snapshot field * Keys in btrees that will be using snapshots (extents, inodes, dirents and xattrs) now always have their snapshot field set to U32_MAX The btree iterator code gets a new flag, BTREE_ITER_ALL_SNAPSHOTS, that determines whether we're iterating over keys in all snapshots or not - internally, this controlls whether bkey_(successor\|predecessor) increment/decrement the snapshot field, or only the higher bits of the key. We add a new member to struct btree_iter, iter->snapshot: when BTREE_ITER_ALL_SNAPSHOTS is not set, iter->pos.snapshot should always equal iter->snapshot, which will be 0 for btrees that don't use snapshots, and alsways U32_MAX for btrees that will use snapshots (until we enable snapshot creation). This patch also introduces a new metadata version number, and compat code for reading from/writing to older versions - this isn't a forced upgrade (yet). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:57 -04:00
Kent Overstreet	4cf91b0270	bcachefs: Split out bpos_cmp() and bkey_cmp() With snapshots, we're going to need to differentiate between comparisons that should and shouldn't include the snapshot field. bpos_cmp is now the comparison function that does include the snapshot field, used by core btree code. Upper level filesystem code generally does _not_ want to compare against the snapshot field - that code wants keys to compare as equal even when one of them is in an ancestor snapshot. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:57 -04:00
Kent Overstreet	73590619ec	bcachefs: Don't unconditially version_upgrade in initialize This is mkfs's job. Also, clean up the handling of feature bits some. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:56 -04:00
Kent Overstreet	c7bb769c81	bcachefs: __bch2_trans_get_iter() refactoring, BTREE_ITER_NOT_EXTENTS Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:55 -04:00
Kent Overstreet	7d6f07edc2	bcachefs: Fix compat code for superblock The bkey compat code wasn't being run for btree roots in the superblock clean section - this patch fixes it to use the journal entry validate code. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:55 -04:00
Kent Overstreet	41f8b09edc	bcachefs: Rename BTREE_ID enums for consistency with other enums Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:55 -04:00
Kent Overstreet	f2785955bb	bcachefs: Kill support for !BTREE_NODE_NEW_EXTENT_OVERWRITE() bcachefs has been aggressively migrating filesystems and btree nodes to the new format for quite some time - this shouldn't affect anyone anymore, and lets us delete a _lot_ of code. Also, it frees up KEY_TYPE_discard for a new whiteout key type for snapshots. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:55 -04:00
Kent Overstreet	41e3778636	bcachefs: Bring back metadata only gc This is useful for the filesystem dump debugging tool - when we're hitting bugs we want to skip as much of the recovery process as possible, and the dump tool only needs to know where metadata lives. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:54 -04:00
Kent Overstreet	19dd3172b0	bcachefs: Use x-macros for compat feature bits This is to generate strings for them, so that we can print them out. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:54 -04:00
Kent Overstreet	e01dacf76c	bcachefs: Fix bkey format generation for 32 bit fields Having a packed format that can represent a field larger than the unpacked type breaks bkey_packed_successor() assertions - we need to fix this to start using the snapshot filed. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:54 -04:00
Kent Overstreet	a4805d6672	bcachefs: Scan for old btree nodes if necessary on mount We dropped support for !BTREE_NODE_NEW_EXTENT_OVERWRITE but it turned out there were people who still had filesystems with btree nodes in that format in the wild. This adds a new compat feature that indicates we've scanned for and rewritten nodes in the old format, and does that scan at mount time if the option isn't set. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:54 -04:00
Kent Overstreet	dab9ef0d27	bcachefs: Add error message for some allocation failures Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:53 -04:00
Kent Overstreet	8042b5b715	bcachefs: Extents may now cross btree node boundaries When snapshots arrive, we won't necessarily be able to arbitrarily split existis - when we need to split an existing extent, we'll have to check if the extent was overwritten in child snapshots and if so emit a whiteout for the split in the child snapshot. Because extents couldn't span btree nodes previously, journal replay would sometimes have to split existing extents. That's no good anymore, but fortunately since extent handling has already been lifted above most of the btree code there's no real need for that rule anymore. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:53 -04:00
Kent Overstreet	5d428c7c64	bcachefs: Run fsck if BCH_FEATURE_alloc_v2 isn't set We're using BCH_FEATURE_alloc_v2 to also gate journalling updates to dev usage - we don't have the code for reconstructing this from buckets anymore, so we need to run fsck if it's not set. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:53 -04:00
Kent Overstreet	180fb49dea	bcachefs: Journal updates to dev usage This eliminates the need to scan every bucket to regenerate dev_usage at mount time. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:52 -04:00
Kent Overstreet	2abe542087	bcachefs: Persist 64 bit io clocks Originally, bcachefs - going back to bcache - stored, for each bucket, a 16 bit counter corresponding to how long it had been since the bucket was read from. But, this required periodically rescaling counters on every bucket to avoid wraparound. That wasn't an issue in bcache, where we'd perodically rewrite the per bucket metadata all at once, but in bcachefs we're trying to avoid having to walk every single bucket. This patch switches to persisting 64 bit io clocks, corresponding to the 64 bit bucket timestaps introduced in the previous patch with KEY_TYPE_alloc_v2. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:52 -04:00
Kent Overstreet	a0b73c1c53	bcachefs: Add (partial) support for fixing btree topology When we walk the btrees during recovery, part of that is checking that btree topology is correct: for every interior btree node, its child nodes should exactly span the range the parent node covers. Previously, we had checks for this, but not repair code. Now that we have the ability to do btree updates during initial GC, this patch adds that repair code. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:52 -04:00
Kent Overstreet	5b593ee172	bcachefs: Add support for doing btree updates prior to journal replay Some errors may need to be fixed in order for GC to successfully run - walk and mark all metadata. But we can't start the allocators and do normal btree updates until after GC has completed, and allocation information is known to be consistent, so we need a different method of doing btree updates. Fortunately, we already have code for walking the btree while overlaying keys from the journal to be replayed. This patch adds an update path that adds keys to the list of keys to be replayed by journal replay, and also fixes up iterators. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:52 -04:00
Kent Overstreet	079663d8ed	bcachefs: Kill metadata only gc This was useful before we had transactional updates to interior btree nodes - but now, it's just extra unneeded complexity. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:52 -04:00
Kent Overstreet	ac95800629	bcachefs: Factor out bch2_ec_stripes_heap_start() This fixes a bug where mark and sweep gc incorrectly was clearing out the stripes heap and causing assertions to fire later - simpler to just create the stripes heap after gc has finished. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:51 -04:00
Kent Overstreet	edfbba58e3	bcachefs: Add btree node prefetching to bch2_btree_and_journal_walk() bch2_btree_and_journal_walk() walks the btree overlaying keys from the journal; it was introduced so that we could read in the alloc btree prior to journal replay being done, when journalling of updates to interior btree nodes was introduced. But it didn't have btree node prefetching, which introduced a severe regression with mount times, particularly on spinning rust. This patch implements btree node prefetching for the btree + journal walk, hopefully fixing that. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:51 -04:00
Kent Overstreet	4291a3317f	bcachefs: bch2_alloc_write() should be writing for all devices Alloc info isn't stored on a particular device, it makes no sense to only be writing it out for rw members - this was causing fsck to not fix alloc info errors, oops. Also, make sure we write out alloc info in other repair paths. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:50 -04:00
Kent Overstreet	07a1006ae8	bcachefs: Reduce/kill BKEY_PADDED use With various newer key types - stripe keys, inline data extents - the old approach of calculating the maximum size of the value is becoming more and more error prone. Better to switch to bkey_on_stack, which can dynamically allocate if necessary to handle any size bkey. In particular we also want to get rid of BKEY_EXTENT_VAL_U64s_MAX. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:50 -04:00
Kent Overstreet	719fe7fb55	bcachefs: Update transactional triggers interface to pass old & new keys This is needed to fix a bug where we're overflowing iterators within a btree transaction, because we're updating the stripes btree (to update block counts) and the stripes btree trigger is unnecessarily updating the alloc btree - it doesn't need to update the alloc btree when the pointers within a stripe aren't changing. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:49 -04:00
Kent Overstreet	adbcada43f	bcachefs: Don't require flush/fua on every journal write This patch adds a flag to journal entries which, if set, indicates that they weren't done as flush/fua writes. - non flush/fua journal writes don't update last_seq (i.e. they don't free up space in the journal), thus the journal free space calculations now check whether nonflush journal writes are currently allowed (i.e. are we low on free space, or would doing a flush write free up a lot of space in the journal) - write_delay_ms, the user configurable option for when open journal entries are automatically written, is now interpreted as the max delay between flush journal writes (default 1 second). - bch2_journal_flush_seq_async is changed to ensure a flush write >= the requested sequence number has happened - journal read/replay must now ignore, and blacklist, any journal entries newer than the most recent flush entry in the journal. Also, the way the read_entire_journal option is handled has been improved; struct journal_replay now has an entry, 'ignore', for entries that were read but should not be used. - assorted refactoring and improvements related to journal read in journal_io.c and recovery.c Previously, we'd have to issue a flush/fua write every time we accumulated a full journal entry - typically the bucket size. Now we need to issue them much less frequently: when an fsync is requested, or it's been more than write_delay_ms since the last flush, or when we need to free up space in the journal. This is a significant performance improvement on many write heavy workloads. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:08:49 -04:00

1 2 3 4

152 Commits