linux

iv/linux

Author	SHA1	Message	Date
Kent Overstreet	6bd68ec266	bcachefs: Heap allocate btree_trans We're using more stack than we'd like in a number of functions, and btree_trans is the biggest object that we stack allocate. But we have to do a heap allocatation to initialize it anyways, so there's no real downside to heap allocating the entire thing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:13 -04:00
Kent Overstreet	da52576080	bcachefs: Fix btree write buffer with snapshots btrees Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:11 -04:00
Brian Foster	60a5b89800	bcachefs: use prejournaled key updates for write buffer flushes The write buffer mechanism journals keys twice in certain situations. A key is always journaled on write buffer insertion, and is potentially journaled again if a write buffer flush falls into either of the slow btree insert paths. This has shown to cause journal recovery ordering problems in the event of an untimely crash. For example, consider if a key is inserted into index 0 of a write buffer, the active write buffer switches to index 1, the key is deleted in index 1, and then index 0 is flushed. If the original key is rejournaled in the btree update from the index 0 flush, the (now deleted) key is journaled in a seq buffer ahead of the latest version of key (which was journaled when the key was deleted in index 1). If the fs crashes while this is still observable in the log, recovery sees the key from the btree update after the delete key from the write buffer insert, which is the incorrect order. This problem is occasionally reproduced by generic/388 and generally manifests as one or more backpointer entry inconsistencies. To avoid this problem, never rejournal write buffered key updates to the associated btree. Instead, use prejournaled key updates to pass the journal seq of the write buffer insert down to the btree insert, which updates the btree leaf pin to reflect the seq of the key. Note that tracking the seq is required instead of just using NOJOURNAL here because otherwise we lose protection of the write buffer pin when the buffer is flushed, which means the key can fall off the tail of the on-disk journal before the btree leaf is flushed and lead to similar recovery inconsistencies. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:08 -04:00
Kent Overstreet	d82978ca15	bcachefs: Add a race_fault() for write buffer slowpath We haven't hooked up dynamic fault injection quite yet, but we will soon Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:07 -04:00
Kent Overstreet	f33c58fc46	bcachefs: Kill BTREE_INSERT_USE_RESERVE Now that we have journal watermarks and alloc watermarks unified, BTREE_INSERT_USE_RESERVE is redundant and can be deleted. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:05 -04:00
Kent Overstreet	ec14fc6010	bcachefs: Kill JOURNAL_WATERMARK This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards specifying watermarks once in the transaction commit path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:05 -04:00
Kent Overstreet	8e5b1115f1	bcachefs: Write buffer flush needs BTREE_INSERT_NOCHECK_RW btree write buffer flush is only invoked from contexts that already hold a write ref, and checking if we're still RW could cause us to fail to completely flush the write buffer when shutting down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:10:04 -04:00
Brian Foster	873555f04d	bcachefs: more aggressive fast path write buffer key flushing The btree write buffer flush code is prone to causing journal deadlock due to inefficient use and release of reservation space. Reservation is not pre-reserved for write buffered keys (as is done for key cache keys, for example), because the write buffer flush side uses a fast path that attempts insertion without need for any reservation at all. The write buffer flush attempts to deal with this by inserting keys using the BTREE_INSERT_JOURNAL_RECLAIM flag to return an error on journal reservations that require blocking. Upon first error, it falls back to a slow path that inserts in journal order and supports moving the associated journal pin forward. The problem is that under pathological conditions (i.e. smaller log, larger write buffer and journal reservation pressure), we've seen instances where the fast path fails fairly quickly without having completed many insertions, and then the slow path is unable to push the journal pin forward enough to free up the space it needs to completely flush the buffer. This problem is occasionally reproduced by fstest generic/333. To avoid this problem, update the fast path algorithm to skip key inserts that fail due to inability to acquire needed journal reservation without immediately breaking out of the loop. Instead, insert as many keys as possible, zap the sequence numbers to mark them as processed, and then fall back to the slow path to process the remaining set in journal order. This reduces the amount of journal reservation that might be required to flush the entire buffer and increases the odds that the slow path is able to move the journal pin forward and free up space as keys are processed. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:09:58 -04:00
Kent Overstreet	65d48e3525	bcachefs: Private error codes: ENOMEM This adds private error codes for most (but not all) of our ENOMEM uses, which makes it easier to track down assorted allocation failures. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:09:57 -04:00
Kent Overstreet	747ded6ddf	bcachefs: Fix for shared paths in write buffer flush It's possible for bch2_write_buffer_flush_one() to end up with a shared path, if called from a context that already has a btree iterator pointing to a key being flushed. We have to be careful when that happens, since we can't clone a path that holds write locks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:09:54 -04:00
Daniel Hill	8ffa11a2c5	bcachefs: let __bch2_btree_insert() pass in flags This patch is prep work for the following patch. Signed-off-by: Daniel Hill <daniel@gluo.nz> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:09:52 -04:00
Kent Overstreet	920e69bc3d	bcachefs: Btree write buffer This adds a new method of doing btree updates - a straight write buffer, implemented as a flat fixed size array. This is only useful when we don't need to read from the btree in order to do the update, and when reading is infrequent - perfect for the LRU btree. This will make LRU btree updates fast enough that we'll be able to use it for persistently indexing buckets by fragmentation, which will be a massive boost to copygc performance. Changes: - A new btree_insert_type enum, for btree_insert_entries. Specifies btree, btree key cache, or btree write buffer. - bch2_trans_update_buffered(): updates via the btree write buffer don't need a btree path, so we need a new update path. - Transaction commit path changes: The update to the btree write buffer both mutates global, and can fail if there isn't currently room. Therefore we do all write buffer updates in the transaction all at once, and also if it fails we have to revert filesystem usage counter changes. If there isn't room we flush the write buffer in the transaction commit error path and retry. - A new persistent option, for specifying the number of entries in the write buffer. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2023-10-22 17:09:50 -04:00

12 Commits