IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
This patch changes journal_entry_open() to initialize the new journal
entry, not __journal_entry_close().
This also means that journal_cur_seq() refers to the sequence number of
the last journal entry when we don't have an open journal entry, not the
next one.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This change is prep work for moving some work from
__journal_entry_close() to journal_entry_open(): without this change,
journal_entry_open() doesn't know if it's going to be able to open a new
journal entry until the cmpxchg loop, meaning it can't create the new
journal pin entry and update other global state because those have to be
done prior to the cmpxchg opening the new journal entry.
Fortunately, we don't call bch2_journal_halt() from interrupt context.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This replaces the journal flag JOURNAL_NEED_WRITE with per-journal buf
state - more explicit, and solving a race in the old code that would
lead to entries being opened and written unnecessarily.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
In sysfs, files can only output at most PAGE_SIZE. This is a problem for
debug info that needs to list an arbitrary number of times, and because
of this limit some of our debug info has been terser and harder to read
than we'd like.
This patch moves info about journal pins and cached btree nodes to
debugfs, and greatly expands and improves the output we return.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This patch changes printbufs dynamically allocate and reallocate a
buffer as needed. Stack usage has become a bit of a problem, and a major
cause of that has been static size string buffers on the stack.
The most involved part of this refactoring is that printbufs must now be
exited with printbuf_exit().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When key cache pins were put onto their own list, we neglected to update
bch2_journal_pins_to_text() to print them.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
The journal can get stuck if we need to get a journal reservation for
something we have a pre-reservation for, but aren't able to reclaim
space, or if the pin fifo is full - it's impractical to resize the pin
fifo at runtime.
Previously, we reserved 8 entries in the pin fifo for pre-reservations,
but that seems small - we're seeing the journal occasionally get stuck.
Let's reserve a quarter of it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Add bch2_journal_noflush_seq(), for telling the journal that entries
before a given sequence number should not be flushes - to be used by an
upcoming allocator optimization.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Prep work for adding a hash table of open buckets - instead of embedding
a bch_extent_ptr, we need to refer to the bucket directly so that we're
not calling sector_to_bucket() in the hash table lookup code, which has
an expensive divide.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This fixes a rare bug when mounting & unmounting RO - flushing a clean
filesystem that never went RO should be a no op.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This patch ensures that the journal entry written gets written as flush
entry, which is important for the shutdown path - the last entry written
needs to be a flush entry.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Now that bch2_bucket_alloc_new_fs() isn't looking at bucket marks to
decide what buckets are eligible to allocate, we can clean up the
filesystem initialization and device add paths. Previously, we had to
use ancient code to mark superblock/journal buckets in the in memory
bucket marks as we allocated them, and then zero that out and re-do that
marking using the newer transational bucket mark paths. Now, we can
simply delete the in-memory bucket marking.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This adds more latency/event measurements and breaks some apart into
more events. Journal writes are broken apart into flush writes and
noflush writes, btree compactions are broken out from btree splits,
btree mergers are added, as well as btree_interior_updates - foreground
and total.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Switch to one line of output per pr_buf() call - longer lines but quite
a bit more readable.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This converts journal_write_delay, journal_flush_disabled, and
journal_reclaim_delay to normal filesystems options, and also adds them
to the superblock.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- bch2_journal_halt() was unconditionally overwriting j->err_seq, the
sequence number that we failed to write
- journal_write_done was updating seq_ondisk and flushed_seq_ondisk even
for writes that errored, which broke the way bch2_journal_flush_seq_async()
locklessly checked for completions.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
It's definitely indicative of a bug if we request to flush a journal
sequence number that hasn't happened yet, but it's more useful if we
warn and print out the relevant sequence numbers instead of just dying.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This was used for recording which inodes have been modified by in flight
journal writes, but was broken and has been superceded.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Also, clean up workqueue usage - we shouldn't be using system
workqueues, pretty much everything we do needs to be on our own
WQ_MEM_RECLAIM workqueues.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We weren't holding mark_lock correctly - it's needed for the new_fs
path.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
jset->last_seq is in the region that's encrypted - on journal write
completion, we were using it and getting garbage. This patch shadows it
to fix.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If the journal reclaim thread makes it to the timeout without ever
initializing j->last_flushed, we could end up sleeping for a very long
time.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We need to flush the btree key cache when it's too dirty, because
otherwise the shrinker won't be able to reclaim memory - this is done by
journal reclaim. But journal reclaim also kicks btree node writes: this
meant that btree node writes were getting kicked much too often just
because we needed to flush btree key cache keys.
This patch splits journal pins into two different lists, and teaches
journal reclaim to not flush btree node writes when it only needs to
flush key cache keys.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
JOURNAL_RES_GET_RESERVED should only be used for updatse that need to be
done to free up space in the journal. In particular, when we're flushing
keys from the key cache, if we're flushing them out of order we
shouldn't be using it, since we're using up our remaining space in the
journal without dropping a pin that will let us make forward progress.
With this patch, BTREE_INSERT_JOURNAL_RECLAIM without
BTREE_INSERT_JOURNAL_RESERVED may return -EAGAIN - we can't wait on
journal reclaim if we're already in journal reclaim.
This means we need to propagate these errors up to journal reclaim,
indicating that flushing a journal pin should be retried in the future.
This is prep work for a patch to change the way journal reclaim works,
to split out flushing key cache keys because the btree key cache is too
dirty from journal reclaim because we need space in the journal.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The default was 1/256th of the device and capped at 512MB, which is
fairly tiny these days.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Especially in userspace, we sometime run into resource exhaustion issues
with starting up threads after mark and sweep/fsck.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Also, make the wait in bch2_journal_flush_seq() interruptible, not just
killable.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes some arithmetic bugs in "bcachefs: Journal updates to dev
usage" - additionally, it cleans things up by switching everything that
goes in every journal entry to the journal_entry_res mechanism.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Originally, bcachefs - going back to bcache - stored, for each bucket, a
16 bit counter corresponding to how long it had been since the bucket
was read from. But, this required periodically rescaling counters on
every bucket to avoid wraparound. That wasn't an issue in bcache, where
we'd perodically rewrite the per bucket metadata all at once, but in
bcachefs we're trying to avoid having to walk every single bucket.
This patch switches to persisting 64 bit io clocks, corresponding to the
64 bit bucket timestaps introduced in the previous patch with
KEY_TYPE_alloc_v2.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There's an outstanding bug with journal entries being missing in journal
replay. This patch adds code to print out where the journal entries were
physically located that were around the entry(ies) being missing, which
should make debugging easier.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
More work towards getting rid of the in memory struct bucket: this path
adds code for marking superblock and journal buckets via the btree, and
uses it in the device add and journal resize paths.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If journal replay hasn't finished, the journal can't be empty - oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
All writes prior to a journal write need to be flushed before the
journal write itself happens. On single device filesystems, it suffices
to mark the write with REQ_PREFLUSH|REQ_FUA, but on multi device
filesystems we need to issue flushes to every device - and wait for them
to complete - before issuing the journal writes. Previously, we were
issuing flushes to every device, but we weren't waiting for them to
complete before issuing the journal writes.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is because we had a bug where we were writing out journal entries
with garbage last_seq, and not catching it.
Also, completely ignore jset->last_seq when JSET_NO_FLUSH is true,
because of aforementioned bug, but change the write path to set last_seq
to 0 when JSET_NO_FLUSH is true.
Minor other cleanups and comments.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With various newer key types - stripe keys, inline data extents - the
old approach of calculating the maximum size of the value is becoming
more and more error prone. Better to switch to bkey_on_stack, which can
dynamically allocate if necessary to handle any size bkey.
In particular we also want to get rid of BKEY_EXTENT_VAL_U64s_MAX.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, we were using BTREE_INSERT_RESERVE in a lot of places where
it no longer makes sense.
- we now have more open_buckets than we used to, and the reserves work
better, so we shouldn't need to use BTREE_INSERT_RESERVE just because
we're holding open_buckets pinned anymore.
- We have the btree key cache for updates to the alloc btree, meaning
we no longer need the btree reserve to ensure the allocator can make
forward progress.
This means that we should only need a reserve for btree updates to
ensure that copygc can make forward progress.
Since it's now just for copygc, we can also fold RESERVE_BTREE into
RESERVE_MOVINGGC (the allocator's freelist reserve).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Try to always keep 1/8th of the journal free, on top of
pre-reservations
- Move the check for whether the journal is stuck to
bch2_journal_space_available, and make it only fire when there aren't
any journal writes in flight (that might free up space by updating
last_seq)
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch adds a flag to journal entries which, if set, indicates that
they weren't done as flush/fua writes.
- non flush/fua journal writes don't update last_seq (i.e. they don't
free up space in the journal), thus the journal free space
calculations now check whether nonflush journal writes are currently
allowed (i.e. are we low on free space, or would doing a flush write
free up a lot of space in the journal)
- write_delay_ms, the user configurable option for when open journal
entries are automatically written, is now interpreted as the max
delay between flush journal writes (default 1 second).
- bch2_journal_flush_seq_async is changed to ensure a flush write >=
the requested sequence number has happened
- journal read/replay must now ignore, and blacklist, any journal
entries newer than the most recent flush entry in the journal. Also,
the way the read_entire_journal option is handled has been improved;
struct journal_replay now has an entry, 'ignore', for entries that
were read but should not be used.
- assorted refactoring and improvements related to journal read in
journal_io.c and recovery.c
Previously, we'd have to issue a flush/fua write every time we
accumulated a full journal entry - typically the bucket size. Now we
need to issue them much less frequently: when an fsync is requested, or
it's been more than write_delay_ms since the last flush, or when we need
to free up space in the journal. This is a significant performance
improvement on many write heavy workloads.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch increases the maximum journal buffers in flight from 2 to 4 -
this will be particularly helpful when in the future we stop requiring
flush+fua for every journal write.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The error check was inverted - leading fsyncs to get stuck and hang,
oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Avoid taking the journal lock if we don't have to.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_bucket_alloc() requires rcu_read_lock() to be held.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>