IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
fs-io.c is too big - time for some reorganization
- fs-dio.c: direct io
- fs-pagecache.c: pagecache data structures (bch_folio), utility code
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We've observed significant lock thrashing on fstests generic/083 in
fallocate, due to dropping and retaking btree locks when checking the
pagecache for data.
This adds a nonblocking mode to bch2_clamp_data_hole(), where we only
use folio_trylock(), and can thus be used safely while btree locks are
held - thus we only have to drop btree locks as a fallback, on actual
lock contention.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, fallocate would only check the state of the extents btree
when determining if we need to create a reservation.
But the page cache might already have dirty data or a disk reservation.
This changes __bchfs_fallocate() to call bch2_seek_pagecache_hole() to
check for this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we have distinct error codes for different memory allocation
failures, the early init log messages are no longer needed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- endianness fixes
- mark some things static
- fix a few __percpu annotations
- fix silent enum conversions
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
GFP_NOFS doesn't ever make sense. If we're allocatingc memory it should
be GFP_NOWAIT if btree locks are held, GFP_KERNEL otherwise.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We've been using __GFP_NOFAIL for allocating struct bch_folio, our
private per-folio state.
However, that struct is variable size - it holds state for each sector
in the folio, and folios can be quite large now, which means it's
possible for bch_folio to be larger than PAGE_SIZE now.
__GFP_NOFAIL allocations are undesirable in normal circumstances, but
particularly so at >= PAGE_SIZE, and warnings are emitted for that.
So, this patch adds proper error paths and eliminates most uses of
__GFP_NOFAIL. Also, do some more cleanup of gfp flags w.r.t. btree node
locks: we can use GFP_KERNEL, but only if we're not holding btree locks,
and if we are holding btree locks we should be using GFP_NOWAIT.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Now that we can reliably designate and find the master subvolume out of
a tree of snapshots, we can finally make quotas work with snapshots:
That is - quotas will now _ignore_ snapshot subvolumes, and only be in
effect for the master (non snapshot) subvolume.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Create a small helper to translate from file offset to the
associated bch_folio_sector index in the underlying bch_folio. The
helper assumes the file offset is covered by the passed folio.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Some of the folio_end_*() helpers are prone to overflow of signed
64-bit types because the mapping is only limited by the max value of
loff_t and the associated helpers return the start offset of the
next folio. Therefore, a folio_end_pos() of the max allowable folio in a
mapping returns a value that overflows loff_t.
This makes it hard to rely on such values when doing folio
processing across a range of a file, as bcachefs attempts to do with
the recent folio changes. For example, generic/564 causes problems
in the buffered write path when testing writes at max boundary
conditions.
The current understanding is that the pagecache historically limited
the mapping to one less page to avoid this problem and this was
dropped with some of the folio conversions, but may be reinstated to
properly address the problem. In the meantime, update the internal
folio_end_*() helpers in bcachefs to return a u64, and all of the
associated code to use or cast to u64 to avoid overflow problems.
This allows generic/564 to pass and can be reverted back to using
loff_t if at any point the pagecache subsystem can guarantee these
boundary conditions will not overflow.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The buffered write path batches folio creations in the file mapping
based on the requested size of the write. Under low free space
conditions, it is possible to add a bunch of folios to the mapping
and then return a short write or -ENOSPC due to lack of space. If
this occurs on an extending write, the file size is updated based on
the amount of data successfully written to the file. If folios were
added beyond the final i_size, they may hang around until reclaimed,
truncated or encountered unexpectedly by another operation.
For example, generic/083 reproduces a sequence of events where a
short write leaves around one or more post-EOF folios on an inode, a
subsequent zero range request extends beyond i_size and overlaps
with an aforementioned folio, and __bch2_truncate_folio() happens
across it and complains.
Update __bch2_buffered_write() to keep track of the start offset of
the last folio added to the mapping for a prospective write. After
i_size is updated, check whether this offset starts beyond EOF. If
so, truncate pagecache beyond the latest EOF to clean up any folios
that don't reside at least partially within EOF upon completion of
the write.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
generic/083 occasionally reproduces a panic caused by an overflow
when accessing the bch_folio_sector array of the folio being
processed by __bch2_truncate_folio(). The immediate cause of the
overflow is that the folio offset is beyond i_size, and therefore
the sector index calculation underflows on subtraction of the folio
offset.
One cause of this is mainly observed on nocow mounts. When nocow is
enabled, fallocate performs physical block allocation (as opposed to
block reservation in cow mode), which range_has_data() then
interprets as valid data that requires partial zeroing on truncate.
Therefore, if a post-eof zero range request lands across post-eof
preallocated blocks, __bch2_truncate_folio() may actually create a
post-eof folio in order to perform zeroing. To avoid this problem,
update range_has_data() to filter out unwritten blocks from folio
creation and partial zeroing.
Even though we should never create folios beyond EOF like this, the
mere existence of such folios is not necessarily a fatal error. Fix
up the truncate code to warn about this condition and not overflow
the sector array and possibly crash the system. The addition of this
warning without the corresponding unwritten extent fix has shown
that various other fstests are able to reproduce this problem fairly
frequently, but often in ways that doesn't necessarily result in a
kernel panic or a change in user observable behavior, and therefore
the problem goes undetected.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With large folios, it's now incidentally possible to end up with a
clean, uptodate folio in the page cache that doesn't have a bch_folio
attached, if a folio has to be split.
This patch fixes __bch2_truncate_folio() to check for this; other code
paths appear to handle it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- X-macro-ize the bch_folio_sector_state enum: this means we can easily
generate strings, which is helpful for debugging.
- Add helpers for state transitions: folio_sector_dirty(),
folio_sector_undirty(), folio_sector_reserve()
- Add folio_sector_set(), a single helper for changing folio sector
state just so that we have a single place to instrument when we're
debugging.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This converts fs-io.c to pass folios, not pages. We're not handling
large folios yet, there's no functional changes in this patch - just a
lot of churn doing the initial type conversions.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Seeing an odd bug with page/folio state not being properly initialized,
this is to help track it down.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds private error codes for most (but not all) of our ENOMEM uses,
which makes it easier to track down assorted allocation failures.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds support for nocow mode, where we do writes in-place when
possible. Patch components:
- New boolean filesystem and inode option, nocow: note that when nocow
is enabled, data checksumming and compression are implicitly disabled
- To prevent in-place writes from racing with data moves
(data_update.c) or bucket reuse (i.e. a bucket being reused and
re-allocated while a nocow write is in flight, we have a new locking
mechanism.
Buckets can be locked for either data update or data move, using a
fixed size hash table of two_state_shared locks. We don't have any
chaining, meaning updates and moves to different buckets that hash to
the same lock will wait unnecessarily - we'll want to watch for this
becoming an issue.
- The allocator path also needs to check for in-place writes in flight
to a given bucket before giving it out: thus we add another counter
to bucket_alloc_state so we can track this.
- Fsync now may need to issue cache flushes to block devices instead of
flushing the journal. We add a device bitmask to bch_inode_info,
ei_devs_need_flush, which tracks devices that need to have flushes
issued - note that this will lead to unnecessary flushes when other
codepaths have already issued flushes, we may want to replace this with
a sequence number.
- New nocow write path: look up extents, and if they're writable write
to them - otherwise fall back to the normal COW write path.
XXX: switch to sequence numbers instead of bitmask for devs needing
journal flush
XXX: ei_quota_lock being a mutex means bch2_nocow_write_done() needs to
run in process context - see if we can improve this
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- bch2_extent_merge checks unwritten bit
- read path returns 0s for unwritten extents without actually reading
- reflink path skips over unwritten extents
- bch2_bkey_ptrs_invalid() checks for extents with both written and
unwritten extents, and non-normal extents (stripes, btree ptrs) with
unwritten ptrs
- fiemap checks for unwritten extents and returns
FIEMAP_EXTENT_UNWRITTEN
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This factors out part of __bchfs_fallocate() in fs-io.c into an new,
lower level io.c helper, which creates a single extent reservation.
This is prep work for nocow support - the new helper will shortly gain
the ability to create unwritten extents.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a debug mode where we split up the c->writes refcount into
distinct refcounts for every codepath that takes a reference, and adds
sysfs code to print the value of each ref.
This will make it easier to debug shutdown hangs due to refcount leaks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It's important that in BTREE_ITER_FILTER_SNAPSHOTS mode we always use
peek_upto() and provide an end for the interval we're searching for -
otherwise, when we hit the end of the inode the next inode be in a
different subvolume and not have any keys in the current snapshot, and
we'd iterate over arbitrarily many keys before returning one.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This improves io_opts() and makes it a non-inline function - it's big
enough that it probably shouldn't be.
Also, bch_io_opts no longer needs fields for whether options are
defined, so we can slim it down a bit.
We'd like to stop passing around the full bch_io_opts, but that'll be
tricky because of bch2_rebalance_add_key().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch introduces
- bpos_eq()
- bpos_lt()
- bpos_le()
- bpos_gt()
- bpos_ge()
and equivalent replacements for bkey_cmp().
Looking at the generated assembly these could probably be improved
further, but we already see a significant code size improvement with
this patch.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Centralize format strings in bcachefs.h
- Add bch2_fmt_inum_offset() and related helpers
- Switch error messages for inodes to also print out the offset, in
bytes
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Warnings ought to always have a format string/log message - makes them
considerably more useful.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This switches where we take quota reservations to be per bch_wirte_op
instead of per dio_write, so we can drop the quota reservation in the
same place as we call i_sectors_acct(), and only take/release
ei_quota_lock once.
In the future we'd like ei_quota_lock to not be a mutex, so that we can
avoid punting to process context before deliving write completions in
nocow mode.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We have a unique lock used for controlling adding to the pagecache: the
lock has two states, where both states are shared - the lock may be held
multiple times for either state - but not both states at the same time.
This is exactly what we need for nocow mode locking, so this patch pulls
it out of fs.c into its own file.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BCH_WRITE_FLUSH is a write flag that causes a journal flush. It's only
used in the direct IO path, and this will allow for some consolidation
with the regular fsync path, which will help with the upcoming nocow
mode.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- With BCH_WRITE_SYNC, we no longer need the completion in struct
dio_write
- Pull out bch2_dio_write_copy_iov() into a separate non-inline
function, it's code that doesn't run in the common case
- Copy mapping and inode pointers into dio_write, avoiding pointer
chasing at the start of bch2_dio_write_loop()
- kthread_use_mm() is not needed in the common case; move it into
bch2_dio_write_loop_async()
- factor out various helpers from bch2_dio_write_loop() and rework
control flow for better icache utilization
Other small optimizations:
- bch2_keylist_free() is only used in one place, at the end of the
bch2_write() path - drop the reinit
- in bch2_disk_reservation_put(), check if res->sectors is nonzero
before touching c->online_reserved, since that will likely be a cache
miss
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bcachefs: More DIO write path optimization
Better code prefetching (?)
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This adds a new flag for the write path, BCH_WRITE_SYNC, and switches
the O_DIRECT write path to use it when we're not running asynchronously.
It runs the btree update after the write in the original thread's
context instead of a kworker, cutting context switches in half.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Per fstests generic/275, on -ENOSPC we're supposed write until the
filesystem is full - i.e. do a partial write instead of failing the full
write.
This is a partial fix for the buffered write path: we'll still fail on a
page boundary.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
checkpatch.pl gives lots of warnings that we don't want - suggested
ignore list:
ASSIGN_IN_IF
UNSPECIFIED_INT - bcachefs coding style prefers single token type names
NEW_TYPEDEFS - typedefs are occasionally good
FUNCTION_ARGUMENTS - we prefer to look at functions in .c files
(hopefully with docbook documentation), not .h
file prototypes
MULTISTATEMENT_MACRO_USE_DO_WHILE
- we have _many_ x-macros and other macros where
we can't do this
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- We now correctly allow soft limits to be exceeded, instead of always
returning -EDQUOT
- Disk quota grate times/warnings can now be set, not just the
systemwide defaults
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When modifying a file, we may be required to drop the suid/sgid bits -
we were missing a file_modified() call to do this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>