IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
If IORING_SETUP_SQPOLL is enabled, sqes are either handled in sq thread
task context or in io worker task context. If current task context is sq
thread, we don't need to check whether should wake up sq thread.
io_iopoll_req_issued() calls wq_has_sleeper(), which has smp_mb() memory
barrier, before this patch, perf shows obvious overhead:
Samples: 481K of event 'cycles', Event count (approx.): 299807382878
Overhead Comma Shared Object Symbol
3.69% :9630 [kernel.vmlinux] [k] io_issue_sqe
With this patch, perf shows:
Samples: 482K of event 'cycles', Event count (approx.): 299929547283
Overhead Comma Shared Object Symbol
0.70% :4015 [kernel.vmlinux] [k] io_issue_sqe
It shows some obvious improvements.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Both IOPOLL and sqes handling need to acquire uring_lock, combine
them together, then we just need to acquire uring_lock once.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some static checker reports below warning:
fs/io_uring.c:6939 io_sq_thread()
error: uninitialized symbol 'timeout'.
This is a false positive, but let's just initialize 'timeout' to make
sure we don't trip over this.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are some issues about current io_sq_thread() implementation:
1. The prepare_to_wait() usage in __io_sq_thread() is weird. If
multiple ctxs share one same poll thread, one ctx will put poll thread
in TASK_INTERRUPTIBLE, but if other ctxs have work to do, we don't
need to change task's stat at all. I think only if all ctxs don't have
work to do, we can do it.
2. We use round-robin strategy to make multiple ctxs share one same
poll thread, but there are various condition in __io_sq_thread(), which
seems complicated and may affect round-robin strategy.
To improve above issues, I take below actions:
1. If multiple ctxs share one same poll thread, only if all all ctxs
don't have work to do, we can call prepare_to_wait() and schedule() to
make poll thread enter sleep state.
2. To make round-robin strategy more straight, I simplify
__io_sq_thread() a bit, it just does io poll and sqes submit work once,
does not check various condition.
3. For multiple ctxs share one same poll thread, we choose the biggest
sq_thread_idle among these ctxs as timeout condition, and will update
it when ctx is in or out.
4. Not need to check EBUSY especially, if io_submit_sqes() returns
EBUSY, IORING_SQ_CQ_OVERFLOW should be set, helper in liburing should
be aware of cq overflow and enters kernel to flush work.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of iterating over each request and cancelling it individually in
io_uring_cancel_files(), try to cancel all matching requests and use
->inflight_list only to check if there anything left.
In many cases it should be faster, and we can reuse a lot of code from
task cancellation.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make io_poll_remove_all() and io_kill_timeouts() to match against files
as well. A preparation patch, effectively not used by now.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring_cancel_files() guarantees to cancel all matching requests,
that's not necessary to do that in a loop. Move it up in the callchain
into io_uring_cancel_task_requests().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_uring_cancel_files() cancels all request that match files regardless
of task. There is no real need in that, cancel only requests of the
specified task. That also handles SQPOLL case as it already changes task
to it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add io_match_task() that matches both task and files.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If IORING_SETUP_SQPOLL is set all requests belong to the corresponding
SQPOLL task, so skip task checking in that case and always match.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Inline io_import_iovec() and leave only its former __io_import_iovec()
renamed to the original name. That makes it more obious what is reused in
io_read/write().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_size and iov_count in io_read() and io_write() hold the same value,
kill the last one.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is the only code that relies on import_iovec() returning
iter.count on success.
This allows a better interface to import_iovec().
Signed-off-by: David Laight <david.laight@aculab.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now users who want to get woken when waiting for events should submit a
timeout command first. It is not safe for applications that split SQ and
CQ handling between two threads, such as mysql. Users should synchronize
the two threads explicitly to protect SQ and that will impact the
performance.
This patch adds support for timeout to existing io_uring_enter(). To
avoid overloading arguments, it introduces a new parameter structure
which contains sigmask and timeout.
I have tested the workloads with one thread submiting nop requests
while the other reaping the cqe with timeout. It shows 1.8~2x faster
when the iodepth is 16.
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
[axboe: various cleanups/fixes, and name change to SIG_IS_DATA]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We unconditionally call blk_start_plug() when starting the IO
submission, but we only really should do that if we have more than 1
request to submit AND we're potentially dealing with block based storage
underneath. For any other type of request, it's just a waste of time to
do so.
Add a ->plug bit to io_op_def and set it for read/write requests. We
could make this more precise and check the file itself as well, but it
doesn't matter that much and would quickly become more expensive.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We've got extra 8 bytes in the 2nd cacheline, put ->fixed_file_refs
there, so inline execution path mostly doesn't touch the 3rd cacheline
for fixed_file requests as well.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Singly linked list for keeping linked requests is enough, because we
almost always operate on the head and traverse forward with the
exception of linked timeouts going 1 hop backwards.
Replace ->link_list with a handmade singly linked list. Also kill
REQ_F_LINK_HEAD in favour of checking a newly added ->list for NULL
directly.
That saves 8B in io_kiocb, is not as heavy as list fixup, makes better
use of cache by not touching a previous request (i.e. last request of
the link) each time on list modification and optimises cache use further
in the following patch, and actually makes travesal easier removing in
the end some lines. Also, keeping invariant in ->list instead of having
REQ_F_LINK_HEAD is less error-prone.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for converting singly linked lists for chaining requests,
make linked timeouts save requests that they're responsible for and not
count on doubly linked list for back referencing.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Explicitly save not only a link's head in io_submit_sqe[s]() but the
tail as well. That's in preparation for keeping linked requests in a
singly linked list.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't use a single struct for polls and poll remove requests, they have
totally different layouts.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass in the struct filename pointers instead of the user string, and
update the three callers to do the same.
This behaves like do_unlinkat(), which also takes a filename struct and
puts it when it is done. Converting callers is then trivial.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that SQPOLL supports non-registered files and grabs the file table,
we can relax the restriction on open/close/accept/connect and allow
them on a ring that is setup with IORING_SETUP_SQPOLL.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The restriction of needing fixed files for SQPOLL is problematic, and
prevents/inhibits several valid uses cases. With the referenced
files_struct that we have now, it's trivially supportable.
Treat ->files like we do the mm for the SQPOLL thread - grab a reference
to it (and assign it), and drop it when we're done.
This feature is exposed as IORING_FEAT_SQPOLL_NONFIXED.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since btrfs scrub is utilizing its own infrastructure to submit
read/write, scrub is independent from all other routines.
This brings one very neat feature, allow us to read 4K data into offset
0 of a 64K page. So is the writeback routine.
This makes scrub on subpage sector size much easier to implement, and
thanks to previous commits which just changed the implementation to
always do scrub based on sector size, now scrub can handle subpage
filesystem without any problem.
This patch will just remove the restriction on
(sectorsize != PAGE_SIZE), to make scrub finally work on subpage
filesystems.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs scrub is more flexible than buffered data write path, as we can
read an unaligned subpage data into page offset 0.
This ability makes subpage support much easier, we just need to check
each scrub_page::page_len and ensure we only calculate hash for [0,
page_len) of a page.
There is a small thing to notice: for subpage case, we still do sector
by sector scrub. This means we will submit a read bio for each sector
to scrub, resulting in the same amount of read bios, just like on the 4K
page systems.
This behavior can be considered as a good thing, if we want everything
to be the same as 4K page systems. But this also means, we're wasting
the possibility to submit larger bio using 64K page size. This is
another problem to consider in the future.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To support subpage tree block scrub, scrub_checksum_tree_block() only
needs to learn 2 new tricks:
- Follow sector size
Now scrub_page only represents one sector, we need to follow it
properly.
- Run checksum on all sectors
Since scrub_page only represents one sector, we need to run checksum
on all sectors, not only (nodesize >> PAGE_SIZE).
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For scrub_pages() and scrub_pages_for_parity(), we currently allocate
one scrub_page structure for one page.
This is fine if we only read/write one sector one time. But for cases
like scrubbing RAID56, we need to read/write the full stripe, which is
in 64K size for now.
For subpage size, we will submit the read in just one page, which is
normally a good thing, but for RAID56 case, it only expects to see one
sector, not the full stripe in its endio function.
This could lead to wrong parity checksum for RAID56 on subpage.
To make the existing code work well for subpage case, here we take a
shortcut by always allocating a full page for one sector.
This should provide the base to make RAID56 work for subpage case.
The cost is pretty obvious now, for one RAID56 stripe now we always need
16 pages. For support subpage situation (64K page size, 4K sector size),
this means we need full one megabyte to scrub just one RAID56 stripe.
And for data scrub, each 4K sector will also need one 64K page.
This is mostly just a workaround, the proper fix for this is a much
larger project, using scrub_block to replace scrub_page, and allow
scrub_block to handle multi pages, csums, and csum_bitmap to avoid
allocating one page for each sector.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs on-disk format chose to use u64 for almost everything, but there
are a other restrictions that won't let us use more than u32 for things
like extent length (the maximum length is 128MiB for non-hole extents),
or stripe length (we have device number limit).
This means if we don't have extra handling to convert u64 to u32, we
will always have some questionable operations like
"u32 = u64 >> sectorsize_bits" in the code.
This patch will try to address the problem by reducing the width for the
following members/parameters:
- scrub_parity::stripe_len
- @len of scrub_pages()
- @extent_len of scrub_remap_extent()
- @len of scrub_parity_mark_sectors_error()
- @len of scrub_parity_mark_sectors_data()
- @len of scrub_extent()
- @len of scrub_pages_for_parity()
- @len of scrub_extent_for_parity()
For members extracted from on-disk structure, like map->stripe_len, they
will be kept as is. Since that modification would require on-disk format
change.
There will be cases like "u32 = u64 - u64" or "u32 = u64", for such call
sites, extra ASSERT() is added to be extra safe for debug builds.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_lookup_bio_sums() is only called for read bios.
While btrfs_find_ordered_sum() is to search ordered extent sums, which
is only for write path.
This means to read a page we either:
- Submit read bio if it's not uptodate
This means we only need to search csum tree for checksums.
- The page is already uptodate
It can be marked uptodate for previous read, or being marked dirty.
As we always mark page uptodate for dirty page.
In that case, we don't need to submit read bio at all, thus no need
to search any checksums.
Remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums().
And since btrfs_lookup_bio_sums() is the only caller for
btrfs_find_ordered_sum(), also remove the implementation.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To support sectorsize < PAGE_SIZE case, we need to take extra care of
extent buffer accessors.
Since sectorsize is smaller than PAGE_SIZE, one page can contain
multiple tree blocks, we must use eb->start to determine the real offset
to read/write for extent buffer accessors.
This patch introduces two helpers to do this:
- get_eb_page_index()
This is to calculate the index to access extent_buffer::pages.
It's just a simple wrapper around "start >> PAGE_SHIFT".
For sectorsize == PAGE_SIZE case, nothing is changed.
For sectorsize < PAGE_SIZE case, we always get index as 0, and
the existing page shift also works.
- get_eb_offset_in_page()
This is to calculate the offset to access extent_buffer::pages.
This needs to take extent_buffer::start into consideration.
For sectorsize == PAGE_SIZE case, extent_buffer::start is always
aligned to PAGE_SIZE, thus adding extent_buffer::start to
offset_in_page() won't change the result.
For sectorsize < PAGE_SIZE case, adding extent_buffer::start gives
us the correct offset to access.
This patch will touch the following parts to cover all extent buffer
accessors:
- BTRFS_SETGET_HEADER_FUNCS()
- read_extent_buffer()
- read_extent_buffer_to_user()
- memcmp_extent_buffer()
- write_extent_buffer_chunk_tree_uuid()
- write_extent_buffer_fsid()
- write_extent_buffer()
- memzero_extent_buffer()
- copy_extent_buffer_full()
- copy_extent_buffer()
- memcpy_extent_buffer()
- memmove_extent_buffer()
- btrfs_get_token_##bits()
- btrfs_get_##bits()
- btrfs_set_token_##bits()
- btrfs_set_##bits()
- generic_bin_search()
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For subpage sized extent buffer, we have ensured no extent buffer will
cross page boundary, thus we would only need one page for any extent
buffer.
Update function num_extent_pages to handle such case. Now
num_extent_pages() returns 1 for subpage sized extent buffer.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As a preparation for subpage sector size support (allowing filesystem
with sector size smaller than page size to be mounted) if the sector
size is smaller than page size, we don't allow tree block to be read if
it crosses 64K(*) boundary.
The 64K is selected because:
- we are only going to support 64K page size for subpage for now
- 64K is also the maximum supported node size
This ensures that tree blocks are always contained in one page for a
system with 64K page size, which can greatly simplify the handling.
Otherwise we would have to do complex multi-page handling of tree
blocks. Currently there is no way to create such tree blocks.
In kernel we have avoided such tree blocks allocation even on 4K page
size, as it can lead to RAID56 stripe scrubbing.
While btrfs-progs have fixed its chunk allocator since 2016 for convert,
and has extra checks to do the same behavior as the kernel.
Just add such graceful checks in case of an ancient filesystem.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs only support 64K as maximum node size, thus for 4K page system, we
would have at most 16 pages for one extent buffer.
For a system using 64K page size, we would really have just one page.
While we always use 16 pages for extent_buffer::pages, this means for
systems using 64K pages, we are wasting memory for 15 page pointers
which will never be used.
Calculate the array size based on page size and the node size maximum.
- for systems using 4K page size, it will stay 16 pages
- for systems using 64K page size, it will be 1 page
Move the definition of BTRFS_MAX_METADATA_BLOCKSIZE to btrfs_tree.h, to
avoid circular inclusion of ctree.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btree_write_cache_pages() we have a btree page submission routine
buried deeply in a nested loop.
This patch will extract that part of code into a helper function,
submit_eb_page(), to do the same work.
Since submit_eb_page() now can return >0 for successful extent
buffer submission, remove the "ASSERT(ret <= 0);" line.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_verify_data_csum() just passes the whole page to
check_data_csum(), which is fine since we only support sectorsize ==
PAGE_SIZE.
To support subpage, we need to properly honor per-sector
checksum verification, just like what we did in dio read path.
This patch will do the csum verification in a for loop, starts with
pg_off == start - page_offset(page), with sectorsize increase for
each loop.
For sectorsize == PAGE_SIZE case, the pg_off will always be 0, and we
will only loop once.
For subpage case, we do the iterate over each sector and if we found any
error, we return error.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Parameter icsum for check_data_csum() is a little hard to understand.
So is the phy_offset for btrfs_verify_data_csum().
Both parameters are calculated values for csum lookup.
Instead of some calculated value, just pass bio_offset and let the
final and only user, check_data_csum(), calculate whatever it needs.
Since we are here, also make the bio_offset parameter and some related
variables to be u32 (unsigned int).
As bio size is limited by its bi_size, which is unsigned int, and has
extra size limit check during various bio operations.
Thus we are ensured that bio_offset won't overflow u32.
Thus for all involved functions, not only rename the parameter from
@phy_offset to @bio_offset, but also reduce its width to u32, so we
won't have suspicious "u32 = u64 >> sector_bits;" lines anymore.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter bio_offset of extent_submit_bio_start_t is very confusing.
If it's really bio_offset (offset to bio), then it should be u32. But
in fact, it's only utilized by dio read, and that member is used as file
offset, which must be u64.
Rename it to dio_file_offset since the only user uses it as file offset,
and add comment for who is using it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A lock dependency loop exists between the root tree lock, the extent tree
lock, and the free space tree lock.
The root tree lock depends on the free space tree lock because
btrfs_create_tree holds the new tree's lock while adding it to the root
tree.
The extent tree lock depends on the root tree lock because during
umount, we write out space cache v1, which writes inodes in the root
tree, which results in holding the root tree lock while doing a lookup
in the extent tree.
Finally, the free space tree depends on the extent tree because
populate_free_space_tree holds a locked path in the extent tree and then
does a lookup in the free space tree to add the new item.
The simplest of the three to break is the one during tree creation: we
unlock the leaf before inserting the tree node into the root tree, which
fixes the lockdep warning.
[30.480136] ======================================================
[30.480830] WARNING: possible circular locking dependency detected
[30.481457] 5.9.0-rc8+ #76 Not tainted
[30.481897] ------------------------------------------------------
[30.482500] mount/520 is trying to acquire lock:
[30.483064] ffff9babebe03908 (btrfs-free-space-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
[30.484054]
but task is already holding lock:
[30.484637] ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
[30.485581]
which lock already depends on the new lock.
[30.486397]
the existing dependency chain (in reverse order) is:
[30.487205]
-> #2 (btrfs-extent-01#2){++++}-{3:3}:
[30.487825] down_read_nested+0x43/0x150
[30.488306] __btrfs_tree_read_lock+0x39/0x180
[30.488868] __btrfs_read_lock_root_node+0x3a/0x50
[30.489477] btrfs_search_slot+0x464/0x9b0
[30.490009] check_committed_ref+0x59/0x1d0
[30.490603] btrfs_cross_ref_exist+0x65/0xb0
[30.491108] run_delalloc_nocow+0x405/0x930
[30.491651] btrfs_run_delalloc_range+0x60/0x6b0
[30.492203] writepage_delalloc+0xd4/0x150
[30.492688] __extent_writepage+0x18d/0x3a0
[30.493199] extent_write_cache_pages+0x2af/0x450
[30.493743] extent_writepages+0x34/0x70
[30.494231] do_writepages+0x31/0xd0
[30.494642] __filemap_fdatawrite_range+0xad/0xe0
[30.495194] btrfs_fdatawrite_range+0x1b/0x50
[30.495677] __btrfs_write_out_cache+0x40d/0x460
[30.496227] btrfs_write_out_cache+0x8b/0x110
[30.496716] btrfs_start_dirty_block_groups+0x211/0x4e0
[30.497317] btrfs_commit_transaction+0xc0/0xba0
[30.497861] sync_filesystem+0x71/0x90
[30.498303] btrfs_remount+0x81/0x433
[30.498767] reconfigure_super+0x9f/0x210
[30.499261] path_mount+0x9d1/0xa30
[30.499722] do_mount+0x55/0x70
[30.500158] __x64_sys_mount+0xc4/0xe0
[30.500616] do_syscall_64+0x33/0x40
[30.501091] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[30.501629]
-> #1 (btrfs-root-00){++++}-{3:3}:
[30.502241] down_read_nested+0x43/0x150
[30.502727] __btrfs_tree_read_lock+0x39/0x180
[30.503291] __btrfs_read_lock_root_node+0x3a/0x50
[30.503903] btrfs_search_slot+0x464/0x9b0
[30.504405] btrfs_insert_empty_items+0x60/0xa0
[30.504973] btrfs_insert_item+0x60/0xd0
[30.505412] btrfs_create_tree+0x1b6/0x210
[30.505913] btrfs_create_free_space_tree+0x54/0x110
[30.506460] btrfs_mount_rw+0x15d/0x20f
[30.506937] btrfs_remount+0x356/0x433
[30.507369] reconfigure_super+0x9f/0x210
[30.507868] path_mount+0x9d1/0xa30
[30.508264] do_mount+0x55/0x70
[30.508668] __x64_sys_mount+0xc4/0xe0
[30.509186] do_syscall_64+0x33/0x40
[30.509652] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[30.510271]
-> #0 (btrfs-free-space-00){++++}-{3:3}:
[30.510972] __lock_acquire+0x11ad/0x1b60
[30.511432] lock_acquire+0xa2/0x360
[30.511917] down_read_nested+0x43/0x150
[30.512383] __btrfs_tree_read_lock+0x39/0x180
[30.512947] __btrfs_read_lock_root_node+0x3a/0x50
[30.513455] btrfs_search_slot+0x464/0x9b0
[30.513947] search_free_space_info+0x45/0x90
[30.514465] __add_to_free_space_tree+0x92/0x39d
[30.515010] btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
[30.515639] btrfs_mount_rw+0x15d/0x20f
[30.516142] btrfs_remount+0x356/0x433
[30.516538] reconfigure_super+0x9f/0x210
[30.517065] path_mount+0x9d1/0xa30
[30.517438] do_mount+0x55/0x70
[30.517824] __x64_sys_mount+0xc4/0xe0
[30.518293] do_syscall_64+0x33/0x40
[30.518776] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[30.519335]
other info that might help us debug this:
[30.520210] Chain exists of:
btrfs-free-space-00 --> btrfs-root-00 --> btrfs-extent-01#2
[30.521407] Possible unsafe locking scenario:
[30.522037] CPU0 CPU1
[30.522456] ---- ----
[30.522941] lock(btrfs-extent-01#2);
[30.523311] lock(btrfs-root-00);
[30.523952] lock(btrfs-extent-01#2);
[30.524620] lock(btrfs-free-space-00);
[30.525068]
*** DEADLOCK ***
[30.525669] 5 locks held by mount/520:
[30.526116] #0: ffff9babebc520e0 (&type->s_umount_key#37){+.+.}-{3:3}, at: path_mount+0x7ef/0xa30
[30.527056] #1: ffff9babebc52640 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x3d5/0x5c0
[30.527960] #2: ffff9babeae8f2e8 (&cache->free_space_lock#2){+.+.}-{3:3}, at: btrfs_create_free_space_tree.cold.22+0x101/0x45d
[30.529118] #3: ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
[30.530113] #4: ffff9babebd52eb8 (btrfs-extent-00){++++}-{3:3}, at: btrfs_try_tree_read_lock+0x16/0x100
[30.531124]
stack backtrace:
[30.531528] CPU: 0 PID: 520 Comm: mount Not tainted 5.9.0-rc8+ #76
[30.532166] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-4.module_el8.1.0+248+298dec18 04/01/2014
[30.533215] Call Trace:
[30.533452] dump_stack+0x8d/0xc0
[30.533797] check_noncircular+0x13c/0x150
[30.534233] __lock_acquire+0x11ad/0x1b60
[30.534667] lock_acquire+0xa2/0x360
[30.535063] ? __btrfs_tree_read_lock+0x39/0x180
[30.535525] down_read_nested+0x43/0x150
[30.535939] ? __btrfs_tree_read_lock+0x39/0x180
[30.536400] __btrfs_tree_read_lock+0x39/0x180
[30.536862] __btrfs_read_lock_root_node+0x3a/0x50
[30.537304] btrfs_search_slot+0x464/0x9b0
[30.537713] ? trace_hardirqs_on+0x1c/0xf0
[30.538148] search_free_space_info+0x45/0x90
[30.538572] __add_to_free_space_tree+0x92/0x39d
[30.539071] ? printk+0x48/0x4a
[30.539367] btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
[30.539972] btrfs_mount_rw+0x15d/0x20f
[30.540350] btrfs_remount+0x356/0x433
[30.540773] ? shrink_dcache_sb+0xd9/0x100
[30.541203] reconfigure_super+0x9f/0x210
[30.541642] path_mount+0x9d1/0xa30
[30.542040] do_mount+0x55/0x70
[30.542366] __x64_sys_mount+0xc4/0xe0
[30.542822] do_syscall_64+0x33/0x40
[30.543197] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[30.543691] RIP: 0033:0x7f109f7ab93a
[30.546042] RSP: 002b:00007ffc47c4f858 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[30.546770] RAX: ffffffffffffffda RBX: 00007f109f8cf264 RCX: 00007f109f7ab93a
[30.547485] RDX: 0000557e6fc10770 RSI: 0000557e6fc19cf0 RDI: 0000557e6fc19cd0
[30.548185] RBP: 0000557e6fc10520 R08: 0000557e6fc18e30 R09: 0000557e6fc18cb0
[30.548911] R10: 0000000000200020 R11: 0000000000000246 R12: 0000000000000000
[30.549606] R13: 0000557e6fc19cd0 R14: 0000557e6fc10770 R15: 0000557e6fc10520
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
If we are not using space cache v1, we should not create the free space
object or free space inodes. This comes up when we delete the existing
free space objects/inodes when migrating to v2, only to see them get
recreated for every dirtied block group.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When the filesystem transitions from space cache v1 to v2 or to
nospace_cache, it removes the old cached data, but does not remove
the FREE_SPACE items nor the free space inodes they point to. This
doesn't cause any issues besides being a bit inefficient, since these
items no longer do anything useful.
To fix it, when we are mounting, and plan to disable the space cache,
destroy each block group's free space item and free space inode.
The code to remove the items is lifted from the existing use case of
removing the block group, with a light adaptation to handle whether or
not we have already looked up the free space inode.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
If the remount is ro->ro, rw->ro, or rw->rw, we will not create or
clear the free space tree. This can be surprising, so print a warning
to dmesg to make the failure more visible. It is also important to
ensure that the space cache options (SPACE_CACHE, FREE_SPACE_TREE) are
consistent, so ensure those are set to properly match the current on
disk state (which won't be changing).
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To make the contents of /proc/mounts better match the actual state of
the filesystem, base the display of the space cache mount options off
the contents of the super block rather than the last mount options
passed in. Since there are many scenarios where the mount will ignore a
space cache option, simply showing the passed in option is misleading.
For example, if we mount with -o remount,space_cache=v2 on a read-write
file system without an existing free space tree, we won't build a free
space tree, but /proc/mounts will read space_cache=v2 (until we mount
again and it goes away)
cache_generation is set iff space_cache=v1, FREE_SPACE_TREE is set iff
space_cache=v2, and if neither is the case, we print nospace_cache.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When mounting, btrfs uses the cache_generation in the super block to
determine if space cache v1 is in use. However, by mounting with
nospace_cache or space_cache=v2, it is possible to disable space cache
v1, which does not result in un-setting cache_generation back to 0.
In order to base some logic, like mount option printing in /proc/mounts,
on the current state of the space cache rather than just the values of
the mount option, keep the value of cache_generation consistent with the
status of space cache v1.
We ensure that cache_generation > 0 iff the file system is using
space_cache v1. This requires committing a transaction on any mount
which changes whether we are using v1. (v1->nospace_cache, v1->v2,
nospace_cache->v1, v2->v1).
Since the mechanism for writing out the cache generation is transaction
commit, but we want some finer grained control over when we un-set it,
we can't just rely on the SPACE_CACHE mount option, and introduce an
fs_info flag that mount can use when it wants to unset the generation.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
A user might want to revert to v1 or nospace_cache on a root filesystem,
and much like turning on the free space tree, that can only be done
remounting from ro->rw. Support clearing the free space tree on such
mounts by moving it into the shared remount logic.
Since the CLEAR_CACHE option sticks around across remounts, this change
would result in clearing the tree for ever on every remount, which is
not desirable. To fix that, add CLEAR_CACHE to the oneshot options we
clear at mount end, which has the other bonus of not cluttering the
/proc/mounts output with clear_cache.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some options only apply during mount time and are cleared at the end
of mount. For now, the example is USEBACKUPROOT, but CLEAR_CACHE also
fits the bill, and this is a preparation patch for also clearing that
option.
One subtlety is that the current code only resets USEBACKUPROOT on rw
mounts, but the option is meaningfully "consumed" by a ro mount, so it
feels appropriate to clear in that case as well. A subsequent read-write
remount would not go through open_ctree, which is the only place that
checks the option, so the change should be benign.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a user attempts to remount a btrfs filesystem with
'mount -o remount,space_cache=v2', that operation silently succeeds.
Unfortunately, this is misleading, because the remount does not create
the free space tree. /proc/mounts will incorrectly show space_cache=v2,
but on the next mount, the file system will revert to the old
space_cache.
For now, we handle only the easier case, where the existing mount is
read-only and the new mount is read-write. In that case, we can create
the free space tree without contending with the block groups changing
as we go.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>