2021-01-26 11:33:47 +03:00
/* SPDX-License-Identifier: GPL-2.0 */
# ifndef BTRFS_SUBPAGE_H
# define BTRFS_SUBPAGE_H
# include <linux/spinlock.h>
2021-08-17 12:38:51 +03:00
/*
* Extra info for subpapge bitmap .
*
2023-05-31 09:04:57 +03:00
* For subpage we pack all uptodate / dirty / writeback / ordered bitmaps into
2021-08-17 12:38:51 +03:00
* one larger bitmap .
*
* This structure records how they are organized in the bitmap :
*
2023-05-31 09:04:57 +03:00
* / - uptodate_offset / - dirty_offset / - ordered_offset
2021-08-17 12:38:51 +03:00
* | | |
* v v v
2023-05-31 09:04:57 +03:00
* | u | u | u | u | . . . . . . . . | u | u | d | d | . . . . . . . | d | d | o | o | . . . . . . . | o | o |
2021-08-17 12:38:51 +03:00
* | < - bitmap_nr_bits - > |
2023-05-31 09:04:57 +03:00
* | < - - - - - - - - - - - - - - - - - total_nr_bits - - - - - - - - - - - - - - - - - - > |
2021-08-17 12:38:51 +03:00
*/
struct btrfs_subpage_info {
/* Number of bits for each bitmap */
unsigned int bitmap_nr_bits ;
/* Total number of bits for the whole bitmap */
unsigned int total_nr_bits ;
/*
* * _start indicates where the bitmap starts , the length is always
* @ bitmap_size , which is calculated from PAGE_SIZE / sectorsize .
*/
unsigned int uptodate_offset ;
unsigned int dirty_offset ;
unsigned int writeback_offset ;
unsigned int ordered_offset ;
2021-09-27 10:21:49 +03:00
unsigned int checked_offset ;
2021-08-17 12:38:51 +03:00
} ;
2021-01-26 11:33:47 +03:00
/*
* Structure to trace status of each sector inside a page , attached to
* page : : private for both data and metadata inodes .
*/
struct btrfs_subpage {
/* Common members for both data and metadata pages */
spinlock_t lock ;
btrfs: subpage: fix a rare race between metadata endio and eb freeing
[BUG]
There is a very rare ASSERT() triggering during full fstests run for
subpage rw support.
No other reproducer so far.
The ASSERT() gets triggered for metadata read in
btrfs_page_set_uptodate() inside end_page_read().
[CAUSE]
There is still a small race window for metadata only, the race could
happen like this:
T1 | T2
------------------------------------+-----------------------------
end_bio_extent_readpage() |
|- btrfs_validate_metadata_buffer() |
| |- free_extent_buffer() |
| Still have 2 refs |
|- end_page_read() |
|- if (unlikely(PagePrivate()) |
| The page still has Private |
| | free_extent_buffer()
| | | Only one ref 1, will be
| | | released
| | |- detach_extent_buffer_page()
| | |- btrfs_detach_subpage()
|- btrfs_set_page_uptodate() |
The page no longer has Private|
>>> ASSERT() triggered <<< |
This race window is super small, thus pretty hard to hit, even with so
many runs of fstests.
But the race window is still there, we have to go another way to solve
it other than relying on random PagePrivate() check.
Data path is not affected, as it will lock the page before reading,
while unlocking the page after the last read has finished, thus no race
window.
[FIX]
This patch will fix the bug by repurposing btrfs_subpage::readers.
Now btrfs_subpage::readers will be a member shared by both metadata and
data.
For metadata path, we don't do the page unlock as metadata only relies
on extent locking.
At the same time, teach page_range_has_eb() to take
btrfs_subpage::readers into consideration.
So that even if the last eb of a page gets freed, page::private won't be
detached as long as there still are pending end_page_read() calls.
By this we eliminate the race window, this will slight increase the
metadata memory usage, as the page may not be released as frequently as
usual. But it should not be a big deal.
The code got introduced in ("btrfs: submit read time repair only for
each corrupted sector"), but the fix is in a separate patch to keep the
problem description and the crash is rare so it should not hurt
bisectability.
Signed-off-by: Qu Wegruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-07 12:02:58 +03:00
/*
* Both data and metadata needs to track how many readers are for the
* page .
* Data relies on @ readers to unlock the page when last reader finished .
* While metadata doesn ' t need page unlock , it needs to prevent
* page : : private get cleared before the last end_page_read ( ) .
*/
atomic_t readers ;
2021-01-26 11:33:48 +03:00
union {
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 11:33:50 +03:00
/*
* Structures only used by metadata
*
* @ eb_refs should only be operated under private_lock , as it
* manages whether the subpage can be detached .
*/
atomic_t eb_refs ;
2021-05-31 11:50:45 +03:00
2021-08-17 12:38:52 +03:00
/* Structures only used by data */
atomic_t writers ;
2021-01-26 11:33:48 +03:00
} ;
2021-08-17 12:38:52 +03:00
unsigned long bitmaps [ ] ;
2021-01-26 11:33:47 +03:00
} ;
enum btrfs_subpage_type {
BTRFS_SUBPAGE_METADATA ,
BTRFS_SUBPAGE_DATA ,
} ;
2023-12-07 02:09:28 +03:00
bool btrfs_is_subpage ( const struct btrfs_fs_info * fs_info , struct address_space * mapping ) ;
btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine
The reason why we only support 64K page size for subpage is, for 64K
page size we can ensure no matter what the nodesize is, we can fit it
into one page.
When other page size come, especially like 16K, the limitation is a bit
limiting.
To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
non-subpage routine. By this, we can allow 4K sectorsize on 16K page
size.
Although this introduces another smaller limitation, the metadata can
not cross page boundary, which is already met by most recent mkfs.
Another small improvement is, we can avoid the overhead for metadata if
nodesize >= PAGE_SIZE.
For 4K sector size and 64K page size/node size, or 4K sector size and
16K page size/node size, we don't need to allocate extra memory for the
metadata pages.
Please note that, this patch will not yet enable other page size support
yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-13 08:22:09 +03:00
2021-08-17 12:38:51 +03:00
void btrfs_init_subpage_info ( struct btrfs_subpage_info * subpage_info , u32 sectorsize ) ;
2021-01-26 11:33:47 +03:00
int btrfs_attach_subpage ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , enum btrfs_subpage_type type ) ;
void btrfs_detach_subpage ( const struct btrfs_fs_info * fs_info , struct folio * folio ) ;
2021-01-26 11:33:47 +03:00
2021-01-26 11:33:48 +03:00
/* Allocate additional data where page represents more than one sector */
2021-08-17 12:38:50 +03:00
struct btrfs_subpage * btrfs_alloc_subpage ( const struct btrfs_fs_info * fs_info ,
enum btrfs_subpage_type type ) ;
2021-01-26 11:33:48 +03:00
void btrfs_free_subpage ( struct btrfs_subpage * subpage ) ;
2023-12-07 02:09:28 +03:00
void btrfs_folio_inc_eb_refs ( const struct btrfs_fs_info * fs_info , struct folio * folio ) ;
void btrfs_folio_dec_eb_refs ( const struct btrfs_fs_info * fs_info , struct folio * folio ) ;
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 11:33:50 +03:00
btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there is more work to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-02 05:28:36 +03:00
void btrfs_subpage_start_reader ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len ) ;
btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there is more work to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-02 05:28:36 +03:00
void btrfs_subpage_end_reader ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len ) ;
btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there is more work to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-02 05:28:36 +03:00
2021-05-31 11:50:44 +03:00
void btrfs_subpage_start_writer ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len ) ;
2021-05-31 11:50:44 +03:00
bool btrfs_subpage_end_and_test_writer ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len ) ;
int btrfs_folio_start_writer_lock ( const struct btrfs_fs_info * fs_info ,
struct folio * folio , u64 start , u32 len ) ;
void btrfs_folio_end_writer_lock ( const struct btrfs_fs_info * fs_info ,
struct folio * folio , u64 start , u32 len ) ;
2021-05-31 11:50:44 +03:00
2021-01-26 11:33:52 +03:00
/*
* Template for subpage related operations .
*
2023-12-12 05:28:37 +03:00
* btrfs_subpage_ * ( ) are for call sites where the folio has subpage attached and
* the range is ensured to be inside the folio ' s single page .
2021-01-26 11:33:52 +03:00
*
2023-12-12 05:28:37 +03:00
* btrfs_folio_ * ( ) are for call sites where the page can either be subpage
* specific or regular folios . The function will handle both cases .
* But the range still needs to be inside one single page .
2021-05-31 11:50:39 +03:00
*
2023-12-12 05:28:37 +03:00
* btrfs_folio_clamp_ * ( ) are similar to btrfs_folio_ * ( ) , except the range doesn ' t
2021-05-31 11:50:39 +03:00
* need to be inside the page . Those functions will truncate the range
* automatically .
2021-01-26 11:33:52 +03:00
*/
# define DECLARE_BTRFS_SUBPAGE_OPS(name) \
void btrfs_subpage_set_ # # name ( const struct btrfs_fs_info * fs_info , \
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len ) ; \
2021-01-26 11:33:52 +03:00
void btrfs_subpage_clear_ # # name ( const struct btrfs_fs_info * fs_info , \
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len ) ; \
2021-01-26 11:33:52 +03:00
bool btrfs_subpage_test_ # # name ( const struct btrfs_fs_info * fs_info , \
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len ) ; \
void btrfs_folio_set_ # # name ( const struct btrfs_fs_info * fs_info , \
struct folio * folio , u64 start , u32 len ) ; \
void btrfs_folio_clear_ # # name ( const struct btrfs_fs_info * fs_info , \
struct folio * folio , u64 start , u32 len ) ; \
bool btrfs_folio_test_ # # name ( const struct btrfs_fs_info * fs_info , \
struct folio * folio , u64 start , u32 len ) ; \
void btrfs_folio_clamp_set_ # # name ( const struct btrfs_fs_info * fs_info , \
struct folio * folio , u64 start , u32 len ) ; \
void btrfs_folio_clamp_clear_ # # name ( const struct btrfs_fs_info * fs_info , \
struct folio * folio , u64 start , u32 len ) ; \
bool btrfs_folio_clamp_test_ # # name ( const struct btrfs_fs_info * fs_info , \
struct folio * folio , u64 start , u32 len ) ;
2021-01-26 11:33:52 +03:00
DECLARE_BTRFS_SUBPAGE_OPS ( uptodate ) ;
2021-03-25 10:14:37 +03:00
DECLARE_BTRFS_SUBPAGE_OPS ( dirty ) ;
2021-03-25 10:14:38 +03:00
DECLARE_BTRFS_SUBPAGE_OPS ( writeback ) ;
2021-05-31 11:50:45 +03:00
DECLARE_BTRFS_SUBPAGE_OPS ( ordered ) ;
2021-09-27 10:21:49 +03:00
DECLARE_BTRFS_SUBPAGE_OPS ( checked ) ;
2021-03-25 10:14:37 +03:00
bool btrfs_subpage_clear_and_test_dirty ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len ) ;
2021-01-26 11:33:52 +03:00
2023-12-12 05:28:37 +03:00
void btrfs_folio_assert_not_dirty ( const struct btrfs_fs_info * fs_info , struct folio * folio ) ;
void btrfs_folio_unlock_writer ( struct btrfs_fs_info * fs_info ,
struct folio * folio , u64 start , u32 len ) ;
2023-05-26 15:30:53 +03:00
void __cold btrfs_subpage_dump_bitmap ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len ) ;
btrfs: subpage: fix writeback which does not have ordered extent
[BUG]
When running fsstress with subpage RW support, there are random
BUG_ON()s triggered with the following trace:
kernel BUG at fs/btrfs/file-item.c:667!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 1 PID: 3486 Comm: kworker/u13:2 5.11.0-rc4-custom+ #43
Hardware name: Radxa ROCK Pi 4B (DT)
Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
Call trace:
btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
btrfs_submit_bio_start+0x20/0x30 [btrfs]
run_one_async_start+0x28/0x44 [btrfs]
btrfs_work_helper+0x128/0x1b4 [btrfs]
process_one_work+0x22c/0x430
worker_thread+0x70/0x3a0
kthread+0x13c/0x140
ret_from_fork+0x10/0x30
[CAUSE]
Above BUG_ON() means there is some bio range which doesn't have ordered
extent, which indeed is worth a BUG_ON().
Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
subpage dirty bitmap to record which range is dirty and should be
written back.
This means, if we submit bio for a subpage range, we do not only need to
clear page dirty, but also need to clear subpage dirty bits.
In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
any range we submit a bio.
But there is loophole, if we hit a range which is beyond i_size, we just
call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
then break out, without clearing the subpage dirty.
This means, if we hit above branch, the subpage dirty bits are still
there, if other range of the page get dirtied and we need to writeback
that page again, we will submit bio for the old range, leaving a wild
bio range which doesn't have ordered extent.
[FIX]
Fix it by always calling btrfs_page_clear_dirty() in
__extent_writepage_io().
Also to avoid such problem from happening again, add a new assert,
btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
dirty bits are cleared before exiting __extent_writepage_io().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-26 09:34:58 +03:00
2021-01-26 11:33:47 +03:00
# endif