2021-01-26 11:33:47 +03:00
// SPDX-License-Identifier: GPL-2.0
# include <linux/slab.h>
2022-10-19 17:50:49 +03:00
# include "messages.h"
2021-01-26 11:33:47 +03:00
# include "ctree.h"
# include "subpage.h"
btrfs: subpage: fix a rare race between metadata endio and eb freeing
[BUG]
There is a very rare ASSERT() triggering during full fstests run for
subpage rw support.
No other reproducer so far.
The ASSERT() gets triggered for metadata read in
btrfs_page_set_uptodate() inside end_page_read().
[CAUSE]
There is still a small race window for metadata only, the race could
happen like this:
T1 | T2
------------------------------------+-----------------------------
end_bio_extent_readpage() |
|- btrfs_validate_metadata_buffer() |
| |- free_extent_buffer() |
| Still have 2 refs |
|- end_page_read() |
|- if (unlikely(PagePrivate()) |
| The page still has Private |
| | free_extent_buffer()
| | | Only one ref 1, will be
| | | released
| | |- detach_extent_buffer_page()
| | |- btrfs_detach_subpage()
|- btrfs_set_page_uptodate() |
The page no longer has Private|
>>> ASSERT() triggered <<< |
This race window is super small, thus pretty hard to hit, even with so
many runs of fstests.
But the race window is still there, we have to go another way to solve
it other than relying on random PagePrivate() check.
Data path is not affected, as it will lock the page before reading,
while unlocking the page after the last read has finished, thus no race
window.
[FIX]
This patch will fix the bug by repurposing btrfs_subpage::readers.
Now btrfs_subpage::readers will be a member shared by both metadata and
data.
For metadata path, we don't do the page unlock as metadata only relies
on extent locking.
At the same time, teach page_range_has_eb() to take
btrfs_subpage::readers into consideration.
So that even if the last eb of a page gets freed, page::private won't be
detached as long as there still are pending end_page_read() calls.
By this we eliminate the race window, this will slight increase the
metadata memory usage, as the page may not be released as frequently as
usual. But it should not be a big deal.
The code got introduced in ("btrfs: submit read time repair only for
each corrupted sector"), but the fix is in a separate patch to keep the
problem description and the crash is rare so it should not hurt
bisectability.
Signed-off-by: Qu Wegruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-07 12:02:58 +03:00
# include "btrfs_inode.h"
2021-01-26 11:33:47 +03:00
2021-03-25 10:14:45 +03:00
/*
* Subpage ( sectorsize < PAGE_SIZE ) support overview :
*
* Limitations :
*
* - Only support 64 K page size for now
* This is to make metadata handling easier , as 64 K page would ensure
* all nodesize would fit inside one page , thus we don ' t need to handle
* cases where a tree block crosses several pages .
*
* - Only metadata read - write for now
* The data read - write part is in development .
*
* - Metadata can ' t cross 64 K page boundary
* btrfs - progs and kernel have done that for a while , thus only ancient
* filesystems could have such problem . For such case , do a graceful
* rejection .
*
* Special behavior :
*
* - Metadata
* Metadata read is fully supported .
* Meaning when reading one tree block will only trigger the read for the
* needed range , other unrelated range in the same page will not be touched .
*
* Metadata write support is partial .
* The writeback is still for the full page , but we will only submit
* the dirty extent buffers in the page .
*
* This means , if we have a metadata page like this :
*
* Page offset
* 0 16 K 32 K 48 K 64 K
* | /////////| |///////////|
* \ - Tree block A \ - Tree block B
*
* Even if we just want to writeback tree block A , we will also writeback
* tree block B if it ' s also dirty .
*
* This may cause extra metadata writeback which results more COW .
*
* Implementation :
*
* - Common
* Both metadata and data will use a new structure , btrfs_subpage , to
* record the status of each sector inside a page . This provides the extra
* granularity needed .
*
* - Metadata
* Since we have multiple tree blocks inside one page , we can ' t rely on page
* locking anymore , or we will have greatly reduced concurrency or even
* deadlocks ( hold one tree lock while trying to lock another tree lock in
* the same page ) .
*
* Thus for metadata locking , subpage support relies on io_tree locking only .
* This means a slightly higher tree locking latency .
*/
2023-12-07 02:09:28 +03:00
bool btrfs_is_subpage ( const struct btrfs_fs_info * fs_info , struct address_space * mapping )
btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine
The reason why we only support 64K page size for subpage is, for 64K
page size we can ensure no matter what the nodesize is, we can fit it
into one page.
When other page size come, especially like 16K, the limitation is a bit
limiting.
To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
non-subpage routine. By this, we can allow 4K sectorsize on 16K page
size.
Although this introduces another smaller limitation, the metadata can
not cross page boundary, which is already met by most recent mkfs.
Another small improvement is, we can avoid the overhead for metadata if
nodesize >= PAGE_SIZE.
For 4K sector size and 64K page size/node size, or 4K sector size and
16K page size/node size, we don't need to allocate extra memory for the
metadata pages.
Please note that, this patch will not yet enable other page size support
yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-13 08:22:09 +03:00
{
if ( fs_info - > sectorsize > = PAGE_SIZE )
return false ;
/*
* Only data pages ( either through DIO or compression ) can have no
* mapping . And if page - > mapping - > host is data inode , it ' s subpage .
* As we have ruled our sectorsize > = PAGE_SIZE case already .
*/
2022-06-15 16:03:11 +03:00
if ( ! mapping | | ! mapping - > host | | is_data_inode ( BTRFS_I ( mapping - > host ) ) )
btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine
The reason why we only support 64K page size for subpage is, for 64K
page size we can ensure no matter what the nodesize is, we can fit it
into one page.
When other page size come, especially like 16K, the limitation is a bit
limiting.
To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
non-subpage routine. By this, we can allow 4K sectorsize on 16K page
size.
Although this introduces another smaller limitation, the metadata can
not cross page boundary, which is already met by most recent mkfs.
Another small improvement is, we can avoid the overhead for metadata if
nodesize >= PAGE_SIZE.
For 4K sector size and 64K page size/node size, or 4K sector size and
16K page size/node size, we don't need to allocate extra memory for the
metadata pages.
Please note that, this patch will not yet enable other page size support
yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-13 08:22:09 +03:00
return true ;
/*
* Now the only remaining case is metadata , which we only go subpage
* routine if nodesize < PAGE_SIZE .
*/
if ( fs_info - > nodesize < PAGE_SIZE )
return true ;
return false ;
}
2021-08-17 12:38:51 +03:00
void btrfs_init_subpage_info ( struct btrfs_subpage_info * subpage_info , u32 sectorsize )
{
unsigned int cur = 0 ;
unsigned int nr_bits ;
ASSERT ( IS_ALIGNED ( PAGE_SIZE , sectorsize ) ) ;
nr_bits = PAGE_SIZE / sectorsize ;
subpage_info - > bitmap_nr_bits = nr_bits ;
subpage_info - > uptodate_offset = cur ;
cur + = nr_bits ;
subpage_info - > dirty_offset = cur ;
cur + = nr_bits ;
subpage_info - > writeback_offset = cur ;
cur + = nr_bits ;
subpage_info - > ordered_offset = cur ;
cur + = nr_bits ;
2021-09-27 10:21:49 +03:00
subpage_info - > checked_offset = cur ;
cur + = nr_bits ;
2024-02-17 09:29:49 +03:00
subpage_info - > locked_offset = cur ;
cur + = nr_bits ;
2021-08-17 12:38:51 +03:00
subpage_info - > total_nr_bits = cur ;
}
2021-01-26 11:33:47 +03:00
int btrfs_attach_subpage ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , enum btrfs_subpage_type type )
2021-01-26 11:33:47 +03:00
{
2021-08-17 12:38:50 +03:00
struct btrfs_subpage * subpage ;
2021-01-26 11:33:47 +03:00
/*
2022-05-25 17:27:25 +03:00
* We have cases like a dummy extent buffer page , which is not mapped
2021-01-26 11:33:47 +03:00
* and doesn ' t need to be locked .
*/
2023-12-12 05:28:37 +03:00
if ( folio - > mapping )
ASSERT ( folio_test_locked ( folio ) ) ;
2021-08-17 12:38:50 +03:00
2023-11-17 06:54:14 +03:00
/* Either not subpage, or the folio already has private attached. */
2023-12-12 05:28:37 +03:00
if ( ! btrfs_is_subpage ( fs_info , folio - > mapping ) | | folio_test_private ( folio ) )
2021-01-26 11:33:47 +03:00
return 0 ;
2021-08-17 12:38:50 +03:00
subpage = btrfs_alloc_subpage ( fs_info , type ) ;
if ( IS_ERR ( subpage ) )
return PTR_ERR ( subpage ) ;
2023-11-17 06:54:14 +03:00
folio_attach_private ( folio , subpage ) ;
2021-01-26 11:33:47 +03:00
return 0 ;
}
2023-12-12 05:28:37 +03:00
void btrfs_detach_subpage ( const struct btrfs_fs_info * fs_info , struct folio * folio )
2021-01-26 11:33:47 +03:00
{
struct btrfs_subpage * subpage ;
2023-11-17 06:54:14 +03:00
/* Either not subpage, or the folio already has private attached. */
2023-12-12 05:28:37 +03:00
if ( ! btrfs_is_subpage ( fs_info , folio - > mapping ) | | ! folio_test_private ( folio ) )
2021-01-26 11:33:47 +03:00
return ;
2023-11-17 06:54:14 +03:00
subpage = folio_detach_private ( folio ) ;
2021-01-26 11:33:47 +03:00
ASSERT ( subpage ) ;
2021-01-26 11:33:48 +03:00
btrfs_free_subpage ( subpage ) ;
}
2021-08-17 12:38:50 +03:00
struct btrfs_subpage * btrfs_alloc_subpage ( const struct btrfs_fs_info * fs_info ,
enum btrfs_subpage_type type )
2021-01-26 11:33:48 +03:00
{
2021-08-17 12:38:50 +03:00
struct btrfs_subpage * ret ;
2021-08-17 12:38:52 +03:00
unsigned int real_size ;
2021-08-17 12:38:50 +03:00
2021-08-17 12:38:49 +03:00
ASSERT ( fs_info - > sectorsize < PAGE_SIZE ) ;
2021-01-26 11:33:48 +03:00
2021-08-17 12:38:52 +03:00
real_size = struct_size ( ret , bitmaps ,
BITS_TO_LONGS ( fs_info - > subpage_info - > total_nr_bits ) ) ;
ret = kzalloc ( real_size , GFP_NOFS ) ;
2021-08-17 12:38:50 +03:00
if ( ! ret )
return ERR_PTR ( - ENOMEM ) ;
spin_lock_init ( & ret - > lock ) ;
2021-05-31 11:50:44 +03:00
if ( type = = BTRFS_SUBPAGE_METADATA ) {
2021-08-17 12:38:50 +03:00
atomic_set ( & ret - > eb_refs , 0 ) ;
2021-05-31 11:50:44 +03:00
} else {
2021-08-17 12:38:50 +03:00
atomic_set ( & ret - > readers , 0 ) ;
atomic_set ( & ret - > writers , 0 ) ;
2021-05-31 11:50:44 +03:00
}
2021-08-17 12:38:50 +03:00
return ret ;
2021-01-26 11:33:48 +03:00
}
void btrfs_free_subpage ( struct btrfs_subpage * subpage )
{
2021-01-26 11:33:47 +03:00
kfree ( subpage ) ;
}
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 11:33:50 +03:00
/*
* Increase the eb_refs of current subpage .
*
* This is important for eb allocation , to prevent race with last eb freeing
* of the same page .
* With the eb_refs increased before the eb inserted into radix tree ,
2023-11-17 06:54:14 +03:00
* detach_extent_buffer_page ( ) won ' t detach the folio private while we ' re still
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 11:33:50 +03:00
* allocating the extent buffer .
*/
2023-12-07 02:09:28 +03:00
void btrfs_folio_inc_eb_refs ( const struct btrfs_fs_info * fs_info , struct folio * folio )
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 11:33:50 +03:00
{
struct btrfs_subpage * subpage ;
2023-12-07 02:09:28 +03:00
if ( ! btrfs_is_subpage ( fs_info , folio - > mapping ) )
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 11:33:50 +03:00
return ;
2023-12-07 02:09:28 +03:00
ASSERT ( folio_test_private ( folio ) & & folio - > mapping ) ;
for-6.8-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmWYTmMACgkQxWXV+ddt
WDvPRg/+KgS5LV3nNC0MguYcTMQxmgeutIgXZIMfeA3v6EnFS7nj8leP4EPc6+bj
JPSkwj4u2vHVwpnTVuEAuJUXnmFY+Qu70nVy6bM2uOHOYTVBQ8zRVK4cErNNLWCp
OekDaADR53RrZ/xprlQ7b7Ph0Ch2uq9OrpH50IcyquEsH1ffkxlqwyrvth4/8dxC
6zgsFHWrbtVKJf0DYoQPpjEPz5tpdQ+xHZwtmf1cNlUgI1objODr/ZTqXtZqTfw4
/GwrtDPbEri53K/qjgr0dDH7pBVqD6PtnbgoHfYkiizZ0G7UkmlaK6rZIurtATJb
Yk/RCqCUp9tPC4yeFSewFMm1Y8Ae3rkUBG7rnYkvMmBspMqyh/kQAWSBimF5yk/y
vFEdFTe9AbdvP19Nw0CqovLzaO6RrOXCL1usnFvCmBgvF5gZAv63ZW1njP3ZoNta
wB8Rs6hxdRkph8Dk7yvYf54uUR+JyKqjHY6egg2qkKTjz0CSf6qQFyFZXpr81m97
gK4WN5SeP/P2ukRbBKKyzZ5IljUxZuVatvJa0tktd7kAbU26WLzofOJ7pX+iqimM
F2G7gKGJZykLY1WPntXBp9Dg97Ras2O5iViQ7ZKwRdOx1yZS5zzTYlIznHBAmXbL
UgXfVnpJH1xFdkvedNTn+Fz9BHNV1K2a2AT7VITj7sxz23z3aJA=
=4sw3
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"There are no exciting changes for users, it's been mostly API
conversions and some fixes or refactoring.
The mount API conversion is a base for future improvements that would
come with VFS. Metadata processing has been converted to folios, not
yet enabling the large folios but it's one patch away once everything
gets tested enough.
Core changes:
- convert extent buffers to folios:
- direct API conversion where possible
- performance can drop by a few percent on metadata heavy
workloads, the folio sizes are not constant and the calculations
add up in the item helpers
- both regular and subpage modes
- data cannot be converted yet, we need to port that to iomap and
there are some other generic changes required
- convert mount to the new API, should not be user visible:
- options deprecated long time ago have been removed: inode_cache,
recovery
- the new logic that splits mount to two phases slightly changes
timing of device scanning for multi-device filesystems
- LSM options will now work (like for selinux)
- convert delayed nodes radix tree to xarray, preserving the
preload-like logic that still allows to allocate with GFP_NOFS
- more validation of sysfs value of scrub_speed_max
- refactor chunk map structure, reduce size and improve performance
- extent map refactoring, smaller data structures, improved
performance
- reduce size of struct extent_io_tree, embedded in several
structures
- temporary pages used for compression are cached and attached to a
shrinker, this may slightly improve performance
- in zoned mode, remove redirty extent buffer tracking, zeros are
written in case an out-of-order is detected and proper data are
written to the actual write pointer
- cleanups, refactoring, error message improvements, updated tests
- verify and update branch name or tag
- remove unwanted text"
* tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (89 commits)
btrfs: pass btrfs_io_geometry into btrfs_max_io_len
btrfs: pass struct btrfs_io_geometry to set_io_stripe
btrfs: open code set_io_stripe for RAID56
btrfs: change block mapping to switch/case in btrfs_map_block
btrfs: factor out block mapping for single profiles
btrfs: factor out block mapping for RAID5/6
btrfs: reduce scope of data_stripes in btrfs_map_block
btrfs: factor out block mapping for RAID10
btrfs: factor out block mapping for DUP profiles
btrfs: factor out RAID1 block mapping
btrfs: factor out block-mapping for RAID0
btrfs: re-introduce struct btrfs_io_geometry
btrfs: factor out helper for single device IO check
btrfs: migrate btrfs_repair_io_failure() to folio interfaces
btrfs: migrate eb_bitmap_offset() to folio interfaces
btrfs: migrate various end io functions to folios
btrfs: migrate subpage code to folio interfaces
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
btrfs: don't double put our subpage reference in alloc_extent_buffer
btrfs: cleanup metadata page pointer usage
...
2024-01-10 20:27:40 +03:00
lockdep_assert_held ( & folio - > mapping - > i_private_lock ) ;
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 11:33:50 +03:00
2023-11-17 06:54:14 +03:00
subpage = folio_get_private ( folio ) ;
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 11:33:50 +03:00
atomic_inc ( & subpage - > eb_refs ) ;
}
2023-12-07 02:09:28 +03:00
void btrfs_folio_dec_eb_refs ( const struct btrfs_fs_info * fs_info , struct folio * folio )
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 11:33:50 +03:00
{
struct btrfs_subpage * subpage ;
2023-12-07 02:09:28 +03:00
if ( ! btrfs_is_subpage ( fs_info , folio - > mapping ) )
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 11:33:50 +03:00
return ;
2023-12-07 02:09:28 +03:00
ASSERT ( folio_test_private ( folio ) & & folio - > mapping ) ;
for-6.8-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmWYTmMACgkQxWXV+ddt
WDvPRg/+KgS5LV3nNC0MguYcTMQxmgeutIgXZIMfeA3v6EnFS7nj8leP4EPc6+bj
JPSkwj4u2vHVwpnTVuEAuJUXnmFY+Qu70nVy6bM2uOHOYTVBQ8zRVK4cErNNLWCp
OekDaADR53RrZ/xprlQ7b7Ph0Ch2uq9OrpH50IcyquEsH1ffkxlqwyrvth4/8dxC
6zgsFHWrbtVKJf0DYoQPpjEPz5tpdQ+xHZwtmf1cNlUgI1objODr/ZTqXtZqTfw4
/GwrtDPbEri53K/qjgr0dDH7pBVqD6PtnbgoHfYkiizZ0G7UkmlaK6rZIurtATJb
Yk/RCqCUp9tPC4yeFSewFMm1Y8Ae3rkUBG7rnYkvMmBspMqyh/kQAWSBimF5yk/y
vFEdFTe9AbdvP19Nw0CqovLzaO6RrOXCL1usnFvCmBgvF5gZAv63ZW1njP3ZoNta
wB8Rs6hxdRkph8Dk7yvYf54uUR+JyKqjHY6egg2qkKTjz0CSf6qQFyFZXpr81m97
gK4WN5SeP/P2ukRbBKKyzZ5IljUxZuVatvJa0tktd7kAbU26WLzofOJ7pX+iqimM
F2G7gKGJZykLY1WPntXBp9Dg97Ras2O5iViQ7ZKwRdOx1yZS5zzTYlIznHBAmXbL
UgXfVnpJH1xFdkvedNTn+Fz9BHNV1K2a2AT7VITj7sxz23z3aJA=
=4sw3
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"There are no exciting changes for users, it's been mostly API
conversions and some fixes or refactoring.
The mount API conversion is a base for future improvements that would
come with VFS. Metadata processing has been converted to folios, not
yet enabling the large folios but it's one patch away once everything
gets tested enough.
Core changes:
- convert extent buffers to folios:
- direct API conversion where possible
- performance can drop by a few percent on metadata heavy
workloads, the folio sizes are not constant and the calculations
add up in the item helpers
- both regular and subpage modes
- data cannot be converted yet, we need to port that to iomap and
there are some other generic changes required
- convert mount to the new API, should not be user visible:
- options deprecated long time ago have been removed: inode_cache,
recovery
- the new logic that splits mount to two phases slightly changes
timing of device scanning for multi-device filesystems
- LSM options will now work (like for selinux)
- convert delayed nodes radix tree to xarray, preserving the
preload-like logic that still allows to allocate with GFP_NOFS
- more validation of sysfs value of scrub_speed_max
- refactor chunk map structure, reduce size and improve performance
- extent map refactoring, smaller data structures, improved
performance
- reduce size of struct extent_io_tree, embedded in several
structures
- temporary pages used for compression are cached and attached to a
shrinker, this may slightly improve performance
- in zoned mode, remove redirty extent buffer tracking, zeros are
written in case an out-of-order is detected and proper data are
written to the actual write pointer
- cleanups, refactoring, error message improvements, updated tests
- verify and update branch name or tag
- remove unwanted text"
* tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (89 commits)
btrfs: pass btrfs_io_geometry into btrfs_max_io_len
btrfs: pass struct btrfs_io_geometry to set_io_stripe
btrfs: open code set_io_stripe for RAID56
btrfs: change block mapping to switch/case in btrfs_map_block
btrfs: factor out block mapping for single profiles
btrfs: factor out block mapping for RAID5/6
btrfs: reduce scope of data_stripes in btrfs_map_block
btrfs: factor out block mapping for RAID10
btrfs: factor out block mapping for DUP profiles
btrfs: factor out RAID1 block mapping
btrfs: factor out block-mapping for RAID0
btrfs: re-introduce struct btrfs_io_geometry
btrfs: factor out helper for single device IO check
btrfs: migrate btrfs_repair_io_failure() to folio interfaces
btrfs: migrate eb_bitmap_offset() to folio interfaces
btrfs: migrate various end io functions to folios
btrfs: migrate subpage code to folio interfaces
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
btrfs: don't double put our subpage reference in alloc_extent_buffer
btrfs: cleanup metadata page pointer usage
...
2024-01-10 20:27:40 +03:00
lockdep_assert_held ( & folio - > mapping - > i_private_lock ) ;
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 11:33:50 +03:00
2023-11-17 06:54:14 +03:00
subpage = folio_get_private ( folio ) ;
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 11:33:50 +03:00
ASSERT ( atomic_read ( & subpage - > eb_refs ) ) ;
atomic_dec ( & subpage - > eb_refs ) ;
}
2021-01-26 11:33:52 +03:00
btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there is more work to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-02 05:28:36 +03:00
static void btrfs_subpage_assert ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len )
2021-01-26 11:33:52 +03:00
{
2023-12-12 05:28:37 +03:00
/* For subpage support, the folio must be single page. */
ASSERT ( folio_order ( folio ) = = 0 ) ;
2023-11-17 06:54:14 +03:00
2021-01-26 11:33:52 +03:00
/* Basic checks */
2023-11-17 06:54:14 +03:00
ASSERT ( folio_test_private ( folio ) & & folio_get_private ( folio ) ) ;
2021-01-26 11:33:52 +03:00
ASSERT ( IS_ALIGNED ( start , fs_info - > sectorsize ) & &
IS_ALIGNED ( len , fs_info - > sectorsize ) ) ;
/*
* The range check only works for mapped page , we can still have
* unmapped page like dummy extent buffer pages .
*/
2023-12-12 05:28:37 +03:00
if ( folio - > mapping )
ASSERT ( folio_pos ( folio ) < = start & &
start + len < = folio_pos ( folio ) + PAGE_SIZE ) ;
btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there is more work to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-02 05:28:36 +03:00
}
2024-02-17 09:29:49 +03:00
# define subpage_calc_start_bit(fs_info, folio, name, start, len) \
( { \
2024-05-20 20:49:17 +03:00
unsigned int __start_bit ; \
2024-02-17 09:29:49 +03:00
\
btrfs_subpage_assert ( fs_info , folio , start , len ) ; \
2024-05-20 20:49:17 +03:00
__start_bit = offset_in_page ( start ) > > fs_info - > sectorsize_bits ; \
__start_bit + = fs_info - > subpage_info - > name # # _offset ; \
__start_bit ; \
2024-02-17 09:29:49 +03:00
} )
btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there is more work to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-02 05:28:36 +03:00
void btrfs_subpage_start_reader ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len )
btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there is more work to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-02 05:28:36 +03:00
{
2023-11-17 06:54:14 +03:00
struct btrfs_subpage * subpage = folio_get_private ( folio ) ;
2024-02-17 09:29:49 +03:00
const int start_bit = subpage_calc_start_bit ( fs_info , folio , locked , start , len ) ;
btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there is more work to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-02 05:28:36 +03:00
const int nbits = len > > fs_info - > sectorsize_bits ;
2024-02-17 09:29:49 +03:00
unsigned long flags ;
btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there is more work to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-02 05:28:36 +03:00
2023-12-12 05:28:37 +03:00
btrfs_subpage_assert ( fs_info , folio , start , len ) ;
btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there is more work to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-02 05:28:36 +03:00
2024-02-17 09:29:49 +03:00
spin_lock_irqsave ( & subpage - > lock , flags ) ;
/*
* Even though it ' s just for reading the page , no one should have
* locked the subpage range .
*/
ASSERT ( bitmap_test_range_all_zero ( subpage - > bitmaps , start_bit , nbits ) ) ;
bitmap_set ( subpage - > bitmaps , start_bit , nbits ) ;
btrfs: subpage: fix a rare race between metadata endio and eb freeing
[BUG]
There is a very rare ASSERT() triggering during full fstests run for
subpage rw support.
No other reproducer so far.
The ASSERT() gets triggered for metadata read in
btrfs_page_set_uptodate() inside end_page_read().
[CAUSE]
There is still a small race window for metadata only, the race could
happen like this:
T1 | T2
------------------------------------+-----------------------------
end_bio_extent_readpage() |
|- btrfs_validate_metadata_buffer() |
| |- free_extent_buffer() |
| Still have 2 refs |
|- end_page_read() |
|- if (unlikely(PagePrivate()) |
| The page still has Private |
| | free_extent_buffer()
| | | Only one ref 1, will be
| | | released
| | |- detach_extent_buffer_page()
| | |- btrfs_detach_subpage()
|- btrfs_set_page_uptodate() |
The page no longer has Private|
>>> ASSERT() triggered <<< |
This race window is super small, thus pretty hard to hit, even with so
many runs of fstests.
But the race window is still there, we have to go another way to solve
it other than relying on random PagePrivate() check.
Data path is not affected, as it will lock the page before reading,
while unlocking the page after the last read has finished, thus no race
window.
[FIX]
This patch will fix the bug by repurposing btrfs_subpage::readers.
Now btrfs_subpage::readers will be a member shared by both metadata and
data.
For metadata path, we don't do the page unlock as metadata only relies
on extent locking.
At the same time, teach page_range_has_eb() to take
btrfs_subpage::readers into consideration.
So that even if the last eb of a page gets freed, page::private won't be
detached as long as there still are pending end_page_read() calls.
By this we eliminate the race window, this will slight increase the
metadata memory usage, as the page may not be released as frequently as
usual. But it should not be a big deal.
The code got introduced in ("btrfs: submit read time repair only for
each corrupted sector"), but the fix is in a separate patch to keep the
problem description and the crash is rare so it should not hurt
bisectability.
Signed-off-by: Qu Wegruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-07 12:02:58 +03:00
atomic_add ( nbits , & subpage - > readers ) ;
2024-02-17 09:29:49 +03:00
spin_unlock_irqrestore ( & subpage - > lock , flags ) ;
btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there is more work to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-02 05:28:36 +03:00
}
void btrfs_subpage_end_reader ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len )
btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there is more work to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-02 05:28:36 +03:00
{
2023-11-17 06:54:14 +03:00
struct btrfs_subpage * subpage = folio_get_private ( folio ) ;
2024-02-17 09:29:49 +03:00
const int start_bit = subpage_calc_start_bit ( fs_info , folio , locked , start , len ) ;
btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there is more work to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-02 05:28:36 +03:00
const int nbits = len > > fs_info - > sectorsize_bits ;
2024-02-17 09:29:49 +03:00
unsigned long flags ;
btrfs: subpage: fix a rare race between metadata endio and eb freeing
[BUG]
There is a very rare ASSERT() triggering during full fstests run for
subpage rw support.
No other reproducer so far.
The ASSERT() gets triggered for metadata read in
btrfs_page_set_uptodate() inside end_page_read().
[CAUSE]
There is still a small race window for metadata only, the race could
happen like this:
T1 | T2
------------------------------------+-----------------------------
end_bio_extent_readpage() |
|- btrfs_validate_metadata_buffer() |
| |- free_extent_buffer() |
| Still have 2 refs |
|- end_page_read() |
|- if (unlikely(PagePrivate()) |
| The page still has Private |
| | free_extent_buffer()
| | | Only one ref 1, will be
| | | released
| | |- detach_extent_buffer_page()
| | |- btrfs_detach_subpage()
|- btrfs_set_page_uptodate() |
The page no longer has Private|
>>> ASSERT() triggered <<< |
This race window is super small, thus pretty hard to hit, even with so
many runs of fstests.
But the race window is still there, we have to go another way to solve
it other than relying on random PagePrivate() check.
Data path is not affected, as it will lock the page before reading,
while unlocking the page after the last read has finished, thus no race
window.
[FIX]
This patch will fix the bug by repurposing btrfs_subpage::readers.
Now btrfs_subpage::readers will be a member shared by both metadata and
data.
For metadata path, we don't do the page unlock as metadata only relies
on extent locking.
At the same time, teach page_range_has_eb() to take
btrfs_subpage::readers into consideration.
So that even if the last eb of a page gets freed, page::private won't be
detached as long as there still are pending end_page_read() calls.
By this we eliminate the race window, this will slight increase the
metadata memory usage, as the page may not be released as frequently as
usual. But it should not be a big deal.
The code got introduced in ("btrfs: submit read time repair only for
each corrupted sector"), but the fix is in a separate patch to keep the
problem description and the crash is rare so it should not hurt
bisectability.
Signed-off-by: Qu Wegruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-07 12:02:58 +03:00
bool is_data ;
bool last ;
btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there is more work to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-02 05:28:36 +03:00
2023-12-12 05:28:37 +03:00
btrfs_subpage_assert ( fs_info , folio , start , len ) ;
2022-06-15 16:03:11 +03:00
is_data = is_data_inode ( BTRFS_I ( folio - > mapping - > host ) ) ;
2024-02-17 09:29:49 +03:00
spin_lock_irqsave ( & subpage - > lock , flags ) ;
/* The range should have already been locked. */
ASSERT ( bitmap_test_range_all_set ( subpage - > bitmaps , start_bit , nbits ) ) ;
btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there is more work to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-02 05:28:36 +03:00
ASSERT ( atomic_read ( & subpage - > readers ) > = nbits ) ;
2024-02-17 09:29:49 +03:00
bitmap_clear ( subpage - > bitmaps , start_bit , nbits ) ;
btrfs: subpage: fix a rare race between metadata endio and eb freeing
[BUG]
There is a very rare ASSERT() triggering during full fstests run for
subpage rw support.
No other reproducer so far.
The ASSERT() gets triggered for metadata read in
btrfs_page_set_uptodate() inside end_page_read().
[CAUSE]
There is still a small race window for metadata only, the race could
happen like this:
T1 | T2
------------------------------------+-----------------------------
end_bio_extent_readpage() |
|- btrfs_validate_metadata_buffer() |
| |- free_extent_buffer() |
| Still have 2 refs |
|- end_page_read() |
|- if (unlikely(PagePrivate()) |
| The page still has Private |
| | free_extent_buffer()
| | | Only one ref 1, will be
| | | released
| | |- detach_extent_buffer_page()
| | |- btrfs_detach_subpage()
|- btrfs_set_page_uptodate() |
The page no longer has Private|
>>> ASSERT() triggered <<< |
This race window is super small, thus pretty hard to hit, even with so
many runs of fstests.
But the race window is still there, we have to go another way to solve
it other than relying on random PagePrivate() check.
Data path is not affected, as it will lock the page before reading,
while unlocking the page after the last read has finished, thus no race
window.
[FIX]
This patch will fix the bug by repurposing btrfs_subpage::readers.
Now btrfs_subpage::readers will be a member shared by both metadata and
data.
For metadata path, we don't do the page unlock as metadata only relies
on extent locking.
At the same time, teach page_range_has_eb() to take
btrfs_subpage::readers into consideration.
So that even if the last eb of a page gets freed, page::private won't be
detached as long as there still are pending end_page_read() calls.
By this we eliminate the race window, this will slight increase the
metadata memory usage, as the page may not be released as frequently as
usual. But it should not be a big deal.
The code got introduced in ("btrfs: submit read time repair only for
each corrupted sector"), but the fix is in a separate patch to keep the
problem description and the crash is rare so it should not hurt
bisectability.
Signed-off-by: Qu Wegruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-07 12:02:58 +03:00
last = atomic_sub_and_test ( nbits , & subpage - > readers ) ;
/*
* For data we need to unlock the page if the last read has finished .
*
* And please don ' t replace @ last with atomic_sub_and_test ( ) call
* inside if ( ) condition .
* As we want the atomic_sub_and_test ( ) to be always executed .
*/
if ( is_data & & last )
2023-12-12 05:28:37 +03:00
folio_unlock ( folio ) ;
2024-02-17 09:29:49 +03:00
spin_unlock_irqrestore ( & subpage - > lock , flags ) ;
btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there is more work to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-02 05:28:36 +03:00
}
2023-12-12 05:28:37 +03:00
static void btrfs_subpage_clamp_range ( struct folio * folio , u64 * start , u32 * len )
2021-05-31 11:50:44 +03:00
{
u64 orig_start = * start ;
u32 orig_len = * len ;
2023-12-12 05:28:37 +03:00
* start = max_t ( u64 , folio_pos ( folio ) , orig_start ) ;
2021-09-27 10:21:49 +03:00
/*
* For certain call sites like btrfs_drop_pages ( ) , we may have pages
* beyond the target range . In that case , just set @ len to 0 , subpage
* helpers can handle @ len = = 0 without any problem .
*/
2023-12-12 05:28:37 +03:00
if ( folio_pos ( folio ) > = orig_start + orig_len )
2021-09-27 10:21:49 +03:00
* len = 0 ;
else
2023-12-12 05:28:37 +03:00
* len = min_t ( u64 , folio_pos ( folio ) + PAGE_SIZE ,
2021-09-27 10:21:49 +03:00
orig_start + orig_len ) - * start ;
2021-05-31 11:50:44 +03:00
}
2024-02-17 09:29:48 +03:00
static void btrfs_subpage_start_writer ( const struct btrfs_fs_info * fs_info ,
struct folio * folio , u64 start , u32 len )
2021-05-31 11:50:44 +03:00
{
2023-11-17 06:54:14 +03:00
struct btrfs_subpage * subpage = folio_get_private ( folio ) ;
2024-02-17 09:29:50 +03:00
const int start_bit = subpage_calc_start_bit ( fs_info , folio , locked , start , len ) ;
2021-05-31 11:50:44 +03:00
const int nbits = ( len > > fs_info - > sectorsize_bits ) ;
2024-02-17 09:29:50 +03:00
unsigned long flags ;
2021-05-31 11:50:44 +03:00
int ret ;
2023-12-12 05:28:37 +03:00
btrfs_subpage_assert ( fs_info , folio , start , len ) ;
2021-05-31 11:50:44 +03:00
2024-02-17 09:29:50 +03:00
spin_lock_irqsave ( & subpage - > lock , flags ) ;
2021-05-31 11:50:44 +03:00
ASSERT ( atomic_read ( & subpage - > readers ) = = 0 ) ;
2024-02-17 09:29:50 +03:00
ASSERT ( bitmap_test_range_all_zero ( subpage - > bitmaps , start_bit , nbits ) ) ;
bitmap_set ( subpage - > bitmaps , start_bit , nbits ) ;
2021-05-31 11:50:44 +03:00
ret = atomic_add_return ( nbits , & subpage - > writers ) ;
ASSERT ( ret = = nbits ) ;
2024-02-17 09:29:50 +03:00
spin_unlock_irqrestore ( & subpage - > lock , flags ) ;
2021-05-31 11:50:44 +03:00
}
2024-02-17 09:29:48 +03:00
static bool btrfs_subpage_end_and_test_writer ( const struct btrfs_fs_info * fs_info ,
struct folio * folio , u64 start , u32 len )
2021-05-31 11:50:44 +03:00
{
2023-11-17 06:54:14 +03:00
struct btrfs_subpage * subpage = folio_get_private ( folio ) ;
2024-02-17 09:29:50 +03:00
const int start_bit = subpage_calc_start_bit ( fs_info , folio , locked , start , len ) ;
2021-05-31 11:50:44 +03:00
const int nbits = ( len > > fs_info - > sectorsize_bits ) ;
2024-02-17 09:29:50 +03:00
unsigned long flags ;
bool last ;
2021-05-31 11:50:44 +03:00
2023-12-12 05:28:37 +03:00
btrfs_subpage_assert ( fs_info , folio , start , len ) ;
2021-05-31 11:50:44 +03:00
2024-02-17 09:29:50 +03:00
spin_lock_irqsave ( & subpage - > lock , flags ) ;
2021-09-27 10:22:06 +03:00
/*
* We have call sites passing @ lock_page into
* extent_clear_unlock_delalloc ( ) for compression path .
*
* This @ locked_page is locked by plain lock_page ( ) , thus its
* subpage : : writers is 0. Handle them in a special way .
*/
2024-02-17 09:29:50 +03:00
if ( atomic_read ( & subpage - > writers ) = = 0 ) {
spin_unlock_irqrestore ( & subpage - > lock , flags ) ;
2021-09-27 10:22:06 +03:00
return true ;
2024-02-17 09:29:50 +03:00
}
2021-09-27 10:22:06 +03:00
2021-05-31 11:50:44 +03:00
ASSERT ( atomic_read ( & subpage - > writers ) > = nbits ) ;
2024-02-17 09:29:50 +03:00
/* The target range should have been locked. */
ASSERT ( bitmap_test_range_all_set ( subpage - > bitmaps , start_bit , nbits ) ) ;
bitmap_clear ( subpage - > bitmaps , start_bit , nbits ) ;
last = atomic_sub_and_test ( nbits , & subpage - > writers ) ;
spin_unlock_irqrestore ( & subpage - > lock , flags ) ;
return last ;
2021-05-31 11:50:44 +03:00
}
/*
2023-12-12 05:28:37 +03:00
* Lock a folio for delalloc page writeback .
2021-05-31 11:50:44 +03:00
*
* Return - EAGAIN if the page is not properly initialized .
* Return 0 with the page locked , and writer counter updated .
*
* Even with 0 returned , the page still need extra check to make sure
* it ' s really the correct page , as the caller is using
2022-08-24 03:40:20 +03:00
* filemap_get_folios_contig ( ) , which can race with page invalidating .
2021-05-31 11:50:44 +03:00
*/
2023-12-12 05:28:37 +03:00
int btrfs_folio_start_writer_lock ( const struct btrfs_fs_info * fs_info ,
struct folio * folio , u64 start , u32 len )
2021-05-31 11:50:44 +03:00
{
2023-12-12 05:28:37 +03:00
if ( unlikely ( ! fs_info ) | | ! btrfs_is_subpage ( fs_info , folio - > mapping ) ) {
folio_lock ( folio ) ;
2021-05-31 11:50:44 +03:00
return 0 ;
}
2023-12-12 05:28:37 +03:00
folio_lock ( folio ) ;
2023-11-17 06:54:14 +03:00
if ( ! folio_test_private ( folio ) | | ! folio_get_private ( folio ) ) {
2023-12-12 05:28:37 +03:00
folio_unlock ( folio ) ;
2021-05-31 11:50:44 +03:00
return - EAGAIN ;
}
2023-12-12 05:28:37 +03:00
btrfs_subpage_clamp_range ( folio , & start , & len ) ;
btrfs_subpage_start_writer ( fs_info , folio , start , len ) ;
2021-05-31 11:50:44 +03:00
return 0 ;
}
2023-12-12 05:28:37 +03:00
void btrfs_folio_end_writer_lock ( const struct btrfs_fs_info * fs_info ,
struct folio * folio , u64 start , u32 len )
2021-05-31 11:50:44 +03:00
{
2023-12-12 05:28:37 +03:00
if ( unlikely ( ! fs_info ) | | ! btrfs_is_subpage ( fs_info , folio - > mapping ) ) {
folio_unlock ( folio ) ;
return ;
}
btrfs_subpage_clamp_range ( folio , & start , & len ) ;
if ( btrfs_subpage_end_and_test_writer ( fs_info , folio , start , len ) )
folio_unlock ( folio ) ;
2021-05-31 11:50:44 +03:00
}
2021-08-17 12:38:52 +03:00
# define subpage_test_bitmap_all_set(fs_info, subpage, name) \
bitmap_test_range_all_set ( subpage - > bitmaps , \
fs_info - > subpage_info - > name # # _offset , \
fs_info - > subpage_info - > bitmap_nr_bits )
# define subpage_test_bitmap_all_zero(fs_info, subpage, name) \
bitmap_test_range_all_zero ( subpage - > bitmaps , \
fs_info - > subpage_info - > name # # _offset , \
fs_info - > subpage_info - > bitmap_nr_bits )
2021-01-26 11:33:52 +03:00
void btrfs_subpage_set_uptodate ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len )
2021-01-26 11:33:52 +03:00
{
2023-11-17 06:54:14 +03:00
struct btrfs_subpage * subpage = folio_get_private ( folio ) ;
2023-12-12 05:28:37 +03:00
unsigned int start_bit = subpage_calc_start_bit ( fs_info , folio ,
2021-08-17 12:38:52 +03:00
uptodate , start , len ) ;
2021-01-26 11:33:52 +03:00
unsigned long flags ;
spin_lock_irqsave ( & subpage - > lock , flags ) ;
2021-08-17 12:38:52 +03:00
bitmap_set ( subpage - > bitmaps , start_bit , len > > fs_info - > sectorsize_bits ) ;
if ( subpage_test_bitmap_all_set ( fs_info , subpage , uptodate ) )
2023-12-12 05:28:37 +03:00
folio_mark_uptodate ( folio ) ;
2021-01-26 11:33:52 +03:00
spin_unlock_irqrestore ( & subpage - > lock , flags ) ;
}
void btrfs_subpage_clear_uptodate ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len )
2021-01-26 11:33:52 +03:00
{
2023-11-17 06:54:14 +03:00
struct btrfs_subpage * subpage = folio_get_private ( folio ) ;
2023-12-12 05:28:37 +03:00
unsigned int start_bit = subpage_calc_start_bit ( fs_info , folio ,
2021-08-17 12:38:52 +03:00
uptodate , start , len ) ;
2021-01-26 11:33:52 +03:00
unsigned long flags ;
spin_lock_irqsave ( & subpage - > lock , flags ) ;
2021-08-17 12:38:52 +03:00
bitmap_clear ( subpage - > bitmaps , start_bit , len > > fs_info - > sectorsize_bits ) ;
2023-12-12 05:28:37 +03:00
folio_clear_uptodate ( folio ) ;
2021-01-26 11:33:52 +03:00
spin_unlock_irqrestore ( & subpage - > lock , flags ) ;
}
2021-03-25 10:14:37 +03:00
void btrfs_subpage_set_dirty ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len )
2021-03-25 10:14:37 +03:00
{
2023-11-17 06:54:14 +03:00
struct btrfs_subpage * subpage = folio_get_private ( folio ) ;
2023-12-12 05:28:37 +03:00
unsigned int start_bit = subpage_calc_start_bit ( fs_info , folio ,
2021-08-17 12:38:52 +03:00
dirty , start , len ) ;
2021-03-25 10:14:37 +03:00
unsigned long flags ;
spin_lock_irqsave ( & subpage - > lock , flags ) ;
2021-08-17 12:38:52 +03:00
bitmap_set ( subpage - > bitmaps , start_bit , len > > fs_info - > sectorsize_bits ) ;
2021-03-25 10:14:37 +03:00
spin_unlock_irqrestore ( & subpage - > lock , flags ) ;
2023-12-12 05:28:37 +03:00
folio_mark_dirty ( folio ) ;
2021-03-25 10:14:37 +03:00
}
/*
* Extra clear_and_test function for subpage dirty bitmap .
*
* Return true if we ' re the last bits in the dirty_bitmap and clear the
* dirty_bitmap .
* Return false otherwise .
*
* NOTE : Callers should manually clear page dirty for true case , as we have
* extra handling for tree blocks .
*/
bool btrfs_subpage_clear_and_test_dirty ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len )
2021-03-25 10:14:37 +03:00
{
2023-11-17 06:54:14 +03:00
struct btrfs_subpage * subpage = folio_get_private ( folio ) ;
2023-12-12 05:28:37 +03:00
unsigned int start_bit = subpage_calc_start_bit ( fs_info , folio ,
2021-08-17 12:38:52 +03:00
dirty , start , len ) ;
2021-03-25 10:14:37 +03:00
unsigned long flags ;
bool last = false ;
spin_lock_irqsave ( & subpage - > lock , flags ) ;
2021-08-17 12:38:52 +03:00
bitmap_clear ( subpage - > bitmaps , start_bit , len > > fs_info - > sectorsize_bits ) ;
if ( subpage_test_bitmap_all_zero ( fs_info , subpage , dirty ) )
2021-03-25 10:14:37 +03:00
last = true ;
spin_unlock_irqrestore ( & subpage - > lock , flags ) ;
return last ;
}
void btrfs_subpage_clear_dirty ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len )
2021-03-25 10:14:37 +03:00
{
bool last ;
2023-12-12 05:28:37 +03:00
last = btrfs_subpage_clear_and_test_dirty ( fs_info , folio , start , len ) ;
2021-03-25 10:14:37 +03:00
if ( last )
2023-12-12 05:28:37 +03:00
folio_clear_dirty_for_io ( folio ) ;
2021-03-25 10:14:37 +03:00
}
2021-03-25 10:14:38 +03:00
void btrfs_subpage_set_writeback ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len )
2021-03-25 10:14:38 +03:00
{
2023-11-17 06:54:14 +03:00
struct btrfs_subpage * subpage = folio_get_private ( folio ) ;
2023-12-12 05:28:37 +03:00
unsigned int start_bit = subpage_calc_start_bit ( fs_info , folio ,
2021-08-17 12:38:52 +03:00
writeback , start , len ) ;
2021-03-25 10:14:38 +03:00
unsigned long flags ;
spin_lock_irqsave ( & subpage - > lock , flags ) ;
2021-08-17 12:38:52 +03:00
bitmap_set ( subpage - > bitmaps , start_bit , len > > fs_info - > sectorsize_bits ) ;
2024-01-11 01:14:21 +03:00
if ( ! folio_test_writeback ( folio ) )
folio_start_writeback ( folio ) ;
2021-03-25 10:14:38 +03:00
spin_unlock_irqrestore ( & subpage - > lock , flags ) ;
}
void btrfs_subpage_clear_writeback ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len )
2021-03-25 10:14:38 +03:00
{
2023-11-17 06:54:14 +03:00
struct btrfs_subpage * subpage = folio_get_private ( folio ) ;
2023-12-12 05:28:37 +03:00
unsigned int start_bit = subpage_calc_start_bit ( fs_info , folio ,
2021-08-17 12:38:52 +03:00
writeback , start , len ) ;
2021-03-25 10:14:38 +03:00
unsigned long flags ;
spin_lock_irqsave ( & subpage - > lock , flags ) ;
2021-08-17 12:38:52 +03:00
bitmap_clear ( subpage - > bitmaps , start_bit , len > > fs_info - > sectorsize_bits ) ;
if ( subpage_test_bitmap_all_zero ( fs_info , subpage , writeback ) ) {
2023-12-12 05:28:37 +03:00
ASSERT ( folio_test_writeback ( folio ) ) ;
folio_end_writeback ( folio ) ;
2021-07-26 09:35:03 +03:00
}
2021-03-25 10:14:38 +03:00
spin_unlock_irqrestore ( & subpage - > lock , flags ) ;
}
2021-05-31 11:50:45 +03:00
void btrfs_subpage_set_ordered ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len )
2021-05-31 11:50:45 +03:00
{
2023-11-17 06:54:14 +03:00
struct btrfs_subpage * subpage = folio_get_private ( folio ) ;
2023-12-12 05:28:37 +03:00
unsigned int start_bit = subpage_calc_start_bit ( fs_info , folio ,
2021-08-17 12:38:52 +03:00
ordered , start , len ) ;
2021-05-31 11:50:45 +03:00
unsigned long flags ;
spin_lock_irqsave ( & subpage - > lock , flags ) ;
2021-08-17 12:38:52 +03:00
bitmap_set ( subpage - > bitmaps , start_bit , len > > fs_info - > sectorsize_bits ) ;
2023-12-12 05:28:37 +03:00
folio_set_ordered ( folio ) ;
2021-05-31 11:50:45 +03:00
spin_unlock_irqrestore ( & subpage - > lock , flags ) ;
}
void btrfs_subpage_clear_ordered ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len )
2021-05-31 11:50:45 +03:00
{
2023-11-17 06:54:14 +03:00
struct btrfs_subpage * subpage = folio_get_private ( folio ) ;
2023-12-12 05:28:37 +03:00
unsigned int start_bit = subpage_calc_start_bit ( fs_info , folio ,
2021-08-17 12:38:52 +03:00
ordered , start , len ) ;
2021-05-31 11:50:45 +03:00
unsigned long flags ;
spin_lock_irqsave ( & subpage - > lock , flags ) ;
2021-08-17 12:38:52 +03:00
bitmap_clear ( subpage - > bitmaps , start_bit , len > > fs_info - > sectorsize_bits ) ;
if ( subpage_test_bitmap_all_zero ( fs_info , subpage , ordered ) )
2023-12-12 05:28:37 +03:00
folio_clear_ordered ( folio ) ;
2021-05-31 11:50:45 +03:00
spin_unlock_irqrestore ( & subpage - > lock , flags ) ;
}
2021-09-27 10:21:49 +03:00
void btrfs_subpage_set_checked ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len )
2021-09-27 10:21:49 +03:00
{
2023-11-17 06:54:14 +03:00
struct btrfs_subpage * subpage = folio_get_private ( folio ) ;
2023-12-12 05:28:37 +03:00
unsigned int start_bit = subpage_calc_start_bit ( fs_info , folio ,
2021-09-27 10:21:49 +03:00
checked , start , len ) ;
unsigned long flags ;
spin_lock_irqsave ( & subpage - > lock , flags ) ;
bitmap_set ( subpage - > bitmaps , start_bit , len > > fs_info - > sectorsize_bits ) ;
if ( subpage_test_bitmap_all_set ( fs_info , subpage , checked ) )
2023-12-12 05:28:37 +03:00
folio_set_checked ( folio ) ;
2021-09-27 10:21:49 +03:00
spin_unlock_irqrestore ( & subpage - > lock , flags ) ;
}
void btrfs_subpage_clear_checked ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len )
2021-09-27 10:21:49 +03:00
{
2023-11-17 06:54:14 +03:00
struct btrfs_subpage * subpage = folio_get_private ( folio ) ;
2023-12-12 05:28:37 +03:00
unsigned int start_bit = subpage_calc_start_bit ( fs_info , folio ,
2021-09-27 10:21:49 +03:00
checked , start , len ) ;
unsigned long flags ;
spin_lock_irqsave ( & subpage - > lock , flags ) ;
bitmap_clear ( subpage - > bitmaps , start_bit , len > > fs_info - > sectorsize_bits ) ;
2023-12-12 05:28:37 +03:00
folio_clear_checked ( folio ) ;
2021-09-27 10:21:49 +03:00
spin_unlock_irqrestore ( & subpage - > lock , flags ) ;
}
2021-01-26 11:33:52 +03:00
/*
* Unlike set / clear which is dependent on each page status , for test all bits
* are tested in the same way .
*/
# define IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(name) \
bool btrfs_subpage_test_ # # name ( const struct btrfs_fs_info * fs_info , \
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len ) \
2021-01-26 11:33:52 +03:00
{ \
2023-11-17 06:54:14 +03:00
struct btrfs_subpage * subpage = folio_get_private ( folio ) ; \
2023-12-12 05:28:37 +03:00
unsigned int start_bit = subpage_calc_start_bit ( fs_info , folio , \
2021-08-17 12:38:52 +03:00
name , start , len ) ; \
2021-01-26 11:33:52 +03:00
unsigned long flags ; \
bool ret ; \
\
spin_lock_irqsave ( & subpage - > lock , flags ) ; \
2021-08-17 12:38:52 +03:00
ret = bitmap_test_range_all_set ( subpage - > bitmaps , start_bit , \
len > > fs_info - > sectorsize_bits ) ; \
2021-01-26 11:33:52 +03:00
spin_unlock_irqrestore ( & subpage - > lock , flags ) ; \
return ret ; \
}
IMPLEMENT_BTRFS_SUBPAGE_TEST_OP ( uptodate ) ;
2021-03-25 10:14:37 +03:00
IMPLEMENT_BTRFS_SUBPAGE_TEST_OP ( dirty ) ;
2021-03-25 10:14:38 +03:00
IMPLEMENT_BTRFS_SUBPAGE_TEST_OP ( writeback ) ;
2021-05-31 11:50:45 +03:00
IMPLEMENT_BTRFS_SUBPAGE_TEST_OP ( ordered ) ;
2021-09-27 10:21:49 +03:00
IMPLEMENT_BTRFS_SUBPAGE_TEST_OP ( checked ) ;
2021-01-26 11:33:52 +03:00
/*
* Note that , in selftests ( extent - io - tests ) , we can have empty fs_info passed
* in . We only test sectorsize = = PAGE_SIZE cases so far , thus we can fall
* back to regular sectorsize branch .
*/
2023-12-12 05:28:37 +03:00
# define IMPLEMENT_BTRFS_PAGE_OPS(name, folio_set_func, \
folio_clear_func , folio_test_func ) \
void btrfs_folio_set_ # # name ( const struct btrfs_fs_info * fs_info , \
struct folio * folio , u64 start , u32 len ) \
2021-01-26 11:33:52 +03:00
{ \
2023-12-07 02:09:28 +03:00
if ( unlikely ( ! fs_info ) | | \
2023-12-12 05:28:37 +03:00
! btrfs_is_subpage ( fs_info , folio - > mapping ) ) { \
folio_set_func ( folio ) ; \
2021-01-26 11:33:52 +03:00
return ; \
} \
2023-12-12 05:28:37 +03:00
btrfs_subpage_set_ # # name ( fs_info , folio , start , len ) ; \
2021-01-26 11:33:52 +03:00
} \
2023-12-12 05:28:37 +03:00
void btrfs_folio_clear_ # # name ( const struct btrfs_fs_info * fs_info , \
struct folio * folio , u64 start , u32 len ) \
2021-01-26 11:33:52 +03:00
{ \
2023-12-07 02:09:28 +03:00
if ( unlikely ( ! fs_info ) | | \
2023-12-12 05:28:37 +03:00
! btrfs_is_subpage ( fs_info , folio - > mapping ) ) { \
folio_clear_func ( folio ) ; \
2021-01-26 11:33:52 +03:00
return ; \
} \
2023-12-12 05:28:37 +03:00
btrfs_subpage_clear_ # # name ( fs_info , folio , start , len ) ; \
2021-01-26 11:33:52 +03:00
} \
2023-12-12 05:28:37 +03:00
bool btrfs_folio_test_ # # name ( const struct btrfs_fs_info * fs_info , \
struct folio * folio , u64 start , u32 len ) \
2021-01-26 11:33:52 +03:00
{ \
2023-12-07 02:09:28 +03:00
if ( unlikely ( ! fs_info ) | | \
2023-12-12 05:28:37 +03:00
! btrfs_is_subpage ( fs_info , folio - > mapping ) ) \
return folio_test_func ( folio ) ; \
return btrfs_subpage_test_ # # name ( fs_info , folio , start , len ) ; \
2021-05-31 11:50:39 +03:00
} \
2023-12-12 05:28:37 +03:00
void btrfs_folio_clamp_set_ # # name ( const struct btrfs_fs_info * fs_info , \
struct folio * folio , u64 start , u32 len ) \
2021-05-31 11:50:39 +03:00
{ \
2023-12-07 02:09:28 +03:00
if ( unlikely ( ! fs_info ) | | \
2023-12-12 05:28:37 +03:00
! btrfs_is_subpage ( fs_info , folio - > mapping ) ) { \
folio_set_func ( folio ) ; \
2021-05-31 11:50:39 +03:00
return ; \
} \
2023-12-12 05:28:37 +03:00
btrfs_subpage_clamp_range ( folio , & start , & len ) ; \
btrfs_subpage_set_ # # name ( fs_info , folio , start , len ) ; \
2021-05-31 11:50:39 +03:00
} \
2023-12-12 05:28:37 +03:00
void btrfs_folio_clamp_clear_ # # name ( const struct btrfs_fs_info * fs_info , \
struct folio * folio , u64 start , u32 len ) \
2021-05-31 11:50:39 +03:00
{ \
2023-12-07 02:09:28 +03:00
if ( unlikely ( ! fs_info ) | | \
2023-12-12 05:28:37 +03:00
! btrfs_is_subpage ( fs_info , folio - > mapping ) ) { \
folio_clear_func ( folio ) ; \
2021-05-31 11:50:39 +03:00
return ; \
} \
2023-12-12 05:28:37 +03:00
btrfs_subpage_clamp_range ( folio , & start , & len ) ; \
btrfs_subpage_clear_ # # name ( fs_info , folio , start , len ) ; \
2021-05-31 11:50:39 +03:00
} \
2023-12-12 05:28:37 +03:00
bool btrfs_folio_clamp_test_ # # name ( const struct btrfs_fs_info * fs_info , \
struct folio * folio , u64 start , u32 len ) \
2021-05-31 11:50:39 +03:00
{ \
2023-12-07 02:09:28 +03:00
if ( unlikely ( ! fs_info ) | | \
2023-12-12 05:28:37 +03:00
! btrfs_is_subpage ( fs_info , folio - > mapping ) ) \
return folio_test_func ( folio ) ; \
btrfs_subpage_clamp_range ( folio , & start , & len ) ; \
return btrfs_subpage_test_ # # name ( fs_info , folio , start , len ) ; \
}
IMPLEMENT_BTRFS_PAGE_OPS ( uptodate , folio_mark_uptodate , folio_clear_uptodate ,
folio_test_uptodate ) ;
IMPLEMENT_BTRFS_PAGE_OPS ( dirty , folio_mark_dirty , folio_clear_dirty_for_io ,
folio_test_dirty ) ;
IMPLEMENT_BTRFS_PAGE_OPS ( writeback , folio_start_writeback , folio_end_writeback ,
folio_test_writeback ) ;
IMPLEMENT_BTRFS_PAGE_OPS ( ordered , folio_set_ordered , folio_clear_ordered ,
folio_test_ordered ) ;
IMPLEMENT_BTRFS_PAGE_OPS ( checked , folio_set_checked , folio_clear_checked ,
folio_test_checked ) ;
btrfs: subpage: fix writeback which does not have ordered extent
[BUG]
When running fsstress with subpage RW support, there are random
BUG_ON()s triggered with the following trace:
kernel BUG at fs/btrfs/file-item.c:667!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 1 PID: 3486 Comm: kworker/u13:2 5.11.0-rc4-custom+ #43
Hardware name: Radxa ROCK Pi 4B (DT)
Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
Call trace:
btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
btrfs_submit_bio_start+0x20/0x30 [btrfs]
run_one_async_start+0x28/0x44 [btrfs]
btrfs_work_helper+0x128/0x1b4 [btrfs]
process_one_work+0x22c/0x430
worker_thread+0x70/0x3a0
kthread+0x13c/0x140
ret_from_fork+0x10/0x30
[CAUSE]
Above BUG_ON() means there is some bio range which doesn't have ordered
extent, which indeed is worth a BUG_ON().
Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
subpage dirty bitmap to record which range is dirty and should be
written back.
This means, if we submit bio for a subpage range, we do not only need to
clear page dirty, but also need to clear subpage dirty bits.
In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
any range we submit a bio.
But there is loophole, if we hit a range which is beyond i_size, we just
call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
then break out, without clearing the subpage dirty.
This means, if we hit above branch, the subpage dirty bits are still
there, if other range of the page get dirtied and we need to writeback
that page again, we will submit bio for the old range, leaving a wild
bio range which doesn't have ordered extent.
[FIX]
Fix it by always calling btrfs_page_clear_dirty() in
__extent_writepage_io().
Also to avoid such problem from happening again, add a new assert,
btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
dirty bits are cleared before exiting __extent_writepage_io().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-26 09:34:58 +03:00
/*
* Make sure not only the page dirty bit is cleared , but also subpage dirty bit
* is cleared .
*/
btrfs: make __extent_writepage_io() to write specified range only
Function __extent_writepage_io() is designed to find all dirty ranges of
a page, and add the dirty ranges to the bio_ctrl for submission.
It requires all the dirtied ranges to be covered by an ordered extent.
It gets called in two locations, but one call site is not subpage aware:
- __extent_writepage()
It gets called when writepage_delalloc() returned 0, which means
writepage_delalloc() has handled delalloc for all subpage sectors
inside the page.
So this call site is OK.
- extent_write_locked_range()
This call site is utilized by zoned support, and in this case, we may
only run delalloc range for a subset of the page, like this: (64K page
size)
0 16K 32K 48K 64K
|/////| |///////| |
In the above case, if extent_write_locked_range() is only triggered for
range [0, 16K), __extent_writepage_io() would still try to submit
the dirty range of [32K, 48K), then it would not find any ordered
extent for it and triggers various ASSERT()s.
Fix this problem by:
- Introducing @start and @len parameters to specify the range
For the first call site, we just pass the whole page, and the behavior
is not touched, since run_delalloc_range() for the page should have
created all ordered extents for the page.
For the second call site, we avoid touching anything beyond the
range, thus avoiding the dirty range which is not yet covered by any
delalloc range.
- Making btrfs_folio_assert_not_dirty() subpage aware
The only caller is inside __extent_writepage_io(), and since that
caller now accepts a subpage range, we should also check the subpage
range other than the whole page.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-16 07:03:41 +03:00
void btrfs_folio_assert_not_dirty ( const struct btrfs_fs_info * fs_info ,
struct folio * folio , u64 start , u32 len )
btrfs: subpage: fix writeback which does not have ordered extent
[BUG]
When running fsstress with subpage RW support, there are random
BUG_ON()s triggered with the following trace:
kernel BUG at fs/btrfs/file-item.c:667!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 1 PID: 3486 Comm: kworker/u13:2 5.11.0-rc4-custom+ #43
Hardware name: Radxa ROCK Pi 4B (DT)
Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
Call trace:
btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
btrfs_submit_bio_start+0x20/0x30 [btrfs]
run_one_async_start+0x28/0x44 [btrfs]
btrfs_work_helper+0x128/0x1b4 [btrfs]
process_one_work+0x22c/0x430
worker_thread+0x70/0x3a0
kthread+0x13c/0x140
ret_from_fork+0x10/0x30
[CAUSE]
Above BUG_ON() means there is some bio range which doesn't have ordered
extent, which indeed is worth a BUG_ON().
Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
subpage dirty bitmap to record which range is dirty and should be
written back.
This means, if we submit bio for a subpage range, we do not only need to
clear page dirty, but also need to clear subpage dirty bits.
In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
any range we submit a bio.
But there is loophole, if we hit a range which is beyond i_size, we just
call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
then break out, without clearing the subpage dirty.
This means, if we hit above branch, the subpage dirty bits are still
there, if other range of the page get dirtied and we need to writeback
that page again, we will submit bio for the old range, leaving a wild
bio range which doesn't have ordered extent.
[FIX]
Fix it by always calling btrfs_page_clear_dirty() in
__extent_writepage_io().
Also to avoid such problem from happening again, add a new assert,
btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
dirty bits are cleared before exiting __extent_writepage_io().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-26 09:34:58 +03:00
{
btrfs: make __extent_writepage_io() to write specified range only
Function __extent_writepage_io() is designed to find all dirty ranges of
a page, and add the dirty ranges to the bio_ctrl for submission.
It requires all the dirtied ranges to be covered by an ordered extent.
It gets called in two locations, but one call site is not subpage aware:
- __extent_writepage()
It gets called when writepage_delalloc() returned 0, which means
writepage_delalloc() has handled delalloc for all subpage sectors
inside the page.
So this call site is OK.
- extent_write_locked_range()
This call site is utilized by zoned support, and in this case, we may
only run delalloc range for a subset of the page, like this: (64K page
size)
0 16K 32K 48K 64K
|/////| |///////| |
In the above case, if extent_write_locked_range() is only triggered for
range [0, 16K), __extent_writepage_io() would still try to submit
the dirty range of [32K, 48K), then it would not find any ordered
extent for it and triggers various ASSERT()s.
Fix this problem by:
- Introducing @start and @len parameters to specify the range
For the first call site, we just pass the whole page, and the behavior
is not touched, since run_delalloc_range() for the page should have
created all ordered extents for the page.
For the second call site, we avoid touching anything beyond the
range, thus avoiding the dirty range which is not yet covered by any
delalloc range.
- Making btrfs_folio_assert_not_dirty() subpage aware
The only caller is inside __extent_writepage_io(), and since that
caller now accepts a subpage range, we should also check the subpage
range other than the whole page.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-16 07:03:41 +03:00
struct btrfs_subpage * subpage ;
unsigned int start_bit ;
unsigned int nbits ;
unsigned long flags ;
btrfs: subpage: fix writeback which does not have ordered extent
[BUG]
When running fsstress with subpage RW support, there are random
BUG_ON()s triggered with the following trace:
kernel BUG at fs/btrfs/file-item.c:667!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 1 PID: 3486 Comm: kworker/u13:2 5.11.0-rc4-custom+ #43
Hardware name: Radxa ROCK Pi 4B (DT)
Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
Call trace:
btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
btrfs_submit_bio_start+0x20/0x30 [btrfs]
run_one_async_start+0x28/0x44 [btrfs]
btrfs_work_helper+0x128/0x1b4 [btrfs]
process_one_work+0x22c/0x430
worker_thread+0x70/0x3a0
kthread+0x13c/0x140
ret_from_fork+0x10/0x30
[CAUSE]
Above BUG_ON() means there is some bio range which doesn't have ordered
extent, which indeed is worth a BUG_ON().
Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
subpage dirty bitmap to record which range is dirty and should be
written back.
This means, if we submit bio for a subpage range, we do not only need to
clear page dirty, but also need to clear subpage dirty bits.
In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
any range we submit a bio.
But there is loophole, if we hit a range which is beyond i_size, we just
call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
then break out, without clearing the subpage dirty.
This means, if we hit above branch, the subpage dirty bits are still
there, if other range of the page get dirtied and we need to writeback
that page again, we will submit bio for the old range, leaving a wild
bio range which doesn't have ordered extent.
[FIX]
Fix it by always calling btrfs_page_clear_dirty() in
__extent_writepage_io().
Also to avoid such problem from happening again, add a new assert,
btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
dirty bits are cleared before exiting __extent_writepage_io().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-26 09:34:58 +03:00
if ( ! IS_ENABLED ( CONFIG_BTRFS_ASSERT ) )
return ;
btrfs: make __extent_writepage_io() to write specified range only
Function __extent_writepage_io() is designed to find all dirty ranges of
a page, and add the dirty ranges to the bio_ctrl for submission.
It requires all the dirtied ranges to be covered by an ordered extent.
It gets called in two locations, but one call site is not subpage aware:
- __extent_writepage()
It gets called when writepage_delalloc() returned 0, which means
writepage_delalloc() has handled delalloc for all subpage sectors
inside the page.
So this call site is OK.
- extent_write_locked_range()
This call site is utilized by zoned support, and in this case, we may
only run delalloc range for a subset of the page, like this: (64K page
size)
0 16K 32K 48K 64K
|/////| |///////| |
In the above case, if extent_write_locked_range() is only triggered for
range [0, 16K), __extent_writepage_io() would still try to submit
the dirty range of [32K, 48K), then it would not find any ordered
extent for it and triggers various ASSERT()s.
Fix this problem by:
- Introducing @start and @len parameters to specify the range
For the first call site, we just pass the whole page, and the behavior
is not touched, since run_delalloc_range() for the page should have
created all ordered extents for the page.
For the second call site, we avoid touching anything beyond the
range, thus avoiding the dirty range which is not yet covered by any
delalloc range.
- Making btrfs_folio_assert_not_dirty() subpage aware
The only caller is inside __extent_writepage_io(), and since that
caller now accepts a subpage range, we should also check the subpage
range other than the whole page.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-16 07:03:41 +03:00
if ( ! btrfs_is_subpage ( fs_info , folio - > mapping ) ) {
ASSERT ( ! folio_test_dirty ( folio ) ) ;
btrfs: subpage: fix writeback which does not have ordered extent
[BUG]
When running fsstress with subpage RW support, there are random
BUG_ON()s triggered with the following trace:
kernel BUG at fs/btrfs/file-item.c:667!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 1 PID: 3486 Comm: kworker/u13:2 5.11.0-rc4-custom+ #43
Hardware name: Radxa ROCK Pi 4B (DT)
Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
Call trace:
btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
btrfs_submit_bio_start+0x20/0x30 [btrfs]
run_one_async_start+0x28/0x44 [btrfs]
btrfs_work_helper+0x128/0x1b4 [btrfs]
process_one_work+0x22c/0x430
worker_thread+0x70/0x3a0
kthread+0x13c/0x140
ret_from_fork+0x10/0x30
[CAUSE]
Above BUG_ON() means there is some bio range which doesn't have ordered
extent, which indeed is worth a BUG_ON().
Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
subpage dirty bitmap to record which range is dirty and should be
written back.
This means, if we submit bio for a subpage range, we do not only need to
clear page dirty, but also need to clear subpage dirty bits.
In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
any range we submit a bio.
But there is loophole, if we hit a range which is beyond i_size, we just
call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
then break out, without clearing the subpage dirty.
This means, if we hit above branch, the subpage dirty bits are still
there, if other range of the page get dirtied and we need to writeback
that page again, we will submit bio for the old range, leaving a wild
bio range which doesn't have ordered extent.
[FIX]
Fix it by always calling btrfs_page_clear_dirty() in
__extent_writepage_io().
Also to avoid such problem from happening again, add a new assert,
btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
dirty bits are cleared before exiting __extent_writepage_io().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-26 09:34:58 +03:00
return ;
btrfs: make __extent_writepage_io() to write specified range only
Function __extent_writepage_io() is designed to find all dirty ranges of
a page, and add the dirty ranges to the bio_ctrl for submission.
It requires all the dirtied ranges to be covered by an ordered extent.
It gets called in two locations, but one call site is not subpage aware:
- __extent_writepage()
It gets called when writepage_delalloc() returned 0, which means
writepage_delalloc() has handled delalloc for all subpage sectors
inside the page.
So this call site is OK.
- extent_write_locked_range()
This call site is utilized by zoned support, and in this case, we may
only run delalloc range for a subset of the page, like this: (64K page
size)
0 16K 32K 48K 64K
|/////| |///////| |
In the above case, if extent_write_locked_range() is only triggered for
range [0, 16K), __extent_writepage_io() would still try to submit
the dirty range of [32K, 48K), then it would not find any ordered
extent for it and triggers various ASSERT()s.
Fix this problem by:
- Introducing @start and @len parameters to specify the range
For the first call site, we just pass the whole page, and the behavior
is not touched, since run_delalloc_range() for the page should have
created all ordered extents for the page.
For the second call site, we avoid touching anything beyond the
range, thus avoiding the dirty range which is not yet covered by any
delalloc range.
- Making btrfs_folio_assert_not_dirty() subpage aware
The only caller is inside __extent_writepage_io(), and since that
caller now accepts a subpage range, we should also check the subpage
range other than the whole page.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-16 07:03:41 +03:00
}
btrfs: subpage: fix writeback which does not have ordered extent
[BUG]
When running fsstress with subpage RW support, there are random
BUG_ON()s triggered with the following trace:
kernel BUG at fs/btrfs/file-item.c:667!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 1 PID: 3486 Comm: kworker/u13:2 5.11.0-rc4-custom+ #43
Hardware name: Radxa ROCK Pi 4B (DT)
Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
Call trace:
btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
btrfs_submit_bio_start+0x20/0x30 [btrfs]
run_one_async_start+0x28/0x44 [btrfs]
btrfs_work_helper+0x128/0x1b4 [btrfs]
process_one_work+0x22c/0x430
worker_thread+0x70/0x3a0
kthread+0x13c/0x140
ret_from_fork+0x10/0x30
[CAUSE]
Above BUG_ON() means there is some bio range which doesn't have ordered
extent, which indeed is worth a BUG_ON().
Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
subpage dirty bitmap to record which range is dirty and should be
written back.
This means, if we submit bio for a subpage range, we do not only need to
clear page dirty, but also need to clear subpage dirty bits.
In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
any range we submit a bio.
But there is loophole, if we hit a range which is beyond i_size, we just
call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
then break out, without clearing the subpage dirty.
This means, if we hit above branch, the subpage dirty bits are still
there, if other range of the page get dirtied and we need to writeback
that page again, we will submit bio for the old range, leaving a wild
bio range which doesn't have ordered extent.
[FIX]
Fix it by always calling btrfs_page_clear_dirty() in
__extent_writepage_io().
Also to avoid such problem from happening again, add a new assert,
btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
dirty bits are cleared before exiting __extent_writepage_io().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-26 09:34:58 +03:00
btrfs: make __extent_writepage_io() to write specified range only
Function __extent_writepage_io() is designed to find all dirty ranges of
a page, and add the dirty ranges to the bio_ctrl for submission.
It requires all the dirtied ranges to be covered by an ordered extent.
It gets called in two locations, but one call site is not subpage aware:
- __extent_writepage()
It gets called when writepage_delalloc() returned 0, which means
writepage_delalloc() has handled delalloc for all subpage sectors
inside the page.
So this call site is OK.
- extent_write_locked_range()
This call site is utilized by zoned support, and in this case, we may
only run delalloc range for a subset of the page, like this: (64K page
size)
0 16K 32K 48K 64K
|/////| |///////| |
In the above case, if extent_write_locked_range() is only triggered for
range [0, 16K), __extent_writepage_io() would still try to submit
the dirty range of [32K, 48K), then it would not find any ordered
extent for it and triggers various ASSERT()s.
Fix this problem by:
- Introducing @start and @len parameters to specify the range
For the first call site, we just pass the whole page, and the behavior
is not touched, since run_delalloc_range() for the page should have
created all ordered extents for the page.
For the second call site, we avoid touching anything beyond the
range, thus avoiding the dirty range which is not yet covered by any
delalloc range.
- Making btrfs_folio_assert_not_dirty() subpage aware
The only caller is inside __extent_writepage_io(), and since that
caller now accepts a subpage range, we should also check the subpage
range other than the whole page.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-16 07:03:41 +03:00
start_bit = subpage_calc_start_bit ( fs_info , folio , dirty , start , len ) ;
nbits = len > > fs_info - > sectorsize_bits ;
subpage = folio_get_private ( folio ) ;
ASSERT ( subpage ) ;
spin_lock_irqsave ( & subpage - > lock , flags ) ;
ASSERT ( bitmap_test_range_all_zero ( subpage - > bitmaps , start_bit , nbits ) ) ;
spin_unlock_irqrestore ( & subpage - > lock , flags ) ;
btrfs: subpage: fix writeback which does not have ordered extent
[BUG]
When running fsstress with subpage RW support, there are random
BUG_ON()s triggered with the following trace:
kernel BUG at fs/btrfs/file-item.c:667!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 1 PID: 3486 Comm: kworker/u13:2 5.11.0-rc4-custom+ #43
Hardware name: Radxa ROCK Pi 4B (DT)
Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
Call trace:
btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
btrfs_submit_bio_start+0x20/0x30 [btrfs]
run_one_async_start+0x28/0x44 [btrfs]
btrfs_work_helper+0x128/0x1b4 [btrfs]
process_one_work+0x22c/0x430
worker_thread+0x70/0x3a0
kthread+0x13c/0x140
ret_from_fork+0x10/0x30
[CAUSE]
Above BUG_ON() means there is some bio range which doesn't have ordered
extent, which indeed is worth a BUG_ON().
Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
subpage dirty bitmap to record which range is dirty and should be
written back.
This means, if we submit bio for a subpage range, we do not only need to
clear page dirty, but also need to clear subpage dirty bits.
In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
any range we submit a bio.
But there is loophole, if we hit a range which is beyond i_size, we just
call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
then break out, without clearing the subpage dirty.
This means, if we hit above branch, the subpage dirty bits are still
there, if other range of the page get dirtied and we need to writeback
that page again, we will submit bio for the old range, leaving a wild
bio range which doesn't have ordered extent.
[FIX]
Fix it by always calling btrfs_page_clear_dirty() in
__extent_writepage_io().
Also to avoid such problem from happening again, add a new assert,
btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
dirty bits are cleared before exiting __extent_writepage_io().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-26 09:34:58 +03:00
}
2021-09-27 10:22:05 +03:00
/*
* Handle different locked pages with different page sizes :
*
* - Page locked by plain lock_page ( )
* It should not have any subpage : : writers count .
* Can be unlocked by unlock_page ( ) .
* This is the most common locked page for __extent_writepage ( ) called
2022-06-21 10:49:44 +03:00
* inside extent_write_cache_pages ( ) .
2021-09-27 10:22:05 +03:00
* Rarer cases include the @ locked_page from extent_write_locked_range ( ) .
*
* - Page locked by lock_delalloc_pages ( )
* There is only one caller , all pages except @ locked_page for
* extent_write_locked_range ( ) .
* In this case , we have to call subpage helper to handle the case .
*/
2023-12-12 05:28:37 +03:00
void btrfs_folio_unlock_writer ( struct btrfs_fs_info * fs_info ,
struct folio * folio , u64 start , u32 len )
2021-09-27 10:22:05 +03:00
{
struct btrfs_subpage * subpage ;
2023-12-12 05:28:37 +03:00
ASSERT ( folio_test_locked ( folio ) ) ;
btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine
The reason why we only support 64K page size for subpage is, for 64K
page size we can ensure no matter what the nodesize is, we can fit it
into one page.
When other page size come, especially like 16K, the limitation is a bit
limiting.
To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
non-subpage routine. By this, we can allow 4K sectorsize on 16K page
size.
Although this introduces another smaller limitation, the metadata can
not cross page boundary, which is already met by most recent mkfs.
Another small improvement is, we can avoid the overhead for metadata if
nodesize >= PAGE_SIZE.
For 4K sector size and 64K page size/node size, or 4K sector size and
16K page size/node size, we don't need to allocate extra memory for the
metadata pages.
Please note that, this patch will not yet enable other page size support
yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-13 08:22:09 +03:00
/* For non-subpage case, we just unlock the page */
2023-12-12 05:28:37 +03:00
if ( ! btrfs_is_subpage ( fs_info , folio - > mapping ) ) {
folio_unlock ( folio ) ;
return ;
}
2021-09-27 10:22:05 +03:00
2023-11-17 06:54:14 +03:00
ASSERT ( folio_test_private ( folio ) & & folio_get_private ( folio ) ) ;
subpage = folio_get_private ( folio ) ;
2021-09-27 10:22:05 +03:00
/*
* For subpage case , there are two types of locked page . With or
* without writers number .
*
* Since we own the page lock , no one else could touch subpage : : writers
* and we are safe to do several atomic operations without spinlock .
*/
2023-12-12 05:28:37 +03:00
if ( atomic_read ( & subpage - > writers ) = = 0 ) {
2021-09-27 10:22:05 +03:00
/* No writers, locked by plain lock_page() */
2023-12-12 05:28:37 +03:00
folio_unlock ( folio ) ;
return ;
}
2021-09-27 10:22:05 +03:00
/* Have writers, use proper subpage helper to end it */
2023-12-12 05:28:37 +03:00
btrfs_folio_end_writer_lock ( fs_info , folio , start , len ) ;
2021-09-27 10:22:05 +03:00
}
2023-05-26 15:30:53 +03:00
2024-02-19 05:43:24 +03:00
/*
* This is for folio already locked by plain lock_page ( ) / folio_lock ( ) , which
* doesn ' t have any subpage awareness .
*
* This populates the involved subpage ranges so that subpage helpers can
* properly unlock them .
*/
void btrfs_folio_set_writer_lock ( const struct btrfs_fs_info * fs_info ,
struct folio * folio , u64 start , u32 len )
{
struct btrfs_subpage * subpage ;
unsigned long flags ;
unsigned int start_bit ;
unsigned int nbits ;
int ret ;
ASSERT ( folio_test_locked ( folio ) ) ;
if ( unlikely ( ! fs_info ) | | ! btrfs_is_subpage ( fs_info , folio - > mapping ) )
return ;
subpage = folio_get_private ( folio ) ;
start_bit = subpage_calc_start_bit ( fs_info , folio , locked , start , len ) ;
nbits = len > > fs_info - > sectorsize_bits ;
spin_lock_irqsave ( & subpage - > lock , flags ) ;
/* Target range should not yet be locked. */
ASSERT ( bitmap_test_range_all_zero ( subpage - > bitmaps , start_bit , nbits ) ) ;
bitmap_set ( subpage - > bitmaps , start_bit , nbits ) ;
ret = atomic_add_return ( nbits , & subpage - > writers ) ;
ASSERT ( ret < = fs_info - > subpage_info - > bitmap_nr_bits ) ;
spin_unlock_irqrestore ( & subpage - > lock , flags ) ;
}
/*
* Find any subpage writer locked range inside @ folio , starting at file offset
* @ search_start . The caller should ensure the folio is locked .
*
* Return true and update @ found_start_ret and @ found_len_ret to the first
* writer locked range .
* Return false if there is no writer locked range .
*/
bool btrfs_subpage_find_writer_locked ( const struct btrfs_fs_info * fs_info ,
struct folio * folio , u64 search_start ,
u64 * found_start_ret , u32 * found_len_ret )
{
struct btrfs_subpage_info * subpage_info = fs_info - > subpage_info ;
struct btrfs_subpage * subpage = folio_get_private ( folio ) ;
const unsigned int len = PAGE_SIZE - offset_in_page ( search_start ) ;
const unsigned int start_bit = subpage_calc_start_bit ( fs_info , folio ,
locked , search_start , len ) ;
const unsigned int locked_bitmap_start = subpage_info - > locked_offset ;
const unsigned int locked_bitmap_end = locked_bitmap_start +
subpage_info - > bitmap_nr_bits ;
unsigned long flags ;
int first_zero ;
int first_set ;
bool found = false ;
ASSERT ( folio_test_locked ( folio ) ) ;
spin_lock_irqsave ( & subpage - > lock , flags ) ;
first_set = find_next_bit ( subpage - > bitmaps , locked_bitmap_end , start_bit ) ;
if ( first_set > = locked_bitmap_end )
goto out ;
found = true ;
* found_start_ret = folio_pos ( folio ) +
( ( first_set - locked_bitmap_start ) < < fs_info - > sectorsize_bits ) ;
/*
* Since @ first_set is ensured to be smaller than locked_bitmap_end
* here , @ found_start_ret should be inside the folio .
*/
ASSERT ( * found_start_ret < folio_pos ( folio ) + PAGE_SIZE ) ;
first_zero = find_next_zero_bit ( subpage - > bitmaps , locked_bitmap_end , first_set ) ;
* found_len_ret = ( first_zero - first_set ) < < fs_info - > sectorsize_bits ;
out :
spin_unlock_irqrestore ( & subpage - > lock , flags ) ;
return found ;
}
/*
* Unlike btrfs_folio_end_writer_lock ( ) which unlocks a specified subpage range ,
* this ends all writer locked ranges of a page .
*
* This is for the locked page of __extent_writepage ( ) , as the locked page
* can contain several locked subpage ranges .
*/
void btrfs_folio_end_all_writers ( const struct btrfs_fs_info * fs_info , struct folio * folio )
{
btrfs: lock subpage ranges in one go for writepage_delalloc()
If we have a subpage range like this for a 16K page with 4K sectorsize:
0 4K 8K 12K 16K
|/////| |//////| |
|/////| = dirty range
Currently writepage_delalloc() would go through the following steps:
- lock range [0, 4K)
- run delalloc range for [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [8K 12K)
So far it's fine for regular subpage writeback, as
btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
cow_file_range() and run_delalloc_compressed().
But there is a special case for zoned subpage, where we will go
through run_delalloc_cow(), which would create the ordered extent for the
range and immediately submit the range.
This would unlock the whole page range, causing all kinds of different
ASSERT()s related to locked page.
Address the page unlocking problem of run_delalloc_cow(), by changing
the workflow to the following one:
- lock range [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [0, 4K)
- run delalloc range for [8K, 12K)
So that run_delalloc_cow() can only unlock the full page until the
last lock user released.
To do that:
- Utilize subpage locked bitmap
So for every delalloc range we found, call
btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
and later btrfs_folio_end_all_writers() if the page is fully unlocked.
So we know there is a delalloc range that needs to be run later.
- Save the @delalloc_end as @last_delalloc_end inside writepage_delalloc()
Since subpage locked bitmap is only for ranges inside the page,
meanwhile we can have delalloc range ends beyond our page boundary,
we have to save the @last_delalloc_end just in case it's beyond our
page boundary.
Although there is one extra point to notice:
- We need to handle errors in previous iteration
Since we can have multiple locked delalloc ranges we have to call
run_delalloc_ranges() multiple times.
If we hit an error half way, we still need to unlock the remaining
ranges.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-18 09:39:32 +03:00
struct btrfs_subpage * subpage = folio_get_private ( folio ) ;
2024-02-19 05:43:24 +03:00
u64 folio_start = folio_pos ( folio ) ;
u64 cur = folio_start ;
ASSERT ( folio_test_locked ( folio ) ) ;
if ( ! btrfs_is_subpage ( fs_info , folio - > mapping ) ) {
folio_unlock ( folio ) ;
return ;
}
btrfs: lock subpage ranges in one go for writepage_delalloc()
If we have a subpage range like this for a 16K page with 4K sectorsize:
0 4K 8K 12K 16K
|/////| |//////| |
|/////| = dirty range
Currently writepage_delalloc() would go through the following steps:
- lock range [0, 4K)
- run delalloc range for [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [8K 12K)
So far it's fine for regular subpage writeback, as
btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
cow_file_range() and run_delalloc_compressed().
But there is a special case for zoned subpage, where we will go
through run_delalloc_cow(), which would create the ordered extent for the
range and immediately submit the range.
This would unlock the whole page range, causing all kinds of different
ASSERT()s related to locked page.
Address the page unlocking problem of run_delalloc_cow(), by changing
the workflow to the following one:
- lock range [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [0, 4K)
- run delalloc range for [8K, 12K)
So that run_delalloc_cow() can only unlock the full page until the
last lock user released.
To do that:
- Utilize subpage locked bitmap
So for every delalloc range we found, call
btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
and later btrfs_folio_end_all_writers() if the page is fully unlocked.
So we know there is a delalloc range that needs to be run later.
- Save the @delalloc_end as @last_delalloc_end inside writepage_delalloc()
Since subpage locked bitmap is only for ranges inside the page,
meanwhile we can have delalloc range ends beyond our page boundary,
we have to save the @last_delalloc_end just in case it's beyond our
page boundary.
Although there is one extra point to notice:
- We need to handle errors in previous iteration
Since we can have multiple locked delalloc ranges we have to call
run_delalloc_ranges() multiple times.
If we hit an error half way, we still need to unlock the remaining
ranges.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-18 09:39:32 +03:00
/* The page has no new delalloc range locked on it. Just plain unlock. */
if ( atomic_read ( & subpage - > writers ) = = 0 ) {
folio_unlock ( folio ) ;
return ;
}
2024-02-19 05:43:24 +03:00
while ( cur < folio_start + PAGE_SIZE ) {
u64 found_start ;
u32 found_len ;
bool found ;
bool last ;
found = btrfs_subpage_find_writer_locked ( fs_info , folio , cur ,
& found_start , & found_len ) ;
if ( ! found )
break ;
last = btrfs_subpage_end_and_test_writer ( fs_info , folio ,
found_start , found_len ) ;
if ( last ) {
folio_unlock ( folio ) ;
break ;
}
cur = found_start + found_len ;
}
}
2023-05-26 15:30:53 +03:00
# define GET_SUBPAGE_BITMAP(subpage, subpage_info, name, dst) \
bitmap_cut ( dst , subpage - > bitmaps , 0 , \
subpage_info - > name # # _offset , subpage_info - > bitmap_nr_bits )
void __cold btrfs_subpage_dump_bitmap ( const struct btrfs_fs_info * fs_info ,
2023-12-12 05:28:37 +03:00
struct folio * folio , u64 start , u32 len )
2023-05-26 15:30:53 +03:00
{
struct btrfs_subpage_info * subpage_info = fs_info - > subpage_info ;
struct btrfs_subpage * subpage ;
unsigned long uptodate_bitmap ;
unsigned long dirty_bitmap ;
unsigned long writeback_bitmap ;
unsigned long ordered_bitmap ;
unsigned long checked_bitmap ;
unsigned long flags ;
2023-11-17 06:54:14 +03:00
ASSERT ( folio_test_private ( folio ) & & folio_get_private ( folio ) ) ;
2023-05-26 15:30:53 +03:00
ASSERT ( subpage_info ) ;
2023-11-17 06:54:14 +03:00
subpage = folio_get_private ( folio ) ;
2023-05-26 15:30:53 +03:00
spin_lock_irqsave ( & subpage - > lock , flags ) ;
GET_SUBPAGE_BITMAP ( subpage , subpage_info , uptodate , & uptodate_bitmap ) ;
GET_SUBPAGE_BITMAP ( subpage , subpage_info , dirty , & dirty_bitmap ) ;
GET_SUBPAGE_BITMAP ( subpage , subpage_info , writeback , & writeback_bitmap ) ;
GET_SUBPAGE_BITMAP ( subpage , subpage_info , ordered , & ordered_bitmap ) ;
GET_SUBPAGE_BITMAP ( subpage , subpage_info , checked , & checked_bitmap ) ;
2024-02-17 09:29:49 +03:00
GET_SUBPAGE_BITMAP ( subpage , subpage_info , locked , & checked_bitmap ) ;
2023-05-26 15:30:53 +03:00
spin_unlock_irqrestore ( & subpage - > lock , flags ) ;
2023-12-12 05:28:37 +03:00
dump_page ( folio_page ( folio , 0 ) , " btrfs subpage dump " ) ;
2023-05-26 15:30:53 +03:00
btrfs_warn ( fs_info ,
2024-06-10 00:02:07 +03:00
" start=%llu len=%u page=%llu, bitmaps uptodate=%*pbl dirty=%*pbl writeback=%*pbl ordered=%*pbl checked=%*pbl " ,
2023-12-12 05:28:37 +03:00
start , len , folio_pos ( folio ) ,
2023-05-26 15:30:53 +03:00
subpage_info - > bitmap_nr_bits , & uptodate_bitmap ,
subpage_info - > bitmap_nr_bits , & dirty_bitmap ,
subpage_info - > bitmap_nr_bits , & writeback_bitmap ,
subpage_info - > bitmap_nr_bits , & ordered_bitmap ,
subpage_info - > bitmap_nr_bits , & checked_bitmap ) ;
}