2018-04-03 19:23:33 +02:00
// SPDX-License-Identifier: GPL-2.0
2007-06-12 09:07:21 -04:00
/*
* Copyright ( C ) 2007 Oracle . All rights reserved .
*/
2008-02-20 12:07:25 -05:00
# include <linux/bio.h>
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 17:04:11 +09:00
# include <linux/slab.h>
2008-02-20 12:07:25 -05:00
# include <linux/pagemap.h>
# include <linux/highmem.h>
2019-04-01 11:29:58 +03:00
# include <linux/sched/mm.h>
2019-06-03 16:58:57 +02:00
# include <crypto/hash.h>
2007-03-15 19:03:33 -04:00
# include "ctree.h"
2007-03-26 16:00:06 -04:00
# include "disk-io.h"
2007-03-20 14:38:32 -04:00
# include "transaction.h"
2013-07-25 19:22:34 +08:00
# include "volumes.h"
2007-05-29 15:17:08 -04:00
# include "print-tree.h"
2016-03-10 17:26:59 +08:00
# include "compression.h"
2007-03-15 19:03:33 -04:00
2016-08-03 14:05:46 -07:00
# define __MAX_CSUM_ITEMS(r, size) ((unsigned long)(((BTRFS_LEAF_DATA_SIZE(r) - \
sizeof ( struct btrfs_item ) * 2 ) / \
size ) - 1 ) )
2009-01-06 11:42:00 -05:00
2012-09-20 14:33:00 -06:00
# define MAX_CSUM_ITEMS(r, size) (min_t(u32, __MAX_CSUM_ITEMS(r, size), \
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 15:29:47 +03:00
PAGE_SIZE ) )
2012-01-31 20:19:02 -05:00
2020-01-17 09:02:21 -05:00
/**
2021-01-22 11:57:54 +02:00
* Set inode ' s size according to filesystem options
*
* @ inode : inode we want to update the disk_i_size for
* @ new_i_size : i_size we want to set to , 0 if we use i_size
2020-01-17 09:02:21 -05:00
*
* With NO_HOLES set this simply sets the disk_is_size to whatever i_size_read ( )
* returns as it is perfectly fine with a file that has holes without hole file
* extent items .
*
* However without NO_HOLES we need to only return the area that is contiguous
* from the 0 offset of the file . Otherwise we could end up adjust i_size up
* to an extent that has a gap in between .
*
* Finally new_i_size should only be set in the case of truncate where we ' re not
* ready to use i_size_read ( ) as the limiter yet .
*/
2020-11-02 16:48:53 +02:00
void btrfs_inode_safe_disk_i_size_write ( struct btrfs_inode * inode , u64 new_i_size )
2020-01-17 09:02:21 -05:00
{
2020-11-02 16:48:53 +02:00
struct btrfs_fs_info * fs_info = inode - > root - > fs_info ;
2020-01-17 09:02:21 -05:00
u64 start , end , i_size ;
int ret ;
2020-11-02 16:48:53 +02:00
i_size = new_i_size ? : i_size_read ( & inode - > vfs_inode ) ;
2020-01-17 09:02:21 -05:00
if ( btrfs_fs_incompat ( fs_info , NO_HOLES ) ) {
2020-11-02 16:48:53 +02:00
inode - > disk_i_size = i_size ;
2020-01-17 09:02:21 -05:00
return ;
}
2020-11-02 16:48:53 +02:00
spin_lock ( & inode - > lock ) ;
ret = find_contiguous_extent_bit ( & inode - > file_extent_tree , 0 , & start ,
& end , EXTENT_DIRTY ) ;
2020-01-17 09:02:21 -05:00
if ( ! ret & & start = = 0 )
i_size = min ( i_size , end + 1 ) ;
else
i_size = 0 ;
2020-11-02 16:48:53 +02:00
inode - > disk_i_size = i_size ;
spin_unlock ( & inode - > lock ) ;
2020-01-17 09:02:21 -05:00
}
/**
2021-01-22 11:57:54 +02:00
* Mark range within a file as having a new extent inserted
*
* @ inode : inode being modified
* @ start : start file offset of the file extent we ' ve inserted
* @ len : logical length of the file extent item
2020-01-17 09:02:21 -05:00
*
* Call when we are inserting a new file extent where there was none before .
* Does not need to call this in the case where we ' re replacing an existing file
* extent , however if not sure it ' s fine to call this multiple times .
*
* The start and len must match the file extent item , so thus must be sectorsize
* aligned .
*/
int btrfs_inode_set_file_extent_range ( struct btrfs_inode * inode , u64 start ,
u64 len )
{
if ( len = = 0 )
return 0 ;
ASSERT ( IS_ALIGNED ( start + len , inode - > root - > fs_info - > sectorsize ) ) ;
if ( btrfs_fs_incompat ( inode - > root - > fs_info , NO_HOLES ) )
return 0 ;
return set_extent_bits ( & inode - > file_extent_tree , start , start + len - 1 ,
EXTENT_DIRTY ) ;
}
/**
2021-01-22 11:57:54 +02:00
* Marks an inode range as not having a backing extent
*
* @ inode : inode being modified
* @ start : start file offset of the file extent we ' ve inserted
* @ len : logical length of the file extent item
2020-01-17 09:02:21 -05:00
*
* Called when we drop a file extent , for example when we truncate . Doesn ' t
* need to be called for cases where we ' re replacing a file extent , like when
* we ' ve COWed a file extent .
*
* The start and len must match the file extent item , so thus must be sectorsize
* aligned .
*/
int btrfs_inode_clear_file_extent_range ( struct btrfs_inode * inode , u64 start ,
u64 len )
{
if ( len = = 0 )
return 0 ;
ASSERT ( IS_ALIGNED ( start + len , inode - > root - > fs_info - > sectorsize ) | |
len = = ( u64 ) - 1 ) ;
if ( btrfs_fs_incompat ( inode - > root - > fs_info , NO_HOLES ) )
return 0 ;
return clear_extent_bit ( & inode - > file_extent_tree , start ,
start + len - 1 , EXTENT_DIRTY , 0 , 0 , NULL ) ;
}
2019-05-22 10:19:01 +02:00
static inline u32 max_ordered_sum_bytes ( struct btrfs_fs_info * fs_info ,
u16 csum_size )
{
u32 ncsums = ( PAGE_SIZE - sizeof ( struct btrfs_ordered_sum ) ) / csum_size ;
return ncsums * fs_info - > sectorsize ;
}
2009-01-06 11:42:00 -05:00
2007-04-17 13:26:50 -04:00
int btrfs_insert_file_extent ( struct btrfs_trans_handle * trans ,
2008-05-02 14:43:14 -04:00
struct btrfs_root * root ,
u64 objectid , u64 pos ,
u64 disk_offset , u64 disk_num_bytes ,
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
u64 num_bytes , u64 offset , u64 ram_bytes ,
u8 compression , u8 encryption , u16 other_encoding )
2007-03-20 14:38:32 -04:00
{
2007-03-26 16:00:06 -04:00
int ret = 0 ;
struct btrfs_file_extent_item * item ;
struct btrfs_key file_key ;
2007-04-02 11:20:42 -04:00
struct btrfs_path * path ;
2007-10-15 16:14:19 -04:00
struct extent_buffer * leaf ;
2007-03-26 16:00:06 -04:00
2007-04-02 11:20:42 -04:00
path = btrfs_alloc_path ( ) ;
2011-03-23 08:14:16 +00:00
if ( ! path )
return - ENOMEM ;
2007-03-26 16:00:06 -04:00
file_key . objectid = objectid ;
2007-04-17 13:26:50 -04:00
file_key . offset = pos ;
2014-06-04 18:41:45 +02:00
file_key . type = BTRFS_EXTENT_DATA_KEY ;
2007-03-26 16:00:06 -04:00
2007-04-02 11:20:42 -04:00
ret = btrfs_insert_empty_item ( trans , root , path , & file_key ,
2007-03-26 16:00:06 -04:00
sizeof ( * item ) ) ;
2007-06-22 14:16:25 -04:00
if ( ret < 0 )
goto out ;
2012-03-12 16:03:00 +01:00
BUG_ON ( ret ) ; /* Can't happen */
2007-10-15 16:14:19 -04:00
leaf = path - > nodes [ 0 ] ;
item = btrfs_item_ptr ( leaf , path - > slots [ 0 ] ,
2007-03-26 16:00:06 -04:00
struct btrfs_file_extent_item ) ;
2008-05-02 14:43:14 -04:00
btrfs_set_file_extent_disk_bytenr ( leaf , item , disk_offset ) ;
2007-10-15 16:15:53 -04:00
btrfs_set_file_extent_disk_num_bytes ( leaf , item , disk_num_bytes ) ;
2008-05-02 14:43:14 -04:00
btrfs_set_file_extent_offset ( leaf , item , offset ) ;
2007-10-15 16:15:53 -04:00
btrfs_set_file_extent_num_bytes ( leaf , item , num_bytes ) ;
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
btrfs_set_file_extent_ram_bytes ( leaf , item , ram_bytes ) ;
2007-10-15 16:14:19 -04:00
btrfs_set_file_extent_generation ( leaf , item , trans - > transid ) ;
btrfs_set_file_extent_type ( leaf , item , BTRFS_FILE_EXTENT_REG ) ;
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
btrfs_set_file_extent_compression ( leaf , item , compression ) ;
btrfs_set_file_extent_encryption ( leaf , item , encryption ) ;
btrfs_set_file_extent_other_encoding ( leaf , item , other_encoding ) ;
2007-10-15 16:14:19 -04:00
btrfs_mark_buffer_dirty ( leaf ) ;
2007-06-22 14:16:25 -04:00
out :
2007-04-02 11:20:42 -04:00
btrfs_free_path ( path ) ;
2007-06-22 14:16:25 -04:00
return ret ;
2007-03-20 14:38:32 -04:00
}
2007-03-26 16:00:06 -04:00
2013-04-25 20:41:01 +00:00
static struct btrfs_csum_item *
btrfs_lookup_csum ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path ,
u64 bytenr , int cow )
2007-04-16 09:22:45 -04:00
{
2016-06-22 18:54:23 -04:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
2007-04-16 09:22:45 -04:00
int ret ;
struct btrfs_key file_key ;
struct btrfs_key found_key ;
struct btrfs_csum_item * item ;
2007-10-15 16:14:19 -04:00
struct extent_buffer * leaf ;
2007-04-16 09:22:45 -04:00
u64 csum_offset = 0 ;
2020-07-02 11:27:30 +02:00
const u32 csum_size = fs_info - > csum_size ;
2007-04-18 16:15:28 -04:00
int csums_in_item ;
2007-04-16 09:22:45 -04:00
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
file_key . objectid = BTRFS_EXTENT_CSUM_OBJECTID ;
file_key . offset = bytenr ;
2014-06-04 18:41:45 +02:00
file_key . type = BTRFS_EXTENT_CSUM_KEY ;
2007-04-17 13:26:50 -04:00
ret = btrfs_search_slot ( trans , root , & file_key , path , 0 , cow ) ;
2007-04-16 09:22:45 -04:00
if ( ret < 0 )
goto fail ;
2007-10-15 16:14:19 -04:00
leaf = path - > nodes [ 0 ] ;
2007-04-16 09:22:45 -04:00
if ( ret > 0 ) {
ret = 1 ;
2007-04-17 15:39:32 -04:00
if ( path - > slots [ 0 ] = = 0 )
2007-04-16 09:22:45 -04:00
goto fail ;
path - > slots [ 0 ] - - ;
2007-10-15 16:14:19 -04:00
btrfs_item_key_to_cpu ( leaf , & found_key , path - > slots [ 0 ] ) ;
2014-06-04 18:41:45 +02:00
if ( found_key . type ! = BTRFS_EXTENT_CSUM_KEY )
2007-04-16 09:22:45 -04:00
goto fail ;
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
csum_offset = ( bytenr - found_key . offset ) > >
2020-07-01 21:19:09 +02:00
fs_info - > sectorsize_bits ;
2007-10-15 16:14:19 -04:00
csums_in_item = btrfs_item_size_nr ( leaf , path - > slots [ 0 ] ) ;
2008-12-02 07:17:45 -05:00
csums_in_item / = csum_size ;
2007-04-18 16:15:28 -04:00
2013-03-28 08:12:15 +00:00
if ( csum_offset = = csums_in_item ) {
2007-04-18 16:15:28 -04:00
ret = - EFBIG ;
2007-04-16 09:22:45 -04:00
goto fail ;
2013-03-28 08:12:15 +00:00
} else if ( csum_offset > csums_in_item ) {
goto fail ;
2007-04-16 09:22:45 -04:00
}
}
item = btrfs_item_ptr ( leaf , path - > slots [ 0 ] , struct btrfs_csum_item ) ;
2007-05-10 12:36:17 -04:00
item = ( struct btrfs_csum_item * ) ( ( unsigned char * ) item +
2008-12-02 07:17:45 -05:00
csum_offset * csum_size ) ;
2007-04-16 09:22:45 -04:00
return item ;
fail :
if ( ret > 0 )
2007-04-17 13:26:50 -04:00
ret = - ENOENT ;
2007-04-16 09:22:45 -04:00
return ERR_PTR ( ret ) ;
}
2007-03-26 16:00:06 -04:00
int btrfs_lookup_file_extent ( struct btrfs_trans_handle * trans ,
struct btrfs_root * root ,
struct btrfs_path * path , u64 objectid ,
2007-03-27 11:26:26 -04:00
u64 offset , int mod )
2007-03-26 16:00:06 -04:00
{
int ret ;
struct btrfs_key file_key ;
int ins_len = mod < 0 ? - 1 : 0 ;
int cow = mod ! = 0 ;
file_key . objectid = objectid ;
2007-04-17 15:39:32 -04:00
file_key . offset = offset ;
2014-06-04 18:41:45 +02:00
file_key . type = BTRFS_EXTENT_DATA_KEY ;
2007-03-26 16:00:06 -04:00
ret = btrfs_search_slot ( trans , root , & file_key , path , ins_len , cow ) ;
return ret ;
}
2007-03-29 15:15:27 -04:00
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
/*
* Find checksums for logical bytenr range [ disk_bytenr , disk_bytenr + len ) and
* estore the result to @ dst .
*
* Return > 0 for the number of sectors we found .
* Return 0 for the range [ disk_bytenr , disk_bytenr + sectorsize ) has no csum
* for it . Caller may want to try next sector until one range is hit .
* Return < 0 for fatal error .
*/
static int search_csum_tree ( struct btrfs_fs_info * fs_info ,
struct btrfs_path * path , u64 disk_bytenr ,
u64 len , u8 * dst )
{
struct btrfs_csum_item * item = NULL ;
struct btrfs_key key ;
const u32 sectorsize = fs_info - > sectorsize ;
const u32 csum_size = fs_info - > csum_size ;
u32 itemsize ;
int ret ;
u64 csum_start ;
u64 csum_len ;
ASSERT ( IS_ALIGNED ( disk_bytenr , sectorsize ) & &
IS_ALIGNED ( len , sectorsize ) ) ;
/* Check if the current csum item covers disk_bytenr */
if ( path - > nodes [ 0 ] ) {
item = btrfs_item_ptr ( path - > nodes [ 0 ] , path - > slots [ 0 ] ,
struct btrfs_csum_item ) ;
btrfs_item_key_to_cpu ( path - > nodes [ 0 ] , & key , path - > slots [ 0 ] ) ;
itemsize = btrfs_item_size_nr ( path - > nodes [ 0 ] , path - > slots [ 0 ] ) ;
csum_start = key . offset ;
csum_len = ( itemsize / csum_size ) * sectorsize ;
if ( in_range ( disk_bytenr , csum_start , csum_len ) )
goto found ;
}
/* Current item doesn't contain the desired range, search again */
btrfs_release_path ( path ) ;
item = btrfs_lookup_csum ( NULL , fs_info - > csum_root , path , disk_bytenr , 0 ) ;
if ( IS_ERR ( item ) ) {
ret = PTR_ERR ( item ) ;
goto out ;
}
btrfs_item_key_to_cpu ( path - > nodes [ 0 ] , & key , path - > slots [ 0 ] ) ;
itemsize = btrfs_item_size_nr ( path - > nodes [ 0 ] , path - > slots [ 0 ] ) ;
csum_start = key . offset ;
csum_len = ( itemsize / csum_size ) * sectorsize ;
ASSERT ( in_range ( disk_bytenr , csum_start , csum_len ) ) ;
found :
ret = ( min ( csum_start + csum_len , disk_bytenr + len ) -
disk_bytenr ) > > fs_info - > sectorsize_bits ;
read_extent_buffer ( path - > nodes [ 0 ] , dst , ( unsigned long ) item ,
ret * csum_size ) ;
out :
if ( ret = = - ENOENT )
ret = 0 ;
return ret ;
}
/*
* Locate the file_offset of @ cur_disk_bytenr of a @ bio .
*
* Bio of btrfs represents read range of
* [ bi_sector < < 9 , bi_sector < < 9 + bi_size ) .
* Knowing this , we can iterate through each bvec to locate the page belong to
* @ cur_disk_bytenr and get the file offset .
*
* @ inode is used to determine if the bvec page really belongs to @ inode .
*
* Return 0 if we can ' t find the file offset
* Return > 0 if we find the file offset and restore it to @ file_offset_ret
*/
static int search_file_offset_in_bio ( struct bio * bio , struct inode * inode ,
u64 disk_bytenr , u64 * file_offset_ret )
{
struct bvec_iter iter ;
struct bio_vec bvec ;
u64 cur = bio - > bi_iter . bi_sector < < SECTOR_SHIFT ;
int ret = 0 ;
bio_for_each_segment ( bvec , bio , iter ) {
struct page * page = bvec . bv_page ;
if ( cur > disk_bytenr )
break ;
if ( cur + bvec . bv_len < = disk_bytenr ) {
cur + = bvec . bv_len ;
continue ;
}
ASSERT ( in_range ( disk_bytenr , cur , bvec . bv_len ) ) ;
if ( page - > mapping & & page - > mapping - > host & &
page - > mapping - > host = = inode ) {
ret = 1 ;
* file_offset_ret = page_offset ( page ) + bvec . bv_offset +
disk_bytenr - cur ;
break ;
}
}
return ret ;
}
2019-12-02 17:34:17 -08:00
/**
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
* Lookup the checksum for the read bio in csum tree .
2020-12-02 14:48:05 +08:00
*
2019-12-02 17:34:17 -08:00
* @ inode : inode that the bio is for .
2020-04-16 14:46:16 -07:00
* @ bio : bio to look up .
* @ dst : Buffer of size nblocks * btrfs_super_csum_size ( ) used to return
* checksum ( nblocks = bio - > bi_iter . bi_size / fs_info - > sectorsize ) . If
* NULL , the checksum buffer is allocated and returned in
* btrfs_io_bio ( bio ) - > csum instead .
2019-12-02 17:34:17 -08:00
*
* Return : BLK_STS_RESOURCE if allocating memory fails , BLK_STS_OK otherwise .
*/
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
blk_status_t btrfs_lookup_bio_sums ( struct inode * inode , struct bio * bio , u8 * dst )
2008-07-31 15:42:53 -04:00
{
2016-06-22 18:54:23 -04:00
struct btrfs_fs_info * fs_info = btrfs_sb ( inode - > i_sb ) ;
2013-07-25 19:22:34 +08:00
struct extent_io_tree * io_tree = & BTRFS_I ( inode ) - > io_tree ;
struct btrfs_path * path ;
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
const u32 sectorsize = fs_info - > sectorsize ;
const u32 csum_size = fs_info - > csum_size ;
u32 orig_len = bio - > bi_iter . bi_size ;
u64 orig_disk_bytenr = bio - > bi_iter . bi_sector < < SECTOR_SHIFT ;
u64 cur_disk_bytenr ;
2013-07-25 19:22:34 +08:00
u8 * csum ;
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
const unsigned int nblocks = orig_len > > fs_info - > sectorsize_bits ;
2017-05-15 15:33:27 -07:00
int count = 0 ;
2008-07-31 15:42:53 -04:00
2020-10-16 11:29:18 -04:00
if ( ! fs_info - > csum_root | | ( BTRFS_I ( inode ) - > flags & BTRFS_INODE_NODATASUM ) )
2020-10-16 11:29:14 -04:00
return BLK_STS_OK ;
2020-12-02 14:48:05 +08:00
/*
* This function is only called for read bio .
*
* This means two things :
* - All our csums should only be in csum tree
* No ordered extents csums , as ordered extents are only for write
* path .
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
* - No need to bother any other info from bvec
* Since we ' re looking up csums , the only important info is the
* disk_bytenr and the length , which can be extracted from bi_iter
* directly .
2020-12-02 14:48:05 +08:00
*/
ASSERT ( bio_op ( bio ) = = REQ_OP_READ ) ;
2008-07-31 15:42:53 -04:00
path = btrfs_alloc_path ( ) ;
2011-03-01 06:48:31 +00:00
if ( ! path )
2017-06-03 09:38:06 +02:00
return BLK_STS_RESOURCE ;
2013-07-25 19:22:34 +08:00
if ( ! dst ) {
2020-04-16 14:46:16 -07:00
struct btrfs_io_bio * btrfs_bio = btrfs_io_bio ( bio ) ;
2013-07-25 19:22:34 +08:00
if ( nblocks * csum_size > BTRFS_BIO_INLINE_CSUM_SIZE ) {
2018-11-22 17:16:46 +01:00
btrfs_bio - > csum = kmalloc_array ( nblocks , csum_size ,
GFP_NOFS ) ;
if ( ! btrfs_bio - > csum ) {
2013-07-25 19:22:34 +08:00
btrfs_free_path ( path ) ;
2017-06-03 09:38:06 +02:00
return BLK_STS_RESOURCE ;
2013-07-25 19:22:34 +08:00
}
} else {
btrfs_bio - > csum = btrfs_bio - > csum_inline ;
}
csum = btrfs_bio - > csum ;
} else {
2019-05-22 10:19:02 +02:00
csum = dst ;
2013-07-25 19:22:34 +08:00
}
btrfs: use nodesize to determine if we need readahead in btrfs_lookup_bio_sums
In btrfs_lookup_bio_sums() if the bio is pretty large, we want to
start readahead in the csum tree.
However the threshold is an immediate number, (PAGE_SIZE * 8), from the
initial btrfs merge.
The meaning of the value is pretty hard to guess, especially when the
immediate number is from the times when 4K sectorsize was the default
and only CRC32C was supported.
For the most common btrfs setup, CRC32 csum and 4K sectorsize,
it means just 32K read would kick readahead, while the csum itself is
only 32 bytes in size.
Now let's be more reasonable by taking both csum size and node size into
consideration.
If the csum size for the bio is larger than one leaf, then we kick the
readahead. This means for current default btrfs, the threshold will be
16M.
This change should not change performance observably, thus this is
mostly a readability enhancement.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-13 20:51:41 +08:00
/*
* If requested number of sectors is larger than one leaf can contain ,
* kick the readahead for csum tree .
*/
if ( nblocks > fs_info - > csums_per_leaf )
2015-11-27 16:31:35 +01:00
path - > reada = READA_FORWARD ;
2008-07-31 15:42:53 -04:00
2011-07-26 15:35:09 -04:00
/*
* the free space stuff is only read when it hasn ' t been
* updated in the current transaction . So , we can safely
* read from the commit root and sidestep a nasty deadlock
* between reading the free space cache and updating the csum tree .
*/
2017-02-20 13:50:35 +02:00
if ( btrfs_is_free_space_inode ( BTRFS_I ( inode ) ) ) {
2011-07-26 15:35:09 -04:00
path - > search_commit_root = 1 ;
2011-09-11 10:52:24 -04:00
path - > skip_locking = 1 ;
}
2011-07-26 15:35:09 -04:00
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
for ( cur_disk_bytenr = orig_disk_bytenr ;
cur_disk_bytenr < orig_disk_bytenr + orig_len ;
cur_disk_bytenr + = ( count * sectorsize ) ) {
u64 search_len = orig_disk_bytenr + orig_len - cur_disk_bytenr ;
unsigned int sector_offset ;
u8 * csum_dst ;
2016-01-21 15:55:54 +05:30
2008-07-31 15:42:53 -04:00
/*
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
* Although both cur_disk_bytenr and orig_disk_bytenr is u64 ,
* we ' re calculating the offset to the bio start .
*
* Bio size is limited to UINT_MAX , thus unsigned int is large
* enough to contain the raw result , not to mention the right
* shifted result .
2008-07-31 15:42:53 -04:00
*/
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-02 14:48:06 +08:00
ASSERT ( cur_disk_bytenr - orig_disk_bytenr < UINT_MAX ) ;
sector_offset = ( cur_disk_bytenr - orig_disk_bytenr ) > >
fs_info - > sectorsize_bits ;
csum_dst = csum + sector_offset * csum_size ;
count = search_csum_tree ( fs_info , path , cur_disk_bytenr ,
search_len , csum_dst ) ;
if ( count < = 0 ) {
/*
* Either we hit a critical error or we didn ' t find
* the csum .
* Either way , we put zero into the csums dst , and skip
* to the next sector .
*/
memset ( csum_dst , 0 , csum_size ) ;
count = 1 ;
/*
* For data reloc inode , we need to mark the range
* NODATASUM so that balance won ' t report false csum
* error .
*/
if ( BTRFS_I ( inode ) - > root - > root_key . objectid = =
BTRFS_DATA_RELOC_TREE_OBJECTID ) {
u64 file_offset ;
int ret ;
ret = search_file_offset_in_bio ( bio , inode ,
cur_disk_bytenr , & file_offset ) ;
if ( ret )
set_extent_bits ( io_tree , file_offset ,
file_offset + sectorsize - 1 ,
EXTENT_NODATASUM ) ;
} else {
btrfs_warn_rl ( fs_info ,
" csum hole found for disk bytenr range [%llu, %llu) " ,
cur_disk_bytenr , cur_disk_bytenr + sectorsize ) ;
}
2013-04-05 07:20:56 +00:00
}
2008-07-31 15:42:53 -04:00
}
2016-03-21 06:59:09 -07:00
2008-07-31 15:42:53 -04:00
btrfs_free_path ( path ) ;
2019-12-02 17:34:17 -08:00
return BLK_STS_OK ;
2010-05-23 11:00:55 -04:00
}
2008-12-12 10:03:38 -05:00
int btrfs_lookup_csums_range ( struct btrfs_root * root , u64 start , u64 end ,
2011-03-08 14:14:00 +01:00
struct list_head * list , int search_commit )
2008-12-12 10:03:38 -05:00
{
2016-06-22 18:54:23 -04:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
2008-12-12 10:03:38 -05:00
struct btrfs_key key ;
struct btrfs_path * path ;
struct extent_buffer * leaf ;
struct btrfs_ordered_sum * sums ;
struct btrfs_csum_item * item ;
2011-08-05 15:46:16 -07:00
LIST_HEAD ( tmplist ) ;
2008-12-12 10:03:38 -05:00
unsigned long offset ;
int ret ;
size_t size ;
u64 csum_end ;
2020-07-02 11:27:30 +02:00
const u32 csum_size = fs_info - > csum_size ;
2008-12-12 10:03:38 -05:00
2016-06-22 18:54:23 -04:00
ASSERT ( IS_ALIGNED ( start , fs_info - > sectorsize ) & &
IS_ALIGNED ( end + 1 , fs_info - > sectorsize ) ) ;
2013-10-15 09:36:40 -04:00
2008-12-12 10:03:38 -05:00
path = btrfs_alloc_path ( ) ;
btrfs: don't BUG_ON btrfs_alloc_path() errors
This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
failure. All the sites that are fixed in this patch were checked by me to
be fairly trivial to fix because of at least one of two criteria:
- Callers of the function catch errors from it already so bubbling the
error up will be handled.
- Callers of the function might BUG_ON any nonzero return code in which
case there is no behavior changed (but we still got to remove a BUG_ON)
The following functions were updated:
btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
insert_reserved_file_extent, and run_delalloc_nocow
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-13 10:38:47 -07:00
if ( ! path )
return - ENOMEM ;
2008-12-12 10:03:38 -05:00
2011-03-08 14:14:00 +01:00
if ( search_commit ) {
path - > skip_locking = 1 ;
2015-11-27 16:31:35 +01:00
path - > reada = READA_FORWARD ;
2011-03-08 14:14:00 +01:00
path - > search_commit_root = 1 ;
}
2008-12-12 10:03:38 -05:00
key . objectid = BTRFS_EXTENT_CSUM_OBJECTID ;
key . offset = start ;
key . type = BTRFS_EXTENT_CSUM_KEY ;
2009-01-06 11:42:00 -05:00
ret = btrfs_search_slot ( NULL , root , & key , path , 0 , 0 ) ;
2008-12-12 10:03:38 -05:00
if ( ret < 0 )
goto fail ;
if ( ret > 0 & & path - > slots [ 0 ] > 0 ) {
leaf = path - > nodes [ 0 ] ;
btrfs_item_key_to_cpu ( leaf , & key , path - > slots [ 0 ] - 1 ) ;
if ( key . objectid = = BTRFS_EXTENT_CSUM_OBJECTID & &
key . type = = BTRFS_EXTENT_CSUM_KEY ) {
2020-07-01 21:19:09 +02:00
offset = ( start - key . offset ) > > fs_info - > sectorsize_bits ;
2008-12-12 10:03:38 -05:00
if ( offset * csum_size <
btrfs_item_size_nr ( leaf , path - > slots [ 0 ] - 1 ) )
path - > slots [ 0 ] - - ;
}
}
while ( start < = end ) {
leaf = path - > nodes [ 0 ] ;
if ( path - > slots [ 0 ] > = btrfs_header_nritems ( leaf ) ) {
2009-01-06 11:42:00 -05:00
ret = btrfs_next_leaf ( root , path ) ;
2008-12-12 10:03:38 -05:00
if ( ret < 0 )
goto fail ;
if ( ret > 0 )
break ;
leaf = path - > nodes [ 0 ] ;
}
btrfs_item_key_to_cpu ( leaf , & key , path - > slots [ 0 ] ) ;
if ( key . objectid ! = BTRFS_EXTENT_CSUM_OBJECTID | |
2013-03-18 09:18:09 +00:00
key . type ! = BTRFS_EXTENT_CSUM_KEY | |
key . offset > end )
2008-12-12 10:03:38 -05:00
break ;
if ( key . offset > start )
start = key . offset ;
size = btrfs_item_size_nr ( leaf , path - > slots [ 0 ] ) ;
2016-06-22 18:54:23 -04:00
csum_end = key . offset + ( size / csum_size ) * fs_info - > sectorsize ;
2008-12-17 10:21:48 -05:00
if ( csum_end < = start ) {
path - > slots [ 0 ] + + ;
continue ;
}
2008-12-12 10:03:38 -05:00
2009-01-06 11:42:00 -05:00
csum_end = min ( csum_end , end + 1 ) ;
2008-12-12 10:03:38 -05:00
item = btrfs_item_ptr ( path - > nodes [ 0 ] , path - > slots [ 0 ] ,
struct btrfs_csum_item ) ;
2009-01-06 11:42:00 -05:00
while ( start < csum_end ) {
size = min_t ( size_t , csum_end - start ,
2019-05-22 10:19:01 +02:00
max_ordered_sum_bytes ( fs_info , csum_size ) ) ;
2016-06-22 18:54:23 -04:00
sums = kzalloc ( btrfs_ordered_sum_size ( fs_info , size ) ,
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
GFP_NOFS ) ;
2011-08-05 15:46:16 -07:00
if ( ! sums ) {
ret = - ENOMEM ;
goto fail ;
}
2008-12-12 10:03:38 -05:00
2009-01-06 11:42:00 -05:00
sums - > bytenr = start ;
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
sums - > len = ( int ) size ;
2009-01-06 11:42:00 -05:00
2020-07-01 21:19:09 +02:00
offset = ( start - key . offset ) > > fs_info - > sectorsize_bits ;
2009-01-06 11:42:00 -05:00
offset * = csum_size ;
2020-07-01 21:19:09 +02:00
size > > = fs_info - > sectorsize_bits ;
2009-01-06 11:42:00 -05:00
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
read_extent_buffer ( path - > nodes [ 0 ] ,
sums - > sums ,
( ( unsigned long ) item ) + offset ,
csum_size * size ) ;
2016-06-22 18:54:23 -04:00
start + = fs_info - > sectorsize * size ;
2011-08-05 15:46:16 -07:00
list_add_tail ( & sums - > list , & tmplist ) ;
2009-01-06 11:42:00 -05:00
}
2008-12-12 10:03:38 -05:00
path - > slots [ 0 ] + + ;
}
ret = 0 ;
fail :
2011-08-05 15:46:16 -07:00
while ( ret < 0 & & ! list_empty ( & tmplist ) ) {
2014-11-04 06:59:04 -08:00
sums = list_entry ( tmplist . next , struct btrfs_ordered_sum , list ) ;
2011-08-05 15:46:16 -07:00
list_del ( & sums - > list ) ;
kfree ( sums ) ;
}
list_splice_tail ( & tmplist , list ) ;
2008-12-12 10:03:38 -05:00
btrfs_free_path ( path ) ;
return ret ;
}
2019-04-22 16:07:31 +03:00
/*
* btrfs_csum_one_bio - Calculates checksums of the data contained inside a bio
* @ inode : Owner of the data inside the bio
* @ bio : Contains the data to be checksummed
* @ file_start : offset in file this bio begins to describe
* @ contig : Boolean . If true / 1 means all bio vecs in this bio are
* contiguous and they begin at @ file_start in the file . False / 0
* means this bio can contains potentially discontigous bio vecs
* so the logical offset of each should be calculated separately .
*/
2020-06-03 08:55:07 +03:00
blk_status_t btrfs_csum_one_bio ( struct btrfs_inode * inode , struct bio * bio ,
2016-06-22 18:54:24 -04:00
u64 file_start , int contig )
2008-04-16 11:15:20 -04:00
{
2020-06-03 08:55:03 +03:00
struct btrfs_fs_info * fs_info = inode - > root - > fs_info ;
2019-06-03 16:58:57 +02:00
SHASH_DESC_ON_STACK ( shash , fs_info - > csum_shash ) ;
2008-07-17 12:53:50 -04:00
struct btrfs_ordered_sum * sums ;
2016-11-25 09:07:49 +01:00
struct btrfs_ordered_extent * ordered = NULL ;
2008-04-16 11:15:20 -04:00
char * data ;
2017-05-15 15:33:27 -07:00
struct bvec_iter iter ;
struct bio_vec bvec ;
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
int index ;
2016-01-21 15:55:54 +05:30
int nr_sectors ;
2008-07-18 06:17:13 -04:00
unsigned long total_bytes = 0 ;
unsigned long this_sum_bytes = 0 ;
2017-05-15 15:33:27 -07:00
int i ;
2008-07-18 06:17:13 -04:00
u64 offset ;
2019-04-01 11:29:58 +03:00
unsigned nofs_flag ;
nofs_flag = memalloc_nofs_save ( ) ;
sums = kvzalloc ( btrfs_ordered_sum_size ( fs_info , bio - > bi_iter . bi_size ) ,
GFP_KERNEL ) ;
memalloc_nofs_restore ( nofs_flag ) ;
2008-04-16 11:15:20 -04:00
if ( ! sums )
2017-06-03 09:38:06 +02:00
return BLK_STS_RESOURCE ;
2008-07-18 06:17:13 -04:00
2013-10-11 15:44:27 -07:00
sums - > len = bio - > bi_iter . bi_size ;
2008-07-17 12:53:50 -04:00
INIT_LIST_HEAD ( & sums - > list ) ;
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
if ( contig )
offset = file_start ;
else
2016-11-25 09:07:49 +01:00
offset = 0 ; /* shut up gcc */
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
2020-11-26 15:41:27 +01:00
sums - > bytenr = bio - > bi_iter . bi_sector < < 9 ;
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
index = 0 ;
2008-04-16 11:15:20 -04:00
2019-06-03 16:58:57 +02:00
shash - > tfm = fs_info - > csum_shash ;
2017-05-15 15:33:27 -07:00
bio_for_each_segment ( bvec , bio , iter ) {
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
if ( ! contig )
2017-05-15 15:33:27 -07:00
offset = page_offset ( bvec . bv_page ) + bvec . bv_offset ;
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
2016-11-25 09:07:49 +01:00
if ( ! ordered ) {
ordered = btrfs_lookup_ordered_extent ( inode , offset ) ;
BUG_ON ( ! ordered ) ; /* Logic error */
}
2016-06-22 18:54:23 -04:00
nr_sectors = BTRFS_BYTES_TO_BLKS ( fs_info ,
2017-05-15 15:33:27 -07:00
bvec . bv_len + fs_info - > sectorsize
2016-06-22 18:54:23 -04:00
- 1 ) ;
2016-01-21 15:55:54 +05:30
for ( i = 0 ; i < nr_sectors ; i + + ) {
2019-12-02 17:34:19 -08:00
if ( offset > = ordered - > file_offset + ordered - > num_bytes | |
offset < ordered - > file_offset ) {
2016-01-21 15:55:54 +05:30
unsigned long bytes_left ;
sums - > len = this_sum_bytes ;
this_sum_bytes = 0 ;
2019-04-10 16:16:11 +03:00
btrfs_add_ordered_sum ( ordered , sums ) ;
2016-01-21 15:55:54 +05:30
btrfs_put_ordered_extent ( ordered ) ;
bytes_left = bio - > bi_iter . bi_size - total_bytes ;
2019-04-01 11:29:58 +03:00
nofs_flag = memalloc_nofs_save ( ) ;
sums = kvzalloc ( btrfs_ordered_sum_size ( fs_info ,
bytes_left ) , GFP_KERNEL ) ;
memalloc_nofs_restore ( nofs_flag ) ;
2016-01-21 15:55:54 +05:30
BUG_ON ( ! sums ) ; /* -ENOMEM */
sums - > len = bytes_left ;
ordered = btrfs_lookup_ordered_extent ( inode ,
offset ) ;
ASSERT ( ordered ) ; /* Logic error */
2020-11-26 15:41:27 +01:00
sums - > bytenr = ( bio - > bi_iter . bi_sector < < 9 )
2016-01-21 15:55:54 +05:30
+ total_bytes ;
index = 0 ;
}
2008-07-18 06:17:13 -04:00
2019-03-07 17:14:00 +01:00
data = kmap_atomic ( bvec . bv_page ) ;
2020-04-30 23:51:59 -07:00
crypto_shash_digest ( shash , data + bvec . bv_offset
2019-06-03 16:58:57 +02:00
+ ( i * fs_info - > sectorsize ) ,
2020-04-30 23:51:59 -07:00
fs_info - > sectorsize ,
sums - > sums + index ) ;
2019-03-07 17:14:00 +01:00
kunmap_atomic ( data ) ;
2020-06-30 18:04:02 +02:00
index + = fs_info - > csum_size ;
2016-06-22 18:54:23 -04:00
offset + = fs_info - > sectorsize ;
this_sum_bytes + = fs_info - > sectorsize ;
total_bytes + = fs_info - > sectorsize ;
2008-07-18 06:17:13 -04:00
}
2008-04-16 11:15:20 -04:00
}
2008-07-22 23:06:42 -04:00
this_sum_bytes = 0 ;
2019-04-10 16:16:11 +03:00
btrfs_add_ordered_sum ( ordered , sums ) ;
2008-07-18 06:17:13 -04:00
btrfs_put_ordered_extent ( ordered ) ;
2008-04-16 11:15:20 -04:00
return 0 ;
}
2008-12-10 09:10:46 -05:00
/*
* helper function for csum removal , this expects the
* key to describe the csum pointed to by the path , and it expects
* the csum to overlap the range [ bytenr , len ]
*
* The csum should not be entirely contained in the range and the
* range should not be entirely contained in the csum .
*
* This calls btrfs_truncate_item with the correct args based on the
* overlap , and fixes up the key as required .
*/
2016-06-22 18:54:24 -04:00
static noinline void truncate_one_csum ( struct btrfs_fs_info * fs_info ,
2012-03-01 14:56:26 +01:00
struct btrfs_path * path ,
struct btrfs_key * key ,
u64 bytenr , u64 len )
2008-12-10 09:10:46 -05:00
{
struct extent_buffer * leaf ;
2020-07-02 11:27:30 +02:00
const u32 csum_size = fs_info - > csum_size ;
2008-12-10 09:10:46 -05:00
u64 csum_end ;
u64 end_byte = bytenr + len ;
2020-07-01 21:19:09 +02:00
u32 blocksize_bits = fs_info - > sectorsize_bits ;
2008-12-10 09:10:46 -05:00
leaf = path - > nodes [ 0 ] ;
csum_end = btrfs_item_size_nr ( leaf , path - > slots [ 0 ] ) / csum_size ;
2020-07-01 21:19:09 +02:00
csum_end < < = blocksize_bits ;
2008-12-10 09:10:46 -05:00
csum_end + = key - > offset ;
if ( key - > offset < bytenr & & csum_end < = end_byte ) {
/*
* [ bytenr - len ]
* [ ]
* [ csum ]
* A simple truncate off the end of the item
*/
u32 new_size = ( bytenr - key - > offset ) > > blocksize_bits ;
new_size * = csum_size ;
2019-03-20 14:49:12 +01:00
btrfs_truncate_item ( path , new_size , 1 ) ;
2008-12-10 09:10:46 -05:00
} else if ( key - > offset > = bytenr & & csum_end > end_byte & &
end_byte > key - > offset ) {
/*
* [ bytenr - len ]
* [ ]
* [ csum ]
* we need to truncate from the beginning of the csum
*/
u32 new_size = ( csum_end - end_byte ) > > blocksize_bits ;
new_size * = csum_size ;
2019-03-20 14:49:12 +01:00
btrfs_truncate_item ( path , new_size , 0 ) ;
2008-12-10 09:10:46 -05:00
key - > offset = end_byte ;
2016-06-22 18:54:23 -04:00
btrfs_set_item_key_safe ( fs_info , path , key ) ;
2008-12-10 09:10:46 -05:00
} else {
BUG ( ) ;
}
}
/*
* deletes the csum items from the csum tree for a given
* range of bytes .
*/
int btrfs_del_csums ( struct btrfs_trans_handle * trans ,
Btrfs: fix missing data checksums after replaying a log tree
When logging a file that has shared extents (reflinked with other files or
with itself), we can end up logging multiple checksum items that cover
overlapping ranges. This confuses the search for checksums at log replay
time causing some checksums to never be added to the fs/subvolume tree.
Consider the following example of a file that shares the same extent at
offsets 0 and 256Kb:
[ bytenr 13893632, offset 64Kb, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 13893632, offset 0, len 256Kb ]
256Kb 512Kb
When logging the inode, at tree-log.c:copy_items(), when processing the
file extent item at offset 0, we log a checksum item covering the range
13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
64Kb + 64Kb, respectively.
Later when processing the extent item at offset 256K, we log the checksums
for the range from 13893632 to 14155776 (which corresponds to 13893632 +
256Kb). These checksums get merged with the checksum item for the range
from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
So after this we get the two following checksum items in the log tree:
(...)
item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
range start 13631488 end 14155776 length 524288
item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
range start 13959168 end 14024704 length 65536
The first one covers the range from the second one, they overlap.
So far this does not cause a problem after replaying the log, because
when replaying the file extent item for offset 256K, we copy all the
checksums for the extent 13893632 from the log tree to the fs/subvolume
tree, since searching for an checksum item for bytenr 13893632 leaves us
at the first checksum item, which covers the whole range of the extent.
However if we write 64Kb to file offset 256Kb for example, we will
not be able to find and copy the checksums for the last 128Kb of the
extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.
After writing 64Kb into file offset 256Kb we get the following extent
layout for our file:
[ bytenr 13893632, offset 64K, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 14155776, offset 0, len 64Kb ]
256Kb 320Kb
[ bytenr 13893632, offset 64Kb, len 192Kb ]
320Kb 512Kb
After fsync'ing the file, if we have a power failure and then mount
the filesystem to replay the log, the following happens:
1) When replaying the file extent item for file offset 320Kb, we
lookup for the checksums for the extent range from 13959168
(13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
to btrfs_lookup_csums_range();
2) btrfs_lookup_csums_range() finds the checksum item that starts
precisely at offset 13959168 (item 7 in the log tree, shown before);
3) However that checksum item only covers 64Kb of data, and not 192Kb
of data;
4) As a result only the checksums for the first 64Kb of data referenced
by the file extent item are found and copied to the fs/subvolume tree.
The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
the corresponding data checksums found and copied to the fs/subvolume
tree.
5) After replaying the log userspace will not be able to read the file
range from 384Kb to 512Kb, because the checksums are missing and
resulting in an -EIO error.
The following steps reproduce this scenario:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar
$ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
<power failure>
$ mount /dev/sdc /mnt/sdc
$ md5sum /mnt/sdc/foobar
md5sum: /mnt/sdc/foobar: Input/output error
$ dmesg | tail
[165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
[165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
[165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
[165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
[165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
[165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
[165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
[165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
[165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
[165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
Fix this simply by deleting first any checksums, from the log tree, for the
range of the extent we are logging at copy_items(). This ensures we do not
get checksum items in the log tree that have overlapping ranges.
This is a long time issue that has been present since we have the clone
(and deduplication) ioctl, and can happen both when an extent is shared
between different files and within the same file.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-05 16:58:30 +00:00
struct btrfs_root * root , u64 bytenr , u64 len )
2008-12-10 09:10:46 -05:00
{
Btrfs: fix missing data checksums after replaying a log tree
When logging a file that has shared extents (reflinked with other files or
with itself), we can end up logging multiple checksum items that cover
overlapping ranges. This confuses the search for checksums at log replay
time causing some checksums to never be added to the fs/subvolume tree.
Consider the following example of a file that shares the same extent at
offsets 0 and 256Kb:
[ bytenr 13893632, offset 64Kb, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 13893632, offset 0, len 256Kb ]
256Kb 512Kb
When logging the inode, at tree-log.c:copy_items(), when processing the
file extent item at offset 0, we log a checksum item covering the range
13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
64Kb + 64Kb, respectively.
Later when processing the extent item at offset 256K, we log the checksums
for the range from 13893632 to 14155776 (which corresponds to 13893632 +
256Kb). These checksums get merged with the checksum item for the range
from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
So after this we get the two following checksum items in the log tree:
(...)
item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
range start 13631488 end 14155776 length 524288
item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
range start 13959168 end 14024704 length 65536
The first one covers the range from the second one, they overlap.
So far this does not cause a problem after replaying the log, because
when replaying the file extent item for offset 256K, we copy all the
checksums for the extent 13893632 from the log tree to the fs/subvolume
tree, since searching for an checksum item for bytenr 13893632 leaves us
at the first checksum item, which covers the whole range of the extent.
However if we write 64Kb to file offset 256Kb for example, we will
not be able to find and copy the checksums for the last 128Kb of the
extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.
After writing 64Kb into file offset 256Kb we get the following extent
layout for our file:
[ bytenr 13893632, offset 64K, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 14155776, offset 0, len 64Kb ]
256Kb 320Kb
[ bytenr 13893632, offset 64Kb, len 192Kb ]
320Kb 512Kb
After fsync'ing the file, if we have a power failure and then mount
the filesystem to replay the log, the following happens:
1) When replaying the file extent item for file offset 320Kb, we
lookup for the checksums for the extent range from 13959168
(13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
to btrfs_lookup_csums_range();
2) btrfs_lookup_csums_range() finds the checksum item that starts
precisely at offset 13959168 (item 7 in the log tree, shown before);
3) However that checksum item only covers 64Kb of data, and not 192Kb
of data;
4) As a result only the checksums for the first 64Kb of data referenced
by the file extent item are found and copied to the fs/subvolume tree.
The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
the corresponding data checksums found and copied to the fs/subvolume
tree.
5) After replaying the log userspace will not be able to read the file
range from 384Kb to 512Kb, because the checksums are missing and
resulting in an -EIO error.
The following steps reproduce this scenario:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar
$ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
<power failure>
$ mount /dev/sdc /mnt/sdc
$ md5sum /mnt/sdc/foobar
md5sum: /mnt/sdc/foobar: Input/output error
$ dmesg | tail
[165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
[165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
[165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
[165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
[165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
[165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
[165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
[165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
[165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
[165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
Fix this simply by deleting first any checksums, from the log tree, for the
range of the extent we are logging at copy_items(). This ensures we do not
get checksum items in the log tree that have overlapping ranges.
This is a long time issue that has been present since we have the clone
(and deduplication) ioctl, and can happen both when an extent is shared
between different files and within the same file.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-05 16:58:30 +00:00
struct btrfs_fs_info * fs_info = trans - > fs_info ;
2008-12-10 09:10:46 -05:00
struct btrfs_path * path ;
struct btrfs_key key ;
u64 end_byte = bytenr + len ;
u64 csum_end ;
struct extent_buffer * leaf ;
int ret ;
2020-07-02 11:27:30 +02:00
const u32 csum_size = fs_info - > csum_size ;
2020-07-01 21:19:09 +02:00
u32 blocksize_bits = fs_info - > sectorsize_bits ;
2008-12-10 09:10:46 -05:00
Btrfs: fix missing data checksums after replaying a log tree
When logging a file that has shared extents (reflinked with other files or
with itself), we can end up logging multiple checksum items that cover
overlapping ranges. This confuses the search for checksums at log replay
time causing some checksums to never be added to the fs/subvolume tree.
Consider the following example of a file that shares the same extent at
offsets 0 and 256Kb:
[ bytenr 13893632, offset 64Kb, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 13893632, offset 0, len 256Kb ]
256Kb 512Kb
When logging the inode, at tree-log.c:copy_items(), when processing the
file extent item at offset 0, we log a checksum item covering the range
13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
64Kb + 64Kb, respectively.
Later when processing the extent item at offset 256K, we log the checksums
for the range from 13893632 to 14155776 (which corresponds to 13893632 +
256Kb). These checksums get merged with the checksum item for the range
from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
So after this we get the two following checksum items in the log tree:
(...)
item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
range start 13631488 end 14155776 length 524288
item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
range start 13959168 end 14024704 length 65536
The first one covers the range from the second one, they overlap.
So far this does not cause a problem after replaying the log, because
when replaying the file extent item for offset 256K, we copy all the
checksums for the extent 13893632 from the log tree to the fs/subvolume
tree, since searching for an checksum item for bytenr 13893632 leaves us
at the first checksum item, which covers the whole range of the extent.
However if we write 64Kb to file offset 256Kb for example, we will
not be able to find and copy the checksums for the last 128Kb of the
extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.
After writing 64Kb into file offset 256Kb we get the following extent
layout for our file:
[ bytenr 13893632, offset 64K, len 64Kb ]
0 64Kb
[ bytenr 13631488, offset 64Kb, len 192Kb ]
64Kb 256Kb
[ bytenr 14155776, offset 0, len 64Kb ]
256Kb 320Kb
[ bytenr 13893632, offset 64Kb, len 192Kb ]
320Kb 512Kb
After fsync'ing the file, if we have a power failure and then mount
the filesystem to replay the log, the following happens:
1) When replaying the file extent item for file offset 320Kb, we
lookup for the checksums for the extent range from 13959168
(13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
to btrfs_lookup_csums_range();
2) btrfs_lookup_csums_range() finds the checksum item that starts
precisely at offset 13959168 (item 7 in the log tree, shown before);
3) However that checksum item only covers 64Kb of data, and not 192Kb
of data;
4) As a result only the checksums for the first 64Kb of data referenced
by the file extent item are found and copied to the fs/subvolume tree.
The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
the corresponding data checksums found and copied to the fs/subvolume
tree.
5) After replaying the log userspace will not be able to read the file
range from 384Kb to 512Kb, because the checksums are missing and
resulting in an -EIO error.
The following steps reproduce this scenario:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar
$ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
$ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
$ xfs_io -c "fsync" /mnt/sdc/foobar
<power failure>
$ mount /dev/sdc /mnt/sdc
$ md5sum /mnt/sdc/foobar
md5sum: /mnt/sdc/foobar: Input/output error
$ dmesg | tail
[165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
[165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
[165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
[165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
[165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
[165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
[165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
[165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
[165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
[165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
Fix this simply by deleting first any checksums, from the log tree, for the
range of the extent we are logging at copy_items(). This ensures we do not
get checksum items in the log tree that have overlapping ranges.
This is a long time issue that has been present since we have the clone
(and deduplication) ioctl, and can happen both when an extent is shared
between different files and within the same file.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-05 16:58:30 +00:00
ASSERT ( root = = fs_info - > csum_root | |
root - > root_key . objectid = = BTRFS_TREE_LOG_OBJECTID ) ;
2008-12-10 09:10:46 -05:00
path = btrfs_alloc_path ( ) ;
2011-01-26 06:22:08 +00:00
if ( ! path )
return - ENOMEM ;
2008-12-10 09:10:46 -05:00
2009-01-05 21:25:51 -05:00
while ( 1 ) {
2008-12-10 09:10:46 -05:00
key . objectid = BTRFS_EXTENT_CSUM_OBJECTID ;
key . offset = end_byte - 1 ;
key . type = BTRFS_EXTENT_CSUM_KEY ;
ret = btrfs_search_slot ( trans , root , & key , path , - 1 , 1 ) ;
if ( ret > 0 ) {
if ( path - > slots [ 0 ] = = 0 )
2011-05-19 04:37:44 +00:00
break ;
2008-12-10 09:10:46 -05:00
path - > slots [ 0 ] - - ;
2011-01-28 18:44:44 +00:00
} else if ( ret < 0 ) {
2011-05-19 04:37:44 +00:00
break ;
2008-12-10 09:10:46 -05:00
}
2011-01-28 18:44:44 +00:00
2008-12-10 09:10:46 -05:00
leaf = path - > nodes [ 0 ] ;
btrfs_item_key_to_cpu ( leaf , & key , path - > slots [ 0 ] ) ;
if ( key . objectid ! = BTRFS_EXTENT_CSUM_OBJECTID | |
key . type ! = BTRFS_EXTENT_CSUM_KEY ) {
break ;
}
if ( key . offset > = end_byte )
break ;
csum_end = btrfs_item_size_nr ( leaf , path - > slots [ 0 ] ) / csum_size ;
csum_end < < = blocksize_bits ;
csum_end + = key . offset ;
/* this csum ends before we start, we're done */
if ( csum_end < = bytenr )
break ;
/* delete the entire item, it is inside our range */
if ( key . offset > = bytenr & & csum_end < = end_byte ) {
2017-01-28 01:47:56 +00:00
int del_nr = 1 ;
/*
* Check how many csum items preceding this one in this
* leaf correspond to our range and then delete them all
* at once .
*/
if ( key . offset > bytenr & & path - > slots [ 0 ] > 0 ) {
int slot = path - > slots [ 0 ] - 1 ;
while ( slot > = 0 ) {
struct btrfs_key pk ;
btrfs_item_key_to_cpu ( leaf , & pk , slot ) ;
if ( pk . offset < bytenr | |
pk . type ! = BTRFS_EXTENT_CSUM_KEY | |
pk . objectid ! =
BTRFS_EXTENT_CSUM_OBJECTID )
break ;
path - > slots [ 0 ] = slot ;
del_nr + + ;
key . offset = pk . offset ;
slot - - ;
}
}
ret = btrfs_del_items ( trans , root , path ,
path - > slots [ 0 ] , del_nr ) ;
2011-05-19 04:37:44 +00:00
if ( ret )
goto out ;
2008-12-16 13:51:01 -05:00
if ( key . offset = = bytenr )
break ;
2008-12-10 09:10:46 -05:00
} else if ( key . offset < bytenr & & csum_end > end_byte ) {
unsigned long offset ;
unsigned long shift_len ;
unsigned long item_offset ;
/*
* [ bytenr - len ]
* [ csum ]
*
* Our bytes are in the middle of the csum ,
* we need to split this item and insert a new one .
*
* But we can ' t drop the path because the
* csum could change , get removed , extended etc .
*
* The trick here is the max size of a csum item leaves
* enough room in the tree block for a single
* item header . So , we split the item in place ,
* adding a new header pointing to the existing
* bytes . Then we loop around again and we have
* a nicely formed csum item that we can neatly
* truncate .
*/
offset = ( bytenr - key . offset ) > > blocksize_bits ;
offset * = csum_size ;
shift_len = ( len > > blocksize_bits ) * csum_size ;
item_offset = btrfs_item_ptr_offset ( leaf ,
path - > slots [ 0 ] ) ;
2016-11-08 18:09:03 +01:00
memzero_extent_buffer ( leaf , item_offset + offset ,
2008-12-10 09:10:46 -05:00
shift_len ) ;
key . offset = bytenr ;
/*
* btrfs_split_item returns - EAGAIN when the
* item changed size or key
*/
ret = btrfs_split_item ( trans , root , path , & key , offset ) ;
2012-03-12 16:03:00 +01:00
if ( ret & & ret ! = - EAGAIN ) {
2016-06-10 18:19:25 -04:00
btrfs_abort_transaction ( trans , ret ) ;
2012-03-12 16:03:00 +01:00
goto out ;
}
2008-12-10 09:10:46 -05:00
key . offset = end_byte - 1 ;
} else {
2016-06-22 18:54:24 -04:00
truncate_one_csum ( fs_info , path , & key , bytenr , len ) ;
2008-12-16 13:51:01 -05:00
if ( key . offset < bytenr )
break ;
2008-12-10 09:10:46 -05:00
}
2011-04-21 01:20:15 +02:00
btrfs_release_path ( path ) ;
2008-12-10 09:10:46 -05:00
}
2011-05-19 04:37:44 +00:00
ret = 0 ;
2008-12-10 09:10:46 -05:00
out :
btrfs_free_path ( path ) ;
2011-05-19 04:37:44 +00:00
return ret ;
2008-12-10 09:10:46 -05:00
}
2008-02-20 12:07:25 -05:00
int btrfs_csum_file_blocks ( struct btrfs_trans_handle * trans ,
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
struct btrfs_root * root ,
2008-07-17 12:53:50 -04:00
struct btrfs_ordered_sum * sums )
2007-03-29 15:15:27 -04:00
{
2016-06-22 18:54:23 -04:00
struct btrfs_fs_info * fs_info = root - > fs_info ;
2007-03-29 15:15:27 -04:00
struct btrfs_key file_key ;
2007-04-16 09:22:45 -04:00
struct btrfs_key found_key ;
2007-04-02 11:20:42 -04:00
struct btrfs_path * path ;
2007-03-29 15:15:27 -04:00
struct btrfs_csum_item * item ;
2008-02-20 12:07:25 -05:00
struct btrfs_csum_item * item_end ;
2007-10-15 16:22:25 -04:00
struct extent_buffer * leaf = NULL ;
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
u64 next_offset ;
u64 total_bytes = 0 ;
2007-04-16 09:22:45 -04:00
u64 csum_offset ;
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
u64 bytenr ;
2007-10-25 15:42:56 -04:00
u32 nritems ;
u32 ins_size ;
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
int index = 0 ;
int found_next ;
int ret ;
2020-07-02 11:27:30 +02:00
const u32 csum_size = fs_info - > csum_size ;
2008-02-20 12:07:25 -05:00
2007-04-02 11:20:42 -04:00
path = btrfs_alloc_path ( ) ;
btrfs: don't BUG_ON btrfs_alloc_path() errors
This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
failure. All the sites that are fixed in this patch were checked by me to
be fairly trivial to fix because of at least one of two criteria:
- Callers of the function catch errors from it already so bubbling the
error up will be handled.
- Callers of the function might BUG_ON any nonzero return code in which
case there is no behavior changed (but we still got to remove a BUG_ON)
The following functions were updated:
btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
insert_reserved_file_extent, and run_delalloc_nocow
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-13 10:38:47 -07:00
if ( ! path )
return - ENOMEM ;
2008-02-20 12:07:25 -05:00
again :
next_offset = ( u64 ) - 1 ;
found_next = 0 ;
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
bytenr = sums - > bytenr + total_bytes ;
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
file_key . objectid = BTRFS_EXTENT_CSUM_OBJECTID ;
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
file_key . offset = bytenr ;
2014-06-04 18:41:45 +02:00
file_key . type = BTRFS_EXTENT_CSUM_KEY ;
2007-04-18 16:15:28 -04:00
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
item = btrfs_lookup_csum ( trans , root , path , bytenr , 1 ) ;
2007-10-15 16:22:25 -04:00
if ( ! IS_ERR ( item ) ) {
2008-08-28 06:15:25 -04:00
ret = 0 ;
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
leaf = path - > nodes [ 0 ] ;
item_end = btrfs_item_ptr ( leaf , path - > slots [ 0 ] ,
struct btrfs_csum_item ) ;
item_end = ( struct btrfs_csum_item * ) ( ( char * ) item_end +
btrfs_item_size_nr ( leaf , path - > slots [ 0 ] ) ) ;
2007-04-18 16:15:28 -04:00
goto found ;
2007-10-15 16:22:25 -04:00
}
2007-04-18 16:15:28 -04:00
ret = PTR_ERR ( item ) ;
2010-05-16 10:49:59 -04:00
if ( ret ! = - EFBIG & & ret ! = - ENOENT )
2020-05-18 12:15:18 +01:00
goto out ;
2010-05-16 10:49:59 -04:00
2007-04-18 16:15:28 -04:00
if ( ret = = - EFBIG ) {
u32 item_size ;
/* we found one, but it isn't big enough yet */
2007-10-15 16:14:19 -04:00
leaf = path - > nodes [ 0 ] ;
item_size = btrfs_item_size_nr ( leaf , path - > slots [ 0 ] ) ;
2008-12-02 07:17:45 -05:00
if ( ( item_size / csum_size ) > =
2016-06-22 18:54:23 -04:00
MAX_CSUM_ITEMS ( fs_info , csum_size ) ) {
2007-04-18 16:15:28 -04:00
/* already at max size, make a new one */
goto insert ;
}
} else {
2007-10-25 15:42:56 -04:00
int slot = path - > slots [ 0 ] + 1 ;
2007-04-18 16:15:28 -04:00
/* we didn't find a csum item, insert one */
2007-10-25 15:42:56 -04:00
nritems = btrfs_header_nritems ( path - > nodes [ 0 ] ) ;
2014-04-09 14:38:34 +01:00
if ( ! nritems | | ( path - > slots [ 0 ] > = nritems - 1 ) ) {
2007-10-25 15:42:56 -04:00
ret = btrfs_next_leaf ( root , path ) ;
2020-05-18 12:15:09 +01:00
if ( ret < 0 ) {
goto out ;
} else if ( ret > 0 ) {
2007-10-25 15:42:56 -04:00
found_next = 1 ;
goto insert ;
2020-05-18 12:15:09 +01:00
}
Btrfs: fix csum tree corruption, duplicate and outdated checksums
Under rare circumstances we can end up leaving 2 versions of a checksum
for the same file extent range.
The reason for this is that after calling btrfs_next_leaf we process
slot 0 of the leaf it returns, instead of processing the slot set in
path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
btrfs_next_leaf() releases the path and before it searches for the next
leaf, another task might cause a split of the next leaf, which migrates
some of its keys to the leaf we were processing before calling
btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
same leaf but with path->slots[0] having a slot number corresponding
to the first new key it got, that is, a slot number that didn't exist
before calling btrfs_next_leaf(), as the leaf now has more keys than
it had before. So we must really process the returned leaf starting at
path->slots[0] always, as it isn't always 0, and the key at slot 0 can
have an offset much lower than our search offset/bytenr.
For example, consider the following scenario, where we have:
sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472
Leaf N:
slot = 0 slot = btrfs_header_nritems() - 1
|-------------------------------------------------------------------|
| [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
|-------------------------------------------------------------------|
Leaf N + 1:
slot = 0 slot = btrfs_header_nritems() - 1
|--------------------------------------------------------------------|
| [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
|--------------------------------------------------------------------|
Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
find the next highest key, which releases the current path and then searches
for that next key. However after releasing the path and before finding that
next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
btrfs_next_leaf() will returns us a path again with leaf N but with the slot
pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
is then:
slot = 0 slot = btrfs_header_nritems() - 2 slot = btrfs_header_nritems() - 1
|----------------------------------------------------------------------------------------------------|
| [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] [(CSUM CSUM 40161280), size 32] |
|----------------------------------------------------------------------------------------------------|
And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
into the "insert:" label, which will set tmp to:
tmp = min((sums->len - total_bytes) >> blocksize_bits,
(next_offset - file_key.offset) >> blocksize_bits) =
min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4
and
ins_size = csum_size * tmp = 4 * 4 = 16 bytes.
In other words, we insert a new csum item in the tree with key
(CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
because the item with key (CSUM CSUM 40161280) (the one that was moved from
leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
bytes of our data and won't get those old checksums removed.
So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
and breaks the logical rule:
Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover
An obvious bad effect of this is that a subsequent csum tree lookup to get
the checksum of any of the blocks with logical offset of 40161280, 40165376
or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-09 21:22:27 +01:00
slot = path - > slots [ 0 ] ;
2007-10-25 15:42:56 -04:00
}
btrfs_item_key_to_cpu ( path - > nodes [ 0 ] , & found_key , slot ) ;
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
if ( found_key . objectid ! = BTRFS_EXTENT_CSUM_OBJECTID | |
found_key . type ! = BTRFS_EXTENT_CSUM_KEY ) {
2007-10-25 15:42:56 -04:00
found_next = 1 ;
goto insert ;
}
next_offset = found_key . offset ;
found_next = 1 ;
2007-04-18 16:15:28 -04:00
goto insert ;
}
/*
btrfs: make checksum item extension more efficient
When we want to add checksums into the checksums tree, or a log tree, we
try whenever possible to extend existing checksum items, as this helps
reduce amount of metadata space used, since adding a new item uses extra
metadata space for a btrfs_item structure (25 bytes).
However we have two inefficiencies in the current approach:
1) After finding a checksum item that covers a range with an end offset
that matches the start offset of the checksum range we want to insert,
we release the search path populated by btrfs_lookup_csum() and then
do another COW search on tree with the goal of getting additional
space for at least one checksum. Doing this path release and then
searching again is a waste of time because very often the leaf already
has enough free space for at least one more checksum;
2) After the COW search that guarantees we get free space in the leaf for
at least one more checksum, we end up not doing the extension of the
previous checksum item, and fallback to insertion of a new checksum
item, if the leaf doesn't have an amount of free space larger then the
space required for 2 checksums plus one btrfs_item structure - this is
pointless for two reasons:
a) We want to extend an existing item, so we don't need to account for
a btrfs_item structure (25 bytes);
b) We made the COW search with an insertion size for 1 single checksum,
so if the leaf ends up with a free space amount smaller then 2
checksums plus the size of a btrfs_item structure, we give up on the
extension of the existing item and jump to the 'insert' label, where
we end up releasing the path and then doing yet another search to
insert a new checksum item for a single checksum.
Fix these inefficiencies by doing the following:
- For case 1), before releasing the path just check if the leaf already
has enough space for at least 1 more checksum, and if it does, jump
directly to the item extension code, with releasing our current path,
which was already COWed by btrfs_lookup_csum();
- For case 2), fix the logic so that for item extension we require only
that the leaf has enough free space for 1 checksum, and not a minimum
of 2 checksums plus space for a btrfs_item structure.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-18 12:15:00 +01:00
* At this point , we know the tree has a checksum item that ends at an
* offset matching the start of the checksum range we want to insert .
* We try to extend that item as much as possible and then add as many
* checksums to it as they fit .
*
* First check if the leaf has enough free space for at least one
* checksum . If it has go directly to the item extension code , otherwise
* release the path and do a search for insertion before the extension .
2007-04-18 16:15:28 -04:00
*/
btrfs: make checksum item extension more efficient
When we want to add checksums into the checksums tree, or a log tree, we
try whenever possible to extend existing checksum items, as this helps
reduce amount of metadata space used, since adding a new item uses extra
metadata space for a btrfs_item structure (25 bytes).
However we have two inefficiencies in the current approach:
1) After finding a checksum item that covers a range with an end offset
that matches the start offset of the checksum range we want to insert,
we release the search path populated by btrfs_lookup_csum() and then
do another COW search on tree with the goal of getting additional
space for at least one checksum. Doing this path release and then
searching again is a waste of time because very often the leaf already
has enough free space for at least one more checksum;
2) After the COW search that guarantees we get free space in the leaf for
at least one more checksum, we end up not doing the extension of the
previous checksum item, and fallback to insertion of a new checksum
item, if the leaf doesn't have an amount of free space larger then the
space required for 2 checksums plus one btrfs_item structure - this is
pointless for two reasons:
a) We want to extend an existing item, so we don't need to account for
a btrfs_item structure (25 bytes);
b) We made the COW search with an insertion size for 1 single checksum,
so if the leaf ends up with a free space amount smaller then 2
checksums plus the size of a btrfs_item structure, we give up on the
extension of the existing item and jump to the 'insert' label, where
we end up releasing the path and then doing yet another search to
insert a new checksum item for a single checksum.
Fix these inefficiencies by doing the following:
- For case 1), before releasing the path just check if the leaf already
has enough space for at least 1 more checksum, and if it does, jump
directly to the item extension code, with releasing our current path,
which was already COWed by btrfs_lookup_csum();
- For case 2), fix the logic so that for item extension we require only
that the leaf has enough free space for 1 checksum, and not a minimum
of 2 checksums plus space for a btrfs_item structure.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-18 12:15:00 +01:00
if ( btrfs_leaf_free_space ( leaf ) > = csum_size ) {
btrfs_item_key_to_cpu ( leaf , & found_key , path - > slots [ 0 ] ) ;
csum_offset = ( bytenr - found_key . offset ) > >
2020-07-01 21:19:09 +02:00
fs_info - > sectorsize_bits ;
btrfs: make checksum item extension more efficient
When we want to add checksums into the checksums tree, or a log tree, we
try whenever possible to extend existing checksum items, as this helps
reduce amount of metadata space used, since adding a new item uses extra
metadata space for a btrfs_item structure (25 bytes).
However we have two inefficiencies in the current approach:
1) After finding a checksum item that covers a range with an end offset
that matches the start offset of the checksum range we want to insert,
we release the search path populated by btrfs_lookup_csum() and then
do another COW search on tree with the goal of getting additional
space for at least one checksum. Doing this path release and then
searching again is a waste of time because very often the leaf already
has enough free space for at least one more checksum;
2) After the COW search that guarantees we get free space in the leaf for
at least one more checksum, we end up not doing the extension of the
previous checksum item, and fallback to insertion of a new checksum
item, if the leaf doesn't have an amount of free space larger then the
space required for 2 checksums plus one btrfs_item structure - this is
pointless for two reasons:
a) We want to extend an existing item, so we don't need to account for
a btrfs_item structure (25 bytes);
b) We made the COW search with an insertion size for 1 single checksum,
so if the leaf ends up with a free space amount smaller then 2
checksums plus the size of a btrfs_item structure, we give up on the
extension of the existing item and jump to the 'insert' label, where
we end up releasing the path and then doing yet another search to
insert a new checksum item for a single checksum.
Fix these inefficiencies by doing the following:
- For case 1), before releasing the path just check if the leaf already
has enough space for at least 1 more checksum, and if it does, jump
directly to the item extension code, with releasing our current path,
which was already COWed by btrfs_lookup_csum();
- For case 2), fix the logic so that for item extension we require only
that the leaf has enough free space for 1 checksum, and not a minimum
of 2 checksums plus space for a btrfs_item structure.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-18 12:15:00 +01:00
goto extend_csum ;
}
2011-04-21 01:20:15 +02:00
btrfs_release_path ( path ) ;
btrfs: correctly calculate item size used when item key collision happens
Item key collision is allowed for some item types, like dir item and
inode refs, but the overall item size is limited by the nodesize.
item size(ins_len) passed from btrfs_insert_empty_items to
btrfs_search_slot already contains size of btrfs_item.
When btrfs_search_slot reaches leaf, we'll see if we need to split leaf.
The check incorrectly reports that split leaf is required, because
it treats the space required by the newly inserted item as
btrfs_item + item data. But in item key collision case, only item data
is actually needed, the newly inserted item could merge into the existing
one. No new btrfs_item will be inserted.
And split_leaf return EOVERFLOW from following code:
if (extend && data_size + btrfs_item_size_nr(l, slot) +
sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(fs_info))
return -EOVERFLOW;
In most cases, when callers receive EOVERFLOW, they either return
this error or handle in different ways. For example, in normal dir item
creation the userspace will get errno EOVERFLOW; in inode ref case
INODE_EXTREF is used instead.
However, this is not the case for rename. To avoid the unrecoverable
situation in rename, btrfs_check_dir_item_collision is called in
early phase of rename. In this function, when item key collision is
detected leaf space is checked:
data_size = sizeof(*di) + name_len;
if (data_size + btrfs_item_size_nr(leaf, slot) +
sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(root->fs_info))
the sizeof(struct btrfs_item) + btrfs_item_size_nr(leaf, slot) here
refers to existing item size, the condition here correctly calculates
the needed size for collision case rather than the wrong case above.
The consequence of inconsistent condition check between
btrfs_check_dir_item_collision and btrfs_search_slot when item key
collision happens is that we might pass check here but fail
later at btrfs_search_slot. Rename fails and volume is forced readonly
[436149.586170] ------------[ cut here ]------------
[436149.586173] BTRFS: Transaction aborted (error -75)
[436149.586196] WARNING: CPU: 0 PID: 16733 at fs/btrfs/inode.c:9870 btrfs_rename2+0x1938/0x1b70 [btrfs]
[436149.586227] CPU: 0 PID: 16733 Comm: python Tainted: G D 4.18.0-rc5+ #1
[436149.586228] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
[436149.586238] RIP: 0010:btrfs_rename2+0x1938/0x1b70 [btrfs]
[436149.586254] RSP: 0018:ffffa327043a7ce0 EFLAGS: 00010286
[436149.586255] RAX: 0000000000000000 RBX: ffff8d8a17d13340 RCX: 0000000000000006
[436149.586256] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8d8a7fc164b0
[436149.586257] RBP: ffffa327043a7da0 R08: 0000000000000560 R09: 7265282064657472
[436149.586258] R10: 0000000000000000 R11: 6361736e61725420 R12: ffff8d8a0d4c8b08
[436149.586258] R13: ffff8d8a17d13340 R14: ffff8d8a33e0a540 R15: 00000000000001fe
[436149.586260] FS: 00007fa313933740(0000) GS:ffff8d8a7fc00000(0000) knlGS:0000000000000000
[436149.586261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[436149.586262] CR2: 000055d8d9c9a720 CR3: 000000007aae0003 CR4: 00000000003606f0
[436149.586295] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[436149.586296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[436149.586296] Call Trace:
[436149.586311] vfs_rename+0x383/0x920
[436149.586313] ? vfs_rename+0x383/0x920
[436149.586315] do_renameat2+0x4ca/0x590
[436149.586317] __x64_sys_rename+0x20/0x30
[436149.586324] do_syscall_64+0x5a/0x120
[436149.586330] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[436149.586332] RIP: 0033:0x7fa3133b1d37
[436149.586348] RSP: 002b:00007fffd3e43908 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[436149.586349] RAX: ffffffffffffffda RBX: 00007fa3133b1d30 RCX: 00007fa3133b1d37
[436149.586350] RDX: 000055d8da06b5e0 RSI: 000055d8da225d60 RDI: 000055d8da2c4da0
[436149.586351] RBP: 000055d8da2252f0 R08: 00007fa313782000 R09: 00000000000177e0
[436149.586351] R10: 000055d8da010680 R11: 0000000000000246 R12: 00007fa313840b00
Thanks to Hans van Kranenburg for information about crc32 hash collision
tools, I was able to reproduce the dir item collision with following
python script.
https://github.com/wutzuchieh/misc_tools/blob/master/crc32_forge.py Run
it under a btrfs volume will trigger the abort transaction. It simply
creates files and rename them to forged names that leads to
hash collision.
There are two ways to fix this. One is to simply revert the patch
878f2d2cb355 ("Btrfs: fix max dir item size calculation") to make the
condition consistent although that patch is correct about the size.
The other way is to handle the leaf space check correctly when
collision happens. I prefer the second one since it correct leaf
space check in collision case. This fix will not account
sizeof(struct btrfs_item) when the item already exists.
There are two places where ins_len doesn't contain
sizeof(struct btrfs_item), however.
1. extent-tree.c: lookup_inline_extent_backref
2. file-item.c: btrfs_csum_file_blocks
to make the logic of btrfs_search_slot more clear, we add a flag
search_for_extension in btrfs_path.
This flag indicates that ins_len passed to btrfs_search_slot doesn't
contain sizeof(struct btrfs_item). When key exists, btrfs_search_slot
will use the actual size needed to calculate the required leaf space.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-01 17:25:12 +08:00
path - > search_for_extension = 1 ;
2007-04-16 09:22:45 -04:00
ret = btrfs_search_slot ( trans , root , & file_key , path ,
2008-12-02 07:17:45 -05:00
csum_size , 1 ) ;
btrfs: correctly calculate item size used when item key collision happens
Item key collision is allowed for some item types, like dir item and
inode refs, but the overall item size is limited by the nodesize.
item size(ins_len) passed from btrfs_insert_empty_items to
btrfs_search_slot already contains size of btrfs_item.
When btrfs_search_slot reaches leaf, we'll see if we need to split leaf.
The check incorrectly reports that split leaf is required, because
it treats the space required by the newly inserted item as
btrfs_item + item data. But in item key collision case, only item data
is actually needed, the newly inserted item could merge into the existing
one. No new btrfs_item will be inserted.
And split_leaf return EOVERFLOW from following code:
if (extend && data_size + btrfs_item_size_nr(l, slot) +
sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(fs_info))
return -EOVERFLOW;
In most cases, when callers receive EOVERFLOW, they either return
this error or handle in different ways. For example, in normal dir item
creation the userspace will get errno EOVERFLOW; in inode ref case
INODE_EXTREF is used instead.
However, this is not the case for rename. To avoid the unrecoverable
situation in rename, btrfs_check_dir_item_collision is called in
early phase of rename. In this function, when item key collision is
detected leaf space is checked:
data_size = sizeof(*di) + name_len;
if (data_size + btrfs_item_size_nr(leaf, slot) +
sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(root->fs_info))
the sizeof(struct btrfs_item) + btrfs_item_size_nr(leaf, slot) here
refers to existing item size, the condition here correctly calculates
the needed size for collision case rather than the wrong case above.
The consequence of inconsistent condition check between
btrfs_check_dir_item_collision and btrfs_search_slot when item key
collision happens is that we might pass check here but fail
later at btrfs_search_slot. Rename fails and volume is forced readonly
[436149.586170] ------------[ cut here ]------------
[436149.586173] BTRFS: Transaction aborted (error -75)
[436149.586196] WARNING: CPU: 0 PID: 16733 at fs/btrfs/inode.c:9870 btrfs_rename2+0x1938/0x1b70 [btrfs]
[436149.586227] CPU: 0 PID: 16733 Comm: python Tainted: G D 4.18.0-rc5+ #1
[436149.586228] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
[436149.586238] RIP: 0010:btrfs_rename2+0x1938/0x1b70 [btrfs]
[436149.586254] RSP: 0018:ffffa327043a7ce0 EFLAGS: 00010286
[436149.586255] RAX: 0000000000000000 RBX: ffff8d8a17d13340 RCX: 0000000000000006
[436149.586256] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8d8a7fc164b0
[436149.586257] RBP: ffffa327043a7da0 R08: 0000000000000560 R09: 7265282064657472
[436149.586258] R10: 0000000000000000 R11: 6361736e61725420 R12: ffff8d8a0d4c8b08
[436149.586258] R13: ffff8d8a17d13340 R14: ffff8d8a33e0a540 R15: 00000000000001fe
[436149.586260] FS: 00007fa313933740(0000) GS:ffff8d8a7fc00000(0000) knlGS:0000000000000000
[436149.586261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[436149.586262] CR2: 000055d8d9c9a720 CR3: 000000007aae0003 CR4: 00000000003606f0
[436149.586295] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[436149.586296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[436149.586296] Call Trace:
[436149.586311] vfs_rename+0x383/0x920
[436149.586313] ? vfs_rename+0x383/0x920
[436149.586315] do_renameat2+0x4ca/0x590
[436149.586317] __x64_sys_rename+0x20/0x30
[436149.586324] do_syscall_64+0x5a/0x120
[436149.586330] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[436149.586332] RIP: 0033:0x7fa3133b1d37
[436149.586348] RSP: 002b:00007fffd3e43908 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[436149.586349] RAX: ffffffffffffffda RBX: 00007fa3133b1d30 RCX: 00007fa3133b1d37
[436149.586350] RDX: 000055d8da06b5e0 RSI: 000055d8da225d60 RDI: 000055d8da2c4da0
[436149.586351] RBP: 000055d8da2252f0 R08: 00007fa313782000 R09: 00000000000177e0
[436149.586351] R10: 000055d8da010680 R11: 0000000000000246 R12: 00007fa313840b00
Thanks to Hans van Kranenburg for information about crc32 hash collision
tools, I was able to reproduce the dir item collision with following
python script.
https://github.com/wutzuchieh/misc_tools/blob/master/crc32_forge.py Run
it under a btrfs volume will trigger the abort transaction. It simply
creates files and rename them to forged names that leads to
hash collision.
There are two ways to fix this. One is to simply revert the patch
878f2d2cb355 ("Btrfs: fix max dir item size calculation") to make the
condition consistent although that patch is correct about the size.
The other way is to handle the leaf space check correctly when
collision happens. I prefer the second one since it correct leaf
space check in collision case. This fix will not account
sizeof(struct btrfs_item) when the item already exists.
There are two places where ins_len doesn't contain
sizeof(struct btrfs_item), however.
1. extent-tree.c: lookup_inline_extent_backref
2. file-item.c: btrfs_csum_file_blocks
to make the logic of btrfs_search_slot more clear, we add a flag
search_for_extension in btrfs_path.
This flag indicates that ins_len passed to btrfs_search_slot doesn't
contain sizeof(struct btrfs_item). When key exists, btrfs_search_slot
will use the actual size needed to calculate the required leaf space.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: ethanwu <ethanwu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-01 17:25:12 +08:00
path - > search_for_extension = 0 ;
2007-04-16 09:22:45 -04:00
if ( ret < 0 )
2020-05-18 12:15:18 +01:00
goto out ;
2008-12-10 09:10:46 -05:00
if ( ret > 0 ) {
if ( path - > slots [ 0 ] = = 0 )
goto insert ;
path - > slots [ 0 ] - - ;
2007-04-16 09:22:45 -04:00
}
2008-12-10 09:10:46 -05:00
2007-10-15 16:14:19 -04:00
leaf = path - > nodes [ 0 ] ;
btrfs_item_key_to_cpu ( leaf , & found_key , path - > slots [ 0 ] ) ;
2020-07-01 21:19:09 +02:00
csum_offset = ( bytenr - found_key . offset ) > > fs_info - > sectorsize_bits ;
2008-12-10 09:10:46 -05:00
2014-06-04 18:41:45 +02:00
if ( found_key . type ! = BTRFS_EXTENT_CSUM_KEY | |
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
found_key . objectid ! = BTRFS_EXTENT_CSUM_OBJECTID | |
2016-06-22 18:54:23 -04:00
csum_offset > = MAX_CSUM_ITEMS ( fs_info , csum_size ) ) {
2007-04-16 09:22:45 -04:00
goto insert ;
}
2008-12-10 09:10:46 -05:00
btrfs: make checksum item extension more efficient
When we want to add checksums into the checksums tree, or a log tree, we
try whenever possible to extend existing checksum items, as this helps
reduce amount of metadata space used, since adding a new item uses extra
metadata space for a btrfs_item structure (25 bytes).
However we have two inefficiencies in the current approach:
1) After finding a checksum item that covers a range with an end offset
that matches the start offset of the checksum range we want to insert,
we release the search path populated by btrfs_lookup_csum() and then
do another COW search on tree with the goal of getting additional
space for at least one checksum. Doing this path release and then
searching again is a waste of time because very often the leaf already
has enough free space for at least one more checksum;
2) After the COW search that guarantees we get free space in the leaf for
at least one more checksum, we end up not doing the extension of the
previous checksum item, and fallback to insertion of a new checksum
item, if the leaf doesn't have an amount of free space larger then the
space required for 2 checksums plus one btrfs_item structure - this is
pointless for two reasons:
a) We want to extend an existing item, so we don't need to account for
a btrfs_item structure (25 bytes);
b) We made the COW search with an insertion size for 1 single checksum,
so if the leaf ends up with a free space amount smaller then 2
checksums plus the size of a btrfs_item structure, we give up on the
extension of the existing item and jump to the 'insert' label, where
we end up releasing the path and then doing yet another search to
insert a new checksum item for a single checksum.
Fix these inefficiencies by doing the following:
- For case 1), before releasing the path just check if the leaf already
has enough space for at least 1 more checksum, and if it does, jump
directly to the item extension code, with releasing our current path,
which was already COWed by btrfs_lookup_csum();
- For case 2), fix the logic so that for item extension we require only
that the leaf has enough free space for 1 checksum, and not a minimum
of 2 checksums plus space for a btrfs_item structure.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-18 12:15:00 +01:00
extend_csum :
Btrfs: extend the checksum item as much as possible
For write, we also reserve some space for COW blocks during updating
the checksum tree, and we calculate the number of blocks by checking
if the number of bytes outstanding that are going to need csums needs
one more block for csum.
When we add these checksum into the checksum tree, we use ordered sums
list.
Every ordered sum contains csums for each sector, and we'll first try
to look up an existing csum item,
a) if we don't yet have a proper csum item, then we need to insert one,
b) or if we find one but the csum item is not big enough, then we need
to extend it.
The point is we'll unlock the whole path and then insert or extend.
So others can hack in and update the tree.
Each insert or extend needs update the tree with COW on, and we may need
to insert/extend for many times.
That means what we've reserved for updating checksum tree is NOT enough
indeed.
The case is even more serious with having several write threads at the
same time, it can end up eating our reserved space quickly and starting
eating globle reserve pool instead.
I don't yet come up with a way to calculate the worse case for updating
csum, but extending the checksum item as much as possible can be helpful
in my test.
The idea behind is that it can reduce the times we insert/extend so that
it saves us precious reserved space.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-04 13:12:18 +00:00
if ( csum_offset = = btrfs_item_size_nr ( leaf , path - > slots [ 0 ] ) /
2008-12-02 07:17:45 -05:00
csum_size ) {
Btrfs: extend the checksum item as much as possible
For write, we also reserve some space for COW blocks during updating
the checksum tree, and we calculate the number of blocks by checking
if the number of bytes outstanding that are going to need csums needs
one more block for csum.
When we add these checksum into the checksum tree, we use ordered sums
list.
Every ordered sum contains csums for each sector, and we'll first try
to look up an existing csum item,
a) if we don't yet have a proper csum item, then we need to insert one,
b) or if we find one but the csum item is not big enough, then we need
to extend it.
The point is we'll unlock the whole path and then insert or extend.
So others can hack in and update the tree.
Each insert or extend needs update the tree with COW on, and we may need
to insert/extend for many times.
That means what we've reserved for updating checksum tree is NOT enough
indeed.
The case is even more serious with having several write threads at the
same time, it can end up eating our reserved space quickly and starting
eating globle reserve pool instead.
I don't yet come up with a way to calculate the worse case for updating
csum, but extending the checksum item as much as possible can be helpful
in my test.
The idea behind is that it can reduce the times we insert/extend so that
it saves us precious reserved space.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-04 13:12:18 +00:00
int extend_nr ;
u64 tmp ;
u32 diff ;
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
tmp = sums - > len - total_bytes ;
2020-07-01 21:19:09 +02:00
tmp > > = fs_info - > sectorsize_bits ;
Btrfs: extend the checksum item as much as possible
For write, we also reserve some space for COW blocks during updating
the checksum tree, and we calculate the number of blocks by checking
if the number of bytes outstanding that are going to need csums needs
one more block for csum.
When we add these checksum into the checksum tree, we use ordered sums
list.
Every ordered sum contains csums for each sector, and we'll first try
to look up an existing csum item,
a) if we don't yet have a proper csum item, then we need to insert one,
b) or if we find one but the csum item is not big enough, then we need
to extend it.
The point is we'll unlock the whole path and then insert or extend.
So others can hack in and update the tree.
Each insert or extend needs update the tree with COW on, and we may need
to insert/extend for many times.
That means what we've reserved for updating checksum tree is NOT enough
indeed.
The case is even more serious with having several write threads at the
same time, it can end up eating our reserved space quickly and starting
eating globle reserve pool instead.
I don't yet come up with a way to calculate the worse case for updating
csum, but extending the checksum item as much as possible can be helpful
in my test.
The idea behind is that it can reduce the times we insert/extend so that
it saves us precious reserved space.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-04 13:12:18 +00:00
WARN_ON ( tmp < 1 ) ;
extend_nr = max_t ( int , 1 , ( int ) tmp ) ;
diff = ( csum_offset + extend_nr ) * csum_size ;
2016-06-22 18:54:23 -04:00
diff = min ( diff ,
MAX_CSUM_ITEMS ( fs_info , csum_size ) * csum_size ) ;
2008-12-10 09:10:46 -05:00
2007-10-15 16:14:19 -04:00
diff = diff - btrfs_item_size_nr ( leaf , path - > slots [ 0 ] ) ;
btrfs: make checksum item extension more efficient
When we want to add checksums into the checksums tree, or a log tree, we
try whenever possible to extend existing checksum items, as this helps
reduce amount of metadata space used, since adding a new item uses extra
metadata space for a btrfs_item structure (25 bytes).
However we have two inefficiencies in the current approach:
1) After finding a checksum item that covers a range with an end offset
that matches the start offset of the checksum range we want to insert,
we release the search path populated by btrfs_lookup_csum() and then
do another COW search on tree with the goal of getting additional
space for at least one checksum. Doing this path release and then
searching again is a waste of time because very often the leaf already
has enough free space for at least one more checksum;
2) After the COW search that guarantees we get free space in the leaf for
at least one more checksum, we end up not doing the extension of the
previous checksum item, and fallback to insertion of a new checksum
item, if the leaf doesn't have an amount of free space larger then the
space required for 2 checksums plus one btrfs_item structure - this is
pointless for two reasons:
a) We want to extend an existing item, so we don't need to account for
a btrfs_item structure (25 bytes);
b) We made the COW search with an insertion size for 1 single checksum,
so if the leaf ends up with a free space amount smaller then 2
checksums plus the size of a btrfs_item structure, we give up on the
extension of the existing item and jump to the 'insert' label, where
we end up releasing the path and then doing yet another search to
insert a new checksum item for a single checksum.
Fix these inefficiencies by doing the following:
- For case 1), before releasing the path just check if the leaf already
has enough space for at least 1 more checksum, and if it does, jump
directly to the item extension code, with releasing our current path,
which was already COWed by btrfs_lookup_csum();
- For case 2), fix the logic so that for item extension we require only
that the leaf has enough free space for 1 checksum, and not a minimum
of 2 checksums plus space for a btrfs_item structure.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-18 12:15:00 +01:00
diff = min_t ( u32 , btrfs_leaf_free_space ( leaf ) , diff ) ;
Btrfs: extend the checksum item as much as possible
For write, we also reserve some space for COW blocks during updating
the checksum tree, and we calculate the number of blocks by checking
if the number of bytes outstanding that are going to need csums needs
one more block for csum.
When we add these checksum into the checksum tree, we use ordered sums
list.
Every ordered sum contains csums for each sector, and we'll first try
to look up an existing csum item,
a) if we don't yet have a proper csum item, then we need to insert one,
b) or if we find one but the csum item is not big enough, then we need
to extend it.
The point is we'll unlock the whole path and then insert or extend.
So others can hack in and update the tree.
Each insert or extend needs update the tree with COW on, and we may need
to insert/extend for many times.
That means what we've reserved for updating checksum tree is NOT enough
indeed.
The case is even more serious with having several write threads at the
same time, it can end up eating our reserved space quickly and starting
eating globle reserve pool instead.
I don't yet come up with a way to calculate the worse case for updating
csum, but extending the checksum item as much as possible can be helpful
in my test.
The idea behind is that it can reduce the times we insert/extend so that
it saves us precious reserved space.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-04 13:12:18 +00:00
diff / = csum_size ;
diff * = csum_size ;
2008-12-10 09:10:46 -05:00
2019-03-20 14:51:10 +01:00
btrfs_extend_item ( path , diff ) ;
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
ret = 0 ;
2007-04-16 09:22:45 -04:00
goto csum ;
}
insert :
2011-04-21 01:20:15 +02:00
btrfs_release_path ( path ) ;
2007-04-16 09:22:45 -04:00
csum_offset = 0 ;
2007-10-25 15:42:56 -04:00
if ( found_next ) {
Btrfs: extend the checksum item as much as possible
For write, we also reserve some space for COW blocks during updating
the checksum tree, and we calculate the number of blocks by checking
if the number of bytes outstanding that are going to need csums needs
one more block for csum.
When we add these checksum into the checksum tree, we use ordered sums
list.
Every ordered sum contains csums for each sector, and we'll first try
to look up an existing csum item,
a) if we don't yet have a proper csum item, then we need to insert one,
b) or if we find one but the csum item is not big enough, then we need
to extend it.
The point is we'll unlock the whole path and then insert or extend.
So others can hack in and update the tree.
Each insert or extend needs update the tree with COW on, and we may need
to insert/extend for many times.
That means what we've reserved for updating checksum tree is NOT enough
indeed.
The case is even more serious with having several write threads at the
same time, it can end up eating our reserved space quickly and starting
eating globle reserve pool instead.
I don't yet come up with a way to calculate the worse case for updating
csum, but extending the checksum item as much as possible can be helpful
in my test.
The idea behind is that it can reduce the times we insert/extend so that
it saves us precious reserved space.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-04 13:12:18 +00:00
u64 tmp ;
Btrfs: move data checksumming into a dedicated tree
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:58:54 -05:00
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
tmp = sums - > len - total_bytes ;
2020-07-01 21:19:09 +02:00
tmp > > = fs_info - > sectorsize_bits ;
Btrfs: extend the checksum item as much as possible
For write, we also reserve some space for COW blocks during updating
the checksum tree, and we calculate the number of blocks by checking
if the number of bytes outstanding that are going to need csums needs
one more block for csum.
When we add these checksum into the checksum tree, we use ordered sums
list.
Every ordered sum contains csums for each sector, and we'll first try
to look up an existing csum item,
a) if we don't yet have a proper csum item, then we need to insert one,
b) or if we find one but the csum item is not big enough, then we need
to extend it.
The point is we'll unlock the whole path and then insert or extend.
So others can hack in and update the tree.
Each insert or extend needs update the tree with COW on, and we may need
to insert/extend for many times.
That means what we've reserved for updating checksum tree is NOT enough
indeed.
The case is even more serious with having several write threads at the
same time, it can end up eating our reserved space quickly and starting
eating globle reserve pool instead.
I don't yet come up with a way to calculate the worse case for updating
csum, but extending the checksum item as much as possible can be helpful
in my test.
The idea behind is that it can reduce the times we insert/extend so that
it saves us precious reserved space.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-04 13:12:18 +00:00
tmp = min ( tmp , ( next_offset - file_key . offset ) > >
2020-07-01 21:19:09 +02:00
fs_info - > sectorsize_bits ) ;
Btrfs: extend the checksum item as much as possible
For write, we also reserve some space for COW blocks during updating
the checksum tree, and we calculate the number of blocks by checking
if the number of bytes outstanding that are going to need csums needs
one more block for csum.
When we add these checksum into the checksum tree, we use ordered sums
list.
Every ordered sum contains csums for each sector, and we'll first try
to look up an existing csum item,
a) if we don't yet have a proper csum item, then we need to insert one,
b) or if we find one but the csum item is not big enough, then we need
to extend it.
The point is we'll unlock the whole path and then insert or extend.
So others can hack in and update the tree.
Each insert or extend needs update the tree with COW on, and we may need
to insert/extend for many times.
That means what we've reserved for updating checksum tree is NOT enough
indeed.
The case is even more serious with having several write threads at the
same time, it can end up eating our reserved space quickly and starting
eating globle reserve pool instead.
I don't yet come up with a way to calculate the worse case for updating
csum, but extending the checksum item as much as possible can be helpful
in my test.
The idea behind is that it can reduce the times we insert/extend so that
it saves us precious reserved space.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-04 13:12:18 +00:00
2016-12-15 14:38:28 +01:00
tmp = max_t ( u64 , 1 , tmp ) ;
tmp = min_t ( u64 , tmp , MAX_CSUM_ITEMS ( fs_info , csum_size ) ) ;
2008-12-02 07:17:45 -05:00
ins_size = csum_size * tmp ;
2007-10-25 15:42:56 -04:00
} else {
2008-12-02 07:17:45 -05:00
ins_size = csum_size ;
2007-10-25 15:42:56 -04:00
}
2007-04-02 11:20:42 -04:00
ret = btrfs_insert_empty_item ( trans , root , path , & file_key ,
2007-10-25 15:42:56 -04:00
ins_size ) ;
2007-06-22 14:16:25 -04:00
if ( ret < 0 )
2020-05-18 12:15:18 +01:00
goto out ;
2013-10-31 10:30:08 +05:30
if ( WARN_ON ( ret ! = 0 ) )
2020-05-18 12:15:18 +01:00
goto out ;
2007-10-15 16:14:19 -04:00
leaf = path - > nodes [ 0 ] ;
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
csum :
2007-10-15 16:14:19 -04:00
item = btrfs_item_ptr ( leaf , path - > slots [ 0 ] , struct btrfs_csum_item ) ;
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
item_end = ( struct btrfs_csum_item * ) ( ( unsigned char * ) item +
btrfs_item_size_nr ( leaf , path - > slots [ 0 ] ) ) ;
2007-05-10 12:36:17 -04:00
item = ( struct btrfs_csum_item * ) ( ( unsigned char * ) item +
2008-12-02 07:17:45 -05:00
csum_offset * csum_size ) ;
2007-04-17 13:26:50 -04:00
found :
2020-07-01 21:19:09 +02:00
ins_size = ( u32 ) ( sums - > len - total_bytes ) > > fs_info - > sectorsize_bits ;
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
ins_size * = csum_size ;
ins_size = min_t ( u32 , ( unsigned long ) item_end - ( unsigned long ) item ,
ins_size ) ;
write_extent_buffer ( leaf , sums - > sums + index , ( unsigned long ) item ,
ins_size ) ;
2019-05-22 10:19:01 +02:00
index + = ins_size ;
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.
By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-19 10:36:09 +08:00
ins_size / = csum_size ;
2016-06-22 18:54:23 -04:00
total_bytes + = ins_size * fs_info - > sectorsize ;
2011-07-19 12:04:14 -04:00
2007-04-02 11:20:42 -04:00
btrfs_mark_buffer_dirty ( path - > nodes [ 0 ] ) ;
2008-07-17 12:53:50 -04:00
if ( total_bytes < sums - > len ) {
2011-04-21 01:20:15 +02:00
btrfs_release_path ( path ) ;
2009-03-13 11:00:37 -04:00
cond_resched ( ) ;
2008-02-20 12:07:25 -05:00
goto again ;
}
2008-08-15 15:34:18 -04:00
out :
2007-04-02 11:20:42 -04:00
btrfs_free_path ( path ) ;
2007-03-29 15:15:27 -04:00
return ret ;
}
2014-06-09 03:48:05 +01:00
2017-02-20 13:51:02 +02:00
void btrfs_extent_item_to_extent_map ( struct btrfs_inode * inode ,
2014-06-09 03:48:05 +01:00
const struct btrfs_path * path ,
struct btrfs_file_extent_item * fi ,
const bool new_inline ,
struct extent_map * em )
{
2018-06-29 10:56:42 +02:00
struct btrfs_fs_info * fs_info = inode - > root - > fs_info ;
2017-02-20 13:51:02 +02:00
struct btrfs_root * root = inode - > root ;
2014-06-09 03:48:05 +01:00
struct extent_buffer * leaf = path - > nodes [ 0 ] ;
const int slot = path - > slots [ 0 ] ;
struct btrfs_key key ;
u64 extent_start , extent_end ;
u64 bytenr ;
u8 type = btrfs_file_extent_type ( leaf , fi ) ;
int compress_type = btrfs_file_extent_compression ( leaf , fi ) ;
btrfs_item_key_to_cpu ( leaf , & key , slot ) ;
extent_start = key . offset ;
2020-03-09 12:41:06 +00:00
extent_end = btrfs_file_extent_end ( path ) ;
2014-06-09 03:48:05 +01:00
em - > ram_bytes = btrfs_file_extent_ram_bytes ( leaf , fi ) ;
if ( type = = BTRFS_FILE_EXTENT_REG | |
type = = BTRFS_FILE_EXTENT_PREALLOC ) {
em - > start = extent_start ;
em - > len = extent_end - extent_start ;
em - > orig_start = extent_start -
btrfs_file_extent_offset ( leaf , fi ) ;
em - > orig_block_len = btrfs_file_extent_disk_num_bytes ( leaf , fi ) ;
bytenr = btrfs_file_extent_disk_bytenr ( leaf , fi ) ;
if ( bytenr = = 0 ) {
em - > block_start = EXTENT_MAP_HOLE ;
return ;
}
if ( compress_type ! = BTRFS_COMPRESS_NONE ) {
set_bit ( EXTENT_FLAG_COMPRESSED , & em - > flags ) ;
em - > compress_type = compress_type ;
em - > block_start = bytenr ;
em - > block_len = em - > orig_block_len ;
} else {
bytenr + = btrfs_file_extent_offset ( leaf , fi ) ;
em - > block_start = bytenr ;
em - > block_len = em - > len ;
if ( type = = BTRFS_FILE_EXTENT_PREALLOC )
set_bit ( EXTENT_FLAG_PREALLOC , & em - > flags ) ;
}
} else if ( type = = BTRFS_FILE_EXTENT_INLINE ) {
em - > block_start = EXTENT_MAP_INLINE ;
em - > start = extent_start ;
em - > len = extent_end - extent_start ;
/*
* Initialize orig_start and block_len with the same values
* as in inode . c : btrfs_get_extent ( ) .
*/
em - > orig_start = EXTENT_MAP_HOLE ;
em - > block_len = ( u64 ) - 1 ;
if ( ! new_inline & & compress_type ! = BTRFS_COMPRESS_NONE ) {
set_bit ( EXTENT_FLAG_COMPRESSED , & em - > flags ) ;
em - > compress_type = compress_type ;
}
} else {
2016-06-22 18:54:23 -04:00
btrfs_err ( fs_info ,
2017-02-20 13:51:02 +02:00
" unknown file extent item type %d, inode %llu, offset %llu, "
" root %llu " , type , btrfs_ino ( inode ) , extent_start ,
2014-06-09 03:48:05 +01:00
root - > root_key . objectid ) ;
}
}
2020-03-09 12:41:06 +00:00
/*
* Returns the end offset ( non inclusive ) of the file extent item the given path
* points to . If it points to an inline extent , the returned offset is rounded
* up to the sector size .
*/
u64 btrfs_file_extent_end ( const struct btrfs_path * path )
{
const struct extent_buffer * leaf = path - > nodes [ 0 ] ;
const int slot = path - > slots [ 0 ] ;
struct btrfs_file_extent_item * fi ;
struct btrfs_key key ;
u64 end ;
btrfs_item_key_to_cpu ( leaf , & key , slot ) ;
ASSERT ( key . type = = BTRFS_EXTENT_DATA_KEY ) ;
fi = btrfs_item_ptr ( leaf , slot , struct btrfs_file_extent_item ) ;
if ( btrfs_file_extent_type ( leaf , fi ) = = BTRFS_FILE_EXTENT_INLINE ) {
end = btrfs_file_extent_ram_bytes ( leaf , fi ) ;
end = ALIGN ( key . offset + end , leaf - > fs_info - > sectorsize ) ;
} else {
end = key . offset + btrfs_file_extent_num_bytes ( leaf , fi ) ;
}
return end ;
}