License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 17:07:57 +03:00
// SPDX-License-Identifier: GPL-2.0
2006-10-11 12:20:50 +04:00
/*
2006-10-11 12:20:53 +04:00
* linux / fs / ext4 / file . c
2006-10-11 12:20:50 +04:00
*
* Copyright ( C ) 1992 , 1993 , 1994 , 1995
* Remy Card ( card @ masi . ibp . fr )
* Laboratoire MASI - Institut Blaise Pascal
* Universite Pierre et Marie Curie ( Paris VI )
*
* from
*
* linux / fs / minix / file . c
*
* Copyright ( C ) 1991 , 1992 Linus Torvalds
*
2006-10-11 12:20:53 +04:00
* ext4 fs regular file handling primitives
2006-10-11 12:20:50 +04:00
*
* 64 - bit file support on 64 - bit platforms by Jakub Jelinek
* ( jj @ sunsite . ms . mff . cuni . cz )
*/
# include <linux/time.h>
# include <linux/fs.h>
2017-10-02 00:58:54 +03:00
# include <linux/iomap.h>
2009-06-13 18:09:48 +04:00
# include <linux/mount.h>
# include <linux/path.h>
2015-09-09 00:58:40 +03:00
# include <linux/dax.h>
2010-03-03 17:05:07 +03:00
# include <linux/quotaops.h>
2012-11-09 06:57:40 +04:00
# include <linux/pagevec.h>
2015-02-22 19:58:50 +03:00
# include <linux/uio.h>
2017-11-01 18:36:45 +03:00
# include <linux/mman.h>
2019-11-05 15:02:39 +03:00
# include <linux/backing-dev.h>
2008-04-30 02:13:32 +04:00
# include "ext4.h"
# include "ext4_jbd2.h"
2006-10-11 12:20:50 +04:00
# include "xattr.h"
# include "acl.h"
2019-11-05 15:01:51 +03:00
# include "truncate.h"
2006-10-11 12:20:50 +04:00
2022-08-27 09:58:47 +03:00
/*
* Returns % true if the given DIO request should be attempted with DIO , or
* % false if it should fall back to buffered I / O .
*
* DIO isn ' t well specified ; when it ' s unsupported ( either due to the request
* being misaligned , or due to the file not supporting DIO at all ) , filesystems
* either fall back to buffered I / O or return EINVAL . For files that don ' t use
* any special features like encryption or verity , ext4 has traditionally
* returned EINVAL for misaligned DIO . iomap_dio_rw ( ) uses this convention too .
* In this case , we should attempt the DIO , * not * fall back to buffered I / O .
*
* In contrast , in cases where DIO is unsupported due to ext4 features , ext4
* traditionally falls back to buffered I / O .
*
* This function implements the traditional ext4 behavior in all these cases .
*/
static bool ext4_should_use_dio ( struct kiocb * iocb , struct iov_iter * iter )
2019-11-05 15:01:37 +03:00
{
ext4: support direct I/O with fscrypt using blk-crypto
Encrypted files traditionally haven't supported DIO, due to the need to
encrypt/decrypt the data. However, when the encryption is implemented
using inline encryption (blk-crypto) instead of the traditional
filesystem-layer encryption, it is straightforward to support DIO.
Therefore, make ext4 support DIO on files that are using inline
encryption. Since ext4 uses iomap for DIO, and fscrypt support was
already added to iomap DIO, this just requires two small changes:
- Let DIO proceed when supported, by checking fscrypt_dio_supported()
instead of assuming that encrypted files never support DIO.
- In ext4_iomap_begin(), use fscrypt_limit_io_blocks() to limit the
length of the mapping in the rare case where a DUN discontiguity
occurs in the middle of an extent. The iomap DIO implementation
requires this, since it assumes that it can submit a bio covering (up
to) the whole mapping, without checking fscrypt constraints itself.
Co-developed-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Link: https://lore.kernel.org/r/20220128233940.79464-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2022-01-29 02:39:38 +03:00
struct inode * inode = file_inode ( iocb - > ki_filp ) ;
2022-08-27 09:58:47 +03:00
u32 dio_align = ext4_dio_alignment ( inode ) ;
ext4: support direct I/O with fscrypt using blk-crypto
Encrypted files traditionally haven't supported DIO, due to the need to
encrypt/decrypt the data. However, when the encryption is implemented
using inline encryption (blk-crypto) instead of the traditional
filesystem-layer encryption, it is straightforward to support DIO.
Therefore, make ext4 support DIO on files that are using inline
encryption. Since ext4 uses iomap for DIO, and fscrypt support was
already added to iomap DIO, this just requires two small changes:
- Let DIO proceed when supported, by checking fscrypt_dio_supported()
instead of assuming that encrypted files never support DIO.
- In ext4_iomap_begin(), use fscrypt_limit_io_blocks() to limit the
length of the mapping in the rare case where a DUN discontiguity
occurs in the middle of an extent. The iomap DIO implementation
requires this, since it assumes that it can submit a bio covering (up
to) the whole mapping, without checking fscrypt constraints itself.
Co-developed-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Link: https://lore.kernel.org/r/20220128233940.79464-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2022-01-29 02:39:38 +03:00
2022-08-27 09:58:47 +03:00
if ( dio_align = = 0 )
2019-11-05 15:01:37 +03:00
return false ;
2022-08-27 09:58:47 +03:00
if ( dio_align = = 1 )
return true ;
return IS_ALIGNED ( iocb - > ki_pos | iov_iter_alignment ( iter ) , dio_align ) ;
2019-11-05 15:01:37 +03:00
}
static ssize_t ext4_dio_read_iter ( struct kiocb * iocb , struct iov_iter * to )
{
ssize_t ret ;
struct inode * inode = file_inode ( iocb - > ki_filp ) ;
if ( iocb - > ki_flags & IOCB_NOWAIT ) {
if ( ! inode_trylock_shared ( inode ) )
return - EAGAIN ;
} else {
inode_lock_shared ( inode ) ;
}
2022-08-27 09:58:47 +03:00
if ( ! ext4_should_use_dio ( iocb , to ) ) {
2019-11-05 15:01:37 +03:00
inode_unlock_shared ( inode ) ;
/*
* Fallback to buffered I / O if the operation being performed on
* the inode is not supported by direct I / O . The IOCB_DIRECT
* flag needs to be cleared here in order to ensure that the
* direct I / O path within generic_file_read_iter ( ) is not
* taken .
*/
iocb - > ki_flags & = ~ IOCB_DIRECT ;
return generic_file_read_iter ( iocb , to ) ;
}
2022-05-05 23:11:11 +03:00
ret = iomap_dio_rw ( iocb , to , & ext4_iomap_ops , NULL , 0 , NULL , 0 ) ;
2019-11-05 15:01:37 +03:00
inode_unlock_shared ( inode ) ;
file_accessed ( iocb - > ki_filp ) ;
return ret ;
}
2016-11-21 01:36:06 +03:00
# ifdef CONFIG_FS_DAX
static ssize_t ext4_dax_read_iter ( struct kiocb * iocb , struct iov_iter * to )
{
struct inode * inode = file_inode ( iocb - > ki_filp ) ;
ssize_t ret ;
2019-12-12 08:55:55 +03:00
if ( iocb - > ki_flags & IOCB_NOWAIT ) {
if ( ! inode_trylock_shared ( inode ) )
2017-06-20 15:05:47 +03:00
return - EAGAIN ;
2019-12-12 08:55:55 +03:00
} else {
2017-06-20 15:05:47 +03:00
inode_lock_shared ( inode ) ;
}
2016-11-21 01:36:06 +03:00
/*
* Recheck under inode lock - at this point we are sure it cannot
* change anymore
*/
if ( ! IS_DAX ( inode ) ) {
inode_unlock_shared ( inode ) ;
/* Fallback to buffered IO in case we cannot support DAX */
return generic_file_read_iter ( iocb , to ) ;
}
ret = dax_iomap_rw ( iocb , to , & ext4_iomap_ops ) ;
inode_unlock_shared ( inode ) ;
file_accessed ( iocb - > ki_filp ) ;
return ret ;
}
# endif
static ssize_t ext4_file_read_iter ( struct kiocb * iocb , struct iov_iter * to )
{
2019-11-05 15:01:37 +03:00
struct inode * inode = file_inode ( iocb - > ki_filp ) ;
2023-06-16 19:50:49 +03:00
if ( unlikely ( ext4_forced_shutdown ( inode - > i_sb ) ) )
2017-02-05 09:28:48 +03:00
return - EIO ;
2016-11-21 01:36:06 +03:00
if ( ! iov_iter_count ( to ) )
return 0 ; /* skip atime */
# ifdef CONFIG_FS_DAX
2019-11-05 15:01:37 +03:00
if ( IS_DAX ( inode ) )
2016-11-21 01:36:06 +03:00
return ext4_dax_read_iter ( iocb , to ) ;
# endif
2019-11-05 15:01:37 +03:00
if ( iocb - > ki_flags & IOCB_DIRECT )
return ext4_dio_read_iter ( iocb , to ) ;
2016-11-21 01:36:06 +03:00
return generic_file_read_iter ( iocb , to ) ;
}
2023-05-22 16:50:05 +03:00
static ssize_t ext4_file_splice_read ( struct file * in , loff_t * ppos ,
struct pipe_inode_info * pipe ,
size_t len , unsigned int flags )
{
struct inode * inode = file_inode ( in ) ;
2023-06-16 19:50:49 +03:00
if ( unlikely ( ext4_forced_shutdown ( inode - > i_sb ) ) )
2023-05-22 16:50:05 +03:00
return - EIO ;
2023-05-22 16:50:15 +03:00
return filemap_splice_read ( in , ppos , pipe , len , flags ) ;
2023-05-22 16:50:05 +03:00
}
2006-10-11 12:20:50 +04:00
/*
* Called when an inode is released . Note that this is different
2006-10-11 12:20:53 +04:00
* from ext4_file_open : open gets called at every open , but release
2006-10-11 12:20:50 +04:00
* gets called only when / all / the files are closed .
*/
2008-09-09 06:25:24 +04:00
static int ext4_release_file ( struct inode * inode , struct file * filp )
2006-10-11 12:20:50 +04:00
{
2010-01-24 22:34:07 +03:00
if ( ext4_test_inode_state ( inode , EXT4_STATE_DA_ALLOC_CLOSE ) ) {
2009-02-24 16:21:14 +03:00
ext4_alloc_da_blocks ( inode ) ;
2010-01-24 22:34:07 +03:00
ext4_clear_inode_state ( inode , EXT4_STATE_DA_ALLOC_CLOSE ) ;
2009-02-24 16:21:14 +03:00
}
2006-10-11 12:20:50 +04:00
/* if we are the last writer on the inode, drop the block reservation */
if ( ( filp - > f_mode & FMODE_WRITE ) & &
2009-03-28 05:36:43 +03:00
( atomic_read ( & inode - > i_writecount ) = = 1 ) & &
2020-06-14 07:45:44 +03:00
! EXT4_I ( inode ) - > i_reserved_data_blocks ) {
2008-01-29 07:58:26 +03:00
down_write ( & EXT4_I ( inode ) - > i_data_sem ) ;
2024-01-05 12:21:01 +03:00
ext4_discard_preallocations ( inode ) ;
2008-01-29 07:58:26 +03:00
up_write ( & EXT4_I ( inode ) - > i_data_sem ) ;
2006-10-11 12:20:50 +04:00
}
if ( is_dx ( inode ) & & filp - > private_data )
2006-10-11 12:20:53 +04:00
ext4_htree_free_dir_info ( filp - > private_data ) ;
2006-10-11 12:20:50 +04:00
return 0 ;
}
ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 16:17:34 +03:00
/*
* This tests whether the IO in question is block - aligned or not .
* Ext4 utilizes unwritten extents when hole - filling during direct IO , and they
* are converted to written only after the IO is complete . Until they are
* mapped , these blocks appear as holes , so dio_zero_block ( ) will assume that
* it needs to zero out portions of the start and / or end block . If 2 AIO
* threads are at work on the same unwritten block , they must be synchronized
* or one thread will zero the other ' s data , causing corruption .
*/
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
static bool
ext4_unaligned_io ( struct inode * inode , struct iov_iter * from , loff_t pos )
ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 16:17:34 +03:00
{
struct super_block * sb = inode - > i_sb ;
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
unsigned long blockmask = sb - > s_blocksize - 1 ;
ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 16:17:34 +03:00
2014-04-18 00:09:22 +04:00
if ( ( pos | iov_iter_alignment ( from ) ) & blockmask )
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
return true ;
ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 16:17:34 +03:00
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
return false ;
}
static bool
ext4_extending_io ( struct inode * inode , loff_t offset , size_t len )
{
if ( offset + len > i_size_read ( inode ) | |
offset + len > EXT4_I ( inode ) - > i_disksize )
return true ;
return false ;
ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 16:17:34 +03:00
}
ext4: dio take shared inode lock when overwriting preallocated blocks
In the dio write path, we only take shared inode lock for the case of
aligned overwriting initialized blocks inside EOF. But for overwriting
preallocated blocks, it may only need to split unwritten extents, this
procedure has been protected under i_data_sem lock, it's safe to
release the exclusive inode lock and take shared inode lock.
This could give a significant speed up for multi-threaded writes. Test
on Intel Xeon Gold 6140 and nvme SSD with below fio parameters.
direct=1
ioengine=libaio
iodepth=10
numjobs=10
runtime=60
rw=randwrite
size=100G
And the test result are:
Before:
bs=4k IOPS=11.1k, BW=43.2MiB/s
bs=16k IOPS=11.1k, BW=173MiB/s
bs=64k IOPS=11.2k, BW=697MiB/s
After:
bs=4k IOPS=41.4k, BW=162MiB/s
bs=16k IOPS=41.3k, BW=646MiB/s
bs=64k IOPS=13.5k, BW=843MiB/s
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221226062015.3479416-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-12-26 09:20:15 +03:00
/* Is IO overwriting allocated or initialized blocks? */
static bool ext4_overwrite_io ( struct inode * inode ,
loff_t pos , loff_t len , bool * unwritten )
2016-11-21 01:29:51 +03:00
{
struct ext4_map_blocks map ;
unsigned int blkbits = inode - > i_blkbits ;
int err , blklen ;
if ( pos + len > i_size_read ( inode ) )
return false ;
map . m_lblk = pos > > blkbits ;
map . m_len = EXT4_MAX_BLOCKS ( len , pos , blkbits ) ;
blklen = map . m_len ;
err = ext4_map_blocks ( NULL , inode , & map , 0 ) ;
ext4: dio take shared inode lock when overwriting preallocated blocks
In the dio write path, we only take shared inode lock for the case of
aligned overwriting initialized blocks inside EOF. But for overwriting
preallocated blocks, it may only need to split unwritten extents, this
procedure has been protected under i_data_sem lock, it's safe to
release the exclusive inode lock and take shared inode lock.
This could give a significant speed up for multi-threaded writes. Test
on Intel Xeon Gold 6140 and nvme SSD with below fio parameters.
direct=1
ioengine=libaio
iodepth=10
numjobs=10
runtime=60
rw=randwrite
size=100G
And the test result are:
Before:
bs=4k IOPS=11.1k, BW=43.2MiB/s
bs=16k IOPS=11.1k, BW=173MiB/s
bs=64k IOPS=11.2k, BW=697MiB/s
After:
bs=4k IOPS=41.4k, BW=162MiB/s
bs=16k IOPS=41.3k, BW=646MiB/s
bs=64k IOPS=13.5k, BW=843MiB/s
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221226062015.3479416-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-12-26 09:20:15 +03:00
if ( err ! = blklen )
return false ;
2016-11-21 01:29:51 +03:00
/*
* ' err = = len ' means that all of the blocks have been preallocated ,
ext4: dio take shared inode lock when overwriting preallocated blocks
In the dio write path, we only take shared inode lock for the case of
aligned overwriting initialized blocks inside EOF. But for overwriting
preallocated blocks, it may only need to split unwritten extents, this
procedure has been protected under i_data_sem lock, it's safe to
release the exclusive inode lock and take shared inode lock.
This could give a significant speed up for multi-threaded writes. Test
on Intel Xeon Gold 6140 and nvme SSD with below fio parameters.
direct=1
ioengine=libaio
iodepth=10
numjobs=10
runtime=60
rw=randwrite
size=100G
And the test result are:
Before:
bs=4k IOPS=11.1k, BW=43.2MiB/s
bs=16k IOPS=11.1k, BW=173MiB/s
bs=64k IOPS=11.2k, BW=697MiB/s
After:
bs=4k IOPS=41.4k, BW=162MiB/s
bs=16k IOPS=41.3k, BW=646MiB/s
bs=64k IOPS=13.5k, BW=843MiB/s
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221226062015.3479416-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-12-26 09:20:15 +03:00
* regardless of whether they have been initialized or not . We need to
* check m_flags to distinguish the unwritten extents .
2016-11-21 01:29:51 +03:00
*/
ext4: dio take shared inode lock when overwriting preallocated blocks
In the dio write path, we only take shared inode lock for the case of
aligned overwriting initialized blocks inside EOF. But for overwriting
preallocated blocks, it may only need to split unwritten extents, this
procedure has been protected under i_data_sem lock, it's safe to
release the exclusive inode lock and take shared inode lock.
This could give a significant speed up for multi-threaded writes. Test
on Intel Xeon Gold 6140 and nvme SSD with below fio parameters.
direct=1
ioengine=libaio
iodepth=10
numjobs=10
runtime=60
rw=randwrite
size=100G
And the test result are:
Before:
bs=4k IOPS=11.1k, BW=43.2MiB/s
bs=16k IOPS=11.1k, BW=173MiB/s
bs=64k IOPS=11.2k, BW=697MiB/s
After:
bs=4k IOPS=41.4k, BW=162MiB/s
bs=16k IOPS=41.3k, BW=646MiB/s
bs=64k IOPS=13.5k, BW=843MiB/s
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221226062015.3479416-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-12-26 09:20:15 +03:00
* unwritten = ! ( map . m_flags & EXT4_MAP_MAPPED ) ;
return true ;
2016-11-21 01:29:51 +03:00
}
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
static ssize_t ext4_generic_write_checks ( struct kiocb * iocb ,
struct iov_iter * from )
2016-11-21 01:29:51 +03:00
{
struct inode * inode = file_inode ( iocb - > ki_filp ) ;
ssize_t ret ;
2019-11-05 15:02:39 +03:00
if ( unlikely ( IS_IMMUTABLE ( inode ) ) )
return - EPERM ;
2016-11-21 01:29:51 +03:00
ret = generic_write_checks ( iocb , from ) ;
if ( ret < = 0 )
return ret ;
2019-06-10 05:04:33 +03:00
2016-11-21 01:29:51 +03:00
/*
* If we have encountered a bitmap - format file , the size limit
* is smaller than s_maxbytes , which is for extent - mapped files .
*/
if ( ! ( ext4_test_inode_flag ( inode , EXT4_INODE_EXTENTS ) ) ) {
struct ext4_sb_info * sbi = EXT4_SB ( inode - > i_sb ) ;
if ( iocb - > ki_pos > = sbi - > s_bitmap_maxbytes )
return - EFBIG ;
iov_iter_truncate ( from , sbi - > s_bitmap_maxbytes - iocb - > ki_pos ) ;
}
2019-11-05 15:02:39 +03:00
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
return iov_iter_count ( from ) ;
}
static ssize_t ext4_write_checks ( struct kiocb * iocb , struct iov_iter * from )
{
ssize_t ret , count ;
count = ext4_generic_write_checks ( iocb , from ) ;
if ( count < = 0 )
return count ;
2019-11-05 15:02:39 +03:00
ret = file_modified ( iocb - > ki_filp ) ;
if ( ret )
return ret ;
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
return count ;
2016-11-21 01:29:51 +03:00
}
2019-11-05 15:02:39 +03:00
static ssize_t ext4_buffered_write_iter ( struct kiocb * iocb ,
struct iov_iter * from )
{
ssize_t ret ;
struct inode * inode = file_inode ( iocb - > ki_filp ) ;
if ( iocb - > ki_flags & IOCB_NOWAIT )
return - EOPNOTSUPP ;
inode_lock ( inode ) ;
ret = ext4_write_checks ( iocb , from ) ;
if ( ret < = 0 )
goto out ;
2022-02-20 07:19:49 +03:00
ret = generic_perform_write ( iocb , from ) ;
2019-11-05 15:02:39 +03:00
out :
inode_unlock ( inode ) ;
2023-06-01 17:58:55 +03:00
if ( unlikely ( ret < = 0 ) )
return ret ;
return generic_write_sync ( iocb , ret ) ;
2019-11-05 15:02:39 +03:00
}
2019-11-05 15:01:51 +03:00
static ssize_t ext4_handle_inode_extension ( struct inode * inode , loff_t offset ,
2023-10-13 15:13:50 +03:00
ssize_t count )
2019-11-05 15:01:51 +03:00
{
handle_t * handle ;
2023-10-13 15:13:50 +03:00
lockdep_assert_held_write ( & inode - > i_rwsem ) ;
2019-11-05 15:01:51 +03:00
handle = ext4_journal_start ( inode , EXT4_HT_INODE , 2 ) ;
2023-10-13 15:13:50 +03:00
if ( IS_ERR ( handle ) )
return PTR_ERR ( handle ) ;
2019-11-05 15:01:51 +03:00
2023-10-13 15:13:50 +03:00
if ( ext4_update_inode_size ( inode , offset + count ) ) {
int ret = ext4_mark_inode_dirty ( handle , inode ) ;
2020-04-27 04:34:37 +03:00
if ( unlikely ( ret ) ) {
ext4_journal_stop ( handle ) ;
2023-10-13 15:13:50 +03:00
return ret ;
2020-04-27 04:34:37 +03:00
}
}
2019-11-05 15:01:51 +03:00
2023-10-13 15:13:50 +03:00
if ( inode - > i_nlink )
2019-11-05 15:01:51 +03:00
ext4_orphan_del ( handle , inode ) ;
ext4_journal_stop ( handle ) ;
2023-10-13 15:13:50 +03:00
return count ;
}
/*
* Clean up the inode after DIO or DAX extending write has completed and the
* inode size has been updated using ext4_handle_inode_extension ( ) .
*/
static void ext4_inode_extension_cleanup ( struct inode * inode , ssize_t count )
{
lockdep_assert_held_write ( & inode - > i_rwsem ) ;
if ( count < 0 ) {
2019-11-05 15:01:51 +03:00
ext4_truncate_failed_write ( inode ) ;
/*
* If the truncate operation failed early , then the inode may
* still be on the orphan list . In that case , we need to try
* remove the inode from the in - memory linked list .
*/
if ( inode - > i_nlink )
ext4_orphan_del ( NULL , inode ) ;
2023-10-13 15:13:50 +03:00
return ;
2019-11-05 15:01:51 +03:00
}
2023-10-13 15:13:50 +03:00
/*
2023-11-30 12:56:53 +03:00
* If i_disksize got extended either due to writeback of delalloc
* blocks or extending truncate while the DIO was running we could fail
* to cleanup the orphan list in ext4_handle_inode_extension ( ) . Do it
* now .
2023-10-13 15:13:50 +03:00
*/
if ( ! list_empty ( & EXT4_I ( inode ) - > i_orphan ) & & inode - > i_nlink ) {
handle_t * handle = ext4_journal_start ( inode , EXT4_HT_INODE , 2 ) ;
2019-11-05 15:01:51 +03:00
2023-10-13 15:13:50 +03:00
if ( IS_ERR ( handle ) ) {
/*
* The write has successfully completed . Not much to
* do with the error here so just cleanup the orphan
* list and hope for the best .
*/
ext4_orphan_del ( NULL , inode ) ;
return ;
}
ext4_orphan_del ( handle , inode ) ;
ext4_journal_stop ( handle ) ;
}
2019-11-05 15:01:51 +03:00
}
2019-11-05 15:02:39 +03:00
static int ext4_dio_write_end_io ( struct kiocb * iocb , ssize_t size ,
int error , unsigned int flags )
{
2021-04-15 18:54:17 +03:00
loff_t pos = iocb - > ki_pos ;
2019-11-05 15:02:39 +03:00
struct inode * inode = file_inode ( iocb - > ki_filp ) ;
2023-10-13 15:13:50 +03:00
if ( ! error & & size & & flags & IOMAP_DIO_UNWRITTEN )
error = ext4_convert_unwritten_extents ( NULL , inode , pos , size ) ;
2019-11-05 15:02:39 +03:00
if ( error )
return error ;
2021-04-15 18:54:17 +03:00
/*
2023-10-13 15:13:50 +03:00
* Note that EXT4_I ( inode ) - > i_disksize can get extended up to
* inode - > i_size while the I / O was running due to writeback of delalloc
* blocks . But the code in ext4_iomap_alloc ( ) is careful to use
* zeroed / unwritten extents if this is possible ; thus we won ' t leave
* uninitialized blocks in a file even if we didn ' t succeed in writing
2023-11-30 12:56:53 +03:00
* as much as we intended . Also we can race with truncate or write
* expanding the file so we have to be a bit careful here .
2021-04-15 18:54:17 +03:00
*/
2023-11-30 12:56:53 +03:00
if ( pos + size < = READ_ONCE ( EXT4_I ( inode ) - > i_disksize ) & &
pos + size < = i_size_read ( inode ) )
2023-10-13 15:13:50 +03:00
return size ;
return ext4_handle_inode_extension ( inode , pos , size ) ;
2019-11-05 15:02:39 +03:00
}
static const struct iomap_dio_ops ext4_dio_write_ops = {
. end_io = ext4_dio_write_end_io ,
} ;
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
/*
* The intention here is to start with shared lock acquired then see if any
* condition requires an exclusive inode lock . If yes , then we restart the
* whole operation by releasing the shared lock and acquiring exclusive lock .
*
* - For unaligned_io we never take shared lock as it may cause data corruption
* when two unaligned IO tries to modify the same block e . g . while zeroing .
*
* - For extending writes case we don ' t take the shared lock , since it requires
* updating inode i_disksize and / or orphan handling with exclusive lock .
*
ext4: dio take shared inode lock when overwriting preallocated blocks
In the dio write path, we only take shared inode lock for the case of
aligned overwriting initialized blocks inside EOF. But for overwriting
preallocated blocks, it may only need to split unwritten extents, this
procedure has been protected under i_data_sem lock, it's safe to
release the exclusive inode lock and take shared inode lock.
This could give a significant speed up for multi-threaded writes. Test
on Intel Xeon Gold 6140 and nvme SSD with below fio parameters.
direct=1
ioengine=libaio
iodepth=10
numjobs=10
runtime=60
rw=randwrite
size=100G
And the test result are:
Before:
bs=4k IOPS=11.1k, BW=43.2MiB/s
bs=16k IOPS=11.1k, BW=173MiB/s
bs=64k IOPS=11.2k, BW=697MiB/s
After:
bs=4k IOPS=41.4k, BW=162MiB/s
bs=16k IOPS=41.3k, BW=646MiB/s
bs=64k IOPS=13.5k, BW=843MiB/s
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221226062015.3479416-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-12-26 09:20:15 +03:00
* - shared locking will only be true mostly with overwrites , including
* initialized blocks and unwritten blocks . For overwrite unwritten blocks
* we protect splitting extents by i_data_sem in ext4_inode_info , so we can
* also release exclusive i_rwsem lock .
*
* - Otherwise we will switch to exclusive i_rwsem lock .
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
*/
static ssize_t ext4_dio_write_checks ( struct kiocb * iocb , struct iov_iter * from ,
ext4: dio take shared inode lock when overwriting preallocated blocks
In the dio write path, we only take shared inode lock for the case of
aligned overwriting initialized blocks inside EOF. But for overwriting
preallocated blocks, it may only need to split unwritten extents, this
procedure has been protected under i_data_sem lock, it's safe to
release the exclusive inode lock and take shared inode lock.
This could give a significant speed up for multi-threaded writes. Test
on Intel Xeon Gold 6140 and nvme SSD with below fio parameters.
direct=1
ioengine=libaio
iodepth=10
numjobs=10
runtime=60
rw=randwrite
size=100G
And the test result are:
Before:
bs=4k IOPS=11.1k, BW=43.2MiB/s
bs=16k IOPS=11.1k, BW=173MiB/s
bs=64k IOPS=11.2k, BW=697MiB/s
After:
bs=4k IOPS=41.4k, BW=162MiB/s
bs=16k IOPS=41.3k, BW=646MiB/s
bs=64k IOPS=13.5k, BW=843MiB/s
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221226062015.3479416-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-12-26 09:20:15 +03:00
bool * ilock_shared , bool * extend ,
2023-03-14 16:07:59 +03:00
bool * unwritten , int * dio_flags )
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
{
struct file * file = iocb - > ki_filp ;
struct inode * inode = file_inode ( file ) ;
loff_t offset ;
size_t count ;
ssize_t ret ;
2023-03-14 16:07:59 +03:00
bool overwrite , unaligned_io ;
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
restart :
ret = ext4_generic_write_checks ( iocb , from ) ;
if ( ret < = 0 )
goto out ;
offset = iocb - > ki_pos ;
count = ret ;
2023-03-14 16:07:59 +03:00
unaligned_io = ext4_unaligned_io ( inode , from , offset ) ;
* extend = ext4_extending_io ( inode , offset , count ) ;
overwrite = ext4_overwrite_io ( inode , offset , count , unwritten ) ;
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
/*
2023-03-14 16:07:59 +03:00
* Determine whether we need to upgrade to an exclusive lock . This is
* required to change security info in file_modified ( ) , for extending
* I / O , any form of non - overwrite I / O , and unaligned I / O to unwritten
* extents ( as partial block zeroing may be required ) .
ext4: drop dio overwrite only flag and associated warning
The commit referenced below opened up concurrent unaligned dio under
shared locking for pure overwrites. In doing so, it enabled use of
the IOMAP_DIO_OVERWRITE_ONLY flag and added a warning on unexpected
-EAGAIN returns as an extra precaution, since ext4 does not retry
writes in such cases. The flag itself is advisory in this case since
ext4 checks for unaligned I/Os and uses appropriate locking up
front, rather than on a retry in response to -EAGAIN.
As it turns out, the warning check is susceptible to false positives
because there are scenarios where -EAGAIN can be expected from lower
layers without necessarily having IOCB_NOWAIT set on the iocb. For
example, one instance of the warning has been seen where io_uring
sets IOCB_HIPRI, which in turn results in REQ_POLLED|REQ_NOWAIT on
the bio. This results in -EAGAIN if the block layer is unable to
allocate a request, etc. [Note that there is an outstanding patch to
untangle REQ_POLLED and REQ_NOWAIT such that the latter relies on
IOCB_NOWAIT, which would also address this instance of the warning.]
Another instance of the warning has been reproduced by syzbot. A dio
write is interrupted down in __get_user_pages_locked() waiting on
the mm lock and returns -EAGAIN up the stack. If the iomap dio
iteration layer has made no progress on the write to this point,
-EAGAIN returns up to the filesystem and triggers the warning.
This use of the overwrite flag in ext4 is precautionary and
half-baked. I.e., ext4 doesn't actually implement overwrite checking
in the iomap callbacks when the flag is set, so the only extra
verification it provides are i_size checks in the generic iomap dio
layer. Combined with the tendency for false positives, the added
verification is not worth the extra trouble. Remove the flag,
associated warning, and update the comments to document when
concurrent unaligned dio writes are allowed and why said flag is not
used.
Cc: stable@kernel.org
Reported-by: syzbot+5050ad0fb47527b1808a@syzkaller.appspotmail.com
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Fixes: 310ee0902b8d ("ext4: allow concurrent unaligned dio overwrites")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230810165559.946222-1-bfoster@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-10 19:55:59 +03:00
*
* Note that unaligned writes are allowed under shared lock so long as
* they are pure overwrites . Otherwise , concurrent unaligned writes risk
* data corruption due to partial block zeroing in the dio layer , and so
* the I / O must occur exclusively .
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
*/
2023-03-14 16:07:59 +03:00
if ( * ilock_shared & &
( ( ! IS_NOSEC ( inode ) | | * extend | | ! overwrite | |
( unaligned_io & & * unwritten ) ) ) ) {
2020-07-08 18:35:16 +03:00
if ( iocb - > ki_flags & IOCB_NOWAIT ) {
ret = - EAGAIN ;
goto out ;
}
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
inode_unlock_shared ( inode ) ;
* ilock_shared = false ;
inode_lock ( inode ) ;
goto restart ;
}
2023-03-14 16:07:59 +03:00
/*
* Now that locking is settled , determine dio flags and exclusivity
ext4: drop dio overwrite only flag and associated warning
The commit referenced below opened up concurrent unaligned dio under
shared locking for pure overwrites. In doing so, it enabled use of
the IOMAP_DIO_OVERWRITE_ONLY flag and added a warning on unexpected
-EAGAIN returns as an extra precaution, since ext4 does not retry
writes in such cases. The flag itself is advisory in this case since
ext4 checks for unaligned I/Os and uses appropriate locking up
front, rather than on a retry in response to -EAGAIN.
As it turns out, the warning check is susceptible to false positives
because there are scenarios where -EAGAIN can be expected from lower
layers without necessarily having IOCB_NOWAIT set on the iocb. For
example, one instance of the warning has been seen where io_uring
sets IOCB_HIPRI, which in turn results in REQ_POLLED|REQ_NOWAIT on
the bio. This results in -EAGAIN if the block layer is unable to
allocate a request, etc. [Note that there is an outstanding patch to
untangle REQ_POLLED and REQ_NOWAIT such that the latter relies on
IOCB_NOWAIT, which would also address this instance of the warning.]
Another instance of the warning has been reproduced by syzbot. A dio
write is interrupted down in __get_user_pages_locked() waiting on
the mm lock and returns -EAGAIN up the stack. If the iomap dio
iteration layer has made no progress on the write to this point,
-EAGAIN returns up to the filesystem and triggers the warning.
This use of the overwrite flag in ext4 is precautionary and
half-baked. I.e., ext4 doesn't actually implement overwrite checking
in the iomap callbacks when the flag is set, so the only extra
verification it provides are i_size checks in the generic iomap dio
layer. Combined with the tendency for false positives, the added
verification is not worth the extra trouble. Remove the flag,
associated warning, and update the comments to document when
concurrent unaligned dio writes are allowed and why said flag is not
used.
Cc: stable@kernel.org
Reported-by: syzbot+5050ad0fb47527b1808a@syzkaller.appspotmail.com
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Fixes: 310ee0902b8d ("ext4: allow concurrent unaligned dio overwrites")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230810165559.946222-1-bfoster@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-10 19:55:59 +03:00
* requirements . We don ' t use DIO_OVERWRITE_ONLY because we enforce
* behavior already . The inode lock is already held exclusive if the
* write is non - overwrite or extending , so drain all outstanding dio and
* set the force wait dio flag .
2023-03-14 16:07:59 +03:00
*/
ext4: drop dio overwrite only flag and associated warning
The commit referenced below opened up concurrent unaligned dio under
shared locking for pure overwrites. In doing so, it enabled use of
the IOMAP_DIO_OVERWRITE_ONLY flag and added a warning on unexpected
-EAGAIN returns as an extra precaution, since ext4 does not retry
writes in such cases. The flag itself is advisory in this case since
ext4 checks for unaligned I/Os and uses appropriate locking up
front, rather than on a retry in response to -EAGAIN.
As it turns out, the warning check is susceptible to false positives
because there are scenarios where -EAGAIN can be expected from lower
layers without necessarily having IOCB_NOWAIT set on the iocb. For
example, one instance of the warning has been seen where io_uring
sets IOCB_HIPRI, which in turn results in REQ_POLLED|REQ_NOWAIT on
the bio. This results in -EAGAIN if the block layer is unable to
allocate a request, etc. [Note that there is an outstanding patch to
untangle REQ_POLLED and REQ_NOWAIT such that the latter relies on
IOCB_NOWAIT, which would also address this instance of the warning.]
Another instance of the warning has been reproduced by syzbot. A dio
write is interrupted down in __get_user_pages_locked() waiting on
the mm lock and returns -EAGAIN up the stack. If the iomap dio
iteration layer has made no progress on the write to this point,
-EAGAIN returns up to the filesystem and triggers the warning.
This use of the overwrite flag in ext4 is precautionary and
half-baked. I.e., ext4 doesn't actually implement overwrite checking
in the iomap callbacks when the flag is set, so the only extra
verification it provides are i_size checks in the generic iomap dio
layer. Combined with the tendency for false positives, the added
verification is not worth the extra trouble. Remove the flag,
associated warning, and update the comments to document when
concurrent unaligned dio writes are allowed and why said flag is not
used.
Cc: stable@kernel.org
Reported-by: syzbot+5050ad0fb47527b1808a@syzkaller.appspotmail.com
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Fixes: 310ee0902b8d ("ext4: allow concurrent unaligned dio overwrites")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230810165559.946222-1-bfoster@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-08-10 19:55:59 +03:00
if ( ! * ilock_shared & & ( unaligned_io | | * extend ) ) {
2023-03-14 16:07:59 +03:00
if ( iocb - > ki_flags & IOCB_NOWAIT ) {
ret = - EAGAIN ;
goto out ;
}
if ( unaligned_io & & ( ! overwrite | | * unwritten ) )
inode_dio_wait ( inode ) ;
* dio_flags = IOMAP_DIO_FORCE_WAIT ;
}
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
ret = file_modified ( file ) ;
if ( ret < 0 )
goto out ;
return count ;
out :
if ( * ilock_shared )
inode_unlock_shared ( inode ) ;
else
inode_unlock ( inode ) ;
return ret ;
}
2019-11-05 15:02:39 +03:00
static ssize_t ext4_dio_write_iter ( struct kiocb * iocb , struct iov_iter * from )
{
ssize_t ret ;
handle_t * handle ;
struct inode * inode = file_inode ( iocb - > ki_filp ) ;
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
loff_t offset = iocb - > ki_pos ;
size_t count = iov_iter_count ( from ) ;
2019-12-18 20:44:33 +03:00
const struct iomap_ops * iomap_ops = & ext4_iomap_ops ;
2023-03-14 16:07:59 +03:00
bool extend = false , unwritten = false ;
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
bool ilock_shared = true ;
2023-03-14 16:07:59 +03:00
int dio_flags = 0 ;
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
/*
* Quick check here without any i_rwsem lock to see if it is extending
* IO . A more reliable check is done in ext4_dio_write_checks ( ) with
* proper locking in place .
*/
if ( offset + count > i_size_read ( inode ) )
ilock_shared = false ;
2019-11-05 15:02:39 +03:00
if ( iocb - > ki_flags & IOCB_NOWAIT ) {
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
if ( ilock_shared ) {
if ( ! inode_trylock_shared ( inode ) )
return - EAGAIN ;
} else {
if ( ! inode_trylock ( inode ) )
return - EAGAIN ;
}
2019-11-05 15:02:39 +03:00
} else {
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
if ( ilock_shared )
inode_lock_shared ( inode ) ;
else
inode_lock ( inode ) ;
2019-11-05 15:02:39 +03:00
}
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
/* Fallback to buffered I/O if the inode does not support direct I/O. */
2022-08-27 09:58:47 +03:00
if ( ! ext4_should_use_dio ( iocb , from ) ) {
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
if ( ilock_shared )
inode_unlock_shared ( inode ) ;
else
inode_unlock ( inode ) ;
2019-11-05 15:02:39 +03:00
return ext4_buffered_write_iter ( iocb , from ) ;
}
2023-10-02 21:50:20 +03:00
/*
* Prevent inline data from being created since we are going to allocate
* blocks for DIO . We know the inode does not currently have inline data
* because ext4_should_use_dio ( ) checked for it , but we have to clear
* the state flag before the write checks because a lock cycle could
* introduce races with other writers .
*/
ext4_clear_inode_state ( inode , EXT4_STATE_MAY_INLINE_DATA ) ;
2023-03-14 16:07:59 +03:00
ret = ext4_dio_write_checks ( iocb , from , & ilock_shared , & extend ,
& unwritten , & dio_flags ) ;
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
if ( ret < = 0 )
2019-11-05 15:02:39 +03:00
return ret ;
offset = iocb - > ki_pos ;
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
count = ret ;
2019-11-05 15:02:39 +03:00
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
if ( extend ) {
2019-11-05 15:02:39 +03:00
handle = ext4_journal_start ( inode , EXT4_HT_INODE , 2 ) ;
if ( IS_ERR ( handle ) ) {
ret = PTR_ERR ( handle ) ;
goto out ;
}
ret = ext4_orphan_add ( handle , inode ) ;
if ( ret ) {
ext4_journal_stop ( handle ) ;
goto out ;
}
ext4_journal_stop ( handle ) ;
}
ext4: dio take shared inode lock when overwriting preallocated blocks
In the dio write path, we only take shared inode lock for the case of
aligned overwriting initialized blocks inside EOF. But for overwriting
preallocated blocks, it may only need to split unwritten extents, this
procedure has been protected under i_data_sem lock, it's safe to
release the exclusive inode lock and take shared inode lock.
This could give a significant speed up for multi-threaded writes. Test
on Intel Xeon Gold 6140 and nvme SSD with below fio parameters.
direct=1
ioengine=libaio
iodepth=10
numjobs=10
runtime=60
rw=randwrite
size=100G
And the test result are:
Before:
bs=4k IOPS=11.1k, BW=43.2MiB/s
bs=16k IOPS=11.1k, BW=173MiB/s
bs=64k IOPS=11.2k, BW=697MiB/s
After:
bs=4k IOPS=41.4k, BW=162MiB/s
bs=16k IOPS=41.3k, BW=646MiB/s
bs=64k IOPS=13.5k, BW=843MiB/s
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221226062015.3479416-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-12-26 09:20:15 +03:00
if ( ilock_shared & & ! unwritten )
2019-12-18 20:44:33 +03:00
iomap_ops = & ext4_iomap_overwrite_ops ;
ret = iomap_dio_rw ( iocb , from , iomap_ops , & ext4_dio_write_ops ,
2023-03-14 16:07:59 +03:00
dio_flags , NULL , 0 ) ;
2020-07-24 08:45:59 +03:00
if ( ret = = - ENOTBLK )
ret = 0 ;
2023-10-13 15:13:50 +03:00
if ( extend ) {
/*
* We always perform extending DIO write synchronously so by
* now the IO is completed and ext4_handle_inode_extension ( )
* was called . Cleanup the inode in case of error or race with
* writeback of delalloc blocks .
*/
WARN_ON_ONCE ( ret = = - EIOCBQUEUED ) ;
ext4_inode_extension_cleanup ( inode , ret ) ;
}
2019-11-05 15:02:39 +03:00
out :
ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c54688592ce: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.
But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.
So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().
Other than that, it also simplifies below cases:-
1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.
Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.
2. Added ext4_extending_io(). This checks if the IO is extending the file.
3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-12 08:55:56 +03:00
if ( ilock_shared )
2019-11-05 15:02:39 +03:00
inode_unlock_shared ( inode ) ;
else
inode_unlock ( inode ) ;
if ( ret > = 0 & & iov_iter_count ( from ) ) {
ssize_t err ;
loff_t endbyte ;
offset = iocb - > ki_pos ;
err = ext4_buffered_write_iter ( iocb , from ) ;
if ( err < 0 )
return err ;
/*
* We need to ensure that the pages within the page cache for
* the range covered by this I / O are written to disk and
* invalidated . This is in attempt to preserve the expected
* direct I / O semantics in the case we fallback to buffered I / O
* to complete off the I / O request .
*/
ret + = err ;
endbyte = offset + err - 1 ;
err = filemap_write_and_wait_range ( iocb - > ki_filp - > f_mapping ,
offset , endbyte ) ;
if ( ! err )
invalidate_mapping_pages ( iocb - > ki_filp - > f_mapping ,
offset > > PAGE_SHIFT ,
endbyte > > PAGE_SHIFT ) ;
}
return ret ;
}
2016-11-21 02:09:11 +03:00
# ifdef CONFIG_FS_DAX
static ssize_t
ext4_dax_write_iter ( struct kiocb * iocb , struct iov_iter * from )
{
ssize_t ret ;
2019-11-05 15:01:51 +03:00
size_t count ;
loff_t offset ;
2019-11-05 15:02:08 +03:00
handle_t * handle ;
bool extend = false ;
2019-11-05 15:01:51 +03:00
struct inode * inode = file_inode ( iocb - > ki_filp ) ;
2016-11-21 02:09:11 +03:00
2019-12-12 08:55:55 +03:00
if ( iocb - > ki_flags & IOCB_NOWAIT ) {
if ( ! inode_trylock ( inode ) )
2017-06-20 15:05:47 +03:00
return - EAGAIN ;
2019-12-12 08:55:55 +03:00
} else {
2017-06-20 15:05:47 +03:00
inode_lock ( inode ) ;
}
2019-11-05 15:02:39 +03:00
2016-11-21 02:09:11 +03:00
ret = ext4_write_checks ( iocb , from ) ;
if ( ret < = 0 )
goto out ;
2019-11-05 15:01:51 +03:00
offset = iocb - > ki_pos ;
count = iov_iter_count ( from ) ;
2019-11-05 15:02:08 +03:00
if ( offset + count > EXT4_I ( inode ) - > i_disksize ) {
handle = ext4_journal_start ( inode , EXT4_HT_INODE , 2 ) ;
if ( IS_ERR ( handle ) ) {
ret = PTR_ERR ( handle ) ;
goto out ;
}
ret = ext4_orphan_add ( handle , inode ) ;
if ( ret ) {
ext4_journal_stop ( handle ) ;
goto out ;
}
extend = true ;
ext4_journal_stop ( handle ) ;
}
2016-11-21 02:09:11 +03:00
ret = dax_iomap_rw ( iocb , from , & ext4_iomap_ops ) ;
2019-11-05 15:02:08 +03:00
2023-10-13 15:13:50 +03:00
if ( extend ) {
ret = ext4_handle_inode_extension ( inode , offset , ret ) ;
ext4_inode_extension_cleanup ( inode , ret ) ;
}
2016-11-21 02:09:11 +03:00
out :
2017-02-08 22:39:27 +03:00
inode_unlock ( inode ) ;
2016-11-21 02:09:11 +03:00
if ( ret > 0 )
ret = generic_write_sync ( iocb , ret ) ;
return ret ;
}
# endif
2006-10-11 12:20:50 +04:00
static ssize_t
2014-04-18 00:09:22 +04:00
ext4_file_write_iter ( struct kiocb * iocb , struct iov_iter * from )
2006-10-11 12:20:50 +04:00
{
2014-04-21 22:26:57 +04:00
struct inode * inode = file_inode ( iocb - > ki_filp ) ;
2014-04-21 22:26:28 +04:00
2023-06-16 19:50:49 +03:00
if ( unlikely ( ext4_forced_shutdown ( inode - > i_sb ) ) )
2017-02-05 09:28:48 +03:00
return - EIO ;
2016-11-21 02:09:11 +03:00
# ifdef CONFIG_FS_DAX
if ( IS_DAX ( inode ) )
return ext4_dax_write_iter ( iocb , from ) ;
# endif
2019-11-05 15:02:39 +03:00
if ( iocb - > ki_flags & IOCB_DIRECT )
return ext4_dio_write_iter ( iocb , from ) ;
2020-10-15 23:37:57 +03:00
else
return ext4_buffered_write_iter ( iocb , from ) ;
2006-10-11 12:20:50 +04:00
}
2015-02-17 02:59:38 +03:00
# ifdef CONFIG_FS_DAX
2023-08-18 23:23:35 +03:00
static vm_fault_t ext4_dax_huge_fault ( struct vm_fault * vmf , unsigned int order )
2015-02-17 02:59:38 +03:00
{
2018-05-13 23:01:49 +03:00
int error = 0 ;
vm_fault_t result ;
2018-01-08 00:41:01 +03:00
int retries = 0 ;
2017-05-13 01:46:54 +03:00
handle_t * handle = NULL ;
2017-02-25 01:56:41 +03:00
struct inode * inode = file_inode ( vmf - > vma - > vm_file ) ;
2015-12-07 22:28:03 +03:00
struct super_block * sb = inode - > i_sb ;
2017-08-24 22:26:01 +03:00
/*
* We have to distinguish real writes from writes which will result in a
* COW page ; COW writes should * not * poke the journal ( the file will not
* be changed ) . Doing so would cause unintended failures when mounted
* read - only .
*
* We check for VM_SHARED rather than vmf - > cow_page since the latter is
2023-08-18 23:23:35 +03:00
* unset for order ! = 0 ( i . e . only in do_cow_fault ) ; for
2017-08-24 22:26:01 +03:00
* other sizes , dax_iomap_fault will handle splitting / fallback so that
* we eventually come back with a COW page .
*/
bool write = ( vmf - > flags & FAULT_FLAG_WRITE ) & &
( vmf - > vma - > vm_flags & VM_SHARED ) ;
2021-02-04 20:05:42 +03:00
struct address_space * mapping = vmf - > vma - > vm_file - > f_mapping ;
2017-11-01 18:36:45 +03:00
pfn_t pfn ;
2015-09-09 00:59:22 +03:00
if ( write ) {
sb_start_pagefault ( sb ) ;
2017-02-25 01:56:41 +03:00
file_update_time ( vmf - > vma - > vm_file ) ;
2021-02-04 20:05:42 +03:00
filemap_invalidate_lock_shared ( mapping ) ;
2018-01-08 00:41:01 +03:00
retry :
2017-05-13 01:46:54 +03:00
handle = ext4_journal_start_sb ( sb , EXT4_HT_WRITE_PAGE ,
EXT4_DATA_TRANS_BLOCKS ( sb ) ) ;
2017-11-01 18:36:44 +03:00
if ( IS_ERR ( handle ) ) {
2021-02-04 20:05:42 +03:00
filemap_invalidate_unlock_shared ( mapping ) ;
2017-11-01 18:36:44 +03:00
sb_end_pagefault ( sb ) ;
return VM_FAULT_SIGBUS ;
}
2017-05-13 01:46:54 +03:00
} else {
2021-02-04 20:05:42 +03:00
filemap_invalidate_lock_shared ( mapping ) ;
2016-10-21 12:33:49 +03:00
}
2023-08-18 23:23:35 +03:00
result = dax_iomap_fault ( vmf , order , & pfn , & error , & ext4_iomap_ops ) ;
2017-05-13 01:46:54 +03:00
if ( write ) {
2017-11-01 18:36:44 +03:00
ext4_journal_stop ( handle ) ;
2018-01-08 00:41:01 +03:00
if ( ( result & VM_FAULT_ERROR ) & & error = = - ENOSPC & &
ext4_should_retry_alloc ( sb , & retries ) )
goto retry ;
2017-11-01 18:36:45 +03:00
/* Handling synchronous page fault? */
if ( result & VM_FAULT_NEEDDSYNC )
2023-08-18 23:23:35 +03:00
result = dax_finish_sync_fault ( vmf , order , pfn ) ;
2021-02-04 20:05:42 +03:00
filemap_invalidate_unlock_shared ( mapping ) ;
2015-09-09 00:59:22 +03:00
sb_end_pagefault ( sb ) ;
2017-05-13 01:46:54 +03:00
} else {
2021-02-04 20:05:42 +03:00
filemap_invalidate_unlock_shared ( mapping ) ;
2017-05-13 01:46:54 +03:00
}
2015-09-09 00:59:22 +03:00
return result ;
2015-02-17 02:59:38 +03:00
}
2018-05-13 23:01:49 +03:00
static vm_fault_t ext4_dax_fault ( struct vm_fault * vmf )
2017-02-25 01:57:08 +03:00
{
2023-08-18 23:23:35 +03:00
return ext4_dax_huge_fault ( vmf , 0 ) ;
2017-02-25 01:57:08 +03:00
}
2015-02-17 02:59:38 +03:00
static const struct vm_operations_struct ext4_dax_vm_ops = {
. fault = ext4_dax_fault ,
2017-02-25 01:57:08 +03:00
. huge_fault = ext4_dax_huge_fault ,
2016-02-27 22:01:13 +03:00
. page_mkwrite = ext4_dax_fault ,
dax: use common 4k zero page for dax mmap reads
When servicing mmap() reads from file holes the current DAX code
allocates a page cache page of all zeroes and places the struct page
pointer in the mapping->page_tree radix tree.
This has three major drawbacks:
1) It consumes memory unnecessarily. For every 4k page that is read via
a DAX mmap() over a hole, we allocate a new page cache page. This
means that if you read 1GiB worth of pages, you end up using 1GiB of
zeroed memory. This is easily visible by looking at the overall
memory consumption of the system or by looking at /proc/[pid]/smaps:
7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12 /root/dax/data
Size: 1048576 kB
Rss: 1048576 kB
Pss: 1048576 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 1048576 kB
Private_Dirty: 0 kB
Referenced: 1048576 kB
Anonymous: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
2) It is slower than using a common zero page because each page fault
has more work to do. Instead of just inserting a common zero page we
have to allocate a page cache page, zero it, and then insert it. Here
are the average latencies of dax_load_hole() as measured by ftrace on
a random test box:
Old method, using zeroed page cache pages: 3.4 us
New method, using the common 4k zero page: 0.8 us
This was the average latency over 1 GiB of sequential reads done by
this simple fio script:
[global]
size=1G
filename=/root/dax/data
fallocate=none
[io]
rw=read
ioengine=mmap
3) The fact that we had to check for both DAX exceptional entries and
for page cache pages in the radix tree made the DAX code more
complex.
Solve these issues by following the lead of the DAX PMD code and using a
common 4k zero page instead. As with the PMD code we will now insert a
DAX exceptional entry into the radix tree instead of a struct page
pointer which allows us to remove all the special casing in the DAX
code.
Note that we do still pretty aggressively check for regular pages in the
DAX radix tree, especially where we take action based on the bits set in
the page. If we ever find a regular page in our radix tree now that
most likely means that someone besides DAX is inserting pages (which has
happened lots of times in the past), and we want to find that out early
and fail loudly.
This solution also removes the extra memory consumption. Here is that
same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
code:
7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12 /root/dax/data
Size: 1048576 kB
Rss: 0 kB
Pss: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 0 kB
Anonymous: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
Overall system memory consumption is similarly improved.
Another major change is that we remove dax_pfn_mkwrite() from our fault
flow, and instead rely on the page fault itself to make the PTE dirty
and writeable. The following description from the patch adding the
vm_insert_mixed_mkwrite() call explains this a little more:
"To be able to use the common 4k zero page in DAX we need to have our
PTE fault path look more like our PMD fault path where a PTE entry
can be marked as dirty and writeable as it is first inserted rather
than waiting for a follow-up dax_pfn_mkwrite() =>
finish_mkwrite_fault() call.
Right now we can rely on having a dax_pfn_mkwrite() call because we
can distinguish between these two cases in do_wp_page():
case 1: 4k zero page => writable DAX storage
case 2: read-only DAX storage => writeable DAX storage
This distinction is made by via vm_normal_page(). vm_normal_page()
returns false for the common 4k zero page, though, just as it does
for DAX ptes. Instead of special casing the DAX + 4k zero page case
we will simplify our DAX PTE page fault sequence so that it matches
our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
We will instead use dax_iomap_fault() to handle write-protection
faults.
This means that insert_pfn() needs to follow the lead of
insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
'mkwrite' is set insert_pfn() will do the work that was previously
done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"
Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07 02:18:43 +03:00
. pfn_mkwrite = ext4_dax_fault ,
2015-02-17 02:59:38 +03:00
} ;
# else
# define ext4_dax_vm_ops ext4_file_vm_ops
# endif
2009-09-27 22:29:37 +04:00
static const struct vm_operations_struct ext4_file_vm_ops = {
2021-02-04 20:05:42 +03:00
. fault = filemap_fault ,
2014-04-08 02:37:19 +04:00
. map_pages = filemap_map_pages ,
2008-07-12 03:27:31 +04:00
. page_mkwrite = ext4_page_mkwrite ,
} ;
static int ext4_file_mmap ( struct file * file , struct vm_area_struct * vma )
{
2015-04-12 07:56:10 +03:00
struct inode * inode = file - > f_mapping - > host ;
2023-06-16 19:50:49 +03:00
struct dax_device * dax_dev = EXT4_SB ( inode - > i_sb ) - > s_daxdev ;
2015-04-12 07:56:10 +03:00
2023-06-16 19:50:49 +03:00
if ( unlikely ( ext4_forced_shutdown ( inode - > i_sb ) ) )
2017-02-05 09:28:48 +03:00
return - EIO ;
2017-11-01 18:36:45 +03:00
/*
2019-07-05 17:03:27 +03:00
* We don ' t support synchronous mappings for non - DAX files and
* for DAX files if underneath dax_device is not synchronous .
2017-11-01 18:36:45 +03:00
*/
2019-07-05 17:03:27 +03:00
if ( ! daxdev_mapping_supported ( vma , dax_dev ) )
2017-11-01 18:36:45 +03:00
return - EOPNOTSUPP ;
2008-07-12 03:27:31 +04:00
file_accessed ( file ) ;
2015-02-17 02:59:38 +03:00
if ( IS_DAX ( file_inode ( file ) ) ) {
vma - > vm_ops = & ext4_dax_vm_ops ;
2023-01-26 22:37:49 +03:00
vm_flags_set ( vma , VM_HUGEPAGE ) ;
2015-02-17 02:59:38 +03:00
} else {
vma - > vm_ops = & ext4_file_vm_ops ;
}
2008-07-12 03:27:31 +04:00
return 0 ;
}
2018-05-14 05:44:23 +03:00
static int ext4_sample_last_mounted ( struct super_block * sb ,
struct vfsmount * mnt )
2009-06-13 18:09:48 +04:00
{
2018-05-14 05:44:23 +03:00
struct ext4_sb_info * sbi = EXT4_SB ( sb ) ;
2009-06-13 18:09:48 +04:00
struct path path ;
char buf [ 64 ] , * cp ;
2018-05-14 05:44:23 +03:00
handle_t * handle ;
int err ;
2020-11-06 06:59:09 +03:00
if ( likely ( ext4_test_mount_flag ( sb , EXT4_MF_MNTDIR_SAMPLED ) ) )
2018-05-14 05:44:23 +03:00
return 0 ;
2018-05-14 05:54:44 +03:00
if ( sb_rdonly ( sb ) | | ! sb_start_intwrite_trylock ( sb ) )
2018-05-14 05:44:23 +03:00
return 0 ;
2020-11-06 06:59:09 +03:00
ext4_set_mount_flag ( sb , EXT4_MF_MNTDIR_SAMPLED ) ;
2018-05-14 05:44:23 +03:00
/*
* Sample where the filesystem has been mounted and
* store it in the superblock for sysadmin convenience
* when trying to sort through large numbers of block
* devices or filesystem images .
*/
memset ( buf , 0 , sizeof ( buf ) ) ;
path . mnt = mnt ;
path . dentry = mnt - > mnt_root ;
cp = d_path ( & path , buf , sizeof ( buf ) ) ;
2018-05-14 05:54:44 +03:00
err = 0 ;
2018-05-14 05:44:23 +03:00
if ( IS_ERR ( cp ) )
2018-05-14 05:54:44 +03:00
goto out ;
2018-05-14 05:44:23 +03:00
handle = ext4_journal_start_sb ( sb , EXT4_HT_MISC , 1 ) ;
2018-05-14 05:54:44 +03:00
err = PTR_ERR ( handle ) ;
2018-05-14 05:44:23 +03:00
if ( IS_ERR ( handle ) )
2018-05-14 05:54:44 +03:00
goto out ;
2018-05-14 05:44:23 +03:00
BUFFER_TRACE ( sbi - > s_sbh , " get_write_access " ) ;
2021-08-16 12:57:04 +03:00
err = ext4_journal_get_write_access ( handle , sb , sbi - > s_sbh ,
EXT4_JTR_NONE ) ;
2018-05-14 05:44:23 +03:00
if ( err )
2018-05-14 05:54:44 +03:00
goto out_journal ;
2020-12-16 13:18:39 +03:00
lock_buffer ( sbi - > s_sbh ) ;
2020-12-17 21:24:15 +03:00
strncpy ( sbi - > s_es - > s_last_mounted , cp ,
2018-05-14 05:44:23 +03:00
sizeof ( sbi - > s_es - > s_last_mounted ) ) ;
2020-12-16 13:18:39 +03:00
ext4_superblock_csum_set ( sb ) ;
unlock_buffer ( sbi - > s_sbh ) ;
2020-12-16 13:18:44 +03:00
ext4_handle_dirty_metadata ( handle , NULL , sbi - > s_sbh ) ;
2018-05-14 05:54:44 +03:00
out_journal :
2018-05-14 05:44:23 +03:00
ext4_journal_stop ( handle ) ;
2018-05-14 05:54:44 +03:00
out :
sb_end_intwrite ( sb ) ;
2018-05-14 05:44:23 +03:00
return err ;
}
2020-06-14 07:45:44 +03:00
static int ext4_file_open ( struct inode * inode , struct file * filp )
2018-05-14 05:44:23 +03:00
{
2015-04-12 07:56:10 +03:00
int ret ;
2009-06-13 18:09:48 +04:00
2023-06-16 19:50:49 +03:00
if ( unlikely ( ext4_forced_shutdown ( inode - > i_sb ) ) )
2017-02-05 09:28:48 +03:00
return - EIO ;
2018-05-14 05:44:23 +03:00
ret = ext4_sample_last_mounted ( inode - > i_sb , filp - > f_path . mnt ) ;
if ( ret )
return ret ;
2016-03-26 23:14:41 +03:00
2017-10-19 03:21:57 +03:00
ret = fscrypt_file_open ( inode , filp ) ;
if ( ret )
2019-07-22 19:26:24 +03:00
return ret ;
ret = fsverity_file_open ( inode , filp ) ;
if ( ret )
2017-10-19 03:21:57 +03:00
return ret ;
2011-01-10 20:29:43 +03:00
/*
* Set up the jbd2_inode if we are opening the inode for
* writing and the journal is present
*/
2013-08-17 05:19:41 +04:00
if ( filp - > f_mode & FMODE_WRITE ) {
2015-04-12 07:56:10 +03:00
ret = ext4_inode_attach_jinode ( inode ) ;
2013-08-17 05:19:41 +04:00
if ( ret < 0 )
return ret ;
2011-01-10 20:29:43 +03:00
}
2017-06-20 15:05:47 +03:00
2023-03-07 19:40:28 +03:00
filp - > f_mode | = FMODE_NOWAIT | FMODE_BUF_RASYNC |
FMODE_DIO_PARALLEL_WRITE ;
2015-05-31 20:35:39 +03:00
return dquot_file_open ( inode , filp ) ;
2009-06-13 18:09:48 +04:00
}
2010-10-28 05:30:06 +04:00
/*
2012-04-30 22:14:03 +04:00
* ext4_llseek ( ) handles both block - mapped and extent - mapped maxbytes values
* by calling generic_file_llseek_size ( ) with the appropriate maxbytes
* value for each .
2010-10-28 05:30:06 +04:00
*/
2012-12-18 03:59:39 +04:00
loff_t ext4_llseek ( struct file * file , loff_t offset , int whence )
2010-10-28 05:30:06 +04:00
{
struct inode * inode = file - > f_mapping - > host ;
loff_t maxbytes ;
if ( ! ( ext4_test_inode_flag ( inode , EXT4_INODE_EXTENTS ) ) )
maxbytes = EXT4_SB ( inode - > i_sb ) - > s_bitmap_maxbytes ;
else
maxbytes = inode - > i_sb - > s_maxbytes ;
2012-12-18 03:59:39 +04:00
switch ( whence ) {
2017-10-02 00:58:54 +03:00
default :
2012-12-18 03:59:39 +04:00
return generic_file_llseek_size ( file , offset , whence ,
2012-11-09 06:57:40 +04:00
maxbytes , i_size_read ( inode ) ) ;
case SEEK_HOLE :
2017-10-02 00:58:54 +03:00
inode_lock_shared ( inode ) ;
2019-11-05 15:03:31 +03:00
offset = iomap_seek_hole ( inode , offset ,
& ext4_iomap_report_ops ) ;
2017-10-02 00:58:54 +03:00
inode_unlock_shared ( inode ) ;
break ;
case SEEK_DATA :
inode_lock_shared ( inode ) ;
2019-11-05 15:03:31 +03:00
offset = iomap_seek_data ( inode , offset ,
& ext4_iomap_report_ops ) ;
2017-10-02 00:58:54 +03:00
inode_unlock_shared ( inode ) ;
break ;
2012-11-09 06:57:40 +04:00
}
2017-10-02 00:58:54 +03:00
if ( offset < 0 )
return offset ;
return vfs_setpos ( file , offset , maxbytes ) ;
2010-10-28 05:30:06 +04:00
}
2006-10-11 12:20:53 +04:00
const struct file_operations ext4_file_operations = {
2010-10-28 05:30:06 +04:00
. llseek = ext4_llseek ,
2016-11-21 01:36:06 +03:00
. read_iter = ext4_file_read_iter ,
2014-04-18 00:09:22 +04:00
. write_iter = ext4_file_write_iter ,
2021-10-12 14:12:24 +03:00
. iopoll = iocb_bio_iopoll ,
2008-04-30 06:03:54 +04:00
. unlocked_ioctl = ext4_ioctl ,
2006-10-11 12:20:50 +04:00
# ifdef CONFIG_COMPAT
2006-10-11 12:20:53 +04:00
. compat_ioctl = ext4_compat_ioctl ,
2006-10-11 12:20:50 +04:00
# endif
2008-07-12 03:27:31 +04:00
. mmap = ext4_file_mmap ,
2017-11-01 18:36:45 +03:00
. mmap_supported_flags = MAP_SYNC ,
2009-06-13 18:09:48 +04:00
. open = ext4_file_open ,
2006-10-11 12:20:53 +04:00
. release = ext4_release_file ,
. fsync = ext4_sync_file ,
2016-10-08 02:59:59 +03:00
. get_unmapped_area = thp_get_unmapped_area ,
2023-05-22 16:50:05 +03:00
. splice_read = ext4_file_splice_read ,
2014-04-05 12:27:08 +04:00
. splice_write = iter_file_splice_write ,
2011-01-14 15:07:43 +03:00
. fallocate = ext4_fallocate ,
2006-10-11 12:20:50 +04:00
} ;
2007-02-12 11:55:38 +03:00
const struct inode_operations ext4_file_inode_operations = {
2006-10-11 12:20:53 +04:00
. setattr = ext4_setattr ,
2017-03-31 20:31:56 +03:00
. getattr = ext4_file_getattr ,
2006-10-11 12:20:53 +04:00
. listxattr = ext4_listxattr ,
2022-09-22 18:17:00 +03:00
. get_inode_acl = ext4_get_acl ,
2013-12-20 17:16:44 +04:00
. set_acl = ext4_set_acl ,
2008-10-07 08:46:36 +04:00
. fiemap = ext4_fiemap ,
2021-04-07 15:36:43 +03:00
. fileattr_get = ext4_fileattr_get ,
. fileattr_set = ext4_fileattr_set ,
2006-10-11 12:20:50 +04:00
} ;