IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
commit ce9f24cccdc019229b70a5c15e2b09ad9c0ab5d1 upstream.
Currently, system zones just track ranges of block, that are "important"
fs metadata (bitmaps, group descriptors, journal blocks, etc.). This
however complicates how extent tree (or indirect blocks) can be checked
for inodes that actually track such metadata - currently the journal
inode but arguably we should be treating quota files or resize inode
similarly. We cannot run __ext4_ext_check() on such metadata inodes when
loading their extents as that would immediately trigger the validity
checks and so we just hack around that and special-case the journal
inode. This however leads to a situation that a journal inode which has
extent tree of depth at least one can have invalid extent tree that gets
unnoticed until ext4_cache_extents() crashes.
To overcome this limitation, track inode number each system zone belongs
to (0 is used for zones not belonging to any inode). We can then verify
inode number matches the expected one when verifying extent tree and
thus avoid the false errors. With this there's no need to to
special-case journal inode during extent tree checking anymore so remove
it.
Fixes: 0a944e8a6c66 ("ext4: don't perform block validity checks on the journal inode")
Reported-by: Wolfgang Frisch <wolfgang.frisch@suse.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200728130437.7804-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f902b216501094495ff75834035656e8119c537f upstream.
The idea of the warning in ext4_update_dx_flag() is that we should warn
when we are clearing EXT4_INODE_INDEX on a filesystem with metadata
checksums enabled since after clearing the flag, checksums for internal
htree nodes will become invalid. So there's no need to warn (or actually
do anything) when EXT4_INODE_INDEX is not set.
Link: https://lore.kernel.org/r/20201118153032.17281-1-jack@suse.cz
Fixes: 48a34311953d ("ext4: fix checksum errors with indexed dirs")
Reported-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit db717d8e26c2d1b0dba3e08668a1e6a7f665adde upstream.
Multiple bugs were recently fixed in the "set encryption policy" ioctl.
To make it clear that fscrypt_process_policy() and fscrypt_get_policy()
implement ioctls and therefore their implementations must take standard
security and correctness precautions, rename them to
fscrypt_ioctl_set_policy() and fscrypt_ioctl_get_policy(). Make the
latter take in a struct file * to make it consistent with the former.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8a363970d1dc38c4ec4ad575c862f776f468d057 upstream.
If we receive a file handle, either from NFS or open_by_handle_at(2),
and it points at an inode which has not been initialized, and the file
system has metadata checksums enabled, we shouldn't try to get the
inode, discover the checksum is invalid, and then declare the file
system as being inconsistent.
This can be reproduced by creating a test file system via "mke2fs -t
ext4 -O metadata_csum /tmp/foo.img 8M", mounting it, cd'ing into that
directory, and then running the following program.
#define _GNU_SOURCE
#include <fcntl.h>
struct handle {
struct file_handle fh;
unsigned char fid[MAX_HANDLE_SZ];
};
int main(int argc, char **argv)
{
struct handle h = {{8, 1 }, { 12, }};
open_by_handle_at(AT_FDCWD, &h.fh, O_RDONLY);
return 0;
}
Google-Bug-Id: 120690101
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Ashwin H <ashwinh@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit df3da4ea5a0fc5d115c90d5aa6caa4dd433750a7 ]
During an online resize an array of pointers to s_group_info gets replaced
so it can get enlarged. If there is a concurrent access to the array in
ext4_get_group_info() and this memory has been reused then this can lead to
an invalid memory access.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
Link: https://lore.kernel.org/r/20200221053458.730016-3-tytso@mit.edu
Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Balbir Singh <sblbir@amazon.com>
Cc: stable@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 7c990728b99ed6fbe9c75fc202fce1172d9916da upstream.
During an online resize an array of s_flex_groups structures gets replaced
so it can get enlarged. If there is a concurrent access to the array and
this memory has been reused then this can lead to an invalid memory access.
The s_flex_group array has been converted into an array of pointers rather
than an array of structures. This is to ensure that the information
contained in the structures cannot get out of sync during a resize due to
an accessor updating the value in the old structure after it has been
copied but before the array pointer is updated. Since the structures them-
selves are no longer copied but only the pointers to them this case is
mitigated.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
Link: https://lore.kernel.org/r/20200221053458.730016-4-tytso@mit.edu
Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.4.x
Cc: stable@kernel.org # 4.9.x
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 1d0c3924a92e69bfa91163bda83c12a994b4d106 upstream.
During an online resize an array of pointers to buffer heads gets
replaced so it can get enlarged. If there is a racing block
allocation or deallocation which uses the old array, and the old array
has gotten reused this can lead to a GPF or some other random kernel
memory getting modified.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
Link: https://lore.kernel.org/r/20200221053458.730016-2-tytso@mit.edu
Reported-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.4.x
Cc: stable@kernel.org # 4.9.x
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit cb85f4d23f794e24127f3e562cb3b54b0803f456 upstream.
If EXT4_EXTENTS_FL is set on an inode while ext4_writepages() is running
on it, the following warning in ext4_add_complete_io() can be hit:
WARNING: CPU: 1 PID: 0 at fs/ext4/page-io.c:234 ext4_put_io_end_defer+0xf0/0x120
Here's a minimal reproducer (not 100% reliable) (root isn't required):
while true; do
sync
done &
while true; do
rm -f file
touch file
chattr -e file
echo X >> file
chattr +e file
done
The problem is that in ext4_writepages(), ext4_should_dioread_nolock()
(which only returns true on extent-based files) is checked once to set
the number of reserved journal credits, and also again later to select
the flags for ext4_map_blocks() and copy the reserved journal handle to
ext4_io_end::handle. But if EXT4_EXTENTS_FL is being concurrently set,
the first check can see dioread_nolock disabled while the later one can
see it enabled, causing the reserved handle to unexpectedly be NULL.
Since changing EXT4_EXTENTS_FL is uncommon, and there may be other races
related to doing so as well, fix this by synchronizing changing
EXT4_EXTENTS_FL with ext4_writepages() via the existing
s_writepages_rwsem (previously called s_journal_flag_rwsem).
This was originally reported by syzbot without a reproducer at
https://syzkaller.appspot.com/bug?extid=2202a584a00fffd19fbf,
but now that dioread_nolock is the default I also started seeing this
when running syzkaller locally.
Link: https://lore.kernel.org/r/20200219183047.47417-3-ebiggers@kernel.org
Reported-by: syzbot+2202a584a00fffd19fbf@syzkaller.appspotmail.com
Fixes: 6b523df4fb5a ("ext4: use transaction reservation for extent conversion in ext4_end_io")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bbd55937de8f2754adc5792b0f8e5ff7d9c0420e upstream.
In preparation for making s_journal_flag_rwsem synchronize
ext4_writepages() with changes to both the EXTENTS and JOURNAL_DATA
flags (rather than just JOURNAL_DATA as it does currently), rename it to
s_writepages_rwsem.
Link: https://lore.kernel.org/r/20200219183047.47417-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 35df4299a6487f323b0aca120ea3f485dfee2ae3 upstream.
EXT4_I(inode)->i_disksize could be accessed concurrently as noticed by
KCSAN,
BUG: KCSAN: data-race in ext4_write_end [ext4] / ext4_writepages [ext4]
write to 0xffff91c6713b00f8 of 8 bytes by task 49268 on cpu 127:
ext4_write_end+0x4e3/0x750 [ext4]
ext4_update_i_disksize at fs/ext4/ext4.h:3032
(inlined by) ext4_update_inode_size at fs/ext4/ext4.h:3046
(inlined by) ext4_write_end at fs/ext4/inode.c:1287
generic_perform_write+0x208/0x2a0
ext4_buffered_write_iter+0x11f/0x210 [ext4]
ext4_file_write_iter+0xce/0x9e0 [ext4]
new_sync_write+0x29c/0x3b0
__vfs_write+0x92/0xa0
vfs_write+0x103/0x260
ksys_write+0x9d/0x130
__x64_sys_write+0x4c/0x60
do_syscall_64+0x91/0xb47
entry_SYSCALL_64_after_hwframe+0x49/0xbe
read to 0xffff91c6713b00f8 of 8 bytes by task 24872 on cpu 37:
ext4_writepages+0x10ac/0x1d00 [ext4]
mpage_map_and_submit_extent at fs/ext4/inode.c:2468
(inlined by) ext4_writepages at fs/ext4/inode.c:2772
do_writepages+0x5e/0x130
__writeback_single_inode+0xeb/0xb20
writeback_sb_inodes+0x429/0x900
__writeback_inodes_wb+0xc4/0x150
wb_writeback+0x4bd/0x870
wb_workfn+0x6b4/0x960
process_one_work+0x54c/0xbe0
worker_thread+0x80/0x650
kthread+0x1e0/0x200
ret_from_fork+0x27/0x50
Reported by Kernel Concurrency Sanitizer on:
CPU: 37 PID: 24872 Comm: kworker/u261:2 Tainted: G W O L 5.5.0-next-20200204+ #5
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
Workqueue: writeback wb_workfn (flush-7:0)
Since only the read is operating as lockless (outside of the
"i_data_sem"), load tearing could introduce a logic bug. Fix it by
adding READ_ONCE() for the read and WRITE_ONCE() for the write.
Signed-off-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/r/1581085751-31793-1-git-send-email-cai@lca.pw
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 48a34311953d921235f4d7bbd2111690d2e469cf upstream.
DIR_INDEX has been introduced as a compat ext4 feature. That means that
even kernels / tools that don't understand the feature may modify the
filesystem. This works because for kernels not understanding indexed dir
format, internal htree nodes appear just as empty directory entries.
Index dir aware kernels then check the htree structure is still
consistent before using the data. This all worked reasonably well until
metadata checksums were introduced. The problem is that these
effectively made DIR_INDEX only ro-compatible because internal htree
nodes store checksums in a different place than normal directory blocks.
Thus any modification ignorant to DIR_INDEX (or just clearing
EXT4_INDEX_FL from the inode) will effectively cause checksum mismatch
and trigger kernel errors. So we have to be more careful when dealing
with indexed directories on filesystems with checksumming enabled.
1) We just disallow loading any directory inodes with EXT4_INDEX_FL when
DIR_INDEX is not enabled. This is harsh but it should be very rare (it
means someone disabled DIR_INDEX on existing filesystem and didn't run
e2fsck), e2fsck can fix the problem, and we don't want to answer the
difficult question: "Should we rather corrupt the directory more or
should we ignore that DIR_INDEX feature is not set?"
2) When we find out htree structure is corrupted (but the filesystem and
the directory should in support htrees), we continue just ignoring htree
information for reading but we refuse to add new entries to the
directory to avoid corrupting it more.
Link: https://lore.kernel.org/r/20200210144316.22081-1-jack@suse.cz
Fixes: dbe89444042a ("ext4: Calculate and verify checksums for htree nodes")
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8bc1379b82b8e809eef77a9fedbb75c6c297be19 upstream.
Use a separate journal transaction if it turns out that we need to
convert an inline file to use an data block. Otherwise we could end
up failing due to not having journal credits.
This addresses CVE-2018-10883.
https://bugzilla.kernel.org/show_bug.cgi?id=200071
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
[fengc@google.com: 4.4 and 4.9 backport: adjust context]
Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c37e9e013469521d9adb932d17a1795c139b36db upstream.
If there is a directory entry pointing to a system inode (such as a
journal inode), complain and declare the file system to be corrupted.
Also, if the superblock's first inode number field is too small,
refuse to mount the file system.
This addresses CVE-2018-10882.
https://bugzilla.kernel.org/show_bug.cgi?id=200069
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Create a macro to calculate length + offset -> maximum blocks
This adds more readability.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Use the ext4_{has,set,clear}_feature_* helpers to replace the old
feature helpers.
Signed-off-by: Kaho Ng <ngkaho1234@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
When quota information is stored in quota files, we enable only quota
accounting on mount and enforcement is enabled only in response to
Q_QUOTAON quotactl. To make ext4 behavior consistent with XFS, we add a
possibility to enable quota enforcement on mount by specifying
corresponding quota mount option (usrquota, grpquota, prjquota).
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch removes the most parts of internal crypto codes.
And then, it modifies and adds some ext4-specific crypt codes to use the generic
facility.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If there are no pending blocks to be released after a commit, forcing
a journal commit has no hope of helping. It's possible that a commit
had just completed, so if there are now free blocks available for
allocation, it's worth retrying the commit.
Reported-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently ext4 treats DAX IO the same way as direct IO. I.e., it
allocates unwritten extents before IO is done and converts unwritten
extents afterwards. However this way DAX IO can race with page fault to
the same area:
ext4_ext_direct_IO() dax_fault()
dax_io()
get_block() - allocates unwritten extent
copy_from_iter_pmem()
get_block() - converts
unwritten block to
written and zeroes it
out
ext4_convert_unwritten_extents()
So data written with DAX IO gets lost. Similarly dax_new_buf() called
from dax_io() can overwrite data that has been already written to the
block via mmap.
Fix the problem by using pre-zeroed blocks for DAX IO the same way as we
use them for DAX mmap. The downside of this solution is that every
allocating write writes each block twice (once zeros, once data). Fixing
the race with locking is possible as well however we would need to
lock-out faults for the whole range written to by DAX IO. And that is
not easy to do without locking-out faults for the whole file which seems
too aggressive.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently ext4 direct IO handling is split between ext4_ext_direct_IO()
and ext4_ind_direct_IO(). However the extent based function calls into
the indirect based one for some cases and for example it is not able to
handle file extending. Previously it was not also properly handling
retries in case of ENOSPC errors. With DAX things would get even more
contrieved so just refactor the direct IO code and instead of indirect /
extent split do the split to read vs writes.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In ext4, there is a race condition between changing inode journal mode
and ext4_writepages(). While ext4_writepages() is executed on a
non-journalled mode inode, the inode's journal mode could be enabled
by ioctl() and then, some pages dirtied after switching the journal
mode will be still exposed to ext4_writepages() in non-journaled mode.
To resolve this problem, we use fs-wide per-cpu rw semaphore by Jan
Kara's suggestion because we don't want to waste ext4_inode_info's
space for this extra rare case.
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Currently we ask jbd2 to write all dirty allocated buffers before
committing a transaction when doing writeback of delay allocated blocks.
However this is unnecessary since we move all pages to writeback state
before dropping a transaction handle and then submit all the necessary
IO. We still need the transaction commit to wait for all the outstanding
writeback before flushing disk caches during transaction commit to avoid
data exposure issues though. Use the new jbd2 capability and ask it to
only wait for outstanding writeback during transaction commit when
writing back data in ext4_writepages().
Tested-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This flag is just duplicating what ext4_should_order_data() tells you
and is used in a single place. Furthermore it doesn't reflect changes to
inode data journalling flag so it may be possibly misleading. Just
remove it.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
(badly behaved) dentry code in various file systems. These have been
reviewed by Al and the respective file system mtinainers and are going
through the ext4 tree for convenience.
This also has a few ext4 encryption bug fixes that were discovered in
Android testing (yes, we will need to get these sync'ed up with the
fs/crypto code; I'll take care of that). It also has some bug fixes
and a change to ignore the legacy quota options to allow for xfstests
regression testing of ext4's internal quota feature and to be more
consistent with how xfs handles this case.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJXBn4aAAoJEPL5WVaVDYGjHWgH/2wXnlQnC2ndJhblBWtPzprz
OQW4dawdnhxqbTEGUqWe942tZivSb/liu/lF+urCGbWsbgz9jNOCmEAg7JPwlccY
mjzwDvtVq5U4d2rP+JDWXLy/Gi8XgUclhbQDWFVIIIea6fS7IuFWqoVBR+HPMhra
9tEygpiy5lNtJA/hqq3/z9x0AywAjwrYR491CuWreo2Uu1aeKg0YZsiDsuAcGioN
Waa2TgbC/ZZyJuJcPBP8If+VOFAa0ea3F+C/o7Tb9bOqwuz0qSTcaMRgt6eQ2KUt
P4b9Ecp1XLjJTC7IYOknUOScY3lCyREx/Xya9oGZfFNTSHzbOlLBoplCr3aUpYQ=
=/HHR
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bugfixes from Ted Ts'o:
"These changes contains a fix for overlayfs interacting with some
(badly behaved) dentry code in various file systems. These have been
reviewed by Al and the respective file system mtinainers and are going
through the ext4 tree for convenience.
This also has a few ext4 encryption bug fixes that were discovered in
Android testing (yes, we will need to get these sync'ed up with the
fs/crypto code; I'll take care of that). It also has some bug fixes
and a change to ignore the legacy quota options to allow for xfstests
regression testing of ext4's internal quota feature and to be more
consistent with how xfs handles this case"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: ignore quota mount options if the quota feature is enabled
ext4 crypto: fix some error handling
ext4: avoid calling dquot_get_next_id() if quota is not enabled
ext4: retry block allocation for failed DIO and DAX writes
ext4: add lockdep annotations for i_data_sem
ext4: allow readdir()'s of large empty directories to be interrupted
btrfs: fix crash/invalid memory access on fsync when using overlayfs
ext4 crypto: use dget_parent() in ext4_d_revalidate()
ext4: use file_dentry()
ext4: use dget_parent() in ext4_file_open()
nfs: use file_dentry()
fs: add file_dentry()
ext4 crypto: don't let data integrity writebacks fail with ENOMEM
ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()
Mostly direct substitution with occasional adjustment or removing
outdated comments.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the internal Quota feature, mke2fs creates empty quota inodes and
quota usage tracking is enabled as soon as the file system is mounted.
Since quotacheck is no longer preallocating all of the blocks in the
quota inode that are likely needed to be written to, we are now seeing
a lockdep false positive caused by needing to allocate a quota block
from inside ext4_map_blocks(), while holding i_data_sem for a data
inode. This results in this complaint:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&ei->i_data_sem);
lock(&s->s_dquot.dqio_mutex);
lock(&ei->i_data_sem);
lock(&s->s_dquot.dqio_mutex);
Google-Bug-Id: 27907753
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
We don't want the writeback triggered from the journal commit (in
data=writeback mode) to cause the journal to abort due to
generic_writepages() returning an ENOMEM error. In addition, if
fsync() fails with ENOMEM, most applications will probably not do the
right thing.
So if we are doing a data integrity sync, and ext4_encrypt() returns
ENOMEM, we will submit any queued I/O to date, and then retry the
allocation using GFP_NOFAIL.
Google-Bug-Id: 27641567
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Change summary:
o error propagation for direct IO failures fixes for both XFS and ext4
o new quota interfaces and XFS implementation for iterating all the quota IDs
in the filesystem
o locking fixes for real-time device extent allocation
o reduction of duplicate information in the xfs and vfs inode, saving roughly
100 bytes of memory per cached inode.
o buffer flag cleanup
o rework of the writepage code to use the generic write clustering mechanisms
o several fixes for inode flag based DAX enablement
o rework of remount option parsing
o compile time verification of on-disk format structure sizes
o delayed allocation reservation overrun fixes
o lots of little error handling fixes
o small memory leak fixes
o enable xfsaild freezing again
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJW71DQAAoJEK3oKUf0dfodyiwP/0Tou9f1huzLC0kd7kmEoKKC
BWQmtJGEdo0iSpJNZhg/EJmjvRtbBiOB9CRcEyG8d71kqZ+MKW7t/4JjNvNG34aE
vHjhwMBVVqkw/q6azi2LiEDsVcOe5bXxUrXNZi18/09OAl4pHm+X8VERLnnC5y+i
QIHAOdB5R+36cXcceJm1HR6jTZedbNdQkT/ndhm5S60FGhvVI29cs9NwYwoi5aif
O55r6krSWBj6U/X6MsLvr+lNb6+1Sd1hyE8dGTE7lOUX/crFIysaDPEuQmWvDjsO
M1ulVfzKoBJHcyvpbdHwdBEyiBjzvETcrgndMRoWOjZiOLqNtWYsgIEiC+Nlidwd
+T4XhkJJJg5UUQ4r6Hs85SQn/THanzR5KoN5nbTsFtFkCKw1DRkUSNuh2mXP2xVG
JcNDCjDvvHG76EfQ1otlYf7ru79Ck+hjVs+szaEVPpOzAwz8yOtD+L7I8f73gQ6a
ayP8W2oZQpYvQRv+smgvt+HwQA4fNJk9ZseY3QD5+z5snJz7JEhZogqW+ngFYkNQ
dtA5Y7gpTkKfo3mKO0XmE5+3fcSXhGHGYQzmUgJFlgWTK7+E8fuDhn6D66wFcZSq
QhyRk9J7Xb7ZWuP5PlOkxb9DLd4hnuyie2bYw/0hVtOatjE/Em4gRJ3Oq3ZANwZx
OeMGj4Uyb3/MKAJwy3Gq
=ZoiX
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs updates from Dave Chinner:
"There's quite a lot in this request, and there's some cross-over with
ext4, dax and quota code due to the nature of the changes being made.
As for the rest of the XFS changes, there are lots of little things
all over the place, which add up to a lot of changes in the end.
The major changes are that we've reduced the size of the struct
xfs_inode by ~100 bytes (gives an inode cache footprint reduction of
>10%), the writepage code now only does a single set of mapping tree
lockups so uses less CPU, delayed allocation reservations won't
overrun under random write loads anymore, and we added compile time
verification for on-disk structure sizes so we find out when a commit
or platform/compiler change breaks the on disk structure as early as
possible.
Change summary:
- error propagation for direct IO failures fixes for both XFS and
ext4
- new quota interfaces and XFS implementation for iterating all the
quota IDs in the filesystem
- locking fixes for real-time device extent allocation
- reduction of duplicate information in the xfs and vfs inode, saving
roughly 100 bytes of memory per cached inode.
- buffer flag cleanup
- rework of the writepage code to use the generic write clustering
mechanisms
- several fixes for inode flag based DAX enablement
- rework of remount option parsing
- compile time verification of on-disk format structure sizes
- delayed allocation reservation overrun fixes
- lots of little error handling fixes
- small memory leak fixes
- enable xfsaild freezing again"
* tag 'xfs-for-linus-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (66 commits)
xfs: always set rvalp in xfs_dir2_node_trim_free
xfs: ensure committed is initialized in xfs_trans_roll
xfs: borrow indirect blocks from freed extent when available
xfs: refactor delalloc indlen reservation split into helper
xfs: update freeblocks counter after extent deletion
xfs: debug mode forced buffered write failure
xfs: remove impossible condition
xfs: check sizes of XFS on-disk structures at compile time
xfs: ioends require logically contiguous file offsets
xfs: use named array initializers for log item dumping
xfs: fix computation of inode btree maxlevels
xfs: reinitialise per-AG structures if geometry changes during recovery
xfs: remove xfs_trans_get_block_res
xfs: fix up inode32/64 (re)mount handling
xfs: fix format specifier , should be %llx and not %llu
xfs: sanitize remount options
xfs: convert mount option parsing to tokens
xfs: fix two memory leaks in xfs_attr_list.c error paths
xfs: XFS_DIFLAG2_DAX limited by PAGE_SIZE
xfs: dynamically switch modes when XFS_DIFLAG2_DAX is set/cleared
...
the error is:
fs/ext4/mballoc.c:475:43: error: 'struct ext4_group_info' has
no member named 'bb_bitmap'.
so, the definition of macro DOUBLE_CHECK should before
'struct ext4_group_info', I fixed it, and I moved the macro
AGGRESSIVE_CHECK together, because I think they shoule be together.
Signed-off-by: Aihua Zhang <zhangaihua1@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Using SEEK_DATA in a huge sparse file can easily lead to sotflockups as
ext4_seek_data() iterates hole block-by-block. Fix the problem by using
returned hole size from ext4_map_blocks() and thus skip the hole in one
go.
Update also SEEK_HOLE implementation to follow the same pattern as
SEEK_DATA to make future maintenance easier.
Furthermore we add cond_resched() to both ext4_seek_data() and
ext4_seek_hole() to avoid softlockups in case evil user creates huge
fragmented file and we have to go through lots of extents.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When mapping blocks for direct IO, we allocate io_end structure before
mapping blocks and store pointer to it in the inode. This creates a
requirement that any AIO DIO using io_end must be protected by i_mutex.
This created problems in the past with dioread_nolock mode which was
corrupting io_end pointers. Also io_end is allocated unnecessarily in
case where we don't need to convert any extents (which is a common case
for example when overwriting file).
We fix the problem by allocating io_end only once we return unwritten
extent from block mapping function for AIO DIO (so we can save some
pointless io_end allocations) and we pass pointer to it in bh->b_private
which generic DIO code later passes to our end IO callback. That way we
remove any need for global pointer to io_end structure and thus fix the
races.
The downside of this change is that the checking for unwritten IO in
flight in ext4_extents_can_be_merged() is more racy since we now
increment i_unwritten / set EXT4_STATE_DIO_UNWRITTEN only after dropping
i_data_sem. However the check has been racy already before because
ext4_writepages() already increment i_unwritten after dropping
i_data_sem and reserved blocks save us from hitting ENOSPC in the worst
case.
Signed-off-by: Jan Kara <jack@suse.cz>
Rename ext4_get_blocks_write() to ext4_get_blocks_unwritten() to better
describe what it does. Also split out get blocks functions for direct
IO. Later we move functionality from _ext4_get_blocks() there. There's no
functional change in this patch.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently we've used hashed aio_mutex to serialize unaligned AIO DIO.
However the code cleanups that happened after 2011 when the lock was
introduced made aio_mutex acquired at almost the same places where we
already have exclusion using i_mutex. So just use i_mutex for the
exclusion of unaligned AIO DIO.
The change moves waiting for pending unwritten extent conversion under
i_mutex. That makes special handling of O_APPEND writes unnecessary and
also avoids possible livelocking of unaligned AIO DIO with aligned one
(nothing was preventing contiguous stream of aligned AIO DIOs to let
unaligned AIO DIO wait forever).
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
On 64-bit architectures we have two 4-byte holes in struct ext4_io_end.
Order entries better to avoid this and thus make the structure occupy
64 instead of 72 bytes for 64-bit architectures.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When AIO DIO fails e.g. due to IO error, we must not convert unwritten
extents as that will expose uninitialized data. Handle this case
by clearing unwritten flag from io_end in case of error and thus
preventing extent conversion.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Since old mbcache code is gone, let's rename new code to mbcache since
number 2 is now meaningless. This is just a mechanical replacement.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The conversion is generally straightforward. The only tricky part is
that xattr block corresponding to found mbcache entry can get freed
before we get buffer lock for that block. So we have to check whether
the entry is still valid after getting buffer lock.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Add a validation check for dentries for encrypted directory to make
sure we're not caching stale data after a key has been added or removed.
Also check to make sure that status of the encryption key is updated
when readdir(2) is executed.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).
Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch adds FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR ioctl interface
support for ext4. The interface is kept consistent with
XFS_IOC_FSGETXATTR/XFS_IOC_FSGETXATTR.
Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
This patch adds mount options for enabling/disabling project quota
accounting and enforcement. A new specific inode is also used for
project quota accounting.
[ Includes fix from Dan Carpenter to crrect error checking from dqget(). ]
Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
A number of functions include ext4_add_dx_entry, make_indexed_dir,
etc. are being passed a dentry even though the only thing they use is
the containing parent. We can shrink the code size slightly by making
this replacement. This will also be useful in cases where we don't
have a dentry as the argument to the directory entry insert functions.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Make DAX fault path use pre-zeroed blocks to avoid races with extent
conversion and zeroing when two page faults to the same block happen.
Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
DAX page fault path needs to get blocks that are pre-zeroed to avoid
races when two concurrent page faults happen in the same block of a
file. Implement support for this in ext4_map_blocks().
Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Create new function ext4_issue_zeroout() to zeroout contiguous (both
logically and physically) part of inode data. We will need to issue
zeroout when extent structure is not readily available and this function
will allow us to do it without making up fake extent structures.
Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When dioread_nolock mode is enabled, we grab i_data_sem in
ext4_ext_direct_IO() and therefore we need to instruct _ext4_get_block()
not to grab i_data_sem again using EXT4_GET_BLOCKS_NO_LOCK. However
holding i_data_sem over overwrite direct IO isn't needed these days. We
have exclusion against truncate / hole punching because we increase
i_dio_count under i_mutex in ext4_ext_direct_IO() so once
ext4_file_write_iter() verifies blocks are allocated & written, they are
guaranteed to stay so during the whole direct IO even after we drop
i_mutex.
So we can just remove this locking abuse and the no longer necessary
EXT4_GET_BLOCKS_NO_LOCK flag.
Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When doing delayed allocation, update of on-disk inode size is postponed
until IO submission time. However hole punch or zero range fallocate
calls can end up discarding the tail page cache page and thus on-disk
inode size would never be properly updated.
Make sure the on-disk inode size is updated before truncating page
cache.
Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>