IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Merge misc fixes from Andrew Morton:
"6 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
sh: add copy_user_page() alias for __copy_user()
lib/Kconfig: ZLIB_DEFLATE must select BITREVERSE
mm, dax: fix DAX deadlocks
memcg: convert threshold to bytes
builddeb: remove debian/files before build
mm, fs: obey gfp_mapping for add_to_page_cache()
Commit 6afdb859b710 ("mm: do not ignore mapping_gfp_mask in page cache
allocation paths") has caught some users of hardcoded GFP_KERNEL used in
the page cache allocation paths. This, however, wasn't complete and
there were others which went unnoticed.
Dave Chinner has reported the following deadlock for xfs on loop device:
: With the recent merge of the loop device changes, I'm now seeing
: XFS deadlock on my single CPU, 1GB RAM VM running xfs/073.
:
: The deadlocked is as follows:
:
: kloopd1: loop_queue_read_work
: xfs_file_iter_read
: lock XFS inode XFS_IOLOCK_SHARED (on image file)
: page cache read (GFP_KERNEL)
: radix tree alloc
: memory reclaim
: reclaim XFS inodes
: log force to unpin inodes
: <wait for log IO completion>
:
: xfs-cil/loop1: <does log force IO work>
: xlog_cil_push
: xlog_write
: <loop issuing log writes>
: xlog_state_get_iclog_space()
: <blocks due to all log buffers under write io>
: <waits for IO completion>
:
: kloopd1: loop_queue_write_work
: xfs_file_write_iter
: lock XFS inode XFS_IOLOCK_EXCL (on image file)
: <wait for inode to be unlocked>
:
: i.e. the kloopd, with it's split read and write work queues, has
: introduced a dependency through memory reclaim. i.e. that writes
: need to be able to progress for reads make progress.
:
: The problem, fundamentally, is that mpage_readpages() does a
: GFP_KERNEL allocation, rather than paying attention to the inode's
: mapping gfp mask, which is set to GFP_NOFS.
:
: The didn't used to happen, because the loop device used to issue
: reads through the splice path and that does:
:
: error = add_to_page_cache_lru(page, mapping, index,
: GFP_KERNEL & mapping_gfp_mask(mapping));
This has changed by commit aa4d86163e4 ("block: loop: switch to VFS
ITER_BVEC").
This patch changes mpage_readpage{s} to follow gfp mask set for the
mapping. There are, however, other places which are doing basically the
same.
lustre:ll_dir_filler is doing GFP_KERNEL from the function which
apparently uses GFP_NOFS for other allocations so let's make this
consistent.
cifs:readpages_get_pages is called from cifs_readpages and
__cifs_readpages_from_fscache called from the same path obeys mapping
gfp.
ramfs_nommu_expand_for_mapping is hardcoding GFP_KERNEL as well
regardless it uses mapping_gfp_mask for the page allocation.
ext4_mpage_readpages is the called from the page cache allocation path
same as read_pages and read_cache_pages
As I've noticed in my previous post I cannot say I would be happy about
sprinkling mapping_gfp_mask all over the place and it sounds like we
should drop gfp_mask argument altogether and use it internally in
__add_to_page_cache_locked that would require all the filesystems to use
mapping gfp consistently which I am not sure is the case here. From a
quick glance it seems that some file system use it all the time while
others are selective.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If there is a error while copying data from userspace into the page
cache during a write(2) system call, in data=journal mode, in
ext4_journalled_write_end() were using page_zero_new_buffers() from
fs/buffer.c. Unfortunately, this sets the buffer dirty flag, which is
no good if journalling is enabled. This is a long-standing bug that
goes back for years and years in ext3, but a combination of (a)
data=journal not being very common, (b) in many case it only results
in a warning message. and (c) only very rarely causes the kernel hang,
means that we only really noticed this as a problem when commit
998ef75ddb caused this failure to happen frequently enough to cause
generic/208 to fail when run in data=journal mode.
The fix is to have our own version of this function that doesn't call
mark_dirty_buffer(), since we will end up calling
ext4_handle_dirty_metadata() on the buffer head(s) in questions very
shortly afterwards in ext4_journalled_write_end().
Thanks to Dave Hansen and Linus Torvalds for helping to identify the
root cause of the problem.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.com>
Fix multiple bugs in ext4_encrypted_zeroout(), including one that
could cause us to write an encrypted zero page to the wrong location
on disk, potentially causing data and file system corruption.
Fortunately, this tends to only show up in stress tests, but even with
these fixes, we are seeing some test failures with generic/127 --- but
these are now caused by data failures instead of metadata corruption.
Since ext4_encrypted_zeroout() is only used for some optimizations to
keep the extent tree from being too fragmented, and
ext4_encrypted_zeroout() itself isn't all that optimized from a time
or IOPS perspective, disable the extent tree optimization for
encrypted inodes for now. This prevents the data corruption issues
reported by generic/127 until we can figure out what's going wrong.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Since ext4_page_crypto() doesn't need an encryption context (at least
not any more), this allows us to simplify a number function signature
and also allows us to avoid needing to allocate a context in
ext4_block_write_begin(). It also means we no longer need a separate
ext4_decrypt_one() function.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In cases where the file system block size is the same as the page
size, and ext4_writepage() is asked to write out a page which is
either has the unwritten bit set in the extent tree, or which does not
yet have a block assigned due to delayed allocation, we can bail out
early and, unlocking the page earlier and avoiding a round trip
through ext4_bio_write_page() with the attendant calls to
set_page_writeback() and redirty_page_for_writeback().
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There are times when ext4_bio_write_page() is called even though we
don't actually need to do any I/O. This happens when ext4_writepage()
gets called by the jbd2 commit path when an inode needs to force its
pages written out in order to provide data=ordered guarantees --- and
a page is backed by an unwritten (e.g., uninitialized) block on disk,
or if delayed allocation means the page's backing store hasn't been
allocated yet. In that case, we need to skip the call to
ext4_encrypt_page(), since in addition to wasting CPU, it leads to a
bounce page and an ext4 crypto context getting leaked.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
IS_ERR(_OR_NULL) already contain an 'unlikely' compiler flag and there
is no need to do that again from its callers. Drop it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Jeff Layton <jlayton@poochiereds.net>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steve French <smfrench@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Configuration option EXT4_USE_FOR_EXT2 has no effect on ext3 support.
Support for ext3 is always included now.
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Fixes: c290ea01ab ("fs: Remove ext3 filesystem driver")
Cc: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.com>
This allows us to refactor the procfs code, which saves a bit of
compiled space. More importantly it isolates most of the procfs
support code into a single file, so it's easier to #ifdef it out if
the proc file system has been disabled.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Also statically allocate the ext4_kset and ext4_feat objects, since we
only need exactly one of each, and it's simpler and less code if we
drop the dynamic allocation and deallocation when it's not needed.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Jan Kara pointed out that in the case where we are writing to a hole, we
can end up with a lock inversion between the page lock and the journal
lock. We can avoid this by starting the transaction in ext4 before
calling into DAX. The journal lock nests inside the superblock
pagefault lock, so we have to duplicate that code from dax_fault, like
XFS does.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
DAX wants different semantics from any currently-existing ext4 get_block
callback. Unlike ext4_get_block_write(), it needs to honour the
'create' flag, and unlike ext4_get_block(), it needs to be able to
return unwritten extents. So introduce a new ext4_get_block_dax() which
has those semantics.
We could also change ext4_get_block_write() to honour the 'create' flag,
but that might have consequences on other users that I do not currently
understand.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
DAX relies on the get_block function either zeroing newly allocated
blocks before they're findable by subsequent calls to get_block, or
marking newly allocated blocks as unwritten. ext4_get_block() cannot
create unwritten extents, but ext4_get_block_write() can.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Reported-by: Andy Rudoff <andy.rudoff@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use DAX to provide support for huge pages.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to
return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is
defined in linux/mm.h. Given that we don't want to include <linux/mm.h>
in <linux/fs.h>, the easiest solution is to move the DAX-related
functions to a new header, <linux/dax.h>. We could also have moved
VM_FAULT_* definitions to a new header, or a different header that isn't
quite such a boil-the-ocean header as <linux/mm.h>, but this felt like
the best option.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many file systems that implement the show_options hook fail to correctly
escape their output which could lead to unescaped characters (e.g. new
lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This
could lead to confusion, spoofed entries (resulting in things like
systemd issuing false d-bus "mount" notifications), and who knows what
else. This looks like it would only be the root user stepping on
themselves, but it's possible weird things could happen in containers or
in other situations with delegated mount privileges.
Here's an example using overlay with setuid fusermount trusting the
contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use
of "sudo" is something more sneaky:
$ BASE="ovl"
$ MNT="$BASE/mnt"
$ LOW="$BASE/lower"
$ UP="$BASE/upper"
$ WORK="$BASE/work/ 0 0
none /proc fuse.pwn user_id=1000"
$ mkdir -p "$LOW" "$UP" "$WORK"
$ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
$ cat /proc/mounts
none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
none /proc fuse.pwn user_id=1000 0 0
$ fusermount -u /proc
$ cat /proc/mounts
cat: /proc/mounts: No such file or directory
This fixes the problem by adding new seq_show_option and
seq_show_option_n helpers, and updating the vulnerable show_option
handlers to use them as needed. Some, like SELinux, need to be open
coded due to unusual existing escape mechanisms.
[akpm@linux-foundation.org: add lost chunk, per Kees]
[keescook@chromium.org: seq_show_option should be using const parameters]
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
features and other churn going into 4.2.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJV55TlAAoJEPL5WVaVDYGjyzYH/1WtZpIzRjp7o+3H4/vFqONg
R1Fsw785C1w8WX2QuIK/m31u4XO+VeCV4jWA9DuqnSzWm9w9C/4kTqITd4Hwp416
/9gJvYoZHHaDikxpWWADptDi8IoLohTlcFVCHIvvf53cGehVEEsc2WOijUZo7Cgv
O454Nm3tK0CQ3yrCIlf5SyvkUZSMTiawLLJJzd4GCyvU13C1SnABNQj8UxKisBA5
cP8q4O2nPg/S9rkYxnFAifQyZppd3jMvorUaq9eHiWMjl95o6e/6+wYGnHhoFUvr
/P1dNjJYbzk+TUzlsDkq2zANK2UsB3iNNi8YwLFOpfFcuYopmUAYRIWOgIZWYUQ=
=ijuI
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Pretty much all bug fixes and clean ups for 4.3, after a lot of
features and other churn going into 4.2"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
Revert "ext4: remove block_device_ejected"
ext4: ratelimit the file system mounted message
ext4: silence a format string false positive
ext4: simplify some code in read_mmp_block()
ext4: don't manipulate recovery flag when freezing no-journal fs
jbd2: limit number of reserved credits
ext4 crypto: remove duplicate header file
ext4: update c/mtime on truncate up
jbd2: avoid infinite loop when destroying aborted journal
ext4, jbd2: add REQ_FUA flag when recording an error in the superblock
ext4 crypto: fix spelling typo in comment
ext4 crypto: exit cleanly if ext4_derive_key_aes() fails
ext4: reject journal options for ext2 mounts
ext4: implement cgroup writeback support
ext4: replace ext4_io_submit->io_op with ->io_wbc
ext4 crypto: check for too-short encrypted file names
ext4 crypto: use a jbd2 transaction when adding a crypto policy
jbd2: speedup jbd2_journal_dirty_metadata()
Pull ext3 removal, quota & udf fixes from Jan Kara:
"The biggest change in the pull is the removal of ext3 filesystem
driver (~28k lines removed). Ext4 driver is a full featured
replacement these days and both RH and SUSE use it for several years
without issues. Also there are some workarounds in VM & block layer
mainly for ext3 which we could eventually get rid of.
Other larger change is addition of proper error handling for
dquot_initialize(). The rest is small fixes and cleanups"
[ I wasn't convinced about the ext3 removal and worried about things
falling through the cracks for legacy users, but ext4 maintainers
piped up and were all unanimously in favor of removal, and maintaining
all legacy ext3 support inside ext4. - Linus ]
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Don't modify filesystem for read-only mounts
quota: remove an unneeded condition
ext4: memory leak on error in ext4_symlink()
mm/Kconfig: NEED_BOUNCE_POOL: clean-up condition
ext4: Improve ext4 Kconfig test
block: Remove forced page bouncing under IO
fs: Remove ext3 filesystem driver
doc: Update doc about journalling layer
jfs: Handle error from dquot_initialize()
reiserfs: Handle error from dquot_initialize()
ocfs2: Handle error from dquot_initialize()
ext4: Handle error from dquot_initialize()
ext2: Handle error from dquot_initalize()
quota: Propagate error from ->acquire_dquot()
The xfstests ext4/305 will mount and unmount the same file system over
4,000 times, and each one of these will cause a system log message.
Ratelimit this message since if we are getting more than a few dozen
of these messages, they probably aren't going to be helpful.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Static checkers complain that the format string should be "%s". It does
not make a difference for the current code.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
My static check complains because we have:
if (!*bh)
return -ENOMEM;
if (*bh) {
The second check is unnecessary.
I've simplified this code by moving the "if (!*bh)" checks around. Also
Andreas Dilger says we should probably print a warning if sb_getblk()
fails.
[ Restructured the code so that we print a warning message as well if
the mmp block doesn't check out, and to print the error code to
disambiguate between the error cases. - TYT ]
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
At some point along this sequence of changes:
f6e63f9 ext4: fold ext4_nojournal_sops into ext4_sops
bb04457 ext4: support freezing ext2 (nojournal) file systems
9ca9238 ext4: Use separate super_operations structure for no_journal filesystems
ext4 started setting needs_recovery on filesystems without journals
when they are unfrozen. This makes no sense, and in fact confuses
blkid to the point where it doesn't recognize the filesystem at all.
(freeze ext2; unfreeze ext2; run blkid; see no output; run dumpe2fs,
see needs_recovery set on fs w/ no journal).
To fix this, don't manipulate the INCOMPAT_RECOVER feature on
filesystems without journals.
Reported-by: Stu Mark <smark@datto.com>
Reviewed-by: Jan Kara <jack@suse.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 3da40c7b0898 ("ext4: only call ext4_truncate when size <= isize")
introduced a bug that c/mtime is not updated on truncate up.
Fix the issue by setting c/mtime explicitly in the truncate up case.
Note that ftruncate(2) is not affected, so you won't see this bug using
truncate(1) and xfs_io(1).
Signed-off-by: Zirong Lang <zorro.lang@gmail.com>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We should release "sd" before returning.
Fixes: 0fa12ad1b285 ('ext4: Handle error from dquot_initialize()')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jan Kara <jack@suse.com>
Now that ext4 driver must be used to access ext3 filesystems, improve
the Kconfig help text to better explain that using ext4 driver to access
the filesystem is fully compatible with the old ext3 driver.
Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.com>
The functionality of ext3 is fully supported by ext4 driver. Major
distributions (SUSE, RedHat) already use ext4 driver to handle ext3
filesystems for quite some time. There is some ugliness in mm resulting
from jbd cleaning buffers in a dirty page without cleaning page dirty
bit and also support for buffer bouncing in the block layer when stable
pages are required is there only because of jbd. So let's remove the
ext3 driver. This saves us some 28k lines of duplicated code.
Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
When an error condition is detected, an error status should be recorded into
superblocks of EXT4 or JBD2. However, the write request is submitted now
without REQ_FUA flag, even in "barrier=1" mode, which is followed by
panic() function in "errors=panic" mode. On mobile devices which make
whole system reset as soon as kernel panic occurs, this write request
containing an error flag will disappear just from storage cache without
written to the physical cells. Therefore, when next start, even forever,
the error flag cannot be shown in both superblocks, and e2fsck cannot fix
the filesystem problems automatically, unless e2fsck is executed in
force checking mode.
[ Changed use test_opt(sb, BARRIER) of checking the journal flags -- TYT ]
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Return value of ext4_derive_key_aes() is stored but not used.
Add test to exit cleanly if ext4_derive_key_aes() fail.
Also fix coverity CID 1309760.
Signed-off-by: Laurent Navet <laurent.navet@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There is no reason to allow ext2 filesystems be mounted with journal
mount options. So, this patch adds them to the MOPT_NO_EXT2 mount
options list.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
For ordered and writeback data modes, all data IOs go through
ext4_io_submit. This patch adds cgroup writeback support by invoking
wbc_init_bio() from io_submit_init_bio() and wbc_account_io() in
io_submit_add_bh(). Journal data which is written by jbd2 worker is
left alone by this patch and will always be written out from the root
cgroup.
ext4_fill_super() is updated to set MS_CGROUPWB when data mode is
either ordered or writeback. In journaled data mode, most IOs become
synchronous through the journal and enabling cgroup writeback support
doesn't make much sense or difference. Journaled data mode is left
alone.
Lightly tested with sequential data write workload. Behaves as
expected.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_io_submit_init() takes the pointer to writeback_control to test
its sync_mode and determine between WRITE and WRITE_SYNC and records
the result in ->io_op. This patch makes it record the pointer
directly and moves the test to ext4_io_submit().
This doesn't cause any noticeable differences now but having
writeback_control available throughout IO submission path will be
depended upon by the planned cgroup writeback support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
An encrypted file name should never be shorter than an 16 bytes, the
AES block size. The 3.10 crypto layer will oops and crash the kernel
if ciphertext shorter than the block size is passed to it.
Fortunately, in modern kernels the crypto layer will not crash the
kernel in this scenario, but nevertheless, it represents a corrupted
directory, and we should detect it and mark the file system as
corrupted so that e2fsck can fix this.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Start a jbd2 transaction, and mark the inode dirty on the inode under
that transaction after setting the encrypt flag. Otherwise if the
directory isn't modified after setting the crypto policy, the
encrypted flag might not survive the inode getting pushed out from
memory, or the the file system getting unmounted and remounted.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The FITRIM ioctl has the same arguments on 32-bit and 64-bit
architectures, so we can add it to the list of compatible ioctls and
drop it from compat_ioctl method of various filesystems.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ted Ts'o <tytso@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* address corner cases for indirect blocks->extent migration
* fix reserved block accounting invalidate_page when
page_size != block_size (i.e., ppc or 1k block size file systems)
* fix deadlocks when a memcg is under heavy memory pressure
* fix fencepost error in lazytime optimization
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJVmW27AAoJEPL5WVaVDYGjmEkIAJsGHVIKur1Kp//FhejSB/wI
B0d+UuQt5kdAE3lNxC7lHO1NqIhvnS7eBho+52LG8V4JDRrzTbE1GdbsBhAIk6FW
CcsQvsHAI99QJMdqOCachu/+nhCwIINGkxmbumhNaZoJPn6wmGQzCA3Cn5qmnGnK
Ctbk6li1HuMXyzbbvxCLfaD/xCUs1NCdufEnRU44i0U4OfaYNpiAhddeGIQ8WMEQ
G14l2JvhIfye6fG8lnCzfacFvnT9zvvSGfRO3ZQjC4Az1EogIUbhCPLvq0ebDbPp
i4eRfrSRdXmMojqmW/knET8skXQVZVnD7LWuvkue+n47UbTH2c0roTbp4l76W+U=
=x8Cc
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bugfixes from Ted Ts'o:
"Bug fixes (all for stable kernels) for ext4:
- address corner cases for indirect blocks->extent migration
- fix reserved block accounting invalidate_page when
page_size != block_size (i.e., ppc or 1k block size file systems)
- fix deadlocks when a memcg is under heavy memory pressure
- fix fencepost error in lazytime optimization"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: replace open coded nofail allocation in ext4_free_blocks()
ext4: correctly migrate a file with a hole at the beginning
ext4: be more strict when migrating to non-extent based file
ext4: fix reservation release on invalidatepage for delalloc fs
ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp
bufferhead: Add _gfp version for sb_getblk()
ext4: fix fencepost error in lazytime optimization
ext4_free_blocks is looping around the allocation request and mimics
__GFP_NOFAIL behavior without any allocation fallback strategy. Let's
remove the open coded loop and replace it with __GFP_NOFAIL. Without the
flag the allocator has no way to find out never-fail requirement and
cannot help in any way.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Pull more vfs updates from Al Viro:
"Assorted VFS fixes and related cleanups (IMO the most interesting in
that part are f_path-related things and Eric's descriptor-related
stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
fs-cache series, DAX patches, Jan's file_remove_suid() work"
[ I'd say this is much more than "fixes and related cleanups". The
file_table locking rule change by Eric Dumazet is a rather big and
fundamental update even if the patch isn't huge. - Linus ]
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
9p: cope with bogus responses from server in p9_client_{read,write}
p9_client_write(): avoid double p9_free_req()
9p: forgetting to cancel request on interrupted zero-copy RPC
dax: bdev_direct_access() may sleep
block: Add support for DAX reads/writes to block devices
dax: Use copy_from_iter_nocache
dax: Add block size note to documentation
fs/file.c: __fget() and dup2() atomicity rules
fs/file.c: don't acquire files->file_lock in fd_install()
fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
vfs: avoid creation of inode number 0 in get_next_ino
namei: make set_root_rcu() return void
make simple_positive() public
ufs: use dir_pages instead of ufs_dir_pages()
pagemap.h: move dir_pages() over there
remove the pointless include of lglock.h
fs: cleanup slight list_entry abuse
xfs: Correctly lock inode when removing suid and file capabilities
fs: Call security_ops->inode_killpriv on truncate
fs: Provide function telling whether file_remove_privs() will do anything
...
Currently ext4_ind_migrate() doesn't correctly handle a file which
contains a hole at the beginning of the file. This caused the migration
to be done incorrectly, and then if there is a subsequent following
delayed allocation write to the "hole", this would reclaim the same data
blocks again and results in fs corruption.
# assmuing 4k block size ext4, with delalloc enabled
# skip the first block and write to the second block
xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/ext4/testfile
# converting to indirect-mapped file, which would move the data blocks
# to the beginning of the file, but extent status cache still marks
# that region as a hole
chattr -e /mnt/ext4/testfile
# delayed allocation writes to the "hole", reclaim the same data block
# again, results in i_blocks corruption
xfs_io -c "pwrite 0 4k" /mnt/ext4/testfile
umount /mnt/ext4
e2fsck -nf /dev/sda6
...
Inode 53, i_blocks is 16, should be 8. Fix? no
...
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Currently the check in ext4_ind_migrate() is not enough before doing the
real conversion:
a) delayed allocated extents could bypass the check on eh->eh_entries
and eh->eh_depth
This can be demonstrated by this script
xfs_io -fc "pwrite 0 4k" -c "pwrite 8k 4k" /mnt/ext4/testfile
chattr -e /mnt/ext4/testfile
where testfile has two extents but still be converted to non-extent
based file format.
b) only extent length is checked but not the offset, which would result
in data lose (delalloc) or fs corruption (nodelalloc), because
non-extent based file only supports at most (12 + 2^10 + 2^20 + 2^30)
blocks
This can be demostrated by
xfs_io -fc "pwrite 5T 4k" /mnt/ext4/testfile
chattr -e /mnt/ext4/testfile
sync
If delalloc is enabled, dmesg prints
EXT4-fs warning (device dm-4): ext4_block_to_path:105: block 1342177280 > max in inode 53
EXT4-fs (dm-4): Delayed block allocation failed for inode 53 at logical offset 1342177280 with max blocks 1 with error 5
EXT4-fs (dm-4): This should not happen!! Data will be lost
If delalloc is disabled, e2fsck -nf shows corruption
Inode 53, i_size is 5497558142976, should be 4096. Fix? no
Fix the two issues by
a) forcing all delayed allocation blocks to be allocated before checking
eh->eh_depth and eh->eh_entries
b) limiting the last logical block of the extent is within direct map
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
On delalloc enabled file system on invalidatepage operation
in ext4_da_page_release_reservation() we want to clear the delayed
buffer and remove the extent covering the delayed buffer from the extent
status tree.
However currently there is a bug where on the systems with page size >
block size we will always remove extents from the start of the page
regardless where the actual delayed buffers are positioned in the page.
This leads to the errors like this:
EXT4-fs warning (device loop0): ext4_da_release_space:1225:
ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data
blocks
This however can cause data loss on writeback time if the file system is
in ENOSPC condition because we're releasing reservation for someones
else delayed buffer.
Fix this by only removing extents that corresponds to the part of the
page we want to invalidate.
This problem is reproducible by the following fio receipt (however I was
only able to reproduce it with fio-2.1 or older.
[global]
bs=8k
iodepth=1024
iodepth_batch=60
randrepeat=1
size=1m
directory=/mnt/test
numjobs=20
[job1]
ioengine=sync
bs=1k
direct=1
rw=randread
filename=file1:file2
[job2]
ioengine=libaio
rw=randwrite
direct=1
filename=file1:file2
[job3]
bs=1k
ioengine=posixaio
rw=randwrite
direct=1
filename=file1:file2
[job5]
bs=1k
ioengine=sync
rw=randread
filename=file1:file2
[job7]
ioengine=libaio
rw=randwrite
filename=file1:file2
[job8]
ioengine=posixaio
rw=randwrite
filename=file1:file2
[job10]
ioengine=mmap
rw=randwrite
bs=1k
filename=file1:file2
[job11]
ioengine=mmap
rw=randwrite
direct=1
filename=file1:file2
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Switch ext4 to using sb_getblk_gfp with GFP_NOFS added to fix possible
deadlocks in the page writeback path.
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org