IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
commit 613c5a85898d1cd44e68f28d65eccf64a8ace9cf upstream.
Currently the Orlov inode allocator searches for free inodes for a
directory only in flex block groups with at most inodes_per_group/16
more directory inodes than average per flex block group. However with
growing size of flex block group this becomes unnecessarily strict.
Scale allowed difference from average directory count per flex block
group with flex block group size as we do with other metrics.
Tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220908092136.11770-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c89849cc0259f3d33624cc3bd127685c3c0fa25d upstream.
The avefreec should be average free clusters instead
of average free blocks, otherwize Orlov's allocator
will not work properly when bigalloc enabled.
Cc: stable@kernel.org
Signed-off-by: Pan Dong <pandong.peter@bytedance.com>
Link: https://lore.kernel.org/r/20210525073656.31594-1-pandong.peter@bytedance.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a149d2a5cabbf6507a7832a1c4fd2593c55fd450 upstream.
Commit <50122847007> ("ext4: fix check to prevent initializing reserved
inodes") check the block group zero and prevent initializing reserved
inodes. But in some special cases, the reserved inode may not all belong
to the group zero, it may exist into the second group if we format
filesystem below.
mkfs.ext4 -b 4096 -g 8192 -N 1024 -I 4096 /dev/sda
So, it will end up triggering a false positive report of a corrupted
file system. This patch fix it by avoid check reserved inodes if no free
inode blocks will be zeroed.
Cc: stable@kernel.org
Fixes: 50122847007 ("ext4: fix check to prevent initializing reserved inodes")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210331121516.2243099-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8a363970d1dc38c4ec4ad575c862f776f468d057 upstream.
If we receive a file handle, either from NFS or open_by_handle_at(2),
and it points at an inode which has not been initialized, and the file
system has metadata checksums enabled, we shouldn't try to get the
inode, discover the checksum is invalid, and then declare the file
system as being inconsistent.
This can be reproduced by creating a test file system via "mke2fs -t
ext4 -O metadata_csum /tmp/foo.img 8M", mounting it, cd'ing into that
directory, and then running the following program.
#define _GNU_SOURCE
#include <fcntl.h>
struct handle {
struct file_handle fh;
unsigned char fid[MAX_HANDLE_SZ];
};
int main(int argc, char **argv)
{
struct handle h = {{8, 1 }, { 12, }};
open_by_handle_at(AT_FDCWD, &h.fh, O_RDONLY);
return 0;
}
Google-Bug-Id: 120690101
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Ashwin H <ashwinh@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7c990728b99ed6fbe9c75fc202fce1172d9916da upstream.
During an online resize an array of s_flex_groups structures gets replaced
so it can get enlarged. If there is a concurrent access to the array and
this memory has been reused then this can lead to an invalid memory access.
The s_flex_group array has been converted into an array of pointers rather
than an array of structures. This is to ensure that the information
contained in the structures cannot get out of sync during a resize due to
an accessor updating the value in the old structure after it has been
copied but before the array pointer is updated. Since the structures them-
selves are no longer copied but only the pointers to them this case is
mitigated.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
Link: https://lore.kernel.org/r/20200221053458.730016-4-tytso@mit.edu
Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 4.4.x
Cc: stable@kernel.org # 4.9.x
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 5012284700775a4e6e3fbe7eac4c543c4874b559 upstream.
Commit 8844618d8aa7: "ext4: only look at the bg_flags field if it is
valid" will complain if block group zero does not have the
EXT4_BG_INODE_ZEROED flag set. Unfortunately, this is not correct,
since a freshly created file system has this flag cleared. It gets
almost immediately after the file system is mounted read-write --- but
the following somewhat unlikely sequence will end up triggering a
false positive report of a corrupted file system:
mkfs.ext4 /dev/vdc
mount -o ro /dev/vdc /vdc
mount -o remount,rw /dev/vdc
Instead, when initializing the inode table for block group zero, test
to make sure that itable_unused count is not too large, since that is
the case that will result in some or all of the reserved inodes
getting cleared.
This fixes the failures reported by Eric Whiteney when running
generic/230 and generic/231 in the the nojournal test case.
Fixes: 8844618d8aa7 ("ext4: only look at the bg_flags field if it is valid")
Reported-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8d5a803c6a6ce4ec258e31f76059ea5153ba46ef upstream.
With commit 044e6e3d74a3: "ext4: don't update checksum of new
initialized bitmaps" the buffer valid bit will get set without
actually setting up the checksum for the allocation bitmap, since the
checksum will get calculated once we actually allocate an inode or
block.
If we are doing this, then we need to (re-)check the verified bit
after we take the block group lock. Otherwise, we could race with
another process reading and verifying the bitmap, which would then
complain about the checksum being invalid.
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1780137
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8844618d8aa7a9973e7b527d038a2a589665002c upstream.
The bg_flags field in the block group descripts is only valid if the
uninit_bg or metadata_csum feature is enabled. We were not
consistently looking at this field; fix this.
Also block group #0 must never have uninitialized allocation bitmaps,
or need to be zeroed, since that's where the root inode, and other
special inodes are set up. Check for these conditions and mark the
file system as corrupted if they are detected.
This addresses CVE-2018-10876.
https://bugzilla.kernel.org/show_bug.cgi?id=199403
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7dac4a1726a9c64a517d595c40e95e2d0d135f6f upstream.
An privileged attacker can cause a crash by mounting a crafted ext4
image which triggers a out-of-bounds read in the function
ext4_valid_block_bitmap() in fs/ext4/balloc.c.
This issue has been assigned CVE-2018-1093.
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=199181
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1560782
Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 044e6e3d74a3d7103a0c8a9305dfd94d64000660 upstream.
When reading the inode or block allocation bitmap, if the bitmap needs
to be initialized, do not update the checksum in the block group
descriptor. That's because we're not set up to journal those changes.
Instead, just set the verified bit on the bitmap block, so that it's
not necessary to validate the checksum.
When a block or inode allocation actually happens, at that point the
checksum will be calculated, and update of the bg descriptor block
will be properly journalled.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 54475f531bb8d7078f63c159e5e0615d486c498c ]
As part of an effort to clean up fscrypt-related error codes, make
attempting to create a file in an encrypted directory that hasn't been
"unlocked" fail with ENOKEY. Previously, several error codes were used
for this case, including ENOENT, EACCES, and EPERM, and they were not
consistent between and within filesystems. ENOKEY is a better choice
because it expresses that the failure is due to lacking the encryption
key. It also matches the error code returned when trying to open an
encrypted regular file without the key.
I am not aware of any users who might be relying on the previous
inconsistent error codes, which were never documented anywhere.
This failure case will be exercised by an xfstest.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Use the ext4_{has,set,clear}_feature_* helpers to replace the old
feature helpers.
Signed-off-by: Kaho Ng <ngkaho1234@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
encryption code and switching things over to using the copies in
fs/crypto. I've updated the MAINTAINERS file to add an entry for
fs/crypto listing Jaeguk Kim and myself as the maintainers.
There are also a number of bug fixes, most notably for some problems
found by American Fuzzy Lop (AFL) courtesy of Vegard Nossum. Also
fixed is a writeback deadlock detected by generic/130, and some
potential races in the metadata checksum code.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJXlbP9AAoJEPL5WVaVDYGjGxgIAJ9YIqme//yix63oHYLhDNea
lY/TLqZrb9/TdDRvGyZa3jYaKaIejL53eEQS9nhEB/JI0sEiDpHmOrDOxdj8Hlsw
fm7nJyh1u4vFKPyklCbIvLAje1vl8X/6OvqQiwh45gIxbbsFftaBWtccW+UtEkIP
Fx65Vk7RehJ/sNrM0cRrwB79YAmDS8P6BPyzdMRk+vO/uFqyq7Auc+pkd+bTlw/m
TDAEIunlk0Ovjx75ru1zaemL1JJx5ffehrJmGCcSUPHVbMObOEKIrlV50gAAKVhO
qbZAri3mhDvyspSLuS/73L9skeCiWFLhvojCBGu4t2aa3JJolmItO7IpKi4HdRU=
=bxGK
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"The major change this cycle is deleting ext4's copy of the file system
encryption code and switching things over to using the copies in
fs/crypto. I've updated the MAINTAINERS file to add an entry for
fs/crypto listing Jaeguk Kim and myself as the maintainers.
There are also a number of bug fixes, most notably for some problems
found by American Fuzzy Lop (AFL) courtesy of Vegard Nossum. Also
fixed is a writeback deadlock detected by generic/130, and some
potential races in the metadata checksum code"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits)
ext4: verify extent header depth
ext4: short-cut orphan cleanup on error
ext4: fix reference counting bug on block allocation error
MAINTAINRES: fs-crypto maintainers update
ext4 crypto: migrate into vfs's crypto engine
ext2: fix filesystem deadlock while reading corrupted xattr block
ext4: fix project quota accounting without quota limits enabled
ext4: validate s_reserved_gdt_blocks on mount
ext4: remove unused page_idx
ext4: don't call ext4_should_journal_data() on the journal inode
ext4: Fix WARN_ON_ONCE in ext4_commit_super()
ext4: fix deadlock during page writeback
ext4: correct error value of function verifying dx checksum
ext4: avoid modifying checksum fields directly during checksum verification
ext4: check for extents that wrap around
jbd2: make journal y2038 safe
jbd2: track more dependencies on transaction commit
jbd2: move lockdep tracking to journal_s
jbd2: move lockdep instrumentation for jbd2 handles
ext4: respect the nobarrier mount option in nojournal mode
...
This patch removes the most parts of internal crypto codes.
And then, it modifies and adds some ext4-specific crypt codes to use the generic
facility.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This has submit_bh users pass in the operation and flags separately,
so submit_bh_wbc can setup the bio op and bi_rw flags on the bio that
is submitted.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead of just printing warning messages, if the orphan list is
corrupted, declare the file system is corrupted. If there are any
reserved inodes in the orphaned inode list, declare the file system
corrupted and stop right away to avoid doing more potential damage to
the file system.
Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If the orphaned inode list contains inode #5, ext4_iget() returns a
bad inode (since the bootloader inode should never be referenced
directly). Because of the bad inode, we end up processing the inode
repeatedly and this hangs the machine.
This can be reproduced via:
mke2fs -t ext4 /tmp/foo.img 100
debugfs -w -R "ssv last_orphan 5" /tmp/foo.img
mount -o loop /tmp/foo.img /mnt
(But don't do this if you are using an unpatched kernel if you care
about the system staying functional. :-)
This bug was found by the port of American Fuzzy Lop into the kernel
to find file system problems[1]. (Since it *only* happens if inode #5
shows up on the orphan list --- 3, 7, 8, etc. won't do it, it's not
surprising that AFL needed two hours before it found it.)
[1] http://events.linuxfoundation.org/sites/events/files/slides/AFL%20filesystem%20fuzzing%2C%20Vault%202016_0.pdf
Cc: stable@vger.kernel.org
Reported by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When block group checksum is wrong, we call ext4_error() while holding
group spinlock from ext4_init_block_bitmap() or
ext4_init_inode_bitmap() which results in scheduling while in atomic.
Fix the issue by calling ext4_error() later after dropping the spinlock.
CC: stable@vger.kernel.org
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Make the bitmap reaading routines return real error codes (EIO,
EFSCORRUPTED, EFSBADCRC) which can then be reflected back to
userspace for more precise diagnosis work.
In particular, this means that mballoc no longer claims that we're out
of memory if the block bitmaps become corrupt.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Create separate predicate functions to test/set/clear feature flags,
thereby replacing the wordy old macros. Furthermore, clean out the
places where we open-coded feature tests.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead of overloading EIO for CRC errors and corrupt structures,
return the same error codes that XFS returns for the same issues.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Factor out calls to ext4_inherit_context() and move them to
__ext4_new_inode(); this fixes a problem where ext4_tmpfile() wasn't
calling calling ext4_inherit_context(), so the temporary file wasn't
getting protected. Since the blocks for the tmpfile could end up on
disk, they really should be protected if the tmpfile is created within
the context of an encrypted directory.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The superblock fields s_file_encryption_mode and s_dir_encryption_mode
are vestigal, so remove them as a cleanup. While we're at it, allow
file systems with both encryption and inline_data enabled at the same
time to work correctly. We can't have encrypted inodes with inline
data, but there's no reason to prohibit unencrypted inodes from using
the inline data feature.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull fourth vfs update from Al Viro:
"d_inode() annotations from David Howells (sat in for-next since before
the beginning of merge window) + four assorted fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
RCU pathwalk breakage when running into a symlink overmounting something
fix I_DIO_WAKEUP definition
direct-io: only inc/dec inode->i_dio_count for file systems
fs/9p: fix readdir()
VFS: assorted d_backing_inode() annotations
VFS: fs/inode.c helpers: d_inode() annotations
VFS: fs/cachefiles: d_backing_inode() annotations
VFS: fs library helpers: d_inode() annotations
VFS: assorted weird filesystems: d_inode() annotations
VFS: normal filesystems (and lustre): d_inode() annotations
VFS: security/: d_inode() annotations
VFS: security/: d_backing_inode() annotations
VFS: net/: d_inode() annotations
VFS: net/unix: d_backing_inode() annotations
VFS: kernel/: d_inode() annotations
VFS: audit: d_backing_inode() annotations
VFS: Fix up some ->d_inode accesses in the chelsio driver
VFS: Cachefiles should perform fs modifications on the top layer only
VFS: AF_UNIX sockets should call mknod on the top layer only
Also add the test dummy encryption mode flag so we can more easily
test the encryption patches using xfstests.
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
that's the bulk of filesystem drivers dealing with inodes of their own
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pulls block_write_begin() into fs/ext4/inode.c because it might need
to do a low-level read of the existing data, in which case we need to
decrypt it.
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Remove unused header files and header files which are included in
ext4.h.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When we fail to load block bitmap in __ext4_new_inode() we will
dereference NULL pointer in ext4_journal_get_write_access(). So check
for error from ext4_read_block_bitmap().
Coverity-id: 989065
Cc: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Besides the fact that this replacement improves code readability
it also protects from errors caused direct EXT4_S(sb)->s_es manipulation
which may result attempt to use uninitialized csum machinery.
#Testcase_BEGIN
IMG=/dev/ram0
MNT=/mnt
mkfs.ext4 $IMG
mount $IMG $MNT
#Enable feature directly on disk, on mounted fs
tune2fs -O metadata_csum $IMG
# Provoke metadata update, likey result in OOPS
touch $MNT/test
umount $MNT
#Testcase_END
# Replacement script
@@
expression E;
@@
- EXT4_HAS_RO_COMPAT_FEATURE(E, EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)
+ ext4_has_metadata_csum(E)
https://bugzilla.kernel.org/show_bug.cgi?id=82201
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
The first time that we allocate from an uninitialized inode allocation
bitmap, if the block allocation bitmap is also uninitalized, we need
to get write access to the block group descriptor before we start
modifying the block group descriptor flags and updating the free block
count, etc. Otherwise, there is the potential of a bad journal
checksum (if journal checksums are enabled), and of the file system
becoming inconsistent if we crash at exactly the wrong time.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
We should decrement free clusters counter when block bitmap is marked
as corrupt and free inodes counter when the allocation bitmap is
marked as corrupt to avoid misunderstanding due to incorrect available
size in statfs result. User can get immediately ENOSPC error from
write begin without reaching for the writepages.
Cc: Darrick J. Wong<darrick.wong@oracle.com>
Reported-by: Amit Sahrawat <amit.sahrawat83@gmail.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Many of the uses of get_random_bytes() do not actually need
cryptographically secure random numbers. Replace those uses with a
call to prandom_u32(), which is faster and which doesn't consume
entropy from the /dev/random driver.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If the group descriptor fails validation, mark the whole blockgroup
corrupt so that the inode/block allocators skip this group. The
previous approach takes the risk of writing to a damaged group
descriptor; hopefully it was never the case that the [ib]bitmap fields
pointed to another valid block and got dirtied, since the memset would
fill the page with 1s.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If we detect either a discrepancy between the inode bitmap and the
inode counts or the inode bitmap fails to pass validation checks, mark
the block group corrupt and refuse to allocate or deallocate inodes
from the group.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In no journal mode, if an inode has recently been deleted, we
shouldn't reuse it right away. Otherwise it's possible, after an
unclean shutdown, to hit a situation where a recently deleted inode
gets reused for some other purpose before the inode table block has
been written to disk. However, if the directory entry has been
updated, then the directory entry will be pointing at the old inode
contents.
E2fsck will make sure the file system is consistent after the
unclean shutdown. However, if the recently deleted inode is a
character mode device, or an inode with the immutable bit set, even
after the file system has been fixed up by e2fsck, it can be
possible for a *.pyc file to be pointing at a character mode
device, and when python tries to open the *.pyc file, Hilarity
Ensues. We could change all of userspace to be very suspicious
about stat'ing files before opening them, and clearing the
immutable flag if necessary --- or we can just avoid reusing an
inode number if it has been recently deleted.
Google-Bug-Id: 10017573
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When we try to allocate an inode, and there is a race between two
CPU's trying to grab the same inode, _and_ this inode is the last free
inode in the block group, make sure the group number is bumped before
we continue searching the rest of the block groups. Otherwise, we end
up searching the current block group twice, and we end up skipping
searching the last block group. So in the unlikely situation where
almost all of the inodes are allocated, it's possible that we will
return ENOSPC even though there might be free inodes in that last
block group.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
As Dave Chinner pointed out at the 2013 LSF/MM workshop, it's
important that metadata I/O requests are marked as such to avoid
priority inversions caused by I/O bandwidth throttling.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Inode allocation transaction is pretty heavy (246 credits with quotas
and extents before previous patch, still around 200 after it). This is
mostly due to credits required for allocation of quota structures
(credits there are heavily overestimated but it's difficult to make
better estimates if we don't want to wire non-trivial assumptions about
quota format into filesystem).
So move quota initialization out of allocation transaction. That way
transaction for quota structure allocation will be started only if we
need to look up quota structure on disk (rare) and furthermore it will
be started for each quota type separately, not for all of them at once.
This reduces maximum transaction size to 34 is most cases and to 73 in
the worst case.
[ Modified by tytso to clean up the cleanup paths for error handling.
Also use a separate call to ext4_std_error() for each failure so it
is easier for someone who is debugging a problem in this function to
determine which function call failed. ]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
A user who was using a 8TB+ file system and with a very large flexbg
size (> 65536) could cause the atomic_t used in the struct flex_groups
to overflow. This was detected by PaX security patchset:
http://forums.grsecurity.net/viewtopic.php?f=3&t=3289&p=12551#p12551
This bug was introduced in commit 9f24e4208f7e, so it's been around
since 2.6.30. :-(
Fix this by using an atomic64_t for struct orlav_stats's
free_clusters.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org
Some messages printed related to a WARN_ON(1) were printed using
KERN_NOTICE. Use KERN_WARNING or ext4_warning() instead so that
context related to the WARN_ON() is printed at the same printk warning
level (and log files, etc.)
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In ext4_{create,mknod,mkdir,symlink}(), don't start the journal handle
until the inode has been succesfully allocated. In order to do this,
we need to start the handle in the ext4_new_inode(). So create a new
variant of this function, ext4_new_inode_start_handle(), so the handle
can be created at the last possible minute, before we need to modify
the inode allocation bitmap block.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
So we can better understand what bits of ext4 are responsible for
long-running jbd2 handles, use jbd2__journal_start() so we can pass
context information for logging purposes.
The recommended way for finding the longer-running handles is:
T=/sys/kernel/debug/tracing
EVENT=$T/events/jbd2/jbd2_handle_stats
echo "interval > 5" > $EVENT/filter
echo 1 > $EVENT/enable
./run-my-fs-benchmark
cat $T/trace > /tmp/problem-handles
This will list handles that were active for longer than 20ms. Having
longer-running handles is bad, because a commit started at the wrong
time could stall for those 20+ milliseconds, which could delay an
fsync() or an O_SYNC operation. Here is an example line from the
trace file describing a handle which lived on for 311 jiffies, or over
1.2 seconds:
postmark-2917 [000] .... 196.435786: jbd2_handle_stats: dev 254,32
tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
dirtied_blocks 0
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>