IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
commit effa24f689ce0948f68c754991a445a8d697d3a8 upstream.
extent_write_cache_pages stops writing pages as soon as nr_to_write hits
zero. That is the right thing for opportunistic writeback, but incorrect
for data integrity writeback, which needs to ensure that no dirty pages
are left in the range. Thus only stop the writeback for WB_SYNC_NONE
if nr_to_write hits 0.
This is a port of write_cache_pages changes in commit 05fe478dd04e
("mm: write_cache_pages integrity fix").
Note that I've only trigger the problem with other changes to the btrfs
writeback code, but this condition seems worthwhile fixing anyway.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ updated comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f8654743a0e6909dc634cbfad6db6816f10f3399 upstream.
During unmount process of nilfs2, nothing holds nilfs_root structure after
nilfs2 detaches its writer in nilfs_detach_log_writer(). Previously,
nilfs_evict_inode() could cause use-after-free read for nilfs_root if
inodes are left in "garbage_list" and released by nilfs_dispose_list at
the end of nilfs_detach_log_writer(), and this bug was fixed by commit
9b5a04ac3ad9 ("nilfs2: fix use-after-free bug of nilfs_root in
nilfs_evict_inode()").
However, it turned out that there is another possibility of UAF in the
call path where mark_inode_dirty_sync() is called from iput():
nilfs_detach_log_writer()
nilfs_dispose_list()
iput()
mark_inode_dirty_sync()
__mark_inode_dirty()
nilfs_dirty_inode()
__nilfs_mark_inode_dirty()
nilfs_load_inode_block() --> causes UAF of nilfs_root struct
This can happen after commit 0ae45f63d4ef ("vfs: add support for a
lazytime mount option"), which changed iput() to call
mark_inode_dirty_sync() on its final reference if i_state has I_DIRTY_TIME
flag and i_nlink is non-zero.
This issue appears after commit 28a65b49eb53 ("nilfs2: do not write dirty
data after degenerating to read-only") when using the syzbot reproducer,
but the issue has potentially existed before.
Fix this issue by adding a "purging flag" to the nilfs structure, setting
that flag while disposing the "garbage_list" and checking it in
__nilfs_mark_inode_dirty().
Unlike commit 9b5a04ac3ad9 ("nilfs2: fix use-after-free bug of nilfs_root
in nilfs_evict_inode()"), this patch does not rely on ns_writer to
determine whether to skip operations, so as not to break recovery on
mount. The nilfs_salvage_orphan_logs routine dirties the buffer of
salvaged data before attaching the log writer, so changing
__nilfs_mark_inode_dirty() to skip the operation when ns_writer is NULL
will cause recovery write to fail. The purpose of using the cleanup-only
flag is to allow for narrowing of such conditions.
Link: https://lkml.kernel.org/r/20230728191318.33047-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+74db8b3087f293d3a13a@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/000000000000b4e906060113fd63@google.com
Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option")
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org> # 4.0+
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 404615d7f1dcd4cca200e9a7a9df3a1dcae1dd62 upstream.
Ext2 has fields in superblock reserved for subblock allocation support.
However that never landed. Drop the many years dead code.
Reported-by: syzbot+af5e10f73dbff48f70af@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ea2b62f305893992156a798f665847e0663c9f41 upstream.
sb_getblk(inode->i_sb, parent) return a null ptr and taking lock on
that leads to the null-ptr-deref bug.
Reported-by: syzbot+aad58150cbc64ba41bdc@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=aad58150cbc64ba41bdc
Signed-off-by: Prince Kumar Maurya <princekumarmaurya06@gmail.com>
Message-Id: <20230531013141.19487-1-princekumarmaurya06@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b28ff3a7d7e97456fd86b68d24caa32e1cfa7064 upstream.
btrfs_attach_transaction_barrier() is used to get a handle pointing to the
current running transaction if the transaction has not started its commit
yet (its state is < TRANS_STATE_COMMIT_START). If the transaction commit
has started, then we wait for the transaction to commit and finish before
returning - however we completely ignore if the transaction was aborted
due to some error during its commit, we simply return ERR_PT(-ENOENT),
which makes the caller assume everything is fine and no errors happened.
This could make an fsync return success (0) to user space when in fact we
had a transaction abort and the target inode changes were therefore not
persisted.
Fix this by checking for the return value from btrfs_wait_for_commit(),
and if it returned an error, return it back to the caller.
Fixes: d4edf39bd5db ("Btrfs: fix uncompleted transaction")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c4d13222afd8a64bf11bc7ec68645496ee8b54b9 ]
freeze_bdev() can fail due to a lot of reasons, it needs to check its
reason before later process.
Fixes: 783d94854499 ("ext4: add EXT4_IOC_GOINGDOWN ioctl")
Cc: stable@kernel.org
Signed-off-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230606073203.1310389-1-chao@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 59e45c758ca1b9893ac923dd63536da946ac333b ]
If a posix lock request is waiting for a result from user space
(dlm_controld), do not let it be interrupted unless the process
is killed. This reverts commit a6b1533e9a57 ("dlm: make posix locks
interruptible"). The problem with the interruptible change is
that all locks were cleared on any signal interrupt. If a signal
was received that did not terminate the process, the process
could continue running after all its dlm posix locks had been
cleared. A future patch will add cancelation to allow proper
interruption.
Cc: stable@vger.kernel.org
Fixes: a6b1533e9a57 ("dlm: make posix locks interruptible")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a800ba77fd285c6391a82819867ac64e9ab3af46 ]
This patch moves the return of FILE_LOCK_DEFERRED a little bit earlier
than checking afterwards again if the request was an asynchronous request.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Stable-dep-of: 59e45c758ca1 ("fs: dlm: interrupt posix locks only when process is killed")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bcbb4ba6c9ba81e6975b642a2cade68044cd8a66 ]
Lately the different casting between plock_op and plock_xop and list
holders which was involved showed some issues which were hard to see.
This patch removes the "plock_xop" structure and introduces a
"struct plock_async_data". This structure will be set in "struct plock_op"
in case of asynchronous lock handling as the original "plock_xop" was
made for. There is no need anymore to cast pointers around for
additional fields in case of asynchronous lock handling. As disadvantage
another allocation was introduces but only needed in the asynchronous
case which is currently only used in combination with nfs lockd.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Stable-dep-of: 59e45c758ca1 ("fs: dlm: interrupt posix locks only when process is killed")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 26fb5290240dc31cae99b8b4dd2af7f46dfcba6b ]
Following process makes ext4 load stale buffer heads from last failed
mounting in a new mounting operation:
mount_bdev
ext4_fill_super
| ext4_load_and_init_journal
| ext4_load_journal
| jbd2_journal_load
| load_superblock
| journal_get_superblock
| set_buffer_verified(bh) // buffer head is verified
| jbd2_journal_recover // failed caused by EIO
| goto failed_mount3a // skip 'sb->s_root' initialization
deactivate_locked_super
kill_block_super
generic_shutdown_super
if (sb->s_root)
// false, skip ext4_put_super->invalidate_bdev->
// invalidate_mapping_pages->mapping_evict_folio->
// filemap_release_folio->try_to_free_buffers, which
// cannot drop buffer head.
blkdev_put
blkdev_put_whole
if (atomic_dec_and_test(&bdev->bd_openers))
// false, systemd-udev happens to open the device. Then
// blkdev_flush_mapping->kill_bdev->truncate_inode_pages->
// truncate_inode_folio->truncate_cleanup_folio->
// folio_invalidate->block_invalidate_folio->
// filemap_release_folio->try_to_free_buffers will be skipped,
// dropping buffer head is missed again.
Second mount:
ext4_fill_super
ext4_load_and_init_journal
ext4_load_journal
ext4_get_journal
jbd2_journal_init_inode
journal_init_common
bh = getblk_unmovable
bh = __find_get_block // Found stale bh in last failed mounting
journal->j_sb_buffer = bh
jbd2_journal_load
load_superblock
journal_get_superblock
if (buffer_verified(bh))
// true, skip journal->j_format_version = 2, value is 0
jbd2_journal_recover
do_one_pass
next_log_block += count_tags(journal, bh)
// According to journal_tag_bytes(), 'tag_bytes' calculating is
// affected by jbd2_has_feature_csum3(), jbd2_has_feature_csum3()
// returns false because 'j->j_format_version >= 2' is not true,
// then we get wrong next_log_block. The do_one_pass may exit
// early whenoccuring non JBD2_MAGIC_NUMBER in 'next_log_block'.
The filesystem is corrupted here, journal is partially replayed, and
new journal sequence number actually is already used by last mounting.
The invalidate_bdev() can drop all buffer heads even racing with bare
reading block device(eg. systemd-udev), so we can fix it by invalidating
bdev in error handling path in __ext4_fill_super().
Fetch a reproducer in [Link].
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217171
Fixes: 25ed6e8a54df ("jbd2: enable journal clients to enable v2 checksumming")
Cc: stable@vger.kernel.org # v3.5
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230315013128.3911115-2-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ede600e497b1461d06d22a7d17703d9096868bc3 ]
At split_node(), if we fail to log the tree mod log copy operation, we
return without unlocking the split extent buffer we just allocated and
without decrementing the reference we own on it. Fix this by unlocking
it and decrementing the ref count before returning.
Fixes: 5de865eebb83 ("Btrfs: fix tree mod logging")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 6909cf5c4101214f4305a62d582a5b93c7e1eb9a upstream.
When run on a file system where the inline_data feature has been
enabled, xfstests generic/269, generic/270, and generic/476 cause ext4
to emit error messages indicating that inline directory entries are
corrupted. This occurs because the inline offset used to locate
inline directory entries in the inode body is not updated when an
xattr in that shared region is deleted and the region is shifted in
memory to recover the space it occupied. If the deleted xattr precedes
the system.data attribute, which points to the inline directory entries,
that attribute will be moved further up in the region. The inline
offset continues to point to whatever is located in system.data's former
location, with unfortunate effects when used to access directory entries
or (presumably) inline data in the inode body.
Cc: stable@kernel.org
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20230522181520.1570360-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a9d1c4c6df0e568207907c04aed9e7beb1294c42 upstream.
If the LOOKUP request triggered from fuse_dentry_revalidate() is
interrupted, then the dentry will be invalidated, possibly resulting in
submounts being unmounted.
Reported-by: Xu Rongbo <xurongbo@baidu.com>
Closes: https://lore.kernel.org/all/CAJfpegswN_CJJ6C3RZiaK6rpFmNyWmXfaEpnQUJ42KCwNF5tWw@mail.gmail.com/
Fixes: 9e6268db496a ("[PATCH] FUSE - read-write operations")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 257e6172ab36ebbe295a6c9ee9a9dd0fe54c1dc2 upstream.
If a client sends out a cap update dropping caps with the prior 'seq'
just before an incoming cap revoke request, then the client may drop
the revoke because it believes it's already released the requested
capabilities.
This causes the MDS to wait indefinitely for the client to respond
to the revoke. It's therefore always a good idea to ack the cap
revoke request with the bumped up 'seq'.
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/61782
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 92655fbda5c05950a411eaabc19e025e86e2a291 upstream.
The GETLK pid values have all been negated since commit 9d5b86ac13c5
("fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks").
Revert this for local pids, and leave in place negative pids for remote
owners.
Cc: stable@vger.kernel.org
Fixes: 9d5b86ac13c5 ("fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 11509910c599cbd04585ec35a6d5e1a0053d84c1 upstream.
In jfs_dmap.c at line 381, BLKTODMAP is used to get a logical block
number inside dbFree(). db_l2nbperpage, which is the log2 number of
blocks per page, is passed as an argument to BLKTODMAP which uses it
for shifting.
Syzbot reported a shift out-of-bounds crash because db_l2nbperpage is
too big. This happens because the large value is set without any
validation in dbMount() at line 181.
Thus, make sure that db_l2nbperpage is correct while mounting.
Max number of blocks per page = Page size / Min block size
=> log2(Max num_block per page) = log2(Page size / Min block size)
= log2(Page size) - log2(Min block size)
=> Max db_l2nbperpage = L2PSIZE - L2MINBLOCKSIZE
Reported-and-tested-by: syzbot+d2cd27dcf8e04b232eb2@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?id=2a70a453331db32ed491f5cbb07e81bf2d225715
Cc: stable@vger.kernel.org
Suggested-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Siddh Raman Pant <code@siddh.me>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit de25d6e9610a8b30cce9bbb19b50615d02ebca02 upstream.
In our fault injection test, we create an ext4 file, migrate it to
non-extent based file, then punch a hole and finally trigger a WARN_ON
in the ext4_da_update_reserve_space():
EXT4-fs warning (device sda): ext4_da_update_reserve_space:369:
ino 14, used 11 with only 10 reserved data blocks
When writing back a non-extent based file, if we enable delalloc, the
number of reserved blocks will be subtracted from the number of blocks
mapped by ext4_ind_map_blocks(), and the extent status tree will be
updated. We update the extent status tree by first removing the old
extent_status and then inserting the new extent_status. If the block range
we remove happens to be in an extent, then we need to allocate another
extent_status with ext4_es_alloc_extent().
use old to remove to add new
|----------|------------|------------|
old extent_status
The problem is that the allocation of a new extent_status failed due to a
fault injection, and __es_shrink() did not get free memory, resulting in
a return of -ENOMEM. Then do_writepages() retries after receiving -ENOMEM,
we map to the same extent again, and the number of reserved blocks is again
subtracted from the number of blocks in that extent. Since the blocks in
the same extent are subtracted twice, we end up triggering WARN_ON at
ext4_da_update_reserve_space() because used > ei->i_reserved_data_blocks.
For non-extent based file, we update the number of reserved blocks after
ext4_ind_map_blocks() is executed, which causes a problem that when we call
ext4_ind_map_blocks() to create a block, it doesn't always create a block,
but we always reduce the number of reserved blocks. So we move the logic
for updating reserved blocks to ext4_ind_map_blocks() to ensure that the
number of reserved blocks is updated only after we do succeed in allocating
some new blocks.
Fixes: 5f634d064c70 ("ext4: Fix quota accounting error with fallocate")
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230424033846.4732-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1168f095417643f663caa341211e117db552989f upstream.
Use kcalloc() for allocation/flush of 128 pointers table to
reduce stack usage.
Function now returns -ENOMEM or 0 on success.
stackusage
Before:
./fs/jffs2/xattr.c:775 jffs2_build_xattr_subsystem 1208
dynamic,bounded
After:
./fs/jffs2/xattr.c:775 jffs2_build_xattr_subsystem 192
dynamic,bounded
Also update definition when CONFIG_JFFS2_FS_XATTR is not enabled
Tested with an MTD mount point and some user set/getfattr.
Many current target on OpenWRT also suffer from a compilation warning
(that become an error with CONFIG_WERROR) with the following output:
fs/jffs2/xattr.c: In function 'jffs2_build_xattr_subsystem':
fs/jffs2/xattr.c:887:1: error: the frame size of 1088 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
887 | }
| ^
Using dynamic allocation fix this compilation warning.
Fixes: c9f700f840bd ("[JFFS2][XATTR] using 'delete marker' for xdatum/xref deletion")
Reported-by: Tim Gardner <tim.gardner@canonical.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Ron Economos <re@w6rz.net>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Cc: stable@vger.kernel.org
Message-Id: <20230506045612.16616-1-ansuelsmth@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 58f5d894006d82ed7335e1c37182fbc5f08c2f51 upstream.
Modified nfsd4_encode_open to encode the op_recall flag properly
for OPEN result with write delegation granted.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 0135c482fa97e2fd8245cb462784112a00ed1211 ]
If truncate_node() fails in truncate_dnode(), it missed to call
f2fs_put_page(), fix it.
Fixes: 7735730d39d7 ("f2fs: fix to propagate error from __get_meta_page()")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 3f649ab728cda8038259d8f14492fe400fbab911 upstream.
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.
In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:
git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'
drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.
No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.
[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/
Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 504a10d9e46bc37b23d0a1ae2f28973c8516e636 upstream.
On corrupt gfs2 file systems the evict code can try to reference the
journal descriptor structure, jdesc, after it has been freed and set to
NULL. The sequence of events is:
init_journal()
...
fail_jindex:
gfs2_jindex_free(sdp); <------frees journals, sets jdesc = NULL
if (gfs2_holder_initialized(&ji_gh))
gfs2_glock_dq_uninit(&ji_gh);
fail:
iput(sdp->sd_jindex); <--references jdesc in evict_linked_inode
evict()
gfs2_evict_inode()
evict_linked_inode()
ret = gfs2_trans_begin(sdp, 0, sdp->sd_jdesc->jd_blocks);
<------references the now freed/zeroed sd_jdesc pointer.
The call to gfs2_trans_begin is done because the truncate_inode_pages
call can cause gfs2 events that require a transaction, such as removing
journaled data (jdata) blocks from the journal.
This patch fixes the problem by adding a check for sdp->sd_jdesc to
function gfs2_evict_inode. In theory, this should only happen to corrupt
gfs2 file systems, when gfs2 detects the problem, reports it, then tries
to evict all the system inodes it has read in up to that point.
Reported-by: Yang Lan <lanyang0908@gmail.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
[DP: adjusted context]
Signed-off-by: Dragos-Marian Panait <dragos.panait@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 782e53d0c14420858dbf0f8f797973c150d3b6d7 upstream.
In a syzbot stress test that deliberately causes file system errors on
nilfs2 with a corrupted disk image, it has been reported that
nilfs_clear_dirty_page() called from nilfs_clear_dirty_pages() can cause a
general protection fault.
In nilfs_clear_dirty_pages(), when looking up dirty pages from the page
cache and calling nilfs_clear_dirty_page() for each dirty page/folio
retrieved, the back reference from the argument page to "mapping" may have
been changed to NULL (and possibly others). It is necessary to check this
after locking the page/folio.
So, fix this issue by not calling nilfs_clear_dirty_page() on a page/folio
after locking it in nilfs_clear_dirty_pages() if the back reference
"mapping" from the page/folio is different from the "mapping" that held
the page/folio just before.
Link: https://lkml.kernel.org/r/20230612021456.3682-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+53369d11851d8f26735c@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/000000000000da4f6b05eb9bf593@google.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 679bd7ebdd315bf457a4740b306ae99f1d0a403d upstream.
As a result of analysis of a syzbot report, it turned out that in three
cases where nilfs2 allocates block device buffers directly via sb_getblk,
concurrent reads to the device can corrupt the allocated buffers.
Nilfs2 uses sb_getblk for segment summary blocks, that make up a log
header, and the super root block, that is the trailer, and when moving and
writing the second super block after fs resize.
In any of these, since the uptodate flag is not set when storing metadata
to be written in the allocated buffers, the stored metadata will be
overwritten if a device read of the same block occurs concurrently before
the write. This causes metadata corruption and misbehavior in the log
write itself, causing warnings in nilfs_btree_assign() as reported.
Fix these issues by setting an uptodate flag on the buffer head on the
first or before modifying each buffer obtained with sb_getblk, and
clearing the flag on failure.
When setting the uptodate flag, the lock_buffer/unlock_buffer pair is used
to perform necessary exclusive control, and the buffer is filled to ensure
that uninitialized bytes are not mixed into the data read from others. As
for buffers for segment summary blocks, they are filled incrementally, so
if the uptodate flag was unset on their allocation, set the flag and zero
fill the buffer once at that point.
Also, regarding the superblock move routine, the starting point of the
memset call to zerofill the block is incorrectly specified, which can
cause a buffer overflow on file systems with block sizes greater than
4KiB. In addition, if the superblock is moved within a large block, it is
necessary to assume the possibility that the data in the superblock will
be destroyed by zero-filling before copying. So fix these potential
issues as well.
Link: https://lkml.kernel.org/r/20230609035732.20426-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+31837fe952932efc8fb9@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/00000000000030000a05e981f475@google.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 92c5d1b860e9581d64baca76779576c0ab0d943d upstream.
The current sanity check for nilfs2 geometry information lacks checks for
the number of segments stored in superblocks, so even for device images
that have been destructively truncated or have an unusually high number of
segments, the mount operation may succeed.
This causes out-of-bounds block I/O on file system block reads or log
writes to the segments, the latter in particular causing
"a_ops->writepages" to repeatedly fail, resulting in sync_inodes_sb() to
hang.
Fix this issue by checking the number of segments stored in the superblock
and avoiding mounting devices that can cause out-of-bounds accesses. To
eliminate the possibility of overflow when calculating the number of
blocks required for the device from the number of segments, this also adds
a helper function to calculate the upper bound on the number of segments
and inserts a check using it.
Link: https://lkml.kernel.org/r/20230526021332.3431-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+7d50f1e54a12ba3aeae2@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=7d50f1e54a12ba3aeae2
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fee5eaecca86afa544355569b831c1f90f334b85 upstream.
Syzbot reports that in its stress test for resize ioctl, the log writing
function nilfs_segctor_do_construct hits a WARN_ON in
nilfs_segctor_truncate_segments().
It turned out that there is a problem with the current implementation of
the resize ioctl, which changes the writable range on the device (the
range of allocatable segments) at the end of the resize process.
This order is necessary for file system expansion to avoid corrupting the
superblock at trailing edge. However, in the case of a file system
shrink, if log writes occur after truncating out-of-bounds trailing
segments and before the resize is complete, segments may be allocated from
the truncated space.
The userspace resize tool was fine as it limits the range of allocatable
segments before performing the resize, but it can run into this issue if
the resize ioctl is called alone.
Fix this issue by changing nilfs_sufile_resize() to update the range of
allocatable segments immediately after successful truncation of segment
space in case of file system shrink.
Link: https://lkml.kernel.org/r/20230524094348.3784-1-konishi.ryusuke@gmail.com
Fixes: 4e33f9eab07e ("nilfs2: implement resize ioctl")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+33494cd0df2ec2931851@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/0000000000005434c405fbbafdc5@google.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2f012f2baca140c488e43d27a374029c1e59098d upstream.
A syzbot fault injection test reported that nilfs_btnode_create_block, a
helper function that allocates a new node block for b-trees, causes a
kernel BUG for disk images where the file system block size is smaller
than the page size.
This was due to unexpected flags on the newly allocated buffer head, and
it turned out to be because the buffer flags were not cleared by
nilfs_btnode_abort_change_key() after an error occurred during a b-tree
update operation and the buffer was later reused in that state.
Fix this issue by using nilfs_btnode_delete() to abandon the unused
preallocated buffer in nilfs_btnode_abort_change_key().
Link: https://lkml.kernel.org/r/20230513102428.10223-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+b0a35a5c1f7e846d3b09@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/000000000000d1d6c205ebc4d512@google.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 26a6ffff7de5dd369cdb12e38ba11db682f1dec0 upstream.
When changing a file size with fallocate() the new size isn't being
checked. In particular, the FSIZE ulimit isn't being checked, which makes
fstest generic/228 fail. Simply adding a call to inode_newsize_ok() fixes
this issue.
Link: https://lkml.kernel.org/r/20230529152645.32680-1-lhenriques@suse.de
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Mark Fasheh <mark@fasheh.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 50d927880e0f90d5cb25e897e9d03e5edacc79a8 upstream.
It's trivial to trigger a use-after-free bug in the ocfs2 quotas code using
fstest generic/452. After a read-only remount, quotas are suspended and
ocfs2_mem_dqinfo is freed through ->ocfs2_local_free_info(). When unmounting
the filesystem, an UAF access to the oinfo will eventually cause a crash.
BUG: KASAN: slab-use-after-free in timer_delete+0x54/0xc0
Read of size 8 at addr ffff8880389a8208 by task umount/669
...
Call Trace:
<TASK>
...
timer_delete+0x54/0xc0
try_to_grab_pending+0x31/0x230
__cancel_work_timer+0x6c/0x270
ocfs2_disable_quotas.isra.0+0x3e/0xf0 [ocfs2]
ocfs2_dismount_volume+0xdd/0x450 [ocfs2]
generic_shutdown_super+0xaa/0x280
kill_block_super+0x46/0x70
deactivate_locked_super+0x4d/0xb0
cleanup_mnt+0x135/0x1f0
...
</TASK>
Allocated by task 632:
kasan_save_stack+0x1c/0x40
kasan_set_track+0x21/0x30
__kasan_kmalloc+0x8b/0x90
ocfs2_local_read_info+0xe3/0x9a0 [ocfs2]
dquot_load_quota_sb+0x34b/0x680
dquot_load_quota_inode+0xfe/0x1a0
ocfs2_enable_quotas+0x190/0x2f0 [ocfs2]
ocfs2_fill_super+0x14ef/0x2120 [ocfs2]
mount_bdev+0x1be/0x200
legacy_get_tree+0x6c/0xb0
vfs_get_tree+0x3e/0x110
path_mount+0xa90/0xe10
__x64_sys_mount+0x16f/0x1a0
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
Freed by task 650:
kasan_save_stack+0x1c/0x40
kasan_set_track+0x21/0x30
kasan_save_free_info+0x2a/0x50
__kasan_slab_free+0xf9/0x150
__kmem_cache_free+0x89/0x180
ocfs2_local_free_info+0x2ba/0x3f0 [ocfs2]
dquot_disable+0x35f/0xa70
ocfs2_susp_quotas.isra.0+0x159/0x1a0 [ocfs2]
ocfs2_remount+0x150/0x580 [ocfs2]
reconfigure_super+0x1a5/0x3a0
path_mount+0xc8a/0xe10
__x64_sys_mount+0x16f/0x1a0
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
Link: https://lkml.kernel.org/r/20230522102112.9031-1-lhenriques@suse.de
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 85f02d6c856b9f3a0acf5219de6e32f58b9778eb upstream.
In btrfs_relocate_block_group(), the rc is allocated. Then
btrfs_relocate_block_group() calls
relocate_block_group()
prepare_to_relocate()
set_reloc_control()
that assigns rc to the variable fs_info->reloc_ctl. When
prepare_to_relocate() returns, it calls
btrfs_commit_transaction()
btrfs_start_dirty_block_groups()
btrfs_alloc_path()
kmem_cache_zalloc()
which may fail for example (or other errors could happen). When the
failure occurs, btrfs_relocate_block_group() detects the error and frees
rc and doesn't set fs_info->reloc_ctl to NULL. After that, in
btrfs_init_reloc_root(), rc is retrieved from fs_info->reloc_ctl and
then used, which may cause a use-after-free bug.
This possible bug can be triggered by calling btrfs_ioctl_balance()
before calling btrfs_ioctl_defrag().
To fix this possible bug, in prepare_to_relocate(), check if
btrfs_commit_transaction() fails. If the failure occurs,
unset_reloc_control() is called to set fs_info->reloc_ctl to NULL.
The error log in our fault-injection testing is shown as follows:
[ 58.751070] BUG: KASAN: use-after-free in btrfs_init_reloc_root+0x7ca/0x920 [btrfs]
...
[ 58.753577] Call Trace:
...
[ 58.755800] kasan_report+0x45/0x60
[ 58.756066] btrfs_init_reloc_root+0x7ca/0x920 [btrfs]
[ 58.757304] record_root_in_trans+0x792/0xa10 [btrfs]
[ 58.757748] btrfs_record_root_in_trans+0x463/0x4f0 [btrfs]
[ 58.758231] start_transaction+0x896/0x2950 [btrfs]
[ 58.758661] btrfs_defrag_root+0x250/0xc00 [btrfs]
[ 58.759083] btrfs_ioctl_defrag+0x467/0xa00 [btrfs]
[ 58.759513] btrfs_ioctl+0x3c95/0x114e0 [btrfs]
...
[ 58.768510] Allocated by task 23683:
[ 58.768777] ____kasan_kmalloc+0xb5/0xf0
[ 58.769069] __kmalloc+0x227/0x3d0
[ 58.769325] alloc_reloc_control+0x10a/0x3d0 [btrfs]
[ 58.769755] btrfs_relocate_block_group+0x7aa/0x1e20 [btrfs]
[ 58.770228] btrfs_relocate_chunk+0xf1/0x760 [btrfs]
[ 58.770655] __btrfs_balance+0x1326/0x1f10 [btrfs]
[ 58.771071] btrfs_balance+0x3150/0x3d30 [btrfs]
[ 58.771472] btrfs_ioctl_balance+0xd84/0x1410 [btrfs]
[ 58.771902] btrfs_ioctl+0x4caa/0x114e0 [btrfs]
...
[ 58.773337] Freed by task 23683:
...
[ 58.774815] kfree+0xda/0x2b0
[ 58.775038] free_reloc_control+0x1d6/0x220 [btrfs]
[ 58.775465] btrfs_relocate_block_group+0x115c/0x1e20 [btrfs]
[ 58.775944] btrfs_relocate_chunk+0xf1/0x760 [btrfs]
[ 58.776369] __btrfs_balance+0x1326/0x1f10 [btrfs]
[ 58.776784] btrfs_balance+0x3150/0x3d30 [btrfs]
[ 58.777185] btrfs_ioctl_balance+0xd84/0x1410 [btrfs]
[ 58.777621] btrfs_ioctl+0x4caa/0x114e0 [btrfs]
...
Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Zixuan Fu <r33s3n6@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Stefan Ghinea <stefan.ghinea@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fb686c6824dd6294ca772b92424b8fba666e7d00 upstream.
There are a few places where we don't check the return value of
btrfs_commit_transaction in relocation.c. Thankfully all these places
have straightforward error handling, so simply change all of the sites
at once.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Stefan Ghinea <stefan.ghinea@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dea9d8f7643fab07bf89a1155f1f94f37d096a5e upstream.
ext4_xattr_block_set() relies on its caller to call dquot_initialize()
on the inode. To assure that this has happened there are WARN_ON
checks. Unfortunately, this is subject to false positives if there is
an antagonist thread which is flipping the file system at high rates
between r/o and rw. So only do the check if EXT4_XATTR_DEBUG is
enabled.
Link: https://lore.kernel.org/r/20230608044056.GA1418535@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 409e873ea3c1fd3079909718bbeb06ac1ec7f38b upstream.
There is a race between capsnaps flush and removing the inode from
'mdsc->snap_flush_list' list:
== Thread A == == Thread B ==
ceph_queue_cap_snap()
-> allocate 'capsnapA'
->ihold('&ci->vfs_inode')
->add 'capsnapA' to 'ci->i_cap_snaps'
->add 'ci' to 'mdsc->snap_flush_list'
...
== Thread C ==
ceph_flush_snaps()
->__ceph_flush_snaps()
->__send_flush_snap()
handle_cap_flushsnap_ack()
->iput('&ci->vfs_inode')
this also will release 'ci'
...
== Thread D ==
ceph_handle_snap()
->flush_snaps()
->iterate 'mdsc->snap_flush_list'
->get the stale 'ci'
->remove 'ci' from ->ihold(&ci->vfs_inode) this
'mdsc->snap_flush_list' will WARNING
To fix this we will increase the inode's i_count ref when adding 'ci'
to the 'mdsc->snap_flush_list' list.
[ idryomov: need_put int -> bool ]
Cc: stable@vger.kernel.org
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2209299
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit dd3e7cba16274831f5a69f071ed3cf13ffb352ea ]
There are users already and will be more of BITS_TO_BYTES() macro. Move
it to bitops.h for wider use.
In the case of ocfs2 the replacement is identical.
As for bnx2x, there are two places where floor version is used. In the
first case to calculate the amount of structures that can fit one memory
page. In this case obviously the ceiling variant is correct and
original code might have a potential bug, if amount of bits % 8 is not
0. In the second case the macro is used to calculate bytes transmitted
in one microsecond. This will work for all speeds which is multiply of
1Gbps without any change, for the rest new code will give ceiling value,
for instance 100Mbps will give 13 bytes, while old code gives 12 bytes
and the arithmetically correct one is 12.5 bytes. Further the value is
used to setup timer threshold which in any case has its own margins due
to certain resolution. I don't see here an issue with slightly shifting
thresholds for low speed connections, the card is supposed to utilize
highest available rate, which is usually 10Gbps.
Link: http://lkml.kernel.org/r/20200108121316.22411-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Sudarsana Reddy Kalluru <skalluru@marvell.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stable-dep-of: f4e4534850a9 ("net/netlink: fix NETLINK_LIST_MEMBERSHIPS length report")
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 597441b3436a43011f31ce71dc0a6c0bf5ce958a upstream.
Our CI system caught a lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
6.3.0-rc7+ #1167 Not tainted
------------------------------------------------------
kswapd0/46 is trying to acquire lock:
ffff8c6543abd650 (sb_internal#2){++++}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x5f/0x120
but task is already holding lock:
ffffffffabe61b40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x4aa/0x7a0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0xa5/0xe0
kmem_cache_alloc+0x31/0x2c0
alloc_extent_state+0x1d/0xd0
__clear_extent_bit+0x2e0/0x4f0
try_release_extent_mapping+0x216/0x280
btrfs_release_folio+0x2e/0x90
invalidate_inode_pages2_range+0x397/0x470
btrfs_cleanup_dirty_bgs+0x9e/0x210
btrfs_cleanup_one_transaction+0x22/0x760
btrfs_commit_transaction+0x3b7/0x13a0
create_subvol+0x59b/0x970
btrfs_mksubvol+0x435/0x4f0
__btrfs_ioctl_snap_create+0x11e/0x1b0
btrfs_ioctl_snap_create_v2+0xbf/0x140
btrfs_ioctl+0xa45/0x28f0
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
-> #0 (sb_internal#2){++++}-{0:0}:
__lock_acquire+0x1435/0x21a0
lock_acquire+0xc2/0x2b0
start_transaction+0x401/0x730
btrfs_commit_inode_delayed_inode+0x5f/0x120
btrfs_evict_inode+0x292/0x3d0
evict+0xcc/0x1d0
inode_lru_isolate+0x14d/0x1e0
__list_lru_walk_one+0xbe/0x1c0
list_lru_walk_one+0x58/0x80
prune_icache_sb+0x39/0x60
super_cache_scan+0x161/0x1f0
do_shrink_slab+0x163/0x340
shrink_slab+0x1d3/0x290
shrink_node+0x300/0x720
balance_pgdat+0x35c/0x7a0
kswapd+0x205/0x410
kthread+0xf0/0x120
ret_from_fork+0x29/0x50
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(sb_internal#2);
lock(fs_reclaim);
lock(sb_internal#2);
*** DEADLOCK ***
3 locks held by kswapd0/46:
#0: ffffffffabe61b40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x4aa/0x7a0
#1: ffffffffabe50270 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x113/0x290
#2: ffff8c6543abd0e0 (&type->s_umount_key#44){++++}-{3:3}, at: super_cache_scan+0x38/0x1f0
stack backtrace:
CPU: 0 PID: 46 Comm: kswapd0 Not tainted 6.3.0-rc7+ #1167
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x58/0x90
check_noncircular+0xd6/0x100
? save_trace+0x3f/0x310
? add_lock_to_list+0x97/0x120
__lock_acquire+0x1435/0x21a0
lock_acquire+0xc2/0x2b0
? btrfs_commit_inode_delayed_inode+0x5f/0x120
start_transaction+0x401/0x730
? btrfs_commit_inode_delayed_inode+0x5f/0x120
btrfs_commit_inode_delayed_inode+0x5f/0x120
btrfs_evict_inode+0x292/0x3d0
? lock_release+0x134/0x270
? __pfx_wake_bit_function+0x10/0x10
evict+0xcc/0x1d0
inode_lru_isolate+0x14d/0x1e0
__list_lru_walk_one+0xbe/0x1c0
? __pfx_inode_lru_isolate+0x10/0x10
? __pfx_inode_lru_isolate+0x10/0x10
list_lru_walk_one+0x58/0x80
prune_icache_sb+0x39/0x60
super_cache_scan+0x161/0x1f0
do_shrink_slab+0x163/0x340
shrink_slab+0x1d3/0x290
shrink_node+0x300/0x720
balance_pgdat+0x35c/0x7a0
kswapd+0x205/0x410
? __pfx_autoremove_wake_function+0x10/0x10
? __pfx_kswapd+0x10/0x10
kthread+0xf0/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x29/0x50
</TASK>
This happens because when we abort the transaction in the transaction
commit path we call invalidate_inode_pages2_range on our block group
cache inodes (if we have space cache v1) and any delalloc inodes we may
have. The plain invalidate_inode_pages2_range() call passes through
GFP_KERNEL, which makes sense in most cases, but not here. Wrap these
two invalidate callees with memalloc_nofs_save/memalloc_nofs_restore to
make sure we don't end up with the fs reclaim dependency under the
transaction dependency.
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9b5a04ac3ad9898c4745cba46ea26de74ba56a8e upstream.
During unmount process of nilfs2, nothing holds nilfs_root structure after
nilfs2 detaches its writer in nilfs_detach_log_writer(). However, since
nilfs_evict_inode() uses nilfs_root for some cleanup operations, it may
cause use-after-free read if inodes are left in "garbage_list" and
released by nilfs_dispose_list() at the end of nilfs_detach_log_writer().
Fix this issue by modifying nilfs_evict_inode() to only clear inode
without additional metadata changes that use nilfs_root if the file system
is degraded to read-only or the writer is detached.
Link: https://lkml.kernel.org/r/20230509152956.8313-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+78d4495558999f55d1da@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/00000000000099e5ac05fb1c3b85@google.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4cafd0400bcb6187c0d4ab4d4b0229a89ac4f8c2 upstream.
When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
request may still contain a list of 'split_realms', and we need
to skip it anyway. Or it will be parsed as a corrupt snaptrace.
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/61200
Reported-by: Frank Schilder <frans@dtu.dk>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ed40866ec7d328b3dfb70db7e2011640a16202c3 upstream.
s390's struct statfs and struct statfs64 contain padding, which
field-by-field copying does not set. Initialize the respective structs
with zeros before filling them and copying them to userspace, like it's
already done for the compat versions of these structs.
Found by KMSAN.
[agordeev@linux.ibm.com: fixed typo in patch description]
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/r/20230504144021.808932-2-iii@linux.ibm.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 0004ff15ea26015a0a3a6182dca3b9d1df32e2b7 ]
When loading a free space cache from disk, at __load_free_space_cache(),
if we fail to insert a bitmap entry, we still increment the number of
total bitmaps in the btrfs_free_space_ctl structure, which is incorrect
since we failed to add the bitmap entry. On error we then empty the
cache by calling __btrfs_remove_free_space_cache(), which will result
in getting the total bitmaps counter set to 1.
A failure to load a free space cache is not critical, so if a failure
happens we just rebuild the cache by scanning the extent tree, which
happens at block-group.c:caching_thread(). Yet the failure will result
in having the total bitmaps of the btrfs_free_space_ctl always bigger
by 1 then the number of bitmap entries we have. So fix this by having
the total bitmaps counter be incremented only if we successfully added
the bitmap entry.
Fixes: a67509c30079 ("Btrfs: add a io_ctl struct and helpers for dealing with the space cache")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit abadc1fcd72e887a8f875dabe4a07aa8c28ac8af ]
The former is going away as part of the inode map removal so switch
callers to btrfs_find_free_objectid. No functional changes since with
INODE_MAP disabled (default) find_free_objectid was called anyway.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 0004ff15ea26 ("btrfs: fix space cache inconsistency after error loading it from disk")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c9b3649a934d131151111354bcbb638076f03a30 ]
xfstest generic/361 reports a bug as below:
f2fs_bug_on(sbi, sbi->fsync_node_num);
kernel BUG at fs/f2fs/super.c:1627!
RIP: 0010:f2fs_put_super+0x3a8/0x3b0
Call Trace:
generic_shutdown_super+0x8c/0x1b0
kill_block_super+0x2b/0x60
kill_f2fs_super+0x87/0x110
deactivate_locked_super+0x39/0x80
deactivate_super+0x46/0x50
cleanup_mnt+0x109/0x170
__cleanup_mnt+0x16/0x20
task_work_run+0x65/0xa0
exit_to_user_mode_prepare+0x175/0x190
syscall_exit_to_user_mode+0x25/0x50
do_syscall_64+0x4c/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
During umount(), if cp_error is set, f2fs_wait_on_all_pages() should
not stop waiting all F2FS_WB_CP_DATA pages to be writebacked, otherwise,
fsync_node_num can be non-zero after f2fs_wait_on_all_pages() causing
this bug.
In this case, to avoid deadloop in f2fs_wait_on_all_pages(), it needs
to drop all dirty pages rather than redirtying them.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 93cdf49f6eca5e23f6546b8f28457b2e6a6961d9 ]
When the length of best extent found is less than the length of goal extent
we need to make sure that the best extent atleast covers the start of the
original request. This is done by adjusting the ac_b_ex.fe_logical (logical
start) of the extent.
While doing so, the current logic sometimes results in the best extent's
logical range overflowing the goal extent. Since this best extent is later
added to the inode preallocation list, we have a possibility of introducing
overlapping preallocations. This is discussed in detail here [1].
As per Jan's suggestion, to fix this, replace the existing logic with the
below logic for adjusting best extent as it keeps fragmentation in check
while ensuring logical range of best extent doesn't overflow out of goal
extent:
1. Check if best extent can be kept at end of goal range and still cover
original start.
2. Else, check if best extent can be kept at start of goal range and still
cover original start.
3. Else, keep the best extent at start of original request.
Also, add a few extra BUG_ONs that might help catch errors faster.
[1] https://lore.kernel.org/r/Y+OGkVvzPN0RMv0O@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/f96aca6d415b36d1f90db86c1a8cd7e2e9d7ab0e.1679731817.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b07ffe6927c75d99af534d685282ea188d9f71a6 ]
We need to set ac_g_ex to notify the goal start used in
ext4_mb_find_by_goal. Set ac_g_ex instead of ac_f_ex in
ext4_mb_normalize_request.
Besides we should assure goal start is in range [first_data_block,
blocks_count) as ext4_mb_initialize_context does.
[ Added a check to make sure size is less than ar->pright; otherwise
we could end up passing an underflowed value of ar->pright - size to
ext4_get_group_no_and_offset(), which will trigger a BUG_ON later on.
- TYT ]
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230303172120.3800725-2-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit cfcdb5bad34f600aed7613c3c1a5e618111f77b7 ]
The maximum allowed height of an inode's metadata tree depends on the
filesystem block size; it is lower for bigger-block filesystems. When
reading in an inode, make sure that the height doesn't exceed the
maximum allowed height.
Arrays like sd_heightsize are sized to be big enough for any filesystem
block size; they will often be slightly bigger than what's needed for a
specific filesystem.
Reported-by: syzbot+45d4691b1ed3c48eba05@syzkaller.appspotmail.com
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 62aeb94433fcec80241754b70d0d1836d5926b0a ]
Check that log of block size stored in the superblock has sensible
value. Otherwise the shift computing the block size can overflow leading
to undefined behavior.
Reported-by: syzbot+4fec412f59eba8c01b77@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Sasha Levin <sashal@kernel.org>