IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
dump_vma_snapshot() allocs memory for *vma_meta, when dump_vma_snapshot()
returns -EFAULT, the memory will be leaked, so we free it correctly.
Link: https://lkml.kernel.org/r/20210810020441.62806-1-qiuxi1@huawei.com
Fixes: a07279c9a8cd7 ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
Signed-off-by: QiuXi <qiuxi1@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jann Horn <jannh@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For obvious security reasons, a core dump is aborted if the filesystem
cannot preserve ownership or permissions of the dump file.
This affects filesystems like e.g. vfat, but also something like a 9pfs
share in a Qemu test setup, running as a regular user, depending on the
security model used. In those cases, the result is an empty core file and
a confused user.
To hopefully save other people a lot of time figuring out the cause, this
patch adds a simple log message for those specific cases.
[akpm@linux-foundation.org: s/|%s/%s/ in printk text]
Link: https://lkml.kernel.org/r/20210701233151.102720-1-david.oberhollenzer@sigma-star.at
Signed-off-by: David Oberhollenzer <david.oberhollenzer@sigma-star.at>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the refcount is decreased to 0, the resource reclamation branch is
entered. Before CPU0 reaches the race point (1), CPU1 may obtain the
spinlock and traverse the rbtree to find 'root', see
nilfs_lookup_root().
Although CPU1 will call refcount_inc() to increase the refcount, it is
obviously too late. CPU0 will release 'root' directly, CPU1 then
accesses 'root' and triggers UAF.
Use refcount_dec_and_lock() to ensure that both the operations of
decrease refcount to 0 and link deletion are lock protected eliminates
this risk.
CPU0 CPU1
nilfs_put_root():
<-------- (1)
spin_lock(&nilfs->ns_cptree_lock);
rb_erase(&root->rb_node, &nilfs->ns_cptree);
spin_unlock(&nilfs->ns_cptree_lock);
kfree(root);
<-------- use-after-free
refcount_t: underflow; use-after-free.
WARNING: CPU: 2 PID: 9476 at lib/refcount.c:28 \
refcount_warn_saturate+0x1cf/0x210 lib/refcount.c:28
Modules linked in:
CPU: 2 PID: 9476 Comm: syz-executor.0 Not tainted 5.10.45-rc1+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
RIP: 0010:refcount_warn_saturate+0x1cf/0x210 lib/refcount.c:28
... ...
Call Trace:
__refcount_sub_and_test include/linux/refcount.h:283 [inline]
__refcount_dec_and_test include/linux/refcount.h:315 [inline]
refcount_dec_and_test include/linux/refcount.h:333 [inline]
nilfs_put_root+0xc1/0xd0 fs/nilfs2/the_nilfs.c:795
nilfs_segctor_destroy fs/nilfs2/segment.c:2749 [inline]
nilfs_detach_log_writer+0x3fa/0x570 fs/nilfs2/segment.c:2812
nilfs_put_super+0x2f/0xf0 fs/nilfs2/super.c:467
generic_shutdown_super+0xcd/0x1f0 fs/super.c:464
kill_block_super+0x4a/0x90 fs/super.c:1446
deactivate_locked_super+0x6a/0xb0 fs/super.c:335
deactivate_super+0x85/0x90 fs/super.c:366
cleanup_mnt+0x277/0x2e0 fs/namespace.c:1118
__cleanup_mnt+0x15/0x20 fs/namespace.c:1125
task_work_run+0x8e/0x110 kernel/task_work.c:151
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
exit_to_user_mode_loop kernel/entry/common.c:164 [inline]
exit_to_user_mode_prepare+0x13c/0x170 kernel/entry/common.c:191
syscall_exit_to_user_mode+0x16/0x30 kernel/entry/common.c:266
do_syscall_64+0x45/0x80 arch/x86/entry/common.c:56
entry_SYSCALL_64_after_hwframe+0x44/0xa9
There is no reproduction program, and the above is only theoretical
analysis.
Link: https://lkml.kernel.org/r/1629859428-5906-1-git-send-email-konishi.ryusuke@gmail.com
Fixes: ba65ae4729bf ("nilfs2: add checkpoint tree to nilfs object")
Link: https://lkml.kernel.org/r/20210723012317.4146-1-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In nilfs_##name##_attr_release, kobj->parent should not be referenced
because it is a NULL pointer. The release() method of kobject is always
called in kobject_put(kobj), in the implementation of kobject_put(), the
kobj->parent will be assigned as NULL before call the release() method.
So just use kobj to get the subgroups, which is more efficient and can fix
a NULL pointer reference problem.
Link: https://lkml.kernel.org/r/20210629022556.3985106-3-sunnanyong@huawei.com
Link: https://lkml.kernel.org/r/1625651306-10829-3-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "nilfs2: fix incorrect usage of kobject".
This patchset from Nanyong Sun fixes memory leak issues and a NULL
pointer dereference issue caused by incorrect usage of kboject in nilfs2
sysfs implementation.
This patch (of 6):
Reported by syzkaller:
BUG: memory leak
unreferenced object 0xffff888100ca8988 (size 8):
comm "syz-executor.1", pid 1930, jiffies 4294745569 (age 18.052s)
hex dump (first 8 bytes):
6c 6f 6f 70 31 00 ff ff loop1...
backtrace:
kstrdup+0x36/0x70 mm/util.c:60
kstrdup_const+0x35/0x60 mm/util.c:83
kvasprintf_const+0xf1/0x180 lib/kasprintf.c:48
kobject_set_name_vargs+0x56/0x150 lib/kobject.c:289
kobject_add_varg lib/kobject.c:384 [inline]
kobject_init_and_add+0xc9/0x150 lib/kobject.c:473
nilfs_sysfs_create_device_group+0x150/0x7d0 fs/nilfs2/sysfs.c:986
init_nilfs+0xa21/0xea0 fs/nilfs2/the_nilfs.c:637
nilfs_fill_super fs/nilfs2/super.c:1046 [inline]
nilfs_mount+0x7b4/0xe80 fs/nilfs2/super.c:1316
legacy_get_tree+0x105/0x210 fs/fs_context.c:592
vfs_get_tree+0x8e/0x2d0 fs/super.c:1498
do_new_mount fs/namespace.c:2905 [inline]
path_mount+0xf9b/0x1990 fs/namespace.c:3235
do_mount+0xea/0x100 fs/namespace.c:3248
__do_sys_mount fs/namespace.c:3456 [inline]
__se_sys_mount fs/namespace.c:3433 [inline]
__x64_sys_mount+0x14b/0x1f0 fs/namespace.c:3433
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
If kobject_init_and_add return with error, then the cleanup of kobject
is needed because memory may be allocated in kobject_init_and_add
without freeing.
And the place of cleanup_dev_kobject should use kobject_put to free the
memory associated with the kobject. As the section "Kobject removal" of
"Documentation/core-api/kobject.rst" says, kobject_del() just makes the
kobject "invisible", but it is not cleaned up. And no more cleanup will
do after cleanup_dev_kobject, so kobject_put is needed here.
Link: https://lkml.kernel.org/r/1625651306-10829-1-git-send-email-konishi.ryusuke@gmail.com
Link: https://lkml.kernel.org/r/1625651306-10829-2-git-send-email-konishi.ryusuke@gmail.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Link: https://lkml.kernel.org/r/20210629022556.3985106-2-sunnanyong@huawei.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This counter tracks the number of watches a user has, to compare against
the 'max_user_watches' limit. This causes a scalability bottleneck on
SPECjbb2015 on large systems as there is only one user. Changing to a
per-cpu counter increases throughput of the benchmark by about 30% on a
16-socket, > 1000 thread system.
[rdunlap@infradead.org: fix build errors in kernel/user.c when CONFIG_EPOLL=n]
[npiggin@gmail.com: move ifdefs into wrapper functions, slightly improve panic message]
Link: https://lkml.kernel.org/r/1628051945.fens3r99ox.astroid@bobo.none
[akpm@linux-foundation.org: tweak user_epoll_alloc(), per Guenter]
Link: https://lkml.kernel.org/r/20210804191421.GA1900577@roeck-us.net
Link: https://lkml.kernel.org/r/20210802032013.2751916-1-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reported-by: Anton Blanchard <anton@ozlabs.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While comm change event via prctl has been reported to proc connector by
'commit f786ecba4158 ("connector: add comm change event report to proc
connector")', connector listeners were missing comm changes by explicit
writes on /proc/[pid]/comm.
Let explicit writes on /proc/[pid]/comm report to proc connector.
Link: https://lkml.kernel.org/r/20210701133458epcms1p68e9eb9bd0eee8903ba26679a37d9d960@epcms1p6
Signed-off-by: Ohhoon Kwon <ohoono.kwon@samsung.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use seq_escape_str and seq_printf instead of poking holes into the
seq_file abstraction.
Link: https://lkml.kernel.org/r/20210810151945.1795567-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No need to check for 64BIT. While at it, let's just select
ARCH_SUPPORTS_HUGETLBFS from arch/s390/Kconfig.
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20210908154506.20764-1-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
WARNING: CPU: 0 PID: 10392 at fs/io_uring.c:1151 req_ref_put_and_test
fs/io_uring.c:1151 [inline]
WARNING: CPU: 0 PID: 10392 at fs/io_uring.c:1151 req_ref_put_and_test
fs/io_uring.c:1146 [inline]
WARNING: CPU: 0 PID: 10392 at fs/io_uring.c:1151
io_req_complete_post+0xf5b/0x1190 fs/io_uring.c:1794
Modules linked in:
Call Trace:
tctx_task_work+0x1e5/0x570 fs/io_uring.c:2158
task_work_run+0xe0/0x1a0 kernel/task_work.c:164
tracehook_notify_signal include/linux/tracehook.h:212 [inline]
handle_signal_work kernel/entry/common.c:146 [inline]
exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
exit_to_user_mode_prepare+0x232/0x2a0 kernel/entry/common.c:209
__syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:302
do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x44/0xae
When io_wqe_enqueue() -> io_wqe_create_worker() fails, we can't just
call io_run_cancel() to clean up the request, it's already enqueued via
io_wqe_insert_work() and will be executed either by some other worker
during cancellation (e.g. in io_wq_put_and_exit()).
Reported-by: Hao Sun <sunhao.th@gmail.com>
Fixes: 3146cba99aa28 ("io-wq: make worker creation resilient against signals")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/93b9de0fcf657affab0acfd675d4abcd273ee863.1631092071.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mixing NULL and ERR_PTR() just in case is a Bad Idea(tm). For
struct filename the former is wrong - failures are reported
as ERR_PTR(...), not as NULL.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
filename_create() has two variants, one which drops the caller's
reference to filename (filename_create) and one which does
not (__filename_create). This can be confusing as it's unusual to drop a
caller's reference. Remove filename_create, rename __filename_create
to filename_create, and convert all callers.
Link: https://lore.kernel.org/linux-fsdevel/f6238254-35bd-7e97-5b27-21050c745874@oracle.com/
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
filename_lookup() has two variants, one which drops the caller's
reference to filename (filename_lookup), and one which does
not (__filename_lookup). This can be confusing as it's unusual to drop a
caller's reference. Remove filename_lookup, rename __filename_lookup
to filename_lookup, and convert all callers. The cost is a few slightly
longer functions, but the clarity is greater.
[AV: consuming a reference is not at all unusual, actually; look at
e.g. do_mkdirat(), for example. It's more that we want non-consuming
variant for close relative of that function...]
Link: https://lore.kernel.org/linux-fsdevel/YS+dstZ3xfcLxhoB@zeniv-ca.linux.org.uk/
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In 0ee50b47532a ("namei: change filename_parentat() calling
conventions"), filename_parentat() was made to always call putname() on
the filename before returning, and kern_path_locked() was migrated to
this calling convention. However, kern_path_locked() uses the "last"
parameter to lookup and potentially create a new dentry. The last
parameter contains the last component of the path and points within the
filename, which was recently freed at the end of filename_parentat().
Thus, when kern_path_locked() calls __lookup_hash(), it is using the
filename after it has already been freed.
In other words, these calling conventions had been wrong for the
only remaining caller of filename_parentat(). Everything else
is using __filename_parentat(), which does not drop the reference;
so should kern_path_locked().
Switch kern_path_locked() to use of __filename_parentat() and move
getting/dropping struct filename into wrapper. Remove filename_parentat(),
now that we have no remaining callers.
Fixes: 0ee50b47532a ("namei: change filename_parentat() calling conventions")
Link: https://lore.kernel.org/linux-fsdevel/YS9D4AlEsaCxLFV0@infradead.org/
Link: https://lore.kernel.org/linux-fsdevel/YS+csMTV2tTXKg3s@zeniv-ca.linux.org.uk/
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: syzbot+fb0d60a179096e8c2731@syzkaller.appspotmail.com
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Co-authored-by: Dmitry Kadashev <dkadashev@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYTcNYgAKCRDh3BK/laaZ
PCaHAQCTfaWYK+H6kNIffxnRG+yGsMAmoZLGbc5l9M5GeTEmtgEA0vkmWbDt4h/I
PSA0Q2T0gzSsBdxkF4jQHqKO+HGMVg8=
=Vh9A
-----END PGP SIGNATURE-----
Merge tag 'fuse-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Allow mounting an active fuse device. Previously the fuse device
would always be mounted during initialization, and sharing a fuse
superblock was only possible through mount or namespace cloning
- Fix data flushing in syncfs (virtiofs only)
- Fix data flushing in copy_file_range()
- Fix a possible deadlock in atomic O_TRUNC
- Misc fixes and cleanups
* tag 'fuse-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: remove unused arg in fuse_write_file_get()
fuse: wait for writepages in syncfs
fuse: flush extending writes
fuse: truncate pagecache on atomic_o_trunc
fuse: allow sharing existing sb
fuse: move fget() to fuse_get_tree()
fuse: move option checking into fuse_fill_super()
fuse: name fs_context consistently
fuse: fix use after free in fuse_read_interrupt()
This reverts commit b655843444152c0a14b749308e4cb35d91cbcf0b.
Just like with the memcg lock accounting, the kernel test robot reports
a sizeable performance regression for this commit, and while it clearly
does the rigth thing in theory, we'll need to look at just how to avoid
or minimize the performance overhead of the memcg accounting.
People already have suggestions on how to do that, but it's "future
work".
So revert it for now.
[ Note: the first link below is for this same commit but a different
commit ID, because it's the kernel test robot ended up noticing it in
Andrew Morton's patch queue ]
Link: https://lore.kernel.org/lkml/20210905132732.GC15026@xsang-OptiPlex-9020/
Link: https://lore.kernel.org/lkml/20210907150757.GE17617@xsang-OptiPlex-9020/
Acked-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 0f12156dff2862ac54235fc72703f18770769042.
The kernel test robot reports a sizeable performance regression for this
commit, and while it clearly does the rigth thing in theory, we'll need
to look at just how to avoid or minimize the performance overhead of the
memcg accounting.
People already have suggestions on how to do that, but it's "future
work".
So revert it for now.
Link: https://lore.kernel.org/lkml/20210907150757.GE17617@xsang-OptiPlex-9020/
Acked-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 9857a17f206ff374aea78bccfb687f145368be2e.
That commit was completely broken, and I should have caught on to it
earlier. But happily, the kernel test robot noticed the breakage fairly
quickly.
The breakage is because "try_get_page()" is about avoiding the page
reference count overflow case, but is otherwise the exact same as a
plain "get_page()".
In contrast, "try_get_compound_head()" is an entirely different beast,
and uses __page_cache_add_speculative() because it's not just about the
page reference count, but also about possibly racing with the underlying
page going away.
So all the commentary about how
"try_get_page() has fallen a little behind in terms of maintenance,
try_get_compound_head() handles speculative page references more
thoroughly"
was just completely wrong: yes, try_get_compound_head() handles
speculative page references, but the point is that try_get_page() does
not, and must not.
So there's no lack of maintainance - there are fundamentally different
semantics.
A speculative page reference would be entirely wrong in "get_page()",
and it's entirely wrong in "try_get_page()". It's not about
speculation, it's purely about "uhhuh, you can't get this page because
you've tried to increment the reference count too much already".
The reason the kernel test robot noticed this bug was that it hit the
VM_BUG_ON() in __page_cache_add_speculative(), which is all about
verifying that the context of any speculative page access is correct.
But since that isn't what try_get_page() is all about, the VM_BUG_ON()
tests things that are not correct to test for try_get_page().
Reported-by: kernel test robot <oliver.sang@intel.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new block/fops.c for all the file and address_space operations
that provide the block special file support.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210907141303.1371844-2-hch@lst.de
[axboe: correct trailing whitespace while at it]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
btrfs_add_ordered_extent_*() add num_bytes to fs_info->ordered_bytes.
Then, splitting an ordered extent will call btrfs_add_ordered_extent_*()
again for split extents, leading to double counting of the region of
a split extent. These leaked bytes are finally reported at unmount time
as follow:
BTRFS info (device dm-1): at unmount dio bytes count 364544
Fix the double counting by subtracting split extent's size from
fs_info->ordered_bytes.
Fixes: d22002fd37bd ("btrfs: zoned: split ordered extent when bio is sent")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When removing the device we call blkdev_put() on the device once we've
removed it, and because we have an EXCL open we need to take the
->open_mutex on the block device to clean it up. Unfortunately during
device remove we are holding the sb writers lock, which results in the
following lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #407 Not tainted
------------------------------------------------------
losetup/11595 is trying to acquire lock:
ffff973ac35dd138 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
but task is already holding lock:
ffff973ac9812c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_put+0x3a/0x220
btrfs_rm_device.cold+0x62/0xe5
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/11595:
#0: ffff973ac9812c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
stack backtrace:
CPU: 0 PID: 11595 Comm: losetup Not tainted 5.14.0-rc2+ #407
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fc21255d4cb
So instead save the bdev and do the put once we've dropped the sb
writers lock in order to avoid the lockdep recursion.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We update the ctime/mtime of a block device when we remove it so that
blkid knows the device changed. However we do this by re-opening the
block device and calling filp_update_time. This is more correct because
it'll call the inode->i_op->update_time if it exists, but the block dev
inodes do not do this. Instead call generic_update_time() on the
bd_inode in order to avoid the blkdev_open path and get rid of the
following lockdep splat:
======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #406 Not tainted
------------------------------------------------------
losetup/11596 is trying to acquire lock:
ffff939640d2f538 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
but task is already holding lock:
ffff939655510c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
lo_open+0x28/0x60 [loop]
blkdev_get_whole+0x25/0xf0
blkdev_get_by_dev.part.0+0x168/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
do_sys_openat2+0x7b/0x130
__x64_sys_openat+0x46/0x70
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #3 (&disk->open_mutex){+.+.}-{3:3}:
__mutex_lock+0x7d/0x750
blkdev_get_by_dev.part.0+0x56/0x3c0
blkdev_open+0xd2/0xe0
do_dentry_open+0x161/0x390
path_openat+0x3cc/0xa20
do_filp_open+0x96/0x120
file_open_name+0xc7/0x170
filp_open+0x2c/0x50
btrfs_scratch_superblocks.part.0+0x10f/0x170
btrfs_rm_device.cold+0xe8/0xed
btrfs_ioctl+0x2a31/0x2e70
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
-> #2 (sb_writers#12){.+.+}-{0:0}:
lo_write_bvec+0xc2/0x240 [loop]
loop_process_work+0x238/0xd00 [loop]
process_one_work+0x26b/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
process_one_work+0x245/0x560
worker_thread+0x55/0x3c0
kthread+0x140/0x160
ret_from_fork+0x1f/0x30
-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
flush_workqueue+0x91/0x5e0
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
other info that might help us debug this:
Chain exists of:
(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&lo->lo_mutex);
lock(&disk->open_mutex);
lock(&lo->lo_mutex);
lock((wq_completion)loop0);
*** DEADLOCK ***
1 lock held by losetup/11596:
#0: ffff939655510c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
stack backtrace:
CPU: 1 PID: 11596 Comm: losetup Not tainted 5.14.0-rc2+ #406
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
check_noncircular+0xcf/0xf0
? stack_trace_save+0x3b/0x50
__lock_acquire+0x10ea/0x1d90
lock_acquire+0xb5/0x2b0
? flush_workqueue+0x67/0x5e0
? lockdep_init_map_type+0x47/0x220
flush_workqueue+0x91/0x5e0
? flush_workqueue+0x67/0x5e0
? verify_cpu+0xf0/0x100
drain_workqueue+0xa0/0x110
destroy_workqueue+0x36/0x250
__loop_clr_fd+0x9a/0x660 [loop]
? blkdev_ioctl+0x8d/0x2a0
block_ioctl+0x3f/0x50
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
asm/do_div.h is for div_u64, but it is found in math64.h. This change
will make compiler job easier and prevent compiler errors in situation
where compiler will not find math64.h from another paths.
Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The mount option max_inline ranges from 0 to the sectorsize (which is
now equal to page size). But we parse the mount options too early and
before the actual sectorsize is read from the superblock. So the upper
limit of max_inline is unaware of the actual sectorsize and is limited
by the temporary sectorsize 4096, even on a system where the default
sectorsize is 64K.
Fix this by reading the superblock sectorsize before the mount option
parse.
Reported-by: Alexander Tsvetkov <alexander.tsvetkov@oracle.com>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEz5eEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmk1D/wML8Im2erR5s0PaWZgYxXlgEKrJDwJm/p+
2Uixrn/9kQAhwH+0kJnCiI+HwlL3LU+5/iAdeGtdYMcVaotPPmm5V3jfud8+RuAi
E+uIOdULXgQKj8pkiQ2h5mvYd0BxGkGH38gUqilSwFrY2HTpbfxreCHhYoQaE/7o
DiGNgbhJglSFIBuIgS4cfpLkI3FdaAmrCydZ9zaqEv/G/bx9aA9lwSbAJadhTbmt
Qc1vvbh2FB9YvgZX8qfaneyDKzQbwqTvKxCe2SOVMOp/X0feJym7WZUvrPr04EoZ
zBaLDkmn44re4iWPbide7+KQJ8NMQQDBiuxwF5WxdF3hrcsiwqmKgDtBEGWXFMeV
CUZ9Osrfb480UKsDExtxLhQqGz1JZqIPZdtDvSJb8MunPZtvTz27NNFyyb9aBrlX
WiwEHqAOE1W33buPCNyuYLGDVYis4/TkwF0NZpMwsyPdN0Iz/M8Z5F5BHhC7BYoP
U8KMsX3XvddxB113U+IMVqI/SuvT125U65brklQlQeLEHnH57ceII9mNGfNic6LR
bcIu7Fb5J1U5nAMeeLCSXsEYXs+peYgI1UOWXaWgSVixUAyU8H+OqsBVIl8eiMjr
TTbdIMmfWqENE3wBM709FQQLoMmGl1YjBkGmBXKZjNHcDrf9X56rimSxRD2i2okg
r2JczxQ5uQ==
=QoQg
-----END PGP SIGNATURE-----
Merge tag 'for-5.15/io_uring-2021-09-04' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"As sometimes happens, two reports came in around the merge window open
that led to some fixes. Hence this one is a bit bigger than usual
followup fixes, but most of it will be going towards stable, outside
of the fixes that are addressing regressions from this merge window.
In detail:
- postgres is a heavy user of signals between tasks, and if we're
unlucky this can interfere with io-wq worker creation. Make sure
we're resilient against unrelated signal handling. This set of
changes also includes hardening against allocation failures, which
could previously had led to stalls.
- Some use cases that end up having a mix of bounded and unbounded
work would have starvation issues related to that. Split the
pending work lists to handle that better.
- Completion trace int -> unsigned -> long fix
- Fix issue with REGISTER_IOWQ_MAX_WORKERS and SQPOLL
- Fix regression with hash wait lock in this merge window
- Fix retry issued on block devices (Ming)
- Fix regression with links in this merge window (Pavel)
- Fix race with multi-shot poll and completions (Xiaoguang)
- Ensure regular file IO doesn't inadvertently skip completion
batching (Pavel)
- Ensure submissions are flushed after running task_work (Pavel)"
* tag 'for-5.15/io_uring-2021-09-04' of git://git.kernel.dk/linux-block:
io_uring: io_uring_complete() trace should take an integer
io_uring: fix possible poll event lost in multi shot mode
io_uring: prolong tctx_task_work() with flushing
io_uring: don't disable kiocb_done() CQE batching
io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
io-wq: make worker creation resilient against signals
io-wq: get rid of FIXED worker flag
io-wq: only exit on fatal signals
io-wq: split bounded and unbounded work into separate lists
io-wq: fix queue stalling race
io_uring: don't submit half-prepared drain request
io_uring: fix queueing half-created requests
io-wq: ensure that hash wait lock is IRQ disabling
io_uring: retry in case of short read on block device
io_uring: IORING_OP_WRITE needs hash_reg_file set
io-wq: fix race between adding work and activating a free worker
In case of fuse the MM subsystem doesn't guarantee that page writeback
completes by the time ->sync_fs() is called. This is because fuse
completes page writeback immediately to prevent DoS of memory reclaim by
the userspace file server.
This means that fuse itself must ensure that writes are synced before
sending the SYNCFS request to the server.
Introduce sync buckets, that hold a counter for the number of outstanding
write requests. On syncfs replace the current bucket with a new one and
wait until the old bucket's counter goes down to zero.
It is possible to have multiple syncfs calls in parallel, in which case
there could be more than one waited-on buckets. Descendant buckets must
not complete until the parent completes. Add a count to the child (new)
bucket until the (parent) old bucket completes.
Use RCU protection to dereference the current bucket and to wake up an
emptied bucket. Use fc->lock to protect against parallel assignments to
the current bucket.
This leaves just the counter to be a possible scalability issue. The
fc->num_waiting counter has a similar issue, so both should be addressed at
the same time.
Reported-by: Amir Goldstein <amir73il@gmail.com>
Fixes: 2d82ab251ef0 ("virtiofs: propagate sync() to file server")
Cc: <stable@vger.kernel.org> # v5.14
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Export receive_fd() so that some modules can use
it to pass file descriptor between processes without
missing any security stuffs.
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20210831103634.33-4-xieyongji@bytedance.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Export eventfd_wake_count so that some modules can use
the eventfd_signal_count() to check whether the
eventfd_signal() call should be deferred to a safe context.
NB(mst): this patch is not needed in Linus tree since there
eventfd_signal_count() has been superseded by an already exported
eventfd_signal_allowed().
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Link: https://lore.kernel.org/r/20210831103634.33-3-xieyongji@bytedance.com
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
fs/binfmt_aout.c: In function ‘load_aout_library’:
fs/binfmt_aout.c:311:27: error: expected ‘)’ before ‘;’ token
311 | MAP_FIXED | MAP_PRIVATE;
| ^
fs/binfmt_aout.c:309:10: error: too few arguments to function ‘vm_mmap’
309 | error = vm_mmap(file, start_addr, ex.a_text + ex.a_data,
| ^~~~~~~
In file included from fs/binfmt_aout.c:12:
include/linux/mm.h:2626:35: note: declared here
2626 | extern unsigned long __must_check vm_mmap(struct file *, unsigned long,
| ^~~~~~~
Fix this by reverting the accidental replacement of a comma by a
semicolon.
Fixes: 42be8b42535183f8 ("binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()")
Reported-by: noreply@ellerman.id.au
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After we factor out the inline data write procedure from
ext4_da_write_end(), we don't need to start journal handle for the cases
of both buffer overwrite and append-write. If we need to update
i_disksize, mark_inode_dirty() do start handle and update inode buffer.
So we could just remove all the journal handle codes in the delalloc
write procedure.
After this patch, we could get a lot of performance improvement. Below
is the Unixbench comparison data test on my machine with 'Intel Xeon
Gold 5120' CPU and nvme SSD backend.
Test cmd:
./Run -c 56 -i 3 fstime fsbuffer fsdisk
Before this patch:
System Benchmarks Partial Index BASELINE RESULT INDEX
File Copy 1024 bufsize 2000 maxblocks 3960.0 422965.0 1068.1
File Copy 256 bufsize 500 maxblocks 1655.0 105077.0 634.9
File Copy 4096 bufsize 8000 maxblocks 5800.0 1429092.0 2464.0
======
System Benchmarks Index Score (Partial Only) 1186.6
After this patch:
System Benchmarks Partial Index BASELINE RESULT INDEX
File Copy 1024 bufsize 2000 maxblocks 3960.0 732716.0 1850.3
File Copy 256 bufsize 500 maxblocks 1655.0 184940.0 1117.5
File Copy 4096 bufsize 8000 maxblocks 5800.0 2427152.0 4184.7
======
System Benchmarks Index Score (Partial Only) 2053.0
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210716122024.1105856-5-yi.zhang@huawei.com
Now that the inline_data file write end procedure are falled into the
common write end functions, it is not clear. Factor them out and do
some cleanup. This patch also drop ext4_da_write_inline_data_end()
and switch to use ext4_write_inline_data_end() instead because we also
need to do the same error processing if we failed to write data into
inline entry.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210716122024.1105856-4-yi.zhang@huawei.com
Current error path of ext4_write_inline_data_end() is not correct.
Firstly, it should pass out the error value if ext4_get_inode_loc()
return fail, or else it could trigger infinite loop if we inject error
here. And then it's better to add inode to orphan list if it return fail
in ext4_journal_stop(), otherwise we could not restore inline xattr
entry after power failure. Finally, we need to reset the 'ret' value if
ext4_write_inline_data_end() return success in ext4_write_end() and
ext4_journalled_write_end(), otherwise we could not get the error return
value of ext4_journal_stop().
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210716122024.1105856-3-yi.zhang@huawei.com
After commit 3da40c7b0898 ("ext4: only call ext4_truncate when size <=
isize"), i_disksize could always be updated to i_size in ext4_setattr(),
and we could sure that i_disksize <= i_size since holding inode lock and
if i_disksize < i_size there are delalloc writes pending in the range
upto i_size. If the end of the current write is <= i_size, there's no
need to touch i_disksize since writeback will push i_disksize upto
i_size eventually. So we can switch to check i_size instead of
i_disksize in ext4_da_write_end() when write to the end of the file.
we also could remove ext4_mark_inode_dirty() together because we defer
inode dirtying to generic_write_end() or ext4_da_write_inline_data_end().
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210716122024.1105856-2-yi.zhang@huawei.com
Pull MAP_DENYWRITE removal from David Hildenbrand:
"Remove all in-tree usage of MAP_DENYWRITE from the kernel and remove
VM_DENYWRITE.
There are some (minor) user-visible changes:
- We no longer deny write access to shared libaries loaded via legacy
uselib(); this behavior matches modern user space e.g. dlopen().
- We no longer deny write access to the elf interpreter after exec
completed, treating it just like shared libraries (which it often
is).
- We always deny write access to the file linked via /proc/pid/exe:
sys_prctl(PR_SET_MM_MAP/EXE_FILE) will fail if write access to the
file cannot be denied, and write access to the file will remain
denied until the link is effectivel gone (exec, termination,
sys_prctl(PR_SET_MM_MAP/EXE_FILE)) -- just as if exec'ing the file.
Cross-compiled for a bunch of architectures (alpha, microblaze, i386,
s390x, ...) and verified via ltp that especially the relevant tests
(i.e., creat07 and execve04) continue working as expected"
* tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux:
fs: update documentation of get_write_access() and friends
mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
mm: remove VM_DENYWRITE
binfmt: remove in-tree usage of MAP_DENYWRITE
kernel/fork: always deny write access to current MM exe_file
kernel/fork: factor out replacing the current MM exe_file
binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()
Merge NTFSv3 filesystem from Konstantin Komarov:
"This patch adds NTFS Read-Write driver to fs/ntfs3.
Having decades of expertise in commercial file systems development and
huge test coverage, we at Paragon Software GmbH want to make our
contribution to the Open Source Community by providing implementation
of NTFS Read-Write driver for the Linux Kernel.
This is fully functional NTFS Read-Write driver. Current version works
with NTFS (including v3.1) and normal/compressed/sparse files and
supports journal replaying.
We plan to support this version after the codebase once merged, and
add new features and fix bugs. For example, full journaling support
over JBD will be added in later updates"
Link: https://lore.kernel.org/lkml/20210729134943.778917-1-almaz.alexandrovich@paragon-software.com/
Link: https://lore.kernel.org/lkml/aa4aa155-b9b2-9099-b7a2-349d8d9d8fbd@paragon-software.com/
* git://github.com/Paragon-Software-Group/linux-ntfs3: (35 commits)
fs/ntfs3: Change how module init/info messages are displayed
fs/ntfs3: Remove GPL boilerplates from decompress lib files
fs/ntfs3: Remove unnecessary condition checking from ntfs_file_read_iter
fs/ntfs3: Fix integer overflow in ni_fiemap with fiemap_prep()
fs/ntfs3: Restyle comments to better align with kernel-doc
fs/ntfs3: Rework file operations
fs/ntfs3: Remove fat ioctl's from ntfs3 driver for now
fs/ntfs3: Restyle comments to better align with kernel-doc
fs/ntfs3: Fix error handling in indx_insert_into_root()
fs/ntfs3: Potential NULL dereference in hdr_find_split()
fs/ntfs3: Fix error code in indx_add_allocate()
fs/ntfs3: fix an error code in ntfs_get_acl_ex()
fs/ntfs3: add checks for allocation failure
fs/ntfs3: Use kcalloc/kmalloc_array over kzalloc/kmalloc
fs/ntfs3: Do not use driver own alloc wrappers
fs/ntfs3: Use kernel ALIGN macros over driver specific
fs/ntfs3: Restyle comment block in ni_parse_reparse()
fs/ntfs3: Remove unused including <linux/version.h>
fs/ntfs3: Fix fall-through warnings for Clang
fs/ntfs3: Fix one none utf8 char in source file
...
In this cycle, we've addressed some performance issues such as lock contention,
misbehaving compress_cache, allowing extent_cache for compressed files, and new
sysfs to adjust ra_size for fadvise. In order to diagnose the performance issues
quickly, we also added an iostat which shows the IO latencies periodically. On
the stability side, we've found two memory leakage cases in the error path in
compression flow. And, we've also fixed various corner cases in fiemap, quota,
checkpoint=disable, zstd, and so on.
Enhancement:
- avoid long checkpoint latency by releasing nat_tree_lock
- collect and show iostats periodically
- support extent_cache for compressed files
- add a sysfs entry to manage ra_size given fadvise(POSIX_FADV_SEQUENTIAL)
- report f2fs GC status via sysfs
- add discard_unit=%s in mount option to handle zoned device
Bug fix:
- fix two memory leakages when an error happens in the compressed IO flow
- fix commpress_cache to get the right LBA
- fix fiemap to deal with compressed case correctly
- fix wrong EIO returns due to SBI_NEED_FSCK
- fix missing writes when enabling checkpoint back
- fix quota deadlock
- fix zstd level mount option
In addition to the above major updates, we've cleaned up several code paths such
as dio, unnecessary operations, debugfs/f2fs/status, sanity check, and typos.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmEyw1sACgkQQBSofoJI
UNLJmA/+NHUgwUjLMcHvmLyp6QYpQDZtKj93/sRDo+YHOYNdYFjWWUb329PYTKWS
kEdzApCP+KHfVxeSkiL/x3qWP+RlTkIf96P0kR3/BKi0tjg25G2riFWztusDDFpt
xi+AW5sUFDvIx1tFumvQHAQedSwBgcZ96ovT5EwxEuONkljhZC9phEC6vSXz9nOR
e2EQIyezbC5O21np1KSeqSgqRMpVkJkVcEHy4VmpMBCLMOOYPepWwKw+yPaV/jR/
zUXdo2/53vma50M5LCDPCtjCtWQgLoeNeGLxyjfzQuTJU6TmtPY65JObLPt6pUSj
fRW6qIziTZbVYXzOWBD0EYilv2N4c3BNJdhQCpx2Vyjw9/LLxzqKPOUyzBoa1kjY
eZVvmaLXVCKsoJdHDSi7OH/4BqS6SuSZE8eO/nGkgswqiErHZ0Vwl3bFCWC7r/Bk
r2U5spJx/83XO6c9H1bzeWEies1DRtwnCDIRRuw35RtJ4uHZaqCfkuJ7rOBwC90X
4SpaAKdUxP2RWc3GKELBIhaqPn7vyMy9ile6VU14PjM8UcY5hyE87T2azqR8gGut
nVjRL4cbMGTPj6m1Qj8KqBRSaLuShe6AncUy7bNGiM+JlcLcdB6OJ1ZYLl9hjx2r
TbIouXThgcZ4SIK0DEaBLKz2b9/0TfaO9gw1XzpRma+bWA1pApM=
=W67o
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this cycle, we've addressed some performance issues such as lock
contention, misbehaving compress_cache, allowing extent_cache for
compressed files, and new sysfs to adjust ra_size for fadvise.
In order to diagnose the performance issues quickly, we also added an
iostat which shows the IO latencies periodically.
On the stability side, we've found two memory leakage cases in the
error path in compression flow. And, we've also fixed various corner
cases in fiemap, quota, checkpoint=disable, zstd, and so on.
Enhancements:
- avoid long checkpoint latency by releasing nat_tree_lock
- collect and show iostats periodically
- support extent_cache for compressed files
- add a sysfs entry to manage ra_size given fadvise(POSIX_FADV_SEQUENTIAL)
- report f2fs GC status via sysfs
- add discard_unit=%s in mount option to handle zoned device
Bug fixes:
- fix two memory leakages when an error happens in the compressed IO flow
- fix commpress_cache to get the right LBA
- fix fiemap to deal with compressed case correctly
- fix wrong EIO returns due to SBI_NEED_FSCK
- fix missing writes when enabling checkpoint back
- fix quota deadlock
- fix zstd level mount option
In addition to the above major updates, we've cleaned up several code
paths such as dio, unnecessary operations, debugfs/f2fs/status, sanity
check, and typos"
* tag 'f2fs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (46 commits)
f2fs: should put a page beyond EOF when preparing a write
f2fs: deallocate compressed pages when error happens
f2fs: enable realtime discard iff device supports discard
f2fs: guarantee to write dirty data when enabling checkpoint back
f2fs: fix to unmap pages from userspace process in punch_hole()
f2fs: fix unexpected ENOENT comes from f2fs_map_blocks()
f2fs: fix to account missing .skipped_gc_rwsem
f2fs: adjust unlock order for cleanup
f2fs: Don't create discard thread when device doesn't support realtime discard
f2fs: rebuild nat_bits during umount
f2fs: introduce periodic iostat io latency traces
f2fs: separate out iostat feature
f2fs: compress: do sanity check on cluster
f2fs: fix description about main_blkaddr node
f2fs: convert S_IRUGO to 0444
f2fs: fix to keep compatibility of fault injection interface
f2fs: support fault injection for f2fs_kmem_cache_alloc()
f2fs: compress: allow write compress released file after truncate to zero
f2fs: correct comment in segment.h
f2fs: improve sbi status info in debugfs/f2fs/status
...
- New Features:
- Better client responsiveness when server isn't replying
- Use refcount_t in sunrpc rpc_client refcount tracking
- Add srcaddr and dst_port to the sunrpc sysfs info files
- Add basic support for connection sharing between servers with multiple NICs`
- Bugfixes and Cleanups:
- Sunrpc tracepoint cleanups
- Disconnect after ib_post_send() errors to avoid deadlocks
- Fix for tearing down rpcrdma_reps
- Fix a potential pNFS layoutget livelock loop
- pNFS layout barrier fixes
- Fix a potential memory corruption in rpc_wake_up_queued_task_set_status()
- Fix reconnection locking
- Fix return value of get_srcport()
- Remove rpcrdma_post_sends()
- Remove pNFS dead code
- Remove copy size restriction for inter-server copies
- Overhaul the NFS callback service
- Clean up sunrpc TCP socket shutdowns
- Always provide aligned buffers to RPC read layers
-----BEGIN PGP SIGNATURE-----
iQIyBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmExP7AACgkQ18tUv7Cl
QOshTg/zBz7OfrS23CcLLgNidTJ6S7JOuj1DShG+YzsYXT8f9Nl1DadLM7yAEyok
6JZzC8rXYzJcmYztHZzRyTuzj1+tGGb0u/MrD0bBk42VEel6eOjH/Y9ybn12Gf/E
aqlcJh8hPx44U8oo5EFjRJsg2h28O06vywqhJz+sTbkqKN4hlAgMOo5ysAB+1thg
BrTlR84EKBw5QqxPJ1WPmq9tEyGebU9Yrj1p8f0Uf015IeRNeTOXx3NzmdPshphf
2yJvjumwEzqkcHXTJFDfP6ikIcGPPMNVAOK8DHb+vDGzNsOXW7dDM7GuWA3U8DlU
ZHvyyb05Wwe6Wwg8xwx90FEXcYZFfZbSKmI9z2uoOuGFzNG07zWzPDzRft+qrOvU
VMMwP9oEh71+qesmWTvqIbR2RjxqbCYlTcc8mBrD66DROi6jZ2jznraNC85sxG0Q
b8GE+2SnYr2Q25yehj2xrRlOXyiYNkeeYmIpIquEqH9o7cSyDNJhBWbzIv6x+ith
O/S06ZVKMc9X1nH5t5121XcHrSTMMVA/67WMyKfKMxWnrADAWPQALG+ttoTcbRu7
Txew3Jb+hB8+ZdHAqbPf1l1i+7USQl1CRHMw3GRvNjCL2qcjZb1R7eyJRSQQtUyw
q6SJRGe6Sn1FTUnn96Hv15Zy8VHx+q0cOL/EQVzL1RzJIXYcag==
=Ad/3
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"New Features:
- Better client responsiveness when server isn't replying
- Use refcount_t in sunrpc rpc_client refcount tracking
- Add srcaddr and dst_port to the sunrpc sysfs info files
- Add basic support for connection sharing between servers with multiple NICs`
Bugfixes and Cleanups:
- Sunrpc tracepoint cleanups
- Disconnect after ib_post_send() errors to avoid deadlocks
- Fix for tearing down rpcrdma_reps
- Fix a potential pNFS layoutget livelock loop
- pNFS layout barrier fixes
- Fix a potential memory corruption in rpc_wake_up_queued_task_set_status()
- Fix reconnection locking
- Fix return value of get_srcport()
- Remove rpcrdma_post_sends()
- Remove pNFS dead code
- Remove copy size restriction for inter-server copies
- Overhaul the NFS callback service
- Clean up sunrpc TCP socket shutdowns
- Always provide aligned buffers to RPC read layers"
* tag 'nfs-for-5.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (39 commits)
NFS: Always provide aligned buffers to the RPC read layers
NFSv4.1 add network transport when session trunking is detected
SUNRPC enforce creation of no more than max_connect xprts
NFSv4 introduce max_connect mount options
SUNRPC add xps_nunique_destaddr_xprts to xprt_switch_info in sysfs
SUNRPC keep track of number of transports to unique addresses
NFSv3: Delete duplicate judgement in nfs3_async_handle_jukebox
SUNRPC: Tweak TCP socket shutdown in the RPC client
SUNRPC: Simplify socket shutdown when not reusing TCP ports
NFSv4.2: remove restriction of copy size for inter-server copy.
NFS: Clean up the synopsis of callback process_op()
NFS: Extract the xdr_init_encode/decode() calls from decode_compound
NFS: Remove unused callback void decoder
NFS: Add a private local dispatcher for NFSv4 callback operations
SUNRPC: Eliminate the RQ_AUTHERR flag
SUNRPC: Set rq_auth_stat in the pg_authenticate() callout
SUNRPC: Add svc_rqst::rq_auth_stat
SUNRPC: Add dst_port to the sysfs xprt info file
SUNRPC: Add srcaddr as a file in sysfs
sunrpc: Fix return value of get_srcport()
...
If ndr->length is smaller than expected size, ksmbd can access invalid
access in ndr->data. This patch add validation to check ndr->offset is
over ndr->length. and added exception handling to check return value of
ndr read/write function.
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
ksmbd_file_table_flush is a leftover from SMB1. This function is no longer
needed as SMB1 has been removed from ksmbd.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Becase smb direct header is mapped and msg->num_sge
already is incremented, the decrement should be
removed from the condition.
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
This log happens on servers with a network bridge since
the bridge does not have a specified link speed.
This is not a real error so change the error log to debug instead.
Signed-off-by: Per Forlin <perfn@axis.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
When ownership is changed we might in certain scenarios loose the
ability to alter the inode after we changed ownership. This can e.g.
happen when we are on an idmapped mount where uid 0 is mapped to uid
1000 and uid 1000 is mapped to uid 0.
A caller with fs*id 1000 will be able to create files as *id 1000 on
disk. They will also be able to change ownership of files owned by *id 0
to *id 1000 but they won't be able to change ownership in the other
direction. This means acl operations following notify_change() would
fail. Move the notify_change() call after the acls have been updated.
This guarantees that we don't end up with spurious "hash value diff"
warnings later on because we managed to change ownership but didn't
manage to alter acls.
Cc: Steve French <stfrench@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>