linux

iv/linux

Author	SHA1	Message	Date
Zhihao Cheng	233221d284	proc: fix a dentry lock race between release_task and lookup [ Upstream commit `d919a1e79b` ] Commit `7bc3e6e55a` ("proc: Use a list of inodes to flush from proc") moved proc_flush_task() behind __exit_signal(). Then, process systemd can take long period high cpu usage during releasing task in following concurrent processes: systemd ps kernel_waitid stat(/proc/tgid) do_wait filename_lookup wait_consider_task lookup_fast release_task __exit_signal __unhash_process detach_pid __change_pid // remove task->pid_links d_revalidate -> pid_revalidate // 0 d_invalidate(/proc/tgid) shrink_dcache_parent(/proc/tgid) d_walk(/proc/tgid) spin_lock_nested(/proc/tgid/fd) // iterating opened fd proc_flush_pid \| d_invalidate (/proc/tgid/fd) \| shrink_dcache_parent(/proc/tgid/fd) \| shrink_dentry_list(subdirs) ↓ shrink_lock_dentry(/proc/tgid/fd) --> race on dentry lock Function d_invalidate() will remove dentry from hash firstly, but why does proc_flush_pid() process dentry '/proc/tgid/fd' before dentry '/proc/tgid'? That's because proc_pid_make_inode() adds proc inode in reverse order by invoking hlist_add_head_rcu(). But proc should not add any inodes under '/proc/tgid' except '/proc/tgid/task/pid', fix it by adding inode into 'pid->inodes' only if the inode is /proc/tgid or /proc/tgid/task/pid. Performance regression: Create 200 tasks, each task open one file for 50,000 times. Kill all tasks when opened files exceed 10,000,000 (cat /proc/sys/fs/file-nr). Before fix: $ time killall -wq aa real 4m40.946s # During this period, we can see 'ps' and 'systemd' taking high cpu usage. After fix: $ time killall -wq aa real 1m20.732s # During this period, we can see 'systemd' taking high cpu usage. Link: https://lkml.kernel.org/r/20220713130029.4133533-1-chengzhihao1@huawei.com Fixes: `7bc3e6e55a` ("proc: Use a list of inodes to flush from proc") Link: https://bugzilla.kernel.org/show_bug.cgi?id=216054 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Suggested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Baoquan He <bhe@redhat.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-08-17 15:15:54 +02:00
Zhihao Cheng	ddd896792e	jbd2: fix assertion 'jh->b_frozen_data == NULL' failure when journal aborted [ Upstream commit `4a734f0869` ] Following process will fail assertion 'jh->b_frozen_data == NULL' in jbd2_journal_dirty_metadata(): jbd2_journal_commit_transaction unlink(dir/a) jh->b_transaction = trans1 jh->b_jlist = BJ_Metadata journal->j_running_transaction = NULL trans1->t_state = T_COMMIT unlink(dir/b) handle->h_trans = trans2 do_get_write_access jh->b_modified = 0 jh->b_frozen_data = frozen_buffer jh->b_next_transaction = trans2 jbd2_journal_dirty_metadata is_handle_aborted is_journal_aborted // return false --> jbd2 abort <-- while (commit_transaction->t_buffers) if (is_journal_aborted) jbd2_journal_refile_buffer __jbd2_journal_refile_buffer WRITE_ONCE(jh->b_transaction, jh->b_next_transaction) WRITE_ONCE(jh->b_next_transaction, NULL) __jbd2_journal_file_buffer(jh, BJ_Reserved) J_ASSERT_JH(jh, jh->b_frozen_data == NULL) // assertion failure ! The reproducer (See detail in [Link]) reports: ------------[ cut here ]------------ kernel BUG at fs/jbd2/transaction.c:1629! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 2 PID: 584 Comm: unlink Tainted: G W 5.19.0-rc6-00115-g4a57a8400075-dirty #697 RIP: 0010:jbd2_journal_dirty_metadata+0x3c5/0x470 RSP: 0018:ffffc90000be7ce0 EFLAGS: 00010202 Call Trace: <TASK> __ext4_handle_dirty_metadata+0xa0/0x290 ext4_handle_dirty_dirblock+0x10c/0x1d0 ext4_delete_entry+0x104/0x200 __ext4_unlink+0x22b/0x360 ext4_unlink+0x275/0x390 vfs_unlink+0x20b/0x4c0 do_unlinkat+0x42f/0x4c0 __x64_sys_unlink+0x37/0x50 do_syscall_64+0x35/0x80 After journal aborting, __jbd2_journal_refile_buffer() is executed with holding @jh->b_state_lock, we can fix it by moving 'is_handle_aborted()' into the area protected by @jh->b_state_lock. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216251 Fixes: `470decc613` ("[PATCH] jbd2: initial copy of files from jbd") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Link: https://lore.kernel.org/r/20220715125152.4022726-1-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-08-17 15:15:43 +02:00
Li Lingfeng	21197733e6	ext4: recover csum seed of tmp_inode after migrating to extents [ Upstream commit `07ea7a617d` ] When migrating to extents, the checksum seed of temporary inode need to be replaced by inode's, otherwise the inode checksums will be incorrect when swapping the inodes data. However, the temporary inode can not match it's checksum to itself since it has lost it's own checksum seed. mkfs.ext4 -F /dev/sdc mount /dev/sdc /mnt/sdc xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/sdc/testfile chattr -e /mnt/sdc/testfile chattr +e /mnt/sdc/testfile umount /dev/sdc fsck -fn /dev/sdc ======== ... Pass 1: Checking inodes, blocks, and sizes Inode 13 passes checks, but checksum does not match inode. Fix? no ... ======== The fix is simple, save the checksum seed of temporary inode, and recover it after migrating to extents. Fixes: `e81c9302a6` ("ext4: set csum seed in tmp inode while migrating to extents") Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220617062515.2113438-1-lilingfeng3@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-08-17 15:15:43 +02:00
Zhang Yi	d65c932900	jbd2: fix outstanding credits assert in jbd2_journal_commit_transaction() [ Upstream commit `a89573ce4a` ] We catch an assert problem in jbd2_journal_commit_transaction() when doing fsstress and request falut injection tests. The problem is happened in a race condition between jbd2_journal_commit_transaction() and ext4_end_io_end(). Firstly, ext4_writepages() writeback dirty pages and start reserved handle, and then the journal was aborted due to some previous metadata IO error, jbd2_journal_abort() start to commit current running transaction, the committing procedure could be raced by ext4_end_io_end() and lead to subtract j_reserved_credits twice from commit_transaction->t_outstanding_credits, finally the t_outstanding_credits is mistakenly smaller than t_nr_buffers and trigger assert. kjournald2 kworker jbd2_journal_commit_transaction() write_unlock(&journal->j_state_lock); atomic_sub(j_reserved_credits, t_outstanding_credits); //sub once jbd2_journal_start_reserved() start_this_handle() //detect aborted journal jbd2_journal_free_reserved() //get running transaction read_lock(&journal->j_state_lock) __jbd2_journal_unreserve_handle() atomic_sub(j_reserved_credits, t_outstanding_credits); //sub again read_unlock(&journal->j_state_lock); journal->j_running_transaction = NULL; J_ASSERT(t_nr_buffers <= t_outstanding_credits) //bomb!!! Fix this issue by using journal->j_state_lock to protect the subtraction in jbd2_journal_commit_transaction(). Fixes: `96f1e09745` ("jbd2: avoid long hold times of j_state_lock while committing a transaction") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220611130426.2013258-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-08-17 15:15:43 +02:00
Yushan Zhou	4a9f35b872	kernfs: fix potential NULL dereference in __kernfs_remove [ Upstream commit `72b5d5aef2` ] When lockdep is enabled, lockdep_assert_held_write would cause potential NULL pointer dereference. Fix the following smatch warnings: fs/kernfs/dir.c:1353 __kernfs_remove() warn: variable dereferenced before check 'kn' (see line 1346) Fixes: `393c371408` ("kernfs: switch global kernfs_rwsem lock to per-fs lock") Signed-off-by: Yushan Zhou <katrinzhou@tencent.com> Link: https://lore.kernel.org/r/20220630082512.3482581-1-zys.zljxml@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-08-17 15:15:26 +02:00
Qu Wenruo	7470f4a1c2	btrfs: update stripe_sectors::uptodate in steal_rbio [ Upstream commit `4d10046613` ] [BUG] With added debugging, it turns out the following write sequence would cause extra read which is unnecessary: # xfs_io -f -s -c "pwrite -b 32k 0 32k" -c "pwrite -b 32k 32k 32k" \ -c "pwrite -b 32k 64k 32k" -c "pwrite -b 32k 96k 32k" \ $mnt/file The debug message looks like this (btrfs header skipped): partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768 partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536 full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768 full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768 partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768 partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536 full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768 full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768 partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=0 physical=323026944 len=32768 partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536 ^^^^ Still partial read, even 389152768 is already cached by the first. write. full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768 full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768 partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=0 physical=22020096 len=32768 partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536 ^^^^ Still partial read for 298844160. full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768 full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768 This means every 32K writes, even they are in the same full stripe, still trigger read for previously cached data. This would cause extra RAID56 IO, making the btrfs raid56 cache useless. [CAUSE] Commit `d4e28d9b5f` ("btrfs: raid56: make steal_rbio() subpage compatible") tries to make steal_rbio() subpage compatible, but during that conversion, there is one thing missing. We no longer rely on PageUptodate(rbio->stripe_pages[i]), but rbio->stripe_nsectors[i].uptodate to determine if a sector is uptodate. This means, previously if we switch the pointer, everything is done, as the PageUptodate flag is still bound to that page. But now we have to manually mark the involved sectors uptodate, or later raid56_rmw_stripe() will find the stolen sector is not uptodate, and assemble the read bio for it, wasting IO. [FIX] We can easily fix the bug, by also update the rbio->stripe_sectors[].uptodate in steal_rbio(). With this fixed, now the same write pattern no longer leads to the same unnecessary read: partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768 partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536 full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768 full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768 partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768 partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536 full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768 full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768 ^^^ No more partial read, directly into the write path. full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768 full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768 full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768 full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768 Fixes: `d4e28d9b5f` ("btrfs: raid56: make steal_rbio() subpage compatible") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-08-17 15:14:54 +02:00
Jason A. Donenfeld	fd0a6e99b6	fs: check FMODE_LSEEK to control internal pipe splicing [ Upstream commit `97ef77c52b` ] The original direct splicing mechanism from Jens required the input to be a regular file because it was avoiding the special socket case. It also recognized blkdevs as being close enough to a regular file. But it forgot about chardevs, which behave the same way and work fine here. This is an okayish heuristic, but it doesn't totally work. For example, a few chardevs should be spliceable here. And a few regular files shouldn't. This patch fixes this by instead checking whether FMODE_LSEEK is set, which represents decently enough what we need rewinding for when splicing to internal pipes. Fixes: `b92ce55893` ("[PATCH] splice: add direct fd <-> fd splicing support") Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-08-17 15:14:49 +02:00
Hongnan Li	e66205ee2d	erofs: update ctx->pos for every emitted dirent [ Upstream commit `ecce9212d0` ] erofs_readdir update ctx->pos after filling a batch of dentries and it may cause dir/files duplication for NFS readdirplus which depends on ctx->pos to fill dir correctly. So update ctx->pos for every emitted dirent in erofs_fill_dentries to fix it. Also fix the update of ctx->pos when the initial file position has exceeded nameoff. Fixes: `3e917cc305` ("erofs: make filesystem exportable") Signed-off-by: Hongnan Li <hongnan.li@linux.alibaba.com> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20220722082732.30935-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-08-17 15:14:21 +02:00
Jens Axboe	95c74abe40	io_uring: move to separate directory [ Upstream commit `ed29b0b4fd` ] In preparation for splitting io_uring up a bit, move it into its own top level directory. It didn't really belong in fs/ anyway, as it's not a file system only API. This adds io_uring/ and moves the core files in there, and updates the MAINTAINERS file for the new location. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-08-17 15:14:20 +02:00
Gao Xiang	4b3dd66596	erofs: avoid consecutive detection for Highmem memory [ Upstream commit `448b5a1548` ] Currently, vmap()s are avoided if physical addresses are consecutive for decompressed buffers. I observed that is very common for 4KiB pclusters since the numbers of decompressed pages are almost 2 or 3. However, such detection doesn't work for Highmem pages on 32-bit machines, let's fix it now. Reported-by: Liu Jinbao <liujinbao1@xiaomi.com> Fixes: `7fc45dbc93` ("staging: erofs: introduce generic decompression backend") Link: https://lore.kernel.org/r/20220708101001.21242-1-hsiangkao@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-08-17 15:14:16 +02:00
Yuwen Chen	96aa2a6a89	erofs: wake up all waiters after z_erofs_lzma_head ready [ Upstream commit `2df7c4bd7c` ] When the user mounts the erofs second times, the decompression thread may hung. The problem happens due to a sequence of steps like the following: 1) Task A called z_erofs_load_lzma_config which obtain all of the node from the z_erofs_lzma_head. 2) At this time, task B called the z_erofs_lzma_decompress and wanted to get a node. But the z_erofs_lzma_head was empty, the Task B had to sleep. 3) Task A release nodes and push nodes into the z_erofs_lzma_head. But task B was still sleeping. One example report when the hung happens: task:kworker/u3:1 state:D stack:14384 pid: 86 ppid: 2 flags:0x00004000 Workqueue: erofs_unzipd z_erofs_decompressqueue_work Call Trace: <TASK> __schedule+0x281/0x760 schedule+0x49/0xb0 z_erofs_lzma_decompress+0x4bc/0x580 ? cpu_core_flags+0x10/0x10 z_erofs_decompress_pcluster+0x49b/0xba0 ? __update_load_avg_se+0x2b0/0x330 ? __update_load_avg_se+0x2b0/0x330 ? update_load_avg+0x5f/0x690 ? update_load_avg+0x5f/0x690 ? set_next_entity+0xbd/0x110 ? _raw_spin_unlock+0xd/0x20 z_erofs_decompress_queue.isra.0+0x2e/0x50 z_erofs_decompressqueue_work+0x30/0x60 process_one_work+0x1d3/0x3a0 worker_thread+0x45/0x3a0 ? process_one_work+0x3a0/0x3a0 kthread+0xe2/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x22/0x30 </TASK> Signed-off-by: Yuwen Chen <chenyuwen1@meizu.com> Fixes: `622ceaddb7` ("erofs: lzma compression support") Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20220626224041.4288-1-chenyuwen1@meizu.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-08-17 15:14:16 +02:00
Jan Kara	5e63c5fe91	ext2: Add more validity checks for inode counts [ Upstream commit `fa78f33693` ] Add checks verifying number of inodes stored in the superblock matches the number computed from number of inodes per group. Also verify we have at least one block worth of inodes per group. This prevents crashes on corrupted filesystems. Reported-by: syzbot+d273f7d7f58afd93be48@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-08-17 15:14:00 +02:00
Benjamin Segall	503dde6122	epoll: autoremove wakers even more aggressively commit `a16ceb1396` upstream. If a process is killed or otherwise exits while having active network connections and many threads waiting on epoll_wait, the threads will all be woken immediately, but not removed from ep->wq. Then when network traffic scans ep->wq in wake_up, every wakeup attempt will fail, and will not remove the entries from the list. This means that the cost of the wakeup attempt is far higher than usual, does not decrease, and this also competes with the dying threads trying to actually make progress and remove themselves from the wq. Handle this by removing visited epoll wq entries unconditionally, rather than only when the wakeup succeeds - the structure of ep_poll means that the only potential loss is the timed_out->eavail heuristic, which now can race and result in a redundant ep_send_events attempt. (But only when incoming data and a timeout actually race, not on every timeout) Shakeel added: : We are seeing this issue in production with real workloads and it has : caused hard lockups. Particularly network heavy workloads with a lot : of threads in epoll_wait() can easily trigger this issue if they get : killed (oom-killed in our case). Link: https://lkml.kernel.org/r/xm26fsjotqda.fsf@google.com Signed-off-by: Ben Segall <bsegall@google.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Roman Penyaev <rpenyaev@suse.de> Cc: Jason Baron <jbaron@akamai.com> Cc: Khazhismel Kumykov <khazhy@google.com> Cc: Heiher <r@hev.cc> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-08-17 15:13:59 +02:00
Jan Kara	546a2d5c10	mbcache: add functions to delete entry if unused commit `3dc96bba65` upstream. Add function mb_cache_entry_delete_or_get() to delete mbcache entry if it is unused and also add a function to wait for entry to become unused - mb_cache_entry_wait_unused(). We do not share code between the two deleting function as one of them will go away soon. CC: stable@vger.kernel.org Fixes: `82939d7999` ("ext4: convert to mbcache2") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220712105436.32204-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-08-17 15:13:56 +02:00
Jan Kara	6f01f9e166	mbcache: don't reclaim used entries commit `5831891418` upstream. Do not reclaim entries that are currently used by somebody from a shrinker. Firstly, these entries are likely useful. Secondly, we will need to keep such entries to protect pending increment of xattr block refcount. CC: stable@vger.kernel.org Fixes: `82939d7999` ("ext4: convert to mbcache2") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220712105436.32204-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-08-17 15:13:56 +02:00
Miklos Szeredi	1fdbbe246d	fuse: fix deadlock between atomic O_TRUNC and page invalidation commit `2fdbb8dd01` upstream. fuse_finish_open() will be called with FUSE_NOWRITE set in case of atomic O_TRUNC open(), so commit `76224355db` ("fuse: truncate pagecache on atomic_o_trunc") replaced invalidate_inode_pages2() by truncate_pagecache() in such a case to avoid the A-A deadlock. However, we found another A-B-B-A deadlock related to the case above, which will cause the xfstests generic/464 testcase hung in our virtio-fs test environment. For example, consider two processes concurrently open one same file, one with O_TRUNC and another without O_TRUNC. The deadlock case is described below, if open(O_TRUNC) is already set_nowrite(acquired A), and is trying to lock a page (acquiring B), open() could have held the page lock (acquired B), and waiting on the page writeback (acquiring A). This would lead to deadlocks. open(O_TRUNC) ---------------------------------------------------------------- fuse_open_common inode_lock [C acquire] fuse_set_nowrite [A acquire] fuse_finish_open truncate_pagecache lock_page [B acquire] truncate_inode_page unlock_page [B release] fuse_release_nowrite [A release] inode_unlock [C release] ---------------------------------------------------------------- open() ---------------------------------------------------------------- fuse_open_common fuse_finish_open invalidate_inode_pages2 lock_page [B acquire] fuse_launder_page fuse_wait_on_page_writeback [A acquire & release] unlock_page [B release] ---------------------------------------------------------------- Besides this case, all calls of invalidate_inode_pages2() and invalidate_inode_pages2_range() in fuse code also can deadlock with open(O_TRUNC). Fix by moving the truncate_pagecache() call outside the nowrite protected region. The nowrite protection is only for delayed writeback (writeback_cache) case, where inode lock does not protect against truncation racing with writes on the server. Write syscalls racing with page cache truncation still get the inode lock protection. This patch also changes the order of filemap_invalidate_lock() vs. fuse_set_nowrite() in fuse_open_common(). This new order matches the order found in fuse_file_fallocate() and fuse_do_setattr(). Reported-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> Tested-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> Fixes: `e4648309b8` ("fuse: truncate pending writes on O_TRUNC") Cc: <stable@vger.kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-08-17 15:13:55 +02:00
Miklos Szeredi	4bd9d5d20f	fuse: write inode in fuse_release() commit `035ff33cf4` upstream. A race between write(2) and close(2) allows pages to be dirtied after fuse_flush -> write_inode_now(). If these pages are not flushed from fuse_release(), then there might not be a writable open file later. So any remaining dirty pages must be written back before the file is released. This is a partial revert of the blamed commit. Reported-by: syzbot+6e1efbd8efaaa6860e91@syzkaller.appspotmail.com Fixes: `36ea23374d` ("fuse: write inode in fuse_vma_close() instead of fuse_release()") Cc: <stable@vger.kernel.org> # v5.16 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-08-17 15:13:55 +02:00
Miklos Szeredi	ed2ac89f29	fuse: ioctl: translate ENOSYS commit `02c0cab8e7` upstream. Overlayfs may fail to complete updates when a filesystem lacks fileattr/xattr syscall support and responds with an ENOSYS error code, resulting in an unexpected "Function not implemented" error. This bug may occur with FUSE filesystems, such as davfs2. Steps to reproduce: # install davfs2, e.g., apk add davfs2 mkdir /test mkdir /test/lower /test/upper /test/work /test/mnt yes '' \| mount -t davfs -o ro http://some-web-dav-server/path \ /test/lower mount -t overlay -o upperdir=/test/upper,lowerdir=/test/lower \ -o workdir=/test/work overlay /test/mnt # when "some-file" exists in the lowerdir, this fails with "Function # not implemented", with dmesg showing "overlayfs: failed to retrieve # lower fileattr (/some-file, err=-38)" touch /test/mnt/some-file The underlying cause of this regresion is actually in FUSE, which fails to translate the ENOSYS error code returned by userspace filesystem (which means that the ioctl operation is not supported) to ENOTTY. Reported-by: Christian Kohlschütter <christian@kohlschutter.com> Fixes: `72db82115d` ("ovl: copy up sync/noatime fileattr flags") Fixes: `59efec7b90` ("fuse: implement ioctl support") Cc: <stable@vger.kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-08-17 15:13:55 +02:00
Miklos Szeredi	19d747bf8a	fuse: limit nsec commit `47912eaa06` upstream. Limit nanoseconds to 0..999999999. Fixes: `d8a5ba4545` ("[PATCH] FUSE - core") Cc: <stable@vger.kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-08-17 15:13:55 +02:00
Namjae Jeon	8e33102309	ksmbd: fix heap-based overflow in set_ntacl_dacl() commit `8f0541186e` upstream. The testcase use SMB2_SET_INFO_HE command to set a malformed file attribute under the label `security.NTACL`. SMB2_QUERY_INFO_HE command in testcase trigger the following overflow. [ 4712.003781] ================================================================== [ 4712.003790] BUG: KASAN: slab-out-of-bounds in build_sec_desc+0x842/0x1dd0 [ksmbd] [ 4712.003807] Write of size 1060 at addr ffff88801e34c068 by task kworker/0:0/4190 [ 4712.003813] CPU: 0 PID: 4190 Comm: kworker/0:0 Not tainted 5.19.0-rc5 #1 [ 4712.003850] Workqueue: ksmbd-io handle_ksmbd_work [ksmbd] [ 4712.003867] Call Trace: [ 4712.003870] <TASK> [ 4712.003873] dump_stack_lvl+0x49/0x5f [ 4712.003935] print_report.cold+0x5e/0x5cf [ 4712.003972] ? ksmbd_vfs_get_sd_xattr+0x16d/0x500 [ksmbd] [ 4712.003984] ? cmp_map_id+0x200/0x200 [ 4712.003988] ? build_sec_desc+0x842/0x1dd0 [ksmbd] [ 4712.004000] kasan_report+0xaa/0x120 [ 4712.004045] ? build_sec_desc+0x842/0x1dd0 [ksmbd] [ 4712.004056] kasan_check_range+0x100/0x1e0 [ 4712.004060] memcpy+0x3c/0x60 [ 4712.004064] build_sec_desc+0x842/0x1dd0 [ksmbd] [ 4712.004076] ? parse_sec_desc+0x580/0x580 [ksmbd] [ 4712.004088] ? ksmbd_acls_fattr+0x281/0x410 [ksmbd] [ 4712.004099] smb2_query_info+0xa8f/0x6110 [ksmbd] [ 4712.004111] ? psi_group_change+0x856/0xd70 [ 4712.004148] ? update_load_avg+0x1c3/0x1af0 [ 4712.004152] ? asym_cpu_capacity_scan+0x5d0/0x5d0 [ 4712.004157] ? xas_load+0x23/0x300 [ 4712.004162] ? smb2_query_dir+0x1530/0x1530 [ksmbd] [ 4712.004173] ? _raw_spin_lock_bh+0xe0/0xe0 [ 4712.004179] handle_ksmbd_work+0x30e/0x1020 [ksmbd] [ 4712.004192] process_one_work+0x778/0x11c0 [ 4712.004227] ? _raw_spin_lock_irq+0x8e/0xe0 [ 4712.004231] worker_thread+0x544/0x1180 [ 4712.004234] ? __cpuidle_text_end+0x4/0x4 [ 4712.004239] kthread+0x282/0x320 [ 4712.004243] ? process_one_work+0x11c0/0x11c0 [ 4712.004246] ? kthread_complete_and_exit+0x30/0x30 [ 4712.004282] ret_from_fork+0x1f/0x30 This patch add the buffer validation for security descriptor that is stored by malformed SMB2_SET_INFO_HE command. and allocate large response buffer about SMB2_O_INFO_SECURITY file info class. Fixes: `e2f34481b2` ("cifsd: add server-side procedures for SMB3") Cc: stable@vger.kernel.org Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-17771 Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-08-17 15:13:54 +02:00
Namjae Jeon	02ed2a9b78	ksmbd: fix use-after-free bug in smb2_tree_disconect commit `cf6531d981` upstream. smb2_tree_disconnect() freed the struct ksmbd_tree_connect, but it left the dangling pointer. It can be accessed again under compound requests. This bug can lead an oops looking something link: [ 1685.468014 ] BUG: KASAN: use-after-free in ksmbd_tree_conn_disconnect+0x131/0x160 [ksmbd] [ 1685.468068 ] Read of size 4 at addr ffff888102172180 by task kworker/1:2/4807 ... [ 1685.468130 ] Call Trace: [ 1685.468132 ] <TASK> [ 1685.468135 ] dump_stack_lvl+0x49/0x5f [ 1685.468141 ] print_report.cold+0x5e/0x5cf [ 1685.468145 ] ? ksmbd_tree_conn_disconnect+0x131/0x160 [ksmbd] [ 1685.468157 ] kasan_report+0xaa/0x120 [ 1685.468194 ] ? ksmbd_tree_conn_disconnect+0x131/0x160 [ksmbd] [ 1685.468206 ] __asan_report_load4_noabort+0x14/0x20 [ 1685.468210 ] ksmbd_tree_conn_disconnect+0x131/0x160 [ksmbd] [ 1685.468222 ] smb2_tree_disconnect+0x175/0x250 [ksmbd] [ 1685.468235 ] handle_ksmbd_work+0x30e/0x1020 [ksmbd] [ 1685.468247 ] process_one_work+0x778/0x11c0 [ 1685.468251 ] ? _raw_spin_lock_irq+0x8e/0xe0 [ 1685.468289 ] worker_thread+0x544/0x1180 [ 1685.468293 ] ? __cpuidle_text_end+0x4/0x4 [ 1685.468297 ] kthread+0x282/0x320 [ 1685.468301 ] ? process_one_work+0x11c0/0x11c0 [ 1685.468305 ] ? kthread_complete_and_exit+0x30/0x30 [ 1685.468309 ] ret_from_fork+0x1f/0x30 Fixes: `e2f34481b2` ("cifsd: add server-side procedures for SMB3") Cc: stable@vger.kernel.org Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-17816 Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-08-17 15:13:54 +02:00
Hyunchul Lee	0f1c9908c8	ksmbd: prevent out of bound read for SMB2_WRITE commit `ac60778b87` upstream. OOB read memory can be written to a file, if DataOffset is 0 and Length is too large in SMB2_WRITE request of compound request. To prevent this, when checking the length of the data area of SMB2_WRITE in smb2_get_data_area_len(), let the minimum of DataOffset be the size of SMB2 header + the size of SMB2_WRITE header. This bug can lead an oops looking something like: [ 798.008715] BUG: KASAN: slab-out-of-bounds in copy_page_from_iter_atomic+0xd3d/0x14b0 [ 798.008724] Read of size 252 at addr ffff88800f863e90 by task kworker/0:2/2859 ... [ 798.008754] Call Trace: [ 798.008756] <TASK> [ 798.008759] dump_stack_lvl+0x49/0x5f [ 798.008764] print_report.cold+0x5e/0x5cf [ 798.008768] ? __filemap_get_folio+0x285/0x6d0 [ 798.008774] ? copy_page_from_iter_atomic+0xd3d/0x14b0 [ 798.008777] kasan_report+0xaa/0x120 [ 798.008781] ? copy_page_from_iter_atomic+0xd3d/0x14b0 [ 798.008784] kasan_check_range+0x100/0x1e0 [ 798.008788] memcpy+0x24/0x60 [ 798.008792] copy_page_from_iter_atomic+0xd3d/0x14b0 [ 798.008795] ? pagecache_get_page+0x53/0x160 [ 798.008799] ? iov_iter_get_pages_alloc+0x1590/0x1590 [ 798.008803] ? ext4_write_begin+0xfc0/0xfc0 [ 798.008807] ? current_time+0x72/0x210 [ 798.008811] generic_perform_write+0x2c8/0x530 [ 798.008816] ? filemap_fdatawrite_wbc+0x180/0x180 [ 798.008820] ? down_write+0xb4/0x120 [ 798.008824] ? down_write_killable+0x130/0x130 [ 798.008829] ext4_buffered_write_iter+0x137/0x2c0 [ 798.008833] ext4_file_write_iter+0x40b/0x1490 [ 798.008837] ? __fsnotify_parent+0x275/0xb20 [ 798.008842] ? __fsnotify_update_child_dentry_flags+0x2c0/0x2c0 [ 798.008846] ? ext4_buffered_write_iter+0x2c0/0x2c0 [ 798.008851] __kernel_write+0x3a1/0xa70 [ 798.008855] ? __x64_sys_preadv2+0x160/0x160 [ 798.008860] ? security_file_permission+0x4a/0xa0 [ 798.008865] kernel_write+0xbb/0x360 [ 798.008869] ksmbd_vfs_write+0x27e/0xb90 [ksmbd] [ 798.008881] ? ksmbd_vfs_read+0x830/0x830 [ksmbd] [ 798.008892] ? _raw_read_unlock+0x2a/0x50 [ 798.008896] smb2_write+0xb45/0x14e0 [ksmbd] [ 798.008909] ? __kasan_check_write+0x14/0x20 [ 798.008912] ? _raw_spin_lock_bh+0xd0/0xe0 [ 798.008916] ? smb2_read+0x15e0/0x15e0 [ksmbd] [ 798.008927] ? memcpy+0x4e/0x60 [ 798.008931] ? _raw_spin_unlock+0x19/0x30 [ 798.008934] ? ksmbd_smb2_check_message+0x16af/0x2350 [ksmbd] [ 798.008946] ? _raw_spin_lock_bh+0xe0/0xe0 [ 798.008950] handle_ksmbd_work+0x30e/0x1020 [ksmbd] [ 798.008962] process_one_work+0x778/0x11c0 [ 798.008966] ? _raw_spin_lock_irq+0x8e/0xe0 [ 798.008970] worker_thread+0x544/0x1180 [ 798.008973] ? __cpuidle_text_end+0x4/0x4 [ 798.008977] kthread+0x282/0x320 [ 798.008982] ? process_one_work+0x11c0/0x11c0 [ 798.008985] ? kthread_complete_and_exit+0x30/0x30 [ 798.008989] ret_from_fork+0x1f/0x30 [ 798.008995] </TASK> Fixes: `e2f34481b2` ("cifsd: add server-side procedures for SMB3") Cc: stable@vger.kernel.org Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-17817 Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-08-17 15:13:54 +02:00
Hyunchul Lee	9ec5086d14	ksmbd: prevent out of bound read for SMB2_TREE_CONNNECT commit `824d4f64c2` upstream. if Status is not 0 and PathLength is long, smb_strndup_from_utf16 could make out of bound read in smb2_tree_connnect. This bug can lead an oops looking something like: [ 1553.882047] BUG: KASAN: slab-out-of-bounds in smb_strndup_from_utf16+0x469/0x4c0 [ksmbd] [ 1553.882064] Read of size 2 at addr ffff88802c4eda04 by task kworker/0:2/42805 ... [ 1553.882095] Call Trace: [ 1553.882098] <TASK> [ 1553.882101] dump_stack_lvl+0x49/0x5f [ 1553.882107] print_report.cold+0x5e/0x5cf [ 1553.882112] ? smb_strndup_from_utf16+0x469/0x4c0 [ksmbd] [ 1553.882122] kasan_report+0xaa/0x120 [ 1553.882128] ? smb_strndup_from_utf16+0x469/0x4c0 [ksmbd] [ 1553.882139] __asan_report_load_n_noabort+0xf/0x20 [ 1553.882143] smb_strndup_from_utf16+0x469/0x4c0 [ksmbd] [ 1553.882155] ? smb_strtoUTF16+0x3b0/0x3b0 [ksmbd] [ 1553.882166] ? __kmalloc_node+0x185/0x430 [ 1553.882171] smb2_tree_connect+0x140/0xab0 [ksmbd] [ 1553.882185] handle_ksmbd_work+0x30e/0x1020 [ksmbd] [ 1553.882197] process_one_work+0x778/0x11c0 [ 1553.882201] ? _raw_spin_lock_irq+0x8e/0xe0 [ 1553.882206] worker_thread+0x544/0x1180 [ 1553.882209] ? __cpuidle_text_end+0x4/0x4 [ 1553.882214] kthread+0x282/0x320 [ 1553.882218] ? process_one_work+0x11c0/0x11c0 [ 1553.882221] ? kthread_complete_and_exit+0x30/0x30 [ 1553.882225] ret_from_fork+0x1f/0x30 [ 1553.882231] </TASK> There is no need to check error request validation in server. This check allow invalid requests not to validate message. Fixes: `e2f34481b2` ("cifsd: add server-side procedures for SMB3") Cc: stable@vger.kernel.org Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-17818 Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-08-17 15:13:54 +02:00
Namjae Jeon	ff20f18758	ksmbd: fix memory leak in smb2_handle_negotiate commit `aa7253c239` upstream. The allocated memory didn't free under an error path in smb2_handle_negotiate(). Fixes: `e2f34481b2` ("cifsd: add server-side procedures for SMB3") Cc: stable@vger.kernel.org Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-17815 Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-08-17 15:13:54 +02:00
Qu Wenruo	7e06e61527	btrfs: reject log replay if there is unsupported RO compat flag commit `dc4d316849` upstream. [BUG] If we have a btrfs image with dirty log, along with an unsupported RO compatible flag: log_root 30474240 ... compat_flags 0x0 compat_ro_flags 0x40000003 ( FREE_SPACE_TREE \| FREE_SPACE_TREE_VALID \| unknown flag: 0x40000000 ) Then even if we can only mount it RO, we will still cause metadata update for log replay: BTRFS info (device dm-1): flagging fs with big metadata feature BTRFS info (device dm-1): using free space tree BTRFS info (device dm-1): has skinny extents BTRFS info (device dm-1): start tree-log replay This is definitely against RO compact flag requirement. [CAUSE] RO compact flag only forces us to do RO mount, but we will still do log replay for plain RO mount. Thus this will result us to do log replay and update metadata. This can be very problematic for new RO compat flag, for example older kernel can not understand v2 cache, and if we allow metadata update on RO mount and invalidate/corrupt v2 cache. [FIX] Just reject the mount unless rescue=nologreplay is provided: BTRFS error (device dm-1): cannot replay dirty log with unsupport optional features (0x40000000), try rescue=nologreplay instead We don't want to set rescue=nologreply directly, as this would make the end user to read the old data, and cause confusion. Since the such case is really rare, we're mostly fine to just reject the mount with an error message, which also includes the proper workaround. CC: stable@vger.kernel.org #4.9+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-08-17 15:13:52 +02:00
Jiachen Zhang	247caf4a87	ovl: drop WARN_ON() dentry is NULL in ovl_encode_fh() commit `dd524b7f31` upstream. Some code paths cannot guarantee the inode have any dentry alias. So WARN_ON() all !dentry may flood the kernel logs. For example, when an overlayfs inode is watched by inotifywait (1), and someone is trying to read the /proc/$(pidof inotifywait)/fdinfo/INOTIFY_FD, at that time if the dentry has been reclaimed by kernel (such as echo 2 > /proc/sys/vm/drop_caches), there will be a WARN_ON(). The printed call stack would be like: ? show_mark_fhandle+0xf0/0xf0 show_mark_fhandle+0x4a/0xf0 ? show_mark_fhandle+0xf0/0xf0 ? seq_vprintf+0x30/0x50 ? seq_printf+0x53/0x70 ? show_mark_fhandle+0xf0/0xf0 inotify_fdinfo+0x70/0x90 show_fdinfo.isra.4+0x53/0x70 seq_show+0x130/0x170 seq_read+0x153/0x440 vfs_read+0x94/0x150 ksys_read+0x5f/0xe0 do_syscall_64+0x59/0x1e0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 So let's drop WARN_ON() to avoid kernel log flooding. Reported-by: Hongbo Yin <yinhongbo@bytedance.com> Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com> Signed-off-by: Tianci Zhang <zhangtianci.1997@bytedance.com> Fixes: `8ed5eec9d6` ("ovl: encode pure upper file handles") Cc: <stable@vger.kernel.org> # v4.16 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-08-17 15:13:48 +02:00
Yang Xu	d082cb154f	fs: Add missing umask strip in vfs_tmpfile commit `ac6800e279` upstream. All creation paths except for O_TMPFILE handle umask in the vfs directly if the filesystem doesn't support or enable POSIX ACLs. If the filesystem does then umask handling is deferred until posix_acl_create(). Because, O_TMPFILE misses umask handling in the vfs it will not honor umask settings. Fix this by adding the missing umask handling. Link: https://lore.kernel.org/r/1657779088-2242-2-git-send-email-xuyang2018.jy@fujitsu.com Fixes: `60545d0d46` ("[O_TMPFILE] it's still short a few helpers, but infrastructure should be OK now...") Cc: <stable@vger.kernel.org> # 4.19+ Reported-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org> Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-08-17 15:13:47 +02:00
David Howells	5db80480c5	vfs: Check the truncate maximum size in inode_newsize_ok() commit `e2ebff9c57` upstream. If something manages to set the maximum file size to MAX_OFFSET+1, this can cause the xfs and ext4 filesystems at least to become corrupt. Ordinarily, the kernel protects against userspace trying this by checking the value early in the truncate() and ftruncate() system calls calls - but there are at least two places that this check is bypassed: (1) Cachefiles will round up the EOF of the backing file to DIO block size so as to allow DIO on the final block - but this might push the offset negative. It then calls notify_change(), but this inadvertently bypasses the checking. This can be triggered if someone puts an 8EiB-1 file on a server for someone else to try and access by, say, nfs. (2) ksmbd doesn't check the value it is given in set_end_of_file_info() and then calls vfs_truncate() directly - which also bypasses the check. In both cases, it is potentially possible for a network filesystem to cause a disk filesystem to be corrupted: cachefiles in the client's cache filesystem; ksmbd in the server's filesystem. nfsd is okay as it checks the value, but we can then remove this check too. Fix this by adding a check to inode_newsize_ok(), as called from setattr_prepare(), thereby catching the issue as filesystems set up to perform the truncate with minimal opportunity for bypassing the new check. Fixes: `1f08c925e7` ("cachefiles: Implement backing file wrangling") Fixes: `f441584858` ("cifsd: add file operations") Signed-off-by: David Howells <dhowells@redhat.com> Reported-by: Jeff Layton <jlayton@kernel.org> Tested-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Cc: stable@kernel.org Acked-by: Alexander Viro <viro@zeniv.linux.org.uk> cc: Steve French <sfrench@samba.org> cc: Hyunchul Lee <hyc.lee@gmail.com> cc: Chuck Lever <chuck.lever@oracle.com> cc: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-08-17 15:13:46 +02:00
Jeff Layton	eb3ebbd268	lockd: detect and reject lock arguments that overflow commit `6930bcbfb6` upstream. lockd doesn't currently vet the start and length in nlm4 requests like it should, and can end up generating lock requests with arguments that overflow when passed to the filesystem. The NLM4 protocol uses unsigned 64-bit arguments for both start and length, whereas struct file_lock tracks the start and end as loff_t values. By the time we get around to calling nlm4svc_retrieve_args, we've lost the information that would allow us to determine if there was an overflow. Start tracking the actual start and len for NLM4 requests in the nlm_lock. In nlm4svc_retrieve_args, vet these values to ensure they won't cause an overflow, and return NLM4_FBIG if they do. Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=392 Reported-by: Jan Kasiak <j.kasiak@gmail.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@vger.kernel.org> # 5.14+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-08-17 15:13:42 +02:00
Jeff Layton	755f49dd41	nfsd: eliminate the NFSD_FILE_BREAK_* flags commit `23ba98de6d` upstream. We had a report from the spring Bake-a-thon of data corruption in some nfstest_interop tests. Looking at the traces showed the NFS server allowing a v3 WRITE to proceed while a read delegation was still outstanding. Currently, we only set NFSD_FILE_BREAK_* flags if NFSD_MAY_NOT_BREAK_LEASE was set when we call nfsd_file_alloc. NFSD_MAY_NOT_BREAK_LEASE was intended to be set when finding files for COMMIT ops, where we need a writeable filehandle but don't need to break read leases. It doesn't make any sense to consult that flag when allocating a file since the file may be used on subsequent calls where we do want to break the lease (and the usage of it here seems to be reverse from what it should be anyway). Also, after calling nfsd_open_break_lease, we don't want to clear the BREAK_* bits. A lease could end up being set on it later (more than once) and we need to be able to break those leases as well. This means that the NFSD_FILE_BREAK_* flags now just mirror NFSD_MAY_{READ,WRITE} flags, so there's no need for them at all. Just drop those flags and unconditionally call nfsd_open_break_lease every time. Reported-by: Olga Kornieskaia <kolga@netapp.com> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2107360 Fixes: `65294c1f2c` (nfsd: add a new struct file caching facility to nfsd) Cc: <stable@vger.kernel.org> # 5.4.x : `bb283ca18d` NFSD: Clean up the show_nf_flags() macro Cc: <stable@vger.kernel.org> # 5.4.x Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-08-17 15:13:40 +02:00
Trond Myklebust	ed48237d6b	pNFS/flexfiles: Report RDMA connection errors to the server commit `7836d75467` upstream. The RPC/RDMA driver will return -EPROTO and -ENODEV as connection errors under certain circumstances. Make sure that we handle them and report them to the server. If not, we can end up cycling forever in a LAYOUTGET/LAYOUTRETURN loop. Fixes: `a12f996d34` ("NFSv4/pNFS: Use connections to a DS that are all of the same protocol family") Cc: stable@vger.kernel.org # 5.11.x Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-08-17 15:13:40 +02:00
Trond Myklebust	fb78ef830e	Revert "pNFS: nfs3_set_ds_client should set NFS_CS_NOPING" commit `9597152d98` upstream. This reverts commit `c6eb58435b`. If a transport is down, then we want to fail over to other transports if they are listed in the GETDEVICEINFO reply. Fixes: `c6eb58435b` ("pNFS: nfs3_set_ds_client should set NFS_CS_NOPING") Cc: stable@vger.kernel.org # 5.11.x Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-08-17 15:13:39 +02:00
Linus Torvalds	39c3c396f8	Merge tag 'mm-hotfixes-stable-2022-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Thirteen hotfixes. Eight are cc:stable and the remainder are for post-5.18 issues or are too minor to warrant backporting" * tag 'mm-hotfixes-stable-2022-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mailmap: update Gao Xiang's email addresses userfaultfd: provide properly masked address for huge-pages Revert "ocfs2: mount shared volume without ha stack" hugetlb: fix memoryleak in hugetlb_mcopy_atomic_pte fs: sendfile handles O_NONBLOCK of out_fd ntfs: fix use-after-free in ntfs_ucsncmp() secretmem: fix unhandled fault in truncate mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range() mm: fix missing wake-up event for FSDAX pages mm: fix page leak with multiple threads mapping the same page mailmap: update Seth Forshee's email address tmpfs: fix the issue that the mount and remount results are inconsistent. mm: kfence: apply kmemleak_ignore_phys on early allocated pool	2022-07-26 19:38:46 -07:00
Nadav Amit	d172b1a3bd	userfaultfd: provide properly masked address for huge-pages Commit `824ddc601a` ("userfaultfd: provide unmasked address on page-fault") was introduced to fix an old bug, in which the offset in the address of a page-fault was masked. Concerns were raised - although were never backed by actual code - that some userspace code might break because the bug has been around for quite a while. To address these concerns a new flag was introduced, and only when this flag is set by the user, userfaultfd provides the exact address of the page-fault. The commit however had a bug, and if the flag is unset, the offset was always masked based on a base-page granularity. Yet, for huge-pages, the behavior prior to the commit was that the address is masked to the huge-page granulrity. While there are no reports on real breakage, fix this issue. If the flag is unset, use the address with the masking that was done before. Link: https://lkml.kernel.org/r/20220711165906.2682-1-namit@vmware.com Fixes: `824ddc601a` ("userfaultfd: provide unmasked address on page-fault") Signed-off-by: Nadav Amit <namit@vmware.com> Reported-by: James Houghton <jthoughton@google.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: James Houghton <jthoughton@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-26 18:25:01 -07:00
Linus Torvalds	a5235996e1	Merge tag 'io_uring-5.19-2022-07-21' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "Fix for a bad kfree() introduced in this cycle, and a quick fix for disabling buffer recycling for IORING_OP_READV. The latter will get reworked for 5.20, but it gets the job done for 5.19" * tag 'io_uring-5.19-2022-07-21' of git://git.kernel.dk/linux-block: io_uring: do not recycle buffer in READV io_uring: fix free of unallocated buffer list	2022-07-22 12:47:09 -07:00
Dylan Yudaken	934447a603	io_uring: do not recycle buffer in READV READV cannot recycle buffers as it would lose some of the data required to reimport that buffer. Reported-by: Ammar Faizi <ammarfaizi2@gnuweeb.org> Fixes: `b66e65f414` ("io_uring: never call io_buffer_select() for a buffer re-select") Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220721131325.624788-1-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-21 08:31:31 -06:00
Dylan Yudaken	ec8516f3b7	io_uring: fix free of unallocated buffer list in the error path of io_register_pbuf_ring, only free bl if it was allocated. Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com> Fixes: `c7fb19428d` ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Dylan Yudaken <dylany@fb.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/all/CANX2M5bXKw1NaHdHNVqssUUaBCs8aBpmzRNVEYEvV0n44P7ioA@mail.gmail.com/ Link: https://lore.kernel.org/all/CANX2M5YiZBXU3L6iwnaLs-HHJXRvrxM8mhPDiMDF9Y9sAvOHUA@mail.gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-21 08:29:01 -06:00
Junxiao Bi	c80af0c250	Revert "ocfs2: mount shared volume without ha stack" This reverts commit `912f655d78`. This commit introduced a regression that can cause mount hung. The changes in __ocfs2_find_empty_slot causes that any node with none-zero node number can grab the slot that was already taken by node 0, so node 1 will access the same journal with node 0, when it try to grab journal cluster lock, it will hung because it was already acquired by node 0. It's very easy to reproduce this, in one cluster, mount node 0 first, then node 1, you will see the following call trace from node 1. [13148.735424] INFO: task mount.ocfs2:53045 blocked for more than 122 seconds. [13148.739691] Not tainted 5.15.0-2148.0.4.el8uek.mountracev2.x86_64 #2 [13148.742560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [13148.745846] task:mount.ocfs2 state:D stack: 0 pid:53045 ppid: 53044 flags:0x00004000 [13148.749354] Call Trace: [13148.750718] <TASK> [13148.752019] ? usleep_range+0x90/0x89 [13148.753882] __schedule+0x210/0x567 [13148.755684] schedule+0x44/0xa8 [13148.757270] schedule_timeout+0x106/0x13c [13148.759273] ? __prepare_to_swait+0x53/0x78 [13148.761218] __wait_for_common+0xae/0x163 [13148.763144] __ocfs2_cluster_lock.constprop.0+0x1d6/0x870 [ocfs2] [13148.765780] ? ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2] [13148.768312] ocfs2_inode_lock_full_nested+0x18d/0x398 [ocfs2] [13148.770968] ocfs2_journal_init+0x91/0x340 [ocfs2] [13148.773202] ocfs2_check_volume+0x39/0x461 [ocfs2] [13148.775401] ? iput+0x69/0xba [13148.777047] ocfs2_mount_volume.isra.0.cold+0x40/0x1f5 [ocfs2] [13148.779646] ocfs2_fill_super+0x54b/0x853 [ocfs2] [13148.781756] mount_bdev+0x190/0x1b7 [13148.783443] ? ocfs2_remount+0x440/0x440 [ocfs2] [13148.785634] legacy_get_tree+0x27/0x48 [13148.787466] vfs_get_tree+0x25/0xd0 [13148.789270] do_new_mount+0x18c/0x2d9 [13148.791046] __x64_sys_mount+0x10e/0x142 [13148.792911] do_syscall_64+0x3b/0x89 [13148.794667] entry_SYSCALL_64_after_hwframe+0x170/0x0 [13148.797051] RIP: 0033:0x7f2309f6e26e [13148.798784] RSP: 002b:00007ffdcee7d408 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 [13148.801974] RAX: ffffffffffffffda RBX: 00007ffdcee7d4a0 RCX: 00007f2309f6e26e [13148.804815] RDX: 0000559aa762a8ae RSI: 0000559aa939d340 RDI: 0000559aa93a22b0 [13148.807719] RBP: 00007ffdcee7d5b0 R08: 0000559aa93a2290 R09: 00007f230a0b4820 [13148.810659] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcee7d420 [13148.813609] R13: 0000000000000000 R14: 0000559aa939f000 R15: 0000000000000000 [13148.816564] </TASK> To fix it, we can just fix __ocfs2_find_empty_slot. But original commit introduced the feature to mount ocfs2 locally even it is cluster based, that is a very dangerous, it can easily cause serious data corruption, there is no way to stop other nodes mounting the fs and corrupting it. Setup ha or other cluster-aware stack is just the cost that we have to take for avoiding corruption, otherwise we have to do it in kernel. Link: https://lkml.kernel.org/r/20220603222801.42488-1-junxiao.bi@oracle.com Fixes: 912f655d78c5("ocfs2: mount shared volume without ha stack") Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <heming.zhao@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-18 15:09:15 -07:00
Andrei Vagin	bdeb77bc2c	fs: sendfile handles O_NONBLOCK of out_fd sendfile has to return EAGAIN if out_fd is nonblocking and the write into it would block. Here is a small reproducer for the problem: #define _GNU_SOURCE /* See feature_test_macros(7) / #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <errno.h> #include <sys/stat.h> #include <sys/types.h> #include <sys/sendfile.h> #define FILE_SIZE (1UL << 30) int main(int argc, char *argv) { int p[2], fd; if (pipe2(p, O_NONBLOCK)) return 1; fd = open(argv[1], O_RDWR \| O_TMPFILE, 0666); if (fd < 0) return 1; ftruncate(fd, FILE_SIZE); if (sendfile(p[1], fd, 0, FILE_SIZE) == -1) { fprintf(stderr, "FAIL\n"); } if (sendfile(p[1], fd, 0, FILE_SIZE) != -1 \|\| errno != EAGAIN) { fprintf(stderr, "FAIL\n"); } return 0; } It worked before `b964bf53e5`, it is stuck after `b964bf53e5`, and it works again with this fix. This regression occurred because do_splice_direct() calls pipe_write that handles O_NONBLOCK. Here is a trace log from the reproducer: 1) \| __x64_sys_sendfile64() { 1) \| do_sendfile() { 1) \| __fdget() 1) \| rw_verify_area() 1) \| __fdget() 1) \| rw_verify_area() 1) \| do_splice_direct() { 1) \| rw_verify_area() 1) \| splice_direct_to_actor() { 1) \| do_splice_to() { 1) \| rw_verify_area() 1) \| generic_file_splice_read() 1) + 74.153 us \| } 1) \| direct_splice_actor() { 1) \| iter_file_splice_write() { 1) \| __kmalloc() 1) 0.148 us \| pipe_lock(); 1) 0.153 us \| splice_from_pipe_next.part.0(); 1) 0.162 us \| page_cache_pipe_buf_confirm(); ... 16 times 1) 0.159 us \| page_cache_pipe_buf_confirm(); 1) \| vfs_iter_write() { 1) \| do_iter_write() { 1) \| rw_verify_area() 1) \| do_iter_readv_writev() { 1) \| pipe_write() { 1) \| mutex_lock() 1) 0.153 us \| mutex_unlock(); 1) 1.368 us \| } 1) 1.686 us \| } 1) 5.798 us \| } 1) 6.084 us \| } 1) 0.174 us \| kfree(); 1) 0.152 us \| pipe_unlock(); 1) + 14.461 us \| } 1) + 14.783 us \| } 1) 0.164 us \| page_cache_pipe_buf_release(); ... 16 times 1) 0.161 us \| page_cache_pipe_buf_release(); 1) \| touch_atime() 1) + 95.854 us \| } 1) + 99.784 us \| } 1) ! 107.393 us \| } 1) ! 107.699 us \| } Link: https://lkml.kernel.org/r/20220415005015.525191-1-avagin@gmail.com Fixes: `b964bf53e5` ("teach sendfile(2) to handle send-to-pipe directly") Signed-off-by: Andrei Vagin <avagin@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-18 15:07:52 -07:00
ChenXiaoSong	38c9c22a85	ntfs: fix use-after-free in ntfs_ucsncmp() Syzkaller reported use-after-free bug as follows: ================================================================== BUG: KASAN: use-after-free in ntfs_ucsncmp+0x123/0x130 Read of size 2 at addr ffff8880751acee8 by task a.out/879 CPU: 7 PID: 879 Comm: a.out Not tainted 5.19.0-rc4-next-20220630-00001-gcc5218c8bd2c-dirty #7 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x1c0/0x2b0 print_address_description.constprop.0.cold+0xd4/0x484 print_report.cold+0x55/0x232 kasan_report+0xbf/0xf0 ntfs_ucsncmp+0x123/0x130 ntfs_are_names_equal.cold+0x2b/0x41 ntfs_attr_find+0x43b/0xb90 ntfs_attr_lookup+0x16d/0x1e0 ntfs_read_locked_attr_inode+0x4aa/0x2360 ntfs_attr_iget+0x1af/0x220 ntfs_read_locked_inode+0x246c/0x5120 ntfs_iget+0x132/0x180 load_system_files+0x1cc6/0x3480 ntfs_fill_super+0xa66/0x1cf0 mount_bdev+0x38d/0x460 legacy_get_tree+0x10d/0x220 vfs_get_tree+0x93/0x300 do_new_mount+0x2da/0x6d0 path_mount+0x496/0x19d0 __x64_sys_mount+0x284/0x300 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f3f2118d9ea Code: 48 8b 0d a9 f4 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 76 f4 0b 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc269deac8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3f2118d9ea RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffc269dec00 RBP: 00007ffc269dec80 R08: 00007ffc269deb00 R09: 00007ffc269dec44 R10: 0000000000000000 R11: 0000000000000202 R12: 000055f81ab1d220 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> The buggy address belongs to the physical page: page:0000000085430378 refcount:1 mapcount:1 mapping:0000000000000000 index:0x555c6a81d pfn:0x751ac memcg:ffff888101f7e180 anon flags: 0xfffffc00a0014(uptodate\|lru\|mappedtodisk\|swapbacked\|node=0\|zone=1\|lastcpupid=0x1fffff) raw: 000fffffc00a0014 ffffea0001bf2988 ffffea0001de2448 ffff88801712e201 raw: 0000000555c6a81d 0000000000000000 0000000100000000 ffff888101f7e180 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8880751acd80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff8880751ace00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff8880751ace80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ ffff8880751acf00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff8880751acf80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== The reason is that struct ATTR_RECORD->name_offset is 6485, end address of name string is out of bounds. Fix this by adding sanity check on end address of attribute name string. [akpm@linux-foundation.org: coding-style cleanups] [chenxiaosong2@huawei.com: cleanup suggested by Hawkins Jiawei] Link: https://lkml.kernel.org/r/20220709064511.3304299-1-chenxiaosong2@huawei.com Link: https://lkml.kernel.org/r/20220707105329.4020708-1-chenxiaosong2@huawei.com Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> Cc: Anton Altaparmakov <anton@tuxera.com> Cc: ChenXiaoSong <chenxiaosong2@huawei.com> Cc: Yongqiang Liu <liuyongqiang13@huawei.com> Cc: Zhang Yi <yi.zhang@huawei.com> Cc: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-18 15:07:52 -07:00
Linus Torvalds	972a278fe6	Merge tag 'for-5.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs reverts from David Sterba: "Due to a recent report [1] we need to revert the radix tree to xarray conversion patches. There's a problem with sleeping under spinlock, when xa_insert could allocate memory under pressure. We use GFP_NOFS so this is a real problem that we unfortunately did not discover during review. I'm sorry to do such change at rc6 time but the revert is IMO the safer option, there are patches to use mutex instead of the spin locks but that would need more testing. The revert branch has been tested on a few setups, all seem ok. The conversion to xarray will be revisited in the future" Link: https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ [1] * tag 'for-5.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Revert "btrfs: turn delayed_nodes_tree into an XArray" Revert "btrfs: turn name_cache radix tree into XArray in send_ctx" Revert "btrfs: turn fs_info member buffer_radix into XArray" Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray"	2022-07-16 13:48:55 -07:00
Linus Torvalds	1ce9d792e8	Merge tag 'ceph-for-5.19-rc7' of https://github.com/ceph/ceph-client Pull ceph fix from Ilya Dryomov: "A folio locking fixup that Xiubo and David cooperated on, marked for stable. Most of it is in netfs but I picked it up into ceph tree on agreement with David" * tag 'ceph-for-5.19-rc7' of https://github.com/ceph/ceph-client: netfs: do not unlock and put the folio twice	2022-07-15 10:27:28 -07:00
David Sterba	088aea3b97	Revert "btrfs: turn delayed_nodes_tree into an XArray" This reverts commit `253bf57555`. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-15 19:15:19 +02:00
David Sterba	5b8418b843	Revert "btrfs: turn name_cache radix tree into XArray in send_ctx" This reverts commit `4076942021`. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-15 19:14:58 +02:00
David Sterba	01cd390903	Revert "btrfs: turn fs_info member buffer_radix into XArray" This reverts commit `8ee922689d`. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-15 19:14:33 +02:00
David Sterba	fc7cbcd489	Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray" This reverts commit `48b36a602a`. Revert the xarray conversion, there's a problem with potential sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS allocation. The radix tree used the preloading mechanism to avoid sleeping but this is not available in xarray. Conversion from spin lock to mutex is possible but at time of rc6 is riskier than a clean revert. [1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2022-07-15 19:14:28 +02:00
Linus Torvalds	b926f2adb0	Revert "vf/remap: return the amount of bytes actually deduplicated" This reverts commit `4a57a84000`. Dave Chinner reports: "As I suspected would occur, this change causes test failures. e.g generic/517 in fstests fails with: generic/517 1s ... - output mismatch [..] -deduped 131172/131172 bytes at offset 65536 +deduped 131072/131172 bytes at offset 65536" can you please revert this commit for the 5.19 series to give us more time to investigate and consider the impact of the the API change on userspace applications before we commit to changing the API" That changed return value seems to reflect reality, but with the fstest change, let's revert for now. Requested-by: Dave Chinner <david@fromorbit.com> Link: https://lore.kernel.org/all/20220714223238.GH3600936@dread.disaster.area/ Cc: Ansgar Lößer <ansgar.loesser@tu-darmstadt.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-07-14 15:35:24 -07:00
Linus Torvalds	f41d5df5f1	Merge tag '5.19-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs fixes from Steve French: "Three smb3 client fixes: - two multichannel fixes: fix a potential deadlock freeing a channel, and fix a race condition on failed creation of a new channel - mount failure fix: work around a server bug in some common older Samba servers by avoiding padding at the end of the negotiate protocol request" * tag '5.19-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb3: workaround negprot bug in some Samba servers cifs: remove unnecessary locking of chan_lock while freeing session cifs: fix race condition with delayed threads	2022-07-14 12:35:15 -07:00
Linus Torvalds	a24a6c05ff	Merge tag 'nfsd-5.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: "Notable regression fixes: - Enable SETATTR(time_create) to fix regression with Mac OS clients - Fix a lockd crasher and broken NLM UNLCK behavior" * tag 'nfsd-5.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: lockd: fix nlm_close_files lockd: set fl_owner when unlocking files NFSD: Decode NFSv4 birth time attribute	2022-07-14 12:29:43 -07:00
Xiubo Li	fac47b43c7	netfs: do not unlock and put the folio twice check_write_begin() will unlock and put the folio when return non-zero. So we should avoid unlocking and putting it twice in netfs layer. Change the way ->check_write_begin() works in the following two ways: (1) Pass it a pointer to the folio pointer, allowing it to unlock and put the folio prior to doing the stuff it wants to do, provided it clears the folio pointer. (2) Change the return values such that 0 with folio pointer set means continue, 0 with folio pointer cleared means re-get and all error codes indicating an error (no special treatment for -EAGAIN). [ bagasdotme: use Sphinx code text syntax for *foliop pointer ] Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/56423 Link: https://lore.kernel.org/r/cf169f43-8ee7-8697-25da-0204d1b4343e@redhat.com Co-developed-by: David Howells <dhowells@redhat.com> Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2022-07-14 10:10:12 +02:00

1 2 3 4 5 ...

77016 Commits