linux

iv/linux

Author	SHA1	Message	Date
Chung-Chiang Cheng	1213375077	fat: remove time truncations in vfat_create/vfat_mkdir All the timestamps in vfat_create() and vfat_mkdir() come from fat_time_fat2unix() which ensures time granularity. We don't need to truncate them to fit FAT's format. Moreover, fat_truncate_crtime() and fat_timespec64_trunc_10ms() are also removed because there is no caller anymore. Link: https://lkml.kernel.org/r/20220503152536.2503003-4-cccheng@synology.com Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-05-19 14:10:31 -07:00
Chung-Chiang Cheng	30abce053f	fat: report creation time in statx creation time is no longer mixed with change time. Add an in-memory field for it, and report it in statx if supported. Link: https://lkml.kernel.org/r/20220503152536.2503003-3-cccheng@synology.com Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-05-19 14:10:31 -07:00
Chung-Chiang Cheng	0f9d148167	fat: ignore ctime updates, and keep ctime identical to mtime in memory FAT supports creation time but not change time, and there was no corresponding timestamp for creation time in previous VFS. The original implementation took the compromise of saving the in-memory change time into the on-disk creation time field, but this would lead to compatibility issues with non-linux systems. To address this issue, this patch changes the behavior of ctime. It will no longer be loaded and stored from the creation time on disk. Instead of that, it'll be consistent with the in-memory mtime and share the same on-disk field. All updates to mtime will also be applied to ctime in memory, while all updates to ctime will be ignored. Link: https://lkml.kernel.org/r/20220503152536.2503003-2-cccheng@synology.com Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-05-19 14:10:31 -07:00
Chung-Chiang Cheng	4dcc3f96e7	fat: split fat_truncate_time() into separate functions Separate fat_truncate_time() to each timestamps for later creation time work. This patch does not introduce any functional changes, it's merely refactoring change. Link: https://lkml.kernel.org/r/20220503152536.2503003-1-cccheng@synology.com Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-05-19 14:10:31 -07:00
Johannes Weiner	f6498b776d	mm: zswap: add basic meminfo and vmstat coverage Currently it requires poking at debugfs to figure out the size and population of the zswap cache on a host. There are no counters for reads and writes against the cache. As a result, it's difficult to understand zswap behavior on production systems. Print zswap memory consumption and how many pages are zswapped out in /proc/meminfo. Count zswapouts and zswapins in /proc/vmstat. Link: https://lkml.kernel.org/r/20220510152847.230957-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-05-19 14:08:53 -07:00
Jakub Kicinski	d7e6f58360	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net drivers/net/ethernet/mellanox/mlx5/core/main.c b33886971dbc ("net/mlx5: Initialize flow steering during driver probe") 40379a0084c2 ("net/mlx5_fpga: Drop INNOVA TLS support") f2b41b32cde8 ("net/mlx5: Remove ipsec_ops function table") https://lore.kernel.org/all/20220519040345.6yrjromcdistu7vh@sx1/ 16d42d313350 ("net/mlx5: Drain fw_reset when removing device") 8324a02c342a ("net/mlx5: Add exit route when waiting for FW") https://lore.kernel.org/all/20220519114119.060ce014@canb.auug.org.au/ tools/testing/selftests/net/mptcp/mptcp_join.sh e274f7154008 ("selftests: mptcp: add subflow limits test-cases") b6e074e171bc ("selftests: mptcp: add infinite map testcase") 5ac1d2d63451 ("selftests: mptcp: Add tests for userspace PM type") https://lore.kernel.org/all/20220516111918.366d747f@canb.auug.org.au/ net/mptcp/options.c ba2c89e0ea74 ("mptcp: fix checksum byte order") 1e39e5a32ad7 ("mptcp: infinite mapping sending") ea66758c1795 ("tcp: allow MPTCP to update the announced window") https://lore.kernel.org/all/20220519115146.751c3a37@canb.auug.org.au/ net/mptcp/pm.c 95d686517884 ("mptcp: fix subflow accounting on close") 4d25247d3ae4 ("mptcp: bypass in-kernel PM restrictions for non-kernel PMs") https://lore.kernel.org/all/20220516111435.72f35dca@canb.auug.org.au/ net/mptcp/subflow.c ae66fb2ba6c3 ("mptcp: Do TCP fallback on early DSS checksum failure") 0348c690ed37 ("mptcp: add the fallback check") f8d4bcacff3b ("mptcp: infinite mapping receiving") https://lore.kernel.org/all/20220519115837.380bb8d4@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-19 11:23:59 -07:00
Hao Luo	1a702dc88e	kernfs: Separate kernfs_pr_cont_buf and rename_lock. Previously the protection of kernfs_pr_cont_buf was piggy backed by rename_lock, which means that pr_cont() needs to be protected under rename_lock. This can cause potential circular lock dependencies. If there is an OOM, we have the following call hierarchy: -> cpuset_print_current_mems_allowed() -> pr_cont_cgroup_name() -> pr_cont_kernfs_name() pr_cont_kernfs_name() will grab rename_lock and call printk. So we have the following lock dependencies: kernfs_rename_lock -> console_sem Sometimes, printk does a wakeup before releasing console_sem, which has the dependence chain: console_sem -> p->pi_lock -> rq->lock Now, imagine one wants to read cgroup_name under rq->lock, for example, printing cgroup_name in a tracepoint in the scheduler code. They will be holding rq->lock and take rename_lock: rq->lock -> kernfs_rename_lock Now they will deadlock. A prevention to this circular lock dependency is to separate the protection of pr_cont_buf from rename_lock. In principle, rename_lock is to protect the integrity of cgroup name when copying to buf. Once pr_cont_buf has got its content, rename_lock can be dropped. So it's safe to drop rename_lock after kernfs_name_locked (and kernfs_path_from_node_locked) and rely on a dedicated pr_cont_lock to protect pr_cont_buf. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20220516190951.3144144-1-haoluo@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-19 19:37:06 +02:00
Zhang Jianhua	e6af1bb077	fs-verity: Use struct_size() helper in enable_verity() Follow the best practice for allocating a variable-sized structure. Signed-off-by: Zhang Jianhua <chris.zjh@huawei.com> [ebiggers: adjusted commit message] Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20220519022450.2434483-1-chris.zjh@huawei.com	2022-05-19 09:53:33 -07:00
Dai Ngo	e9488d5ae1	NFSD: Show state of courtesy client in client info Update client_info_show to show state of courtesy client and seconds since last renew. Reviewed-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-05-19 12:25:40 -04:00
Dai Ngo	27431affb0	NFSD: add support for lock conflict to courteous server This patch allows expired client with lock state to be in COURTESY state. Lock conflict with COURTESY client is resolved by the fs/lock code using the lm_lock_expirable and lm_expire_lock callback in the struct lock_manager_operations. If conflict client is in COURTESY state, set it to EXPIRABLE and schedule the laundromat to run immediately to expire the client. The callback lm_expire_lock waits for the laundromat to flush its work queue before returning to caller. Reviewed-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-05-19 12:25:39 -04:00
Dai Ngo	2443da2259	fs/lock: add 2 callbacks to lock_manager_operations to resolve conflict Add 2 new callbacks, lm_lock_expirable and lm_expire_lock, to lock_manager_operations to allow the lock manager to take appropriate action to resolve the lock conflict if possible. A new field, lm_mod_owner, is also added to lock_manager_operations. The lm_mod_owner is used by the fs/lock code to make sure the lock manager module such as nfsd, is not freed while lock conflict is being resolved. lm_lock_expirable checks and returns true to indicate that the lock conflict can be resolved else return false. This callback must be called with the flc_lock held so it can not block. lm_expire_lock is called to resolve the lock conflict if the returned value from lm_lock_expirable is true. This callback is called without the flc_lock held since it's allowed to block. Upon returning from this callback, the lock conflict should be resolved and the caller is expected to restart the conflict check from the beginnning of the list. Lock manager, such as NFSv4 courteous server, uses this callback to resolve conflict by destroying lock owner, or the NFSv4 courtesy client (client that has expired but allowed to maintains its states) that owns the lock. Reviewed-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2022-05-19 12:25:39 -04:00
Dai Ngo	591502c5cb	fs/lock: add helper locks_owner_has_blockers to check for blockers Add helper locks_owner_has_blockers to check if there is any blockers for a given lockowner. Reviewed-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>	2022-05-19 12:25:39 -04:00
Dai Ngo	d76cc46b37	NFSD: move create/destroy of laundry_wq to init_nfsd and exit_nfsd This patch moves create/destroy of laundry_wq from nfs4_state_start and nfs4_state_shutdown_net to init_nfsd and exit_nfsd to prevent the laundromat from being freed while a thread is processing a conflicting lock. Reviewed-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-05-19 12:25:39 -04:00
Dai Ngo	3d69427151	NFSD: add support for share reservation conflict to courteous server This patch allows expired client with open state to be in COURTESY state. Share/access conflict with COURTESY client is resolved by setting COURTESY client to EXPIRABLE state, schedule laundromat to run and returning nfserr_jukebox to the request client. Reviewed-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-05-19 12:25:39 -04:00
Dai Ngo	66af257999	NFSD: add courteous server support for thread with only delegation This patch provides courteous server support for delegation only. Only expired client with delegation but no conflict and no open or lock state is allowed to be in COURTESY state. Delegation conflict with COURTESY/EXPIRABLE client is resolved by setting it to EXPIRABLE, queue work for the laundromat and return delay to the caller. Conflict is resolved when the laudromat runs and expires the EXIRABLE client while the NFS client retries the OPEN request. Local thread request that gets conflict is doing the retry in _break_lease. Client in COURTESY or EXPIRABLE state is allowed to reconnect and continues to have access to its state. Access to the nfs4_client by the reconnecting thread and the laundromat is serialized via the client_lock. Reviewed-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-05-19 12:25:39 -04:00
Chuck Lever	91e23b1c39	NFSD: Clean up nfsd_splice_actor() nfsd_splice_actor() checks that the page being spliced does not match the previous element in the svc_rqst::rq_pages array. We believe this is to prevent a double put_page() in cases where the READ payload is partially contained in the xdr_buf's head buffer. However, the NFSD READ proc functions no longer place any part of the READ payload in the head buffer, in order to properly support NFS/RDMA READ with Write chunks. Therefore, simplify the logic in nfsd_splice_actor() to remove this unnecessary check. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2022-05-19 12:25:38 -04:00
Paulo Alcantara	d80c69846d	cifs: fix signed integer overflow when fl_end is OFFSET_MAX This fixes the following when running xfstests generic/504: [ 134.394698] CIFS: Attempting to mount \\win16.vm.test\Share [ 134.420905] CIFS: VFS: generate_smb3signingkey: dumping generated AES session keys [ 134.420911] CIFS: VFS: Session Id 05 00 00 00 00 c4 00 00 [ 134.420914] CIFS: VFS: Cipher type 1 [ 134.420917] CIFS: VFS: Session Key ea 0b d9 22 2e af 01 69 30 1b 15 74 bf 87 41 11 [ 134.420920] CIFS: VFS: Signing Key 59 28 43 5c f0 b6 b1 6f f5 7b 65 f2 9f 9e 58 7d [ 134.420923] CIFS: VFS: ServerIn Key eb aa 58 c8 95 01 9a f7 91 98 e4 fa bc d8 74 f1 [ 134.420926] CIFS: VFS: ServerOut Key 08 5b 21 e5 2e 4e 86 f6 05 c2 58 e0 af 53 83 e7 [ 134.771946] ================================================================================ [ 134.771953] UBSAN: signed-integer-overflow in fs/cifs/file.c:1706:19 [ 134.771957] 9223372036854775807 + 1 cannot be represented in type 'long long int' [ 134.771960] CPU: 4 PID: 2773 Comm: flock Not tainted 5.11.22 #1 [ 134.771964] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 134.771966] Call Trace: [ 134.771970] dump_stack+0x8d/0xb5 [ 134.771981] ubsan_epilogue+0x5/0x50 [ 134.771988] handle_overflow+0xa3/0xb0 [ 134.771997] ? lockdep_hardirqs_on_prepare+0xe8/0x1b0 [ 134.772006] cifs_setlk+0x63c/0x680 [cifs] [ 134.772085] ? _get_xid+0x5f/0xa0 [cifs] [ 134.772085] cifs_flock+0x131/0x400 [cifs] [ 134.772085] __x64_sys_flock+0xfc/0x120 [ 134.772085] do_syscall_64+0x33/0x40 [ 134.772085] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 134.772085] RIP: 0033:0x7fea4f83b3fb [ 134.772085] Code: ff 48 8b 15 8f 1a 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb da e8 16 0b 02 00 66 0f 1f 44 00 00 f3 0f 1e fa b8 49 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5d 1a 0d 00 f7 d8 64 89 01 48 Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2022-05-19 10:54:41 -05:00
Zhihao Cheng	68f4c6eba7	fs-writeback: writeback_sb_inodes：Recalculate 'wrote' according skipped pages Commit 505a666ee3fc ("writeback: plug writeback in wb_writeback() and writeback_inodes_wb()") has us holding a plug during wb_writeback, which may cause a potential ABBA dead lock: wb_writeback fat_file_fsync blk_start_plug(&plug) for (;;) { iter i-1: some reqs have been added into plug->mq_list // LOCK A iter i: progress = __writeback_inodes_wb(wb, work) . writeback_sb_inodes // fat's bdev . __writeback_single_inode . . generic_writepages . . __block_write_full_page . . . . __generic_file_fsync . . . . sync_inode_metadata . . . . writeback_single_inode . . . . __writeback_single_inode . . . . fat_write_inode . . . . __fat_write_inode . . . . sync_dirty_buffer // fat's bdev . . . . lock_buffer(bh) // LOCK B . . . . submit_bh . . . . blk_mq_get_tag // LOCK A . . . trylock_buffer(bh) // LOCK B . . . redirty_page_for_writepage . . . wbc->pages_skipped++ . . --wbc->nr_to_write . wrote += write_chunk - wbc.nr_to_write // wrote > 0 . requeue_inode . redirty_tail_locked if (progress) // progress > 0 continue; iter i+1: queue_io // similar process with iter i, infinite for-loop ! } blk_finish_plug(&plug) // flush plug won't be called Above process triggers a hungtask like: [ 399.044861] INFO: task bb:2607 blocked for more than 30 seconds. [ 399.046824] Not tainted 5.18.0-rc1-00005-gefae4d9eb6a2-dirty [ 399.051539] task:bb state:D stack: 0 pid: 2607 ppid: 2426 flags:0x00004000 [ 399.051556] Call Trace: [ 399.051570] __schedule+0x480/0x1050 [ 399.051592] schedule+0x92/0x1a0 [ 399.051602] io_schedule+0x22/0x50 [ 399.051613] blk_mq_get_tag+0x1d3/0x3c0 [ 399.051640] __blk_mq_alloc_requests+0x21d/0x3f0 [ 399.051657] blk_mq_submit_bio+0x68d/0xca0 [ 399.051674] __submit_bio+0x1b5/0x2d0 [ 399.051708] submit_bio_noacct+0x34e/0x720 [ 399.051718] submit_bio+0x3b/0x150 [ 399.051725] submit_bh_wbc+0x161/0x230 [ 399.051734] __sync_dirty_buffer+0xd1/0x420 [ 399.051744] sync_dirty_buffer+0x17/0x20 [ 399.051750] __fat_write_inode+0x289/0x310 [ 399.051766] fat_write_inode+0x2a/0xa0 [ 399.051783] __writeback_single_inode+0x53c/0x6f0 [ 399.051795] writeback_single_inode+0x145/0x200 [ 399.051803] sync_inode_metadata+0x45/0x70 [ 399.051856] __generic_file_fsync+0xa3/0x150 [ 399.051880] fat_file_fsync+0x1d/0x80 [ 399.051895] vfs_fsync_range+0x40/0xb0 [ 399.051929] __x64_sys_fsync+0x18/0x30 In my test, 'need_resched()' (which is imported by 590dca3a71 "fs-writeback: unplug before cond_resched in writeback_sb_inodes") in function 'writeback_sb_inodes()' seldom comes true, unless cond_resched() is deleted from write_cache_pages(). Fix it by correcting wrote number according number of skipped pages in writeback_sb_inodes(). Goto Link to find a reproducer. Link: https://bugzilla.kernel.org/show_bug.cgi?id=215837 Cc: stable@vger.kernel.org # v4.3 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220510133805.1988292-1-chengzhihao1@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-19 06:29:41 -06:00
Linus Torvalds	01464a73a6	io_uring-5.18-2022-05-18 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKFeygQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpsXxD/9dnVf1LCfVR3wQpV/O7o1u5zRpjTSFetmS qGw4kMDUtfE63oCLzGW/WpmQLrHduoTa32Ms/v0vU1+1MixUHatzRzN5OF2Nv7Wv Ukc6JtH2PjcYFNouAFJ8r/H1uanrr9DOK7R/9n3t9CwOIoand+PUmZ0rIHVN9u0j KtpousBih66c7tajADqSxKb1uPlDrgTkzfAYVj5O4OwBAiIwY2CLJfvbLFTdoHP9 bdCrsvo9R08QRH9PedPqctdQURBxfzd0kXm3N8+d4r4GEQvbaGI+6bN88VYW2aEa VfBCZD4A8/VupcGWWhAnKu0rOwxgrh2jDEd5g+0g/f3SRTY2LBUJAlEMM7N043vX MtAuO05E3/Svr5doAwAJsp44h6SvS2s5ogvPIYjEUvY4JK2UDdWOcYrb7HqXrNBi 0jNsgUu4+N89i6jiewf0dPw6BKK7Cf9322fGy4X1dQ77gIkDWt4Z55iV46aSX3Y8 dQt2Jaoj0c/wRkOgZtrDi7FmdM0xcMzGMB/oOyI07d3ERjbyZuDTFowWh/kqd1xA 6vsrkyCQaGilmDu9xTywPSfarKoxIrX+TgZj5rnnJpv2tENJSx/o8yKWcu9wnO1L gnanzpwYW4qIOjkWKvIBLBHXXA6Qp8aXacif8rODUhIuABQpe85b+btpHeqO9lWL a3wS9Gc2nQ== =8Xxv -----END PGP SIGNATURE----- Merge tag 'io_uring-5.18-2022-05-18' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "Two small changes fixing issues from the 5.18 merge window: - Fix wrong ordering of a tracepoint (Dylan) - Fix MSG_RING on IOPOLL rings (me)" * tag 'io_uring-5.18-2022-05-18' of git://git.kernel.dk/linux-block: io_uring: don't attempt to IOPOLL for MSG_RING requests io_uring: fix ordering of args in io_uring_queue_async_work	2022-05-18 14:21:30 -10:00
Chao Yu	677a82b44e	f2fs: fix to do sanity check for inline inode Yanming reported a kernel bug in Bugzilla kernel [1], which can be reproduced. The bug message is: The kernel message is shown below: kernel BUG at fs/inode.c:611! Call Trace: evict+0x282/0x4e0 __dentry_kill+0x2b2/0x4d0 dput+0x2dd/0x720 do_renameat2+0x596/0x970 __x64_sys_rename+0x78/0x90 do_syscall_64+0x3b/0x90 [1] https://bugzilla.kernel.org/show_bug.cgi?id=215895 The bug is due to fuzzed inode has both inline_data and encrypted flags. During f2fs_evict_inode(), as the inode was deleted by rename(), it will cause inline data conversion due to conflicting flags. The page cache will be polluted and the panic will be triggered in clear_inode(). Try fixing the bug by doing more sanity checks for inline data inode in sanity_check_inode(). Cc: stable@vger.kernel.org Reported-by: Ming Yan <yanming@tju.edu.cn> Signed-off-by: Chao Yu <chao.yu@oppo.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2022-05-18 15:36:11 -07:00
Chao Yu	958ed92922	f2fs: fix fallocate to use file_modified to update permissions consistently This patch tries to fix permission consistency issue as all other mainline filesystems. Since the initial introduction of (posix) fallocate back at the turn of the century, it has been possible to use this syscall to change the user-visible contents of files. This can happen by extending the file size during a preallocation, or through any of the newer modes (punch, zero, collapse, insert range). Because the call can be used to change file contents, we should treat it like we do any other modification to a file -- update the mtime, and drop set[ug]id privileges/capabilities. The VFS function file_modified() does all this for us if pass it a locked inode, so let's make fallocate drop permissions correctly. Cc: stable@kernel.org Signed-off-by: Chao Yu <chao.yu@oppo.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2022-05-18 15:36:09 -07:00
Jens Axboe	2fcabce2d7	io_uring: disallow mixed provided buffer group registrations It's nonsensical to register a provided buffer ring, if a classic provided buffer group with the same ID exists. Depending on the order of which we decide what type to pick, the other type will never get used. Explicitly disallow it and return an error if this is attempted. Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 16:21:30 -06:00
Jens Axboe	1d0dbbfa28	io_uring: initialize io_buffer_list head when shared ring is unregistered We use ->buf_pages != 0 to tell if this is a shared buffer ring or a classic provided buffer group. If we unregister the shared ring and then attempt to use it, buf_pages is zero yet the classic list head isn't properly initialized. This causes io_buffer_select() to think that we have classic buffers available, but then we crash when we try and get one from the list. Just initialize the list if we unregister a shared buffer ring, leaving it in a sane state for either re-registration or for attempting to use it. And do the same for the initial setup from the classic path. Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 16:21:11 -06:00
Pavel Begunkov	0184f08e65	io_uring: add fully sparse buffer registration Honour IORING_RSRC_REGISTER_SPARSE not only for direct files but fixed buffers as well. It makes the rsrc API more consistent. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/66f429e4912fe39fb3318217ff33a2853d4544be.1652879898.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 12:53:04 -06:00
Zhang Jianhua	b0487ede1f	fs-verity: remove unused parameter desc_size in fsverity_create_info() The parameter desc_size in fsverity_create_info() is useless and it is not referenced anywhere. The greatest meaning of desc_size here is to indecate the size of struct fsverity_descriptor and futher calculate the size of signature. However, the desc->sig_size can do it also and it is indeed, so remove it. Therefore, it is no need to acquire desc_size by fsverity_get_descriptor() in ensure_verity_info(), so remove the parameter desc_ret in fsverity_get_descriptor() too. Signed-off-by: Zhang Jianhua <chris.zjh@huawei.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20220518132256.2297655-1-chris.zjh@huawei.com	2022-05-18 11:01:31 -07:00
Eric Biggers	c069db76ed	ext4: fix memory leak in parse_apply_sb_mount_options() If processing the on-disk mount options fails after any memory was allocated in the ext4_fs_context, e.g. s_qf_names, then this memory is leaked. Fix this by calling ext4_fc_free() instead of kfree() directly. Reproducer: mkfs.ext4 -F /dev/vdc tune2fs /dev/vdc -E mount_opts=usrjquota=file echo clear > /sys/kernel/debug/kmemleak mount /dev/vdc /vdc echo scan > /sys/kernel/debug/kmemleak sleep 5 echo scan > /sys/kernel/debug/kmemleak cat /sys/kernel/debug/kmemleak Fixes: 7edfd85b1ffd ("ext4: Completely separate options parsing and sb setup") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Ritesh Harjani <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20220513231605.175121-2-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-05-18 11:24:22 -04:00
Eric Biggers	cb8435dc8b	ext4: reject the 'commit' option on ext2 filesystems The 'commit' option is only applicable for ext3 and ext4 filesystems, and has never been accepted by the ext2 filesystem driver, so the ext4 driver shouldn't allow it on ext2 filesystems. This fixes a failure in xfstest ext4/053. Fixes: 8dc0aa8cf0f7 ("ext4: check incompatible mount options while mounting ext2/3") Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Link: https://lore.kernel.org/r/20220510183232.172615-1-ebiggers@kernel.org	2022-05-18 11:24:22 -04:00
Yang Li	b10b6278ae	ext4: remove duplicated #include of dax.h in inode.c Fix following includecheck warning: ./fs/ext4/inode.c: linux/dax.h is included more than once. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/20220504225025.44753-1-yang.lee@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-05-18 11:24:22 -04:00
Christoph Hellwig	0b1e987c56	freevxfs: relicense to GPLv2 only When I wrote the freevxfs driver I had some odd choice of licensing statements, the options are either GPL (without version) or an odd BSD-ish licensense with advertising clause. The GPL vs always meant to be the same as the kernel, that is version 2 only, and the odd BSD-ish license doesn't make much sense. Add a GPL2.0-only SPDX tag to make the GPL intentions clear and drop the bogus BSD license. Acked-by: Krzysztof Błaszkowski <kb@sysmikro.com.pl> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-05-18 15:30:17 +02:00
Amir Goldstein	e730558adf	fsnotify: consistent behavior for parent not watching children The logic for handling events on child in groups that have a mark on the parent inode, but without FS_EVENT_ON_CHILD flag in the mask is duplicated in several places and inconsistent. Move the logic into the preparation of mark type iterator, so that the parent mark type will be excluded from all mark type iterations in that case. This results in several subtle changes of behavior, hopefully all desired changes of behavior, for example: - Group A has a mount mark with FS_MODIFY in mask - Group A has a mark with ignore mask that does not survive FS_MODIFY and does not watch children on directory D. - Group B has a mark with FS_MODIFY in mask that does watch children on directory D. - FS_MODIFY event on file D/foo should not clear the ignore mask of group A, but before this change it does And if group A ignore mask was set to survive FS_MODIFY: - FS_MODIFY event on file D/foo should be reported to group A on account of the mount mark, but before this change it is wrongly ignored Fixes: 2f02fd3fa13e ("fanotify: fix ignore mask logic for events on child and on dir") Reported-by: Jan Kara <jack@suse.com> Link: https://lore.kernel.org/linux-fsdevel/20220314113337.j7slrb5srxukztje@quack3.lan/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220511190213.831646-3-amir73il@gmail.com	2022-05-18 15:08:09 +02:00
Amir Goldstein	14362a2541	fsnotify: introduce mark type iterator fsnotify_foreach_iter_mark_type() is used to reduce boilerplate code of iterating all marks of a specific group interested in an event by consulting the iterator report_mask. Use an open coded version of that iterator in fsnotify_iter_next() that collects all marks of the current iteration group without consulting the iterator report_mask. At the moment, the two iterator variants are the same, but this decoupling will allow us to exclude some of the group's marks from reporting the event, for example for event on child and inode marks on parent did not request to watch events on children. Fixes: 2f02fd3fa13e ("fanotify: fix ignore mask logic for events on child and on dir") Reported-by: Jan Kara <jack@suse.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220511190213.831646-2-amir73il@gmail.com	2022-05-18 15:07:43 +02:00
Christoph Hellwig	0bf1dbee9b	io_uring: use rcu_dereference in io_close Accessing the file table needs a rcu_dereference_protected(). Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 06:19:05 -06:00
Christoph Hellwig	a294bef57c	io_uring: consistently use the EPOLL* defines POLL* are unannotated values for the userspace ABI, while everything in-kernel should use EPOLL* and the __poll_t type. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 06:19:05 -06:00
Christoph Hellwig	58f5c8d39e	io_uring: make apoll_events a __poll_t apoll_events is fed to vfs_poll and the poll tables, so it should be a __poll_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 06:19:05 -06:00
Christoph Hellwig	ee67ba3b20	io_uring: drop a spurious inline on a forward declaration io_file_get_normal isn't marked inline, so don't claim it as such in the forward declaration. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 06:19:05 -06:00
Christoph Hellwig	984824db84	io_uring: don't use ERR_PTR for user pointers ERR_PTR abuses the high bits of a pointer to transport error information. This is only safe for kernel pointers and not user pointers. Fix io_buffer_select and its helpers to just return NULL for failure and get rid of this abuse. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 06:18:56 -06:00
Christoph Hellwig	20cbd21d89	io_uring: use a rwf_t for io_rw.flags Use the proper type. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 06:17:52 -06:00
Jens Axboe	c7fb19428d	io_uring: add support for ring mapped supplied buffers Provided buffers allow an application to supply io_uring with buffers that can then be grabbed for a read/receive request, when the data source is ready to deliver data. The existing scheme relies on using IORING_OP_PROVIDE_BUFFERS to do that, but it can be difficult to use in real world applications. It's pretty efficient if the application is able to supply back batches of provided buffers when they have been consumed and the application is ready to recycle them, but if fragmentation occurs in the buffer space, it can become difficult to supply enough buffers at the time. This hurts efficiency. Add a register op, IORING_REGISTER_PBUF_RING, which allows an application to setup a shared queue for each buffer group of provided buffers. The application can then supply buffers simply by adding them to this ring, and the kernel can consume then just as easily. The ring shares the head with the application, the tail remains private in the kernel. Provided buffers setup with IORING_REGISTER_PBUF_RING cannot use IORING_OP_{PROVIDE,REMOVE}_BUFFERS for adding or removing entries to the ring, they must use the mapped ring. Mapped provided buffer rings can co-exist with normal provided buffers, just not within the same group ID. To gauge overhead of the existing scheme and evaluate the mapped ring approach, a simple NOP benchmark was written. It uses a ring of 128 entries, and submits/completes 32 at the time. 'Replenish' is how many buffers are provided back at the time after they have been consumed: Test Replenish NOPs/sec ================================================================ No provided buffers NA ~30M Provided buffers 32 ~16M Provided buffers 1 ~10M Ring buffers 32 ~27M Ring buffers 1 ~27M The ring mapped buffers perform almost as well as not using provided buffers at all, and they don't care if you provided 1 or more back at the same time. This means application can just replenish as they go, rather than need to batch and compact, further reducing overhead in the application. The NOP benchmark above doesn't need to do any compaction, so that overhead isn't even reflected in the above test. Co-developed-by: Dylan Yudaken <dylany@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 06:12:42 -06:00
Jens Axboe	d8c2237d0a	io_uring: add io_pin_pages() helper Abstract this out from io_sqe_buffer_register() so we can use it elsewhere too without duplicating this code. No intended functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 06:12:42 -06:00
Jens Axboe	3d200242a6	io_uring: add buffer selection support to IORING_OP_NOP Obviously not really useful since it's not transferring data, but it is helpful in benchmarking overhead of provided buffers. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 06:12:42 -06:00
Jens Axboe	e7637a492b	io_uring: fix locking state for empty buffer group io_provided_buffer_select() must drop the submit lock, if needed, even in the error handling case. Failure to do so will leave us with the ctx->uring_lock held, causing spew like: ==================================== WARNING: iou-wrk-366/368 still has locks held! 5.18.0-rc6-00294-gdf8dc7004331 #994 Not tainted ------------------------------------ 1 lock held by iou-wrk-366/368: #0: ffff0000c72598a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_ring_submit_lock+0x20/0x48 stack backtrace: CPU: 4 PID: 368 Comm: iou-wrk-366 Not tainted 5.18.0-rc6-00294-gdf8dc7004331 #994 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace.part.0+0xa4/0xd4 show_stack+0x14/0x5c dump_stack_lvl+0x88/0xb0 dump_stack+0x14/0x2c debug_check_no_locks_held+0x84/0x90 try_to_freeze.isra.0+0x18/0x44 get_signal+0x94/0x6ec io_wqe_worker+0x1d8/0x2b4 ret_from_fork+0x10/0x20 and triggering later hangs off get_signal() because we attempt to re-grab the lock. Reported-by: syzbot+987d7bb19195ae45208c@syzkaller.appspotmail.com Fixes: 149c69b04a90 ("io_uring: abstract out provided buffer list selection") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-18 06:12:41 -06:00
Dave Wysochanski	9c4a5c75a6	NFS: Pass i_size to fscache_unuse_cookie() when a file is released Pass updated i_size in fscache_unuse_cookie() when called from nfs_fscache_release_file(), which ensures the size of an fscache object gets written to the cache storage. Failing to do so results in unnessary reads from the NFS server, even when the data is cached, due to a cachefiles object coherency check failing with a trace similar to the following: cachefiles_coherency: o=0000000e BAD osiz B=afbb3 c=0 This problem can be reproduced as follows: #!/bin/bash v=4.2; NFS_SERVER=127.0.0.1 set -e; trap cleanup EXIT; rc=1 function cleanup { umount /mnt/nfs > /dev/null 2>&1 RC_STR="TEST PASS" [ $rc -eq 1 ] && RC_STR="TEST FAIL" echo "$RC_STR on $(uname -r) with NFSv$v and server $NFS_SERVER" } mount -o vers=$v,fsc $NFS_SERVER:/export /mnt/nfs rm -f /mnt/nfs/file1.bin > /dev/null 2>&1 dd if=/dev/zero of=/mnt/nfs/file1.bin bs=4096 count=1 > /dev/null 2>&1 echo 3 > /proc/sys/vm/drop_caches echo Read file 1st time from NFS server into fscache dd if=/mnt/nfs/file1.bin of=/dev/null > /dev/null 2>&1 umount /mnt/nfs && mount -o vers=$v,fsc $NFS_SERVER:/export /mnt/nfs echo 3 > /proc/sys/vm/drop_caches echo Read file 2nd time from fscache dd if=/mnt/nfs/file1.bin of=/dev/null > /dev/null 2>&1 echo Check mountstats for NFS read grep -q "READ: 0" /proc/self/mountstats # (1st number) == 0 [ $? -eq 0 ] && rc=0 Fixes: a6b5a28eb56c "nfs: Convert to new fscache volume/cookie API" Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Tested-by: Daire Byrne <daire@dneg.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2022-05-17 15:39:45 -04:00
NeilBrown	3e2910c7e2	NFS: Improve warning message when locks are lost. NFSv4 can lose locks if, for example there is a network partition for longer than the lease period. When this happens a warning message NFS: __nfs4_reclaim_open_state: Lock reclaim failed! is generated, possibly once for each lock (though rate limited). This is potentially misleading as is can be read as suggesting that lock reclaim was attempted. However the default behaviour is to not attempt to recover locks (except due to server report). This patch changes the reporting to produce at most one message for each attempt to recover all state from a given server. The message reports the server name and the number of locks lost if that number is non-zero. It reports that locks were lost and give no suggestion as to whether there was an attempt or not. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2022-05-17 15:27:17 -04:00
Steve French	0a55cf74ff	SMB3: EBADF/EIO errors in rename/open caused by race condition in smb2_compound_op There is a race condition in smb2_compound_op: after_close: num_rqst++; if (cfile) { cifsFileInfo_put(cfile); // sends SMB2_CLOSE to the server cfile = NULL; This is triggered by smb2_query_path_info operation that happens during revalidate_dentry. In smb2_query_path_info, get_readable_path is called to load the cfile, increasing the reference counter. If in the meantime, this reference becomes the very last, this call to cifsFileInfo_put(cfile) will trigger a SMB2_CLOSE request sent to the server just before sending this compound request – and so then the compound request fails either with EBADF/EIO depending on the timing at the server, because the handle is already closed. In the first scenario, the race seems to be happening between smb2_query_path_info triggered by the rename operation, and between “cleanup” of asynchronous writes – while fsync(fd) likely waits for the asynchronous writes to complete, releasing the writeback structures can happen after the close(fd) call. So the EBADF/EIO errors will pop up if the timing is such that: 1) There are still outstanding references after close(fd) in the writeback structures 2) smb2_query_path_info successfully fetches the cfile, increasing the refcounter by 1 3) All writeback structures release the same cfile, reducing refcounter to 1 4) smb2_compound_op is called with that cfile In the second scenario, the race seems to be similar – here open triggers the smb2_query_path_info operation, and if all other threads in the meantime decrease the refcounter to 1 similarly to the first scenario, again SMB2_CLOSE will be sent to the server just before issuing the compound request. This case is harder to reproduce. See https://bugzilla.samba.org/show_bug.cgi?id=15051 Cc: stable@vger.kernel.org Fixes: 8de9e86c67ba ("cifs: create a helper to find a writeable handle by path name") Signed-off-by: Ondrej Hubsch <ohubsch@purestorage.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>	2022-05-17 13:47:27 -05:00
Jens Axboe	aa184e8671	io_uring: don't attempt to IOPOLL for MSG_RING requests We gate whether to IOPOLL for a request on whether the opcode is allowed on a ring setup for IOPOLL and if it's got a file assigned. MSG_RING is the only one that allows a file yet isn't pollable, it's merely supported to allow communication on an IOPOLL ring, not because we can poll for completion of it. Put the assigned file early and clear it, so we don't attempt to poll for it. Reported-by: syzbot+1a0a53300ce782f8b3ad@syzkaller.appspotmail.com Fixes: 3f1d52abf098 ("io_uring: defer msg-ring file validity check until command issue") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-17 12:46:04 -06:00
Eric Biggers	b5639bb431	f2fs: don't use casefolded comparison for "." and ".." Tryng to rename a directory that has all following properties fails with EINVAL and triggers the 'WARN_ON_ONCE(!fscrypt_has_encryption_key(dir))' in f2fs_match_ci_name(): - The directory is casefolded - The directory is encrypted - The directory's encryption key is not yet set up - The parent directory is not encrypted The problem is incorrect handling of the lookup of ".." to get the parent reference to update. fscrypt_setup_filename() treats ".." (and ".") specially, as it's never encrypted. It's passed through as-is, and setting up the directory's key is not attempted. As the name isn't a no-key name, f2fs treats it as a "normal" name and attempts a casefolded comparison. That breaks the assumption of the WARN_ON_ONCE() in f2fs_match_ci_name() which assumes that for encrypted directories, casefolded comparisons only happen when the directory's key is set up. We could just remove this WARN_ON_ONCE(). However, since casefolding is always a no-op on "." and ".." anyway, let's instead just not casefold these names. This results in the standard bytewise comparison. Fixes: 7ad08a58bf67 ("f2fs: Handle casefolding with Encryption") Cc: <stable@vger.kernel.org> # v5.11+ Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2022-05-17 11:19:23 -07:00
Jaegeuk Kim	c81d5bae40	f2fs: do not stop GC when requiring a free section The f2fs_gc uses a bitmap to indicate pinned sections, but when disabling chckpoint, we call f2fs_gc() with NULL_SEGNO which selects the same dirty segment as a victim all the time, resulting in checkpoint=disable failure, for example. Let's pick another one, if we fail to collect it. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2022-05-17 11:19:19 -07:00
Baokun Li	f87c7a4b08	ext4: fix race condition between ext4_write and ext4_convert_inline_data Hulk Robot reported a BUG_ON: ================================================================== EXT4-fs error (device loop3): ext4_mb_generate_buddy:805: group 0, block bitmap and bg descriptor inconsistent: 25 vs 31513 free clusters kernel BUG at fs/ext4/ext4_jbd2.c:53! invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 0 PID: 25371 Comm: syz-executor.3 Not tainted 5.10.0+ #1 RIP: 0010:ext4_put_nojournal fs/ext4/ext4_jbd2.c:53 [inline] RIP: 0010:__ext4_journal_stop+0x10e/0x110 fs/ext4/ext4_jbd2.c:116 [...] Call Trace: ext4_write_inline_data_end+0x59a/0x730 fs/ext4/inline.c:795 generic_perform_write+0x279/0x3c0 mm/filemap.c:3344 ext4_buffered_write_iter+0x2e3/0x3d0 fs/ext4/file.c:270 ext4_file_write_iter+0x30a/0x11c0 fs/ext4/file.c:520 do_iter_readv_writev+0x339/0x3c0 fs/read_write.c:732 do_iter_write+0x107/0x430 fs/read_write.c:861 vfs_writev fs/read_write.c:934 [inline] do_pwritev+0x1e5/0x380 fs/read_write.c:1031 [...] ================================================================== Above issue may happen as follows: cpu1 cpu2 __________________________\|__________________________ do_pwritev vfs_writev do_iter_write ext4_file_write_iter ext4_buffered_write_iter generic_perform_write ext4_da_write_begin vfs_fallocate ext4_fallocate ext4_convert_inline_data ext4_convert_inline_data_nolock ext4_destroy_inline_data_nolock clear EXT4_STATE_MAY_INLINE_DATA ext4_map_blocks ext4_ext_map_blocks ext4_mb_new_blocks ext4_mb_regular_allocator ext4_mb_good_group_nolock ext4_mb_init_group ext4_mb_init_cache ext4_mb_generate_buddy --> error ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA) ext4_restore_inline_data set EXT4_STATE_MAY_INLINE_DATA ext4_block_write_begin ext4_da_write_end ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA) ext4_write_inline_data_end handle=NULL ext4_journal_stop(handle) __ext4_journal_stop ext4_put_nojournal(handle) ref_cnt = (unsigned long)handle BUG_ON(ref_cnt == 0) ---> BUG_ON The lock held by ext4_convert_inline_data is xattr_sem, but the lock held by generic_perform_write is i_rwsem. Therefore, the two locks can be concurrent. To solve above issue, we add inode_lock() for ext4_convert_inline_data(). At the same time, move ext4_convert_inline_data() in front of ext4_punch_hole(), remove similar handling from ext4_punch_hole(). Fixes: 0c8d414f163f ("ext4: let fallocate handle inline data correctly") Cc: stable@vger.kernel.org Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220428134031.4153381-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2022-05-17 14:17:40 -04:00
Zhang Yi	6493792d32	ext4: convert symlink external data block mapping to bdev Symlink's external data block is one kind of metadata block, and now that almost all ext4 metadata block's page cache (e.g. directory blocks, quota blocks...) belongs to bdev backing inode except the symlink. It is essentially worked in data=journal mode like other regular file's data block because probably in order to make it simple for generic VFS code handling symlinks or some other historical reasons, but the logic of creating external data block in ext4_symlink() is complicated. and it also make things confused if user do not want to let the filesystem worked in data=journal mode. This patch convert the final exceptional case and make things clean, move the mapping of the symlink's external data block to bdev like any other metadata block does. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20220424140936.1898920-3-yi.zhang@huawei.com	2022-05-17 14:17:40 -04:00
Zhang Yi	9558cf14e8	ext4: add nowait mode for ext4_getblk() Current ext4_getblk() might sleep if some resources are not valid or could be race with a concurrent extents modifing procedure. So we cannot call ext4_getblk() and ext4_map_blocks() to get map blocks in the atomic context in some fast path (e.g. the upcoming procedure of getting symlink external block in the RCU context), even if the map extents have already been check and cached. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20220424140936.1898920-2-yi.zhang@huawei.com	2022-05-17 14:17:40 -04:00

... 4 5 6 7 8 ...

76795 Commits