linux

iv/linux

Author	SHA1	Message	Date
Josef Bacik	9a93b5a353	btrfs: abort the transaction if we get an error during snapshot drop We were seeing weird errors when we were testing our btrfs backports before we had the incorrect level check fix. These errors appeared to be improper error handling, but error injection testing uncovered that the errors were a result of corruption that occurred from improper error handling during snapshot delete. With snapshot delete if we encounter any errors during walk_down or walk_up we'll simply return an error, we won't abort the transaction. This is problematic because we will be dropping references for nodes and leaves along the way, and if we fail in the middle we will leave the file system corrupt because we don't know where we left off in the drop. Fix this by making sure we abort if we hit any errors during the walk down or walk up operations, as we have no idea what operations could have been left half done at this point. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:13 +02:00
Josef Bacik	4e19438400	btrfs: handle errors in walk_down_tree properly We can get errors in walk_down_proc as we try and lookup extent info for the snapshot dropping to act on. However if we get an error we simply return 1 which indicates we're done with walking down, which will lead us to improperly continue with the snapshot drop with the incorrect information. Instead break if we get any error from walk_down_proc or do_walk_down, and handle the case of ret == 1 by returning 0, otherwise return the ret value that we have. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:13 +02:00
Josef Bacik	6989627db0	btrfs: drop root refs properly when orphan cleanup fails When we mount the file system we do something like this: while (1) { lookup fs roots; for (i = 0; i < num_roots; i++) { ret = btrfs_orphan_cleanup(roots[i]); if (ret) break; btrfs_put_root(roots[i]); } } for (; i < num_roots; i++) btrfs_put_root(roots[i]); As you can see if we break in that inner loop we just go back to the outer loop and lose the fact that we have to drop references on the remaining roots we looked up. Fix this by making an out label and jumping to that on error so we don't leak a reference to the roots we looked up. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:13 +02:00
Josef Bacik	a13bb2c038	btrfs: add missing iputs on orphan cleanup failure We missed a couple of iput()s in the orphan cleanup failure paths, add them so we don't get refcount errors. The iput needs to be done in the check and not under a common label due to the way the code is structured. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:13 +02:00
Josef Bacik	9cf14029d5	btrfs: handle errors from btrfs_read_node_slot in split While investigating a problem with error injection I tripped over curious behavior in the node/leaf splitting code. If we get an EIO when trying to read either the left or right leaf/node for splitting we'll simply treat the node as if it were full and continue on. The end result of this isn't too bad, we simply end up allocating a block when we may have pushed items into the adjacent blocks. However this does essentially allow us to continue to modify a file system that we've gotten errors on, either from a bad disk or csum mismatch or other corruption. This isn't particularly safe, so instead handle these btrfs_read_node_slot() usages differently. We allow you to pass in any slot, the idea being that we save some code if the slot number is outside of the range of the parent. This means we treat all errors the same, when in reality we only want to ignore -ENOENT. Fix this by changing how we call btrfs_read_node_slot(), which is to only call it for slots we know are valid. This way if we get an error back from reading the block we can properly pass the error up the chain. This was validated with the error injection testing I was doing. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:13 +02:00
Josef Bacik	d469472844	btrfs: replace BUG_ON with ASSERT in btrfs_read_node_slot In btrfs_read_node_slot() we have a BUG_ON() that can be converted to an ASSERT(), it's from an extent buffer and the level is validated at the time it's read from disk. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:13 +02:00
Josef Bacik	13b98989c8	btrfs: use btrfs_handle_fs_error in btrfs_fill_super While trying to track down a lost EIO problem I hit the following assertion while doing my error injection testing BTRFS warning (device nvme1n1): transaction 1609 (with 180224 dirty metadata bytes) is not committed assertion failed: !found, in fs/btrfs/disk-io.c:4456 ------------[ cut here ]------------ kernel BUG at fs/btrfs/messages.h:169! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 1445 Comm: mount Tainted: G W 6.2.0-rc5+ #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014 RIP: 0010:btrfs_assertfail.constprop.0+0x18/0x1a RSP: 0018:ffffb95fc3b0bc68 EFLAGS: 00010286 RAX: 0000000000000034 RBX: ffff9941c2ac2000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffffb6741f7d RDI: 00000000ffffffff RBP: ffff9941c2ac2428 R08: 0000000000000000 R09: ffffb95fc3b0bb38 R10: 0000000000000003 R11: ffffffffb71438a8 R12: ffff9941c2ac2428 R13: ffff9941c2ac2450 R14: ffff9941c2ac2450 R15: 000000000002c000 FS: 00007fcea2d07800(0000) GS:ffff9941fbc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f00cc7c83a8 CR3: 000000010c686000 CR4: 0000000000350ef0 Call Trace: <TASK> close_ctree+0x426/0x48f btrfs_mount_root.cold+0x7e/0xee ? legacy_parse_param+0x2b/0x220 legacy_get_tree+0x2b/0x50 vfs_get_tree+0x29/0xc0 vfs_kern_mount.part.0+0x73/0xb0 btrfs_mount+0x11d/0x3d0 ? legacy_parse_param+0x2b/0x220 legacy_get_tree+0x2b/0x50 vfs_get_tree+0x29/0xc0 path_mount+0x438/0xa40 __x64_sys_mount+0xe9/0x130 do_syscall_64+0x3e/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc This is because the error injection did an EIO for the root inode lookup and we simply jumped to closing the ctree. However because we didn't mark the file system as having an error we skipped all of the broken transaction cleanup stuff, and thus triggered this ASSERT(). Fix this by calling btrfs_handle_fs_error() in this case so we have the error set on the file system. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-04-17 18:01:12 +02:00
Paulo Alcantara	d5a863a153	cifs: avoid dup prefix path in dfs_get_automount_devname() @server->origin_fullpath already contains the tree name + optional prefix, so avoid calling __build_path_from_dentry_optional_prefix() as it might end up duplicating prefix path from @cifs_sb->prepath into final full path. Instead, generate DFS full path by simply merging @server->origin_fullpath with dentry's path. This fixes the following case mount.cifs //root/dfs/dir /mnt/ -o ... ls /mnt/link where cifs_dfs_do_automount() will call smb3_parse_devname() with @devname set to "//root/dfs/dir/link" instead of "//root/dfs/dir/dir/link". Fixes: 7ad54b98fc1f ("cifs: use origin fullpath for automounts") Cc: <stable@vger.kernel.org> # 6.2+ Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-04-16 18:04:36 -05:00
Peter Xu	2ff559f31a	Revert "userfaultfd: don't fail on unrecognized features" This is a proposal to revert commit 914eedcb9ba0ff53c33808. I found this when writing a simple UFFDIO_API test to be the first unit test in this set. Two things breaks with the commit: - UFFDIO_API check was lost and missing. According to man page, the kernel should reject ioctl(UFFDIO_API) if uffdio_api.api != 0xaa. This check is needed if the api version will be extended in the future, or user app won't be able to identify which is a new kernel. - Feature flags checks were removed, which means UFFDIO_API with a feature that does not exist will also succeed. According to the man page, we should (and it makes sense) to reject ioctl(UFFDIO_API) if unknown features passed in. Link: https://lore.kernel.org/r/20220722201513.1624158-1-axelrasmussen@google.com Link: https://lkml.kernel.org/r/20230412163922.327282-2-peterx@redhat.com Fixes: 914eedcb9ba0 ("userfaultfd: don't fail on unrecognized features") Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Zach O'Keefe <zokeefe@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-04-16 10:41:26 -07:00
Baokun Li	1ba1199ec5	writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs KASAN report null-ptr-deref: ================================================================== BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0 Write of size 8 at addr 0000000000000000 by task sync/943 CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461 Call Trace: <TASK> dump_stack_lvl+0x7f/0xc0 print_report+0x2ba/0x340 kasan_report+0xc4/0x120 kasan_check_range+0x1b7/0x2e0 __kasan_check_write+0x24/0x40 bdi_split_work_to_wbs+0x5c5/0x7b0 sync_inodes_sb+0x195/0x630 sync_inodes_one_sb+0x3a/0x50 iterate_supers+0x106/0x1b0 ksys_sync+0x98/0x160 [...] ================================================================== The race that causes the above issue is as follows: cpu1 cpu2 -------------------------\|------------------------- inode_switch_wbs INIT_WORK(&isw->work, inode_switch_wbs_work_fn) queue_rcu_work(isw_wq, &isw->work) // queue_work async inode_switch_wbs_work_fn wb_put_many(old_wb, nr_switched) percpu_ref_put_many ref->data->release(ref) cgwb_release queue_work(cgwb_release_wq, &wb->release_work) // queue_work async &wb->release_work cgwb_release_workfn ksys_sync iterate_supers sync_inodes_one_sb sync_inodes_sb bdi_split_work_to_wbs kmalloc(sizeof(*work), GFP_ATOMIC) // alloc memory failed percpu_ref_exit ref->data = NULL kfree(data) wb_get(wb) percpu_ref_get(&wb->refcnt) percpu_ref_get_many(ref, 1) atomic_long_add(nr, &ref->data->count) atomic64_add(i, v) // trigger null-ptr-deref bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all wbs. If the allocation of new work fails, the on-stack fallback will be used and the reference count of the current wb is increased afterwards. If cgroup writeback membership switches occur before getting the reference count and the current wb is released as old_wd, then calling wb_get() or wb_put() will trigger the null pointer dereference above. This issue was introduced in v4.3-rc7 (see fix tag1). Both sync_inodes_sb() and __writeback_inodes_sb_nr() calls to bdi_split_work_to_wbs() can trigger this issue. For scenarios called via sync_inodes_sb(), originally commit 7fc5854f8c6e ("writeback: synchronize sync(2) against cgroup writeback membership switches") reduced the possibility of the issue by adding wb_switch_rwsem, but in v5.14-rc1 (see fix tag2) removed the "inode_io_list_del_locked(inode, old_wb)" from inode_switch_wbs_work_fn() so that wb->state contains WB_has_dirty_io, thus old_wb is not skipped when traversing wbs in bdi_split_work_to_wbs(), and the issue becomes easily reproducible again. To solve this problem, percpu_ref_exit() is called under RCU protection to avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs(). Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(), and skip the current wb if wb_tryget() fails because the wb has already been shutdown. Link: https://lkml.kernel.org/r/20230410130826.1492525-1-libaokun1@huawei.com Fixes: b817525a4a80 ("writeback: bdi_writeback iteration must not skip dying ones") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christian Brauner <brauner@kernel.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Hou Tao <houtao1@huawei.com> Cc: yangerkun <yangerkun@huawei.com> Cc: Zhang Yi <yi.zhang@huawei.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-04-16 10:41:26 -07:00
Gao Xiang	745ed7d778	erofs: cleanup i_format-related stuffs Switch EROFS_I_{VERSION,DATALAYOUT}_BITS into EROFS_I_{VERSION,DATALAYOUT}_MASK. Also avoid erofs_bitrange() since its functionality is simple enough. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230414083027.12307-2-hsiangkao@linux.alibaba.com	2023-04-17 01:15:55 +08:00
Gao Xiang	10656f9ca6	erofs: sunset erofs_dbg() Such debug messages are rarely used now. Let's get rid of these, and revert locally if they are needed for debugging. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230414083027.12307-1-hsiangkao@linux.alibaba.com	2023-04-17 01:15:54 +08:00
Jingbo Xu	1b3567a196	erofs: fix potential overflow calculating xattr_isize Given on-disk i_xattr_icount is 16 bits and xattr_isize is calculated from i_xattr_icount multiplying 4, xattr_isize has a theoretical maximum of 256K (64K * 4). Thus declare xattr_isize as unsigned int to avoid the potential overflow. Fixes: bfb8674dc044 ("staging: erofs: add erofs in-memory stuffs") Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230414061810.6479-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:54 +08:00
Gao Xiang	4fdadd5b0f	erofs: get rid of z_erofs_fill_inode() Prior to big pclusters, non-compact compression indexes could have empty headers. Let's just avoid the legacy path since it can be handled properly as a specific compression header with z_erofs_fill_inode_lazy() too. Tested with erofs-utils exist versions. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230413092241.73829-1-hsiangkao@linux.alibaba.com	2023-04-17 01:15:53 +08:00
Jingbo Xu	6a318ccd7e	erofs: enable long extended attribute name prefixes Let's enable long xattr name prefix feature. Old kernels will just ignore / skip such extended attributes. In addition, in case you don't want to mount such images, add another incompatible feature as an option for this. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230407222808.19670-1-jefflexu@linux.alibaba.com [ Gao Xiang: minor commit message fix. ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:53 +08:00
Jingbo Xu	82bc1ef41d	erofs: handle long xattr name prefixes properly Make .{list,get}xattr routines adapted to long xattr name prefixes. When the bit 7 of erofs_xattr_entry.e_name_index is set, it indicates that it refers to a long xattr name prefix. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230411093537.127286-1-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:52 +08:00
Jingbo Xu	9e38291461	erofs: add helpers to load long xattr name prefixes Long xattr name prefixes will be scanned upon mounting and the in-memory long xattr name prefix array will be initialized accordingly. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230407141710.113882-6-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:52 +08:00
Jingbo Xu	b3bfcb9dbf	erofs: introduce on-disk format for long xattr name prefixes Besides the predefined xattr name prefixes, introduces long xattr name prefixes, which work similarly as the predefined name prefixes, except that they are user specified. It is especially useful for use cases together with overlayfs like Composefs model, which introduces diverse xattr values with only a few common xattr names (trusted.overlay.redirect, trusted.overlay.digest, and maybe more in the future). That makes the existing predefined prefixes ineffective in both image size and runtime performance. When a user specified long xattr name prefix is used, only the trailing part of the xattr name apart from the long xattr name prefix will be stored in erofs_xattr_entry.e_name. e_name is empty if the xattr name matches exactly as the long xattr name prefix. All long xattr prefixes are stored in the packed or meta inode, which depends if fragments feature is enabled or not. For each long xattr name prefix, the on-disk format is kept as the same as the unique metadata format: ALIGN({__le16 len, data}, 4), where len represents the total size of struct erofs_xattr_long_prefix, followed by data of struct erofs_xattr_long_prefix itself. Each erofs_xattr_long_prefix keeps predefined prefixes (base_index) and the remaining prefix string without the trailing '\0'. Two fields are introduced to the on-disk superblock, where xattr_prefix_count represents the total number of the long xattr name prefixes recorded, and xattr_prefix_start represents the start offset of recorded name prefixes in the packed/meta inode divided by 4. When referring to a long xattr name prefix, the highest bit (bit 7) of erofs_xattr_entry.e_name_index is set, while the lower bits (bit 0-6) as a whole represents the index of the referred long name prefix among all long xattr name prefixes. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230407141710.113882-5-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:51 +08:00
Jingbo Xu	a97a218b08	erofs: move packed inode out of the compression part packed inode could be used in more scenarios which are independent of compression in the future. For example, packed inode could be used to keep extra long xattr prefixes with the help of following patches. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230407141710.113882-4-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:51 +08:00
Gao Xiang	eb2c5e41be	erofs: keep meta inode into erofs_buf So that erofs_read_metadata() can read metadata from other inodes (e.g. packed inode) as well. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230407141710.113882-2-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:50 +08:00
Jingbo Xu	cb9bce7951	erofs: initialize packed inode after root inode is assigned As commit 8f7acdae2cd4 ("staging: erofs: kill all failure handling in fill_super()"), move the initialization of packed inode after root inode is assigned, so that the iput() in .put_super() is adequate as the failure handling. Otherwise, iput() is also needed in .kill_sb(), in case of the mounting fails halfway. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Fixes: b15b2e307c3a ("erofs: support on-disk compressed fragments data") Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230407141710.113882-3-jefflexu@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:49 +08:00
Gao Xiang	cc4efd3dd2	erofs: stop parsing non-compact HEAD index if clusterofs is invalid Syzbot generated a crafted image [1] with a non-compact HEAD index of clusterofs 33024 while valid numbers should be 0 ~ lclustersize-1, which causes the following unexpected behavior as below: BUG: unable to handle page fault for address: fffff52101a3fff9 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 23ffed067 P4D 23ffed067 PUD 0 Oops: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 4398 Comm: kworker/u5:1 Not tainted 6.3.0-rc6-syzkaller-g09a9639e56c0 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/30/2023 Workqueue: erofs_worker z_erofs_decompressqueue_work RIP: 0010:z_erofs_decompress_queue+0xb7e/0x2b40 ... Call Trace: <TASK> z_erofs_decompressqueue_work+0x99/0xe0 process_one_work+0x8f6/0x1170 worker_thread+0xa63/0x1210 kthread+0x270/0x300 ret_from_fork+0x1f/0x30 Note that normal images or images using compact indexes are not impacted. Let's fix this now. [1] https://lore.kernel.org/r/000000000000ec75b005ee97fbaa@google.com Reported-and-tested-by: syzbot+aafb3f37cfeb6534c4ac@syzkaller.appspotmail.com Fixes: 02827e1796b3 ("staging: erofs: add erofs_map_blocks_iter") Fixes: 152a333a5895 ("staging: erofs: add compacted compression indexes support") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230410173714.104604-1-hsiangkao@linux.alibaba.com	2023-04-17 01:15:49 +08:00
Yue Hu	12c2987e89	erofs: don't warn ztailpacking feature anymore The ztailpacking feature has been merged for a year, it has been mostly stable now. Signed-off-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230227084457.3510-1-zbestahu@gmail.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:48 +08:00
Jingbo Xu	b05f30bcf7	erofs: simplify erofs_xattr_generic_get() erofs_xattr_generic_get() won't be called from xattr handlers other than user/trusted/security xattr handler, and thus there's no need of extra checking. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://lore.kernel.org/r/20230330082910.125374-4-jefflexu@linux.alibaba.com Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:48 +08:00
Jingbo Xu	6020ffe76d	erofs: rename init_inode_xattrs with erofs_ prefix Rename init_inode_xattrs() to erofs_init_inode_xattrs() without logic change. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230330082910.125374-3-jefflexu@linux.alibaba.com Reviewed-by: Yue Hu <huyue2@coolpad.com> Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:47 +08:00
Jingbo Xu	399f36d8c6	erofs: move several xattr helpers into xattr.c Move xattrblock_addr() and xattrblock_offset() helpers into xattr.c, as they are not used outside of xattr.c. inlinexattr_header_size() has only one caller, and thus make it inlined into the caller directly. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230330082910.125374-2-jefflexu@linux.alibaba.com Reviewed-by: Yue Hu <huyue2@coolpad.com> Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:47 +08:00
Gao Xiang	1c7f49a767	erofs: tidy up EROFS on-disk naming - Get rid of all "vle" (variable-length extents) expressions since they only expand overall name lengths unnecessarily; - Rename COMPRESSION_LEGACY to COMPRESSED_FULL; - Move on-disk directory definitions ahead of compression; - Drop unused extended attribute definitions; - Move inode ondisk union `i_u` out as `union erofs_inode_i_u`. No actual logical change. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230331063149.25611-1-hsiangkao@linux.alibaba.com	2023-04-17 01:15:46 +08:00
Jia Zhu	8b465fecc3	erofs: support flattened block device for multi-blob images In order to support mounting multi-blobs container image as a single block device, add flattened block device feature for EROFS. In this mode, all meta/data contents will be mapped into one block space. User could compose a block device(by nbd/ublk/virtio-blk/ vhost-user-blk) from multiple sources and mount the block device by EROFS directly. It can reduce the number of block devices used, and it's also benefits in both VM file passthrough and distributed storage scenarios. You can test this using the method mentioned by: https://github.com/dragonflyoss/image-service/pull/1139 1. Compose a (nbd)block device from multi-blobs. 2. Mount EROFS on mntdir/. 3. Compare the md5sum between source dir and mntdir/. Later, we could also use it to refer original tar blobs. Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com> Signed-off-by: Xin Yin <yinxin.x@bytedance.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Tested-by: Jiang Liu <gerry@linux.alibaba.com> Link: https://lore.kernel.org/r/20230302071751.48425-1-zhujia.zj@bytedance.com [ Gao Xiang: refine commit message and use erofs_pos(). ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:46 +08:00
Jingbo Xu	d3c4bdcc75	erofs: set block size to the on-disk block size Set the block size to that specified in on-disk superblock. Also remove the hard constraint of PAGE_SIZE block size for the uncompressed device backend. This constraint is temporarily remained for compressed device and fscache backend, as there is more work needed to handle the condition where the block size is not equal to PAGE_SIZE. It is worth noting that the on-disk block size is read prior to erofs_superblock_csum_verify(), as the read block size is needed in the latter. Besides, later we are going to make erofs refer to tar data blobs (which is 512-byte aligned) for OCI containers, where the block size is 512 bytes. In this case, the 512-byte block size may not be adequate for a directory to contain enough dirents. To fix this, we are also going to introduce directory block size independent on the block size. Due to we have already supported block size smaller than PAGE_SIZE now, disable all these images with such separated directory block size until we supported this feature later. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230313135309.75269-3-jefflexu@linux.alibaba.com [ Gao Xiang: update documentation. ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:45 +08:00
Jingbo Xu	3acea5fc33	erofs: avoid hardcoded blocksize for subpage block support As the first step of converting hardcoded blocksize to that specified in on-disk superblock, convert all call sites of hardcoded blocksize to sb->s_blocksize except for: 1) use sbi->blkszbits instead of sb->s_blocksize in erofs_superblock_csum_verify() since sb->s_blocksize has not been updated with the on-disk blocksize yet when the function is called. 2) use inode->i_blkbits instead of sb->s_blocksize in erofs_bread(), since the inode operated on may be an anonymous inode in fscache mode. Currently the anonymous inode is allocated from an anonymous mount maintained in erofs, while in the near future we may allocate anonymous inodes from a generic API directly and thus have no access to the anonymous inode's i_sb. Thus we keep the block size in i_blkbits for anonymous inodes in fscache mode. Be noted that this patch only gets rid of the hardcoded blocksize, in preparation for actually setting the on-disk block size in the following patch. The hard limit of constraining the block size to PAGE_SIZE still exists until the next patch. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230313135309.75269-2-jefflexu@linux.alibaba.com [ Gao Xiang: fold a patch to fix incorrect truncated offsets. ] Link: https://lore.kernel.org/r/20230413035734.15457-1-zhujia.zj@bytedance.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2023-04-17 01:15:44 +08:00
Linus Torvalds	6586c4d480	smb311 server preauth integrity negotiate context parsing fix -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmQ7bhwACgkQiiy9cAdy T1Fk3Av+NcTPMF6ZIhxXN4IwpsvE0KdXm+BB/+dCw82zi2mVAyZowLSFkM3TKqRh 6GOpSnKu2Vp7TCSNdN0ZtnOcC9q8H/SpFmLojBeoiyUr87tjngd7ktTkUd32FEaf jfOqS0+NSZPmhB7eKXJ75jOISMvga0x3t1KHbO7vTm12I5b6VY3r1hxiit0RP0fg 7QKWNwSR8erQMkg8+F+n5q9kAIi88ymrPTx8991JdENqzCjJ0dNMLX7ULwD8SiWa d9PnEFGyQeLoVF/FRQ4hYNRv67Os3xjEFdJtpZKlZ9CKfzgwA1kOYQQRfGb64bBP wQ0Syga8OudYMq6X1jMGsw0qaGxwC32jIA03M05oQ75A8SaXyb1jauHdwNFJqjmH JhSZ6qI77TduYK0v92Oa+Y76miW/RoI5sS8i0GrayjwN8NsBsrHH7JuLS/LSFpc/ vlv0fPqBTRpFP7Yv+JJr8lgY6a8aeAF5R4fYPeyGbOpxXm71Af95ZX5Q3JYNzdz4 ZuEpSVVn =LnMO -----END PGP SIGNATURE----- Merge tag '6.3-rc6-ksmbd-server-fix' of git://git.samba.org/ksmbd Pull ksmbd server fix from Steve French: "smb311 server preauth integrity negotiate context parsing fix (check for out of bounds access)" * tag '6.3-rc6-ksmbd-server-fix' of git://git.samba.org/ksmbd: ksmbd: avoid out of bounds access in decode_preauth_ctxt()	2023-04-16 09:39:55 -07:00
Linus Torvalds	3e7bb4f246	fix for smb311 client negotiate context overflow, also marked for stable -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmQ7N54ACgkQiiy9cAdy T1GSeQwAuz5qds5hFwc7Z37ElcA6wEqVuNRV7xqgYL3QTAw9rUc56JyUIsJMscy9 /PXXKeZdejyWzZ2OShdpS9vciVzgXyVLVVKIDvy0BUif79fIlBwjPYEXBS39Cet0 vjdTNYl7hAXAK0UXA+II/2SoCjvh6ho8quN7hRQPHKr1PMrQWmqtipbyWXzgOUEU TO2MTNFakv2PiFYF0CvGsyHpvbgZWViSqQ7Pt1VpZwCwB3USImBTn8dl6lPJZHXq RWAoE7hoE075xxCUr4/+VTWCJBC2OFhprrhEEOj+y1lQCoMOPO4vDd73ctBkfL0o 01IlE5q5aOZILkg2EramoynDfOElJ8gbcfRi+8s/ErbrWLRiotyBNElK2ig8JNTH aHU/39KFgkVlsfeyXBPLY8sB/WPoVfwjL2k/P/YuUWGfFZoYqqoxs4x9+gS5p3cZ HuFwjiiJFllX3BRtyINysv5u2PfUoKaNs0eFnELSKS4dvWVXts6rjM4fYCNSeI00 nRHbtnYD =makV -----END PGP SIGNATURE----- Merge tag '6.3-rc6-smb311-client-negcontext-fix' of git://git.samba.org/sfrench/cifs-2.6 Pull cifs fix from Steve French: "Small client fix for better checking for smb311 negotiate context overflows, also marked for stable" * tag '6.3-rc6-smb311-client-negcontext-fix' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix negotiate context parsing	2023-04-15 18:37:51 -07:00
David Disseldorp	5105a7ffce	cifs: fix negotiate context parsing smb311_decode_neg_context() doesn't properly check against SMB packet boundaries prior to accessing individual negotiate context entries. This is due to the length check omitting the eight byte smb2_neg_context header, as well as incorrect decrementing of len_of_ctxts. Fixes: 5100d8a3fe03 ("SMB311: Improve checking of negotiate security contexts") Reported-by: Volker Lendecke <vl@samba.org> Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: David Disseldorp <ddiss@suse.de> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-04-15 18:26:56 -05:00
Jason Yan	54902099b1	ext4: move dax and encrypt checking into ext4_check_feature_compatibility() These checkings are also related with feature compatibility checkings. So move them into ext4_check_feature_compatibility(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Link: https://lore.kernel.org/r/20230323140517.1070239-9-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-14 23:08:03 -04:00
Jason Yan	107d2be901	ext4: factor out ext4_block_group_meta_init() Factor out ext4_block_group_meta_init(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Link: https://lore.kernel.org/r/20230323140517.1070239-8-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-14 23:08:03 -04:00
Jason Yan	269e9226c2	ext4: move s_reserved_gdt_blocks and addressable checking into ext4_check_geometry() These two checkings are more suitable to be put into ext4_check_geometry() rather than spreading outside. Signed-off-by: Jason Yan <yanaijie@huawei.com> Link: https://lore.kernel.org/r/20230323140517.1070239-7-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-14 23:08:03 -04:00
Jason Yan	68e624398f	ext4: rename two functions with 'check' The naming styles are different for some functions with 'check' in their names. Some of them are like: ext4_check_quota_consistency ext4_check_test_dummy_encryption ext4_check_opt_consistency ext4_check_descriptors ext4_check_feature_compatibility While the others looks like below: ext4_geometry_check ext4_journal_data_mode_check This is not a big deal and boils down to personal preference. But I'd like to make them consistent. Signed-off-by: Jason Yan <yanaijie@huawei.com> Link: https://lore.kernel.org/r/20230323140517.1070239-6-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-14 23:08:03 -04:00
Jason Yan	dcbf87589d	ext4: factor out ext4_flex_groups_free() Factor out ext4_flex_groups_free() and it can be used both in __ext4_fill_super() and ext4_put_super(). Signed-off-by: Jason Yan <yanaijie@huawei.com> Link: https://lore.kernel.org/r/20230323140517.1070239-5-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-14 23:08:03 -04:00
Jason Yan	6ef6849888	ext4: use ext4_group_desc_free() in ext4_put_super() to save some duplicated code The only difference here is that ->s_group_desc and ->s_flex_groups share the same rcu read lock here but it is not necessary. In other places they do not share the lock at all. Signed-off-by: Jason Yan <yanaijie@huawei.com> Link: https://lore.kernel.org/r/20230323140517.1070239-4-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-14 23:08:02 -04:00
Jason Yan	1f79467c8a	ext4: factor out ext4_percpu_param_init() and ext4_percpu_param_destroy() Factor out ext4_percpu_param_init() and ext4_percpu_param_destroy(). And also use ext4_percpu_param_destroy() in ext4_put_super() to avoid duplicated code. No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Link: https://lore.kernel.org/r/20230323140517.1070239-3-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-14 23:08:02 -04:00
Jason Yan	db9345d9e6	ext4: factor out ext4_hash_info_init() Factor out ext4_hash_info_init() to simplify __ext4_fill_super(). No functional change. Signed-off-by: Jason Yan <yanaijie@huawei.com> Link: https://lore.kernel.org/r/20230323140517.1070239-2-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-14 23:08:02 -04:00
Jan Kara	d0ab8368c1	Revert "ext4: Fix warnings when freezing filesystem with journaled data" After making ext4_writepages() properly clean all pages there is no need for special treatment of filesystem freezing. Revert commit e6c28a26b799c7640b77daff3e4a67808c74381c. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-13-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-14 20:02:37 -04:00
Jan Kara	ab382539ad	ext4: Update comment in mpage_prepare_extent_to_map() Since filemap_write_and_wait() is now enough to get journalled data to final location update the comment in mpage_prepare_extent_to_map(). Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-12-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-14 19:58:33 -04:00
Jan Kara	951cafa6b8	ext4: Simplify handling of journalled data in ext4_bmap() Now that ext4_writepages() gets journalled data into its final location we just use filemap_write_and_wait() instead of special handling of journalled data in ext4_bmap(). We can also drop EXT4_STATE_JDATA flag as it is not used anymore. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-11-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-14 19:58:33 -04:00
Jan Kara	7c375870fd	ext4: Drop special handling of journalled data from ext4_quota_on() Now that ext4_writepages() makes sure all journalled data is committed and checkpointed, sync_filesystem() call done by dquot_quota_on() is enough for quota IO to see uptodate data. So drop special handling of journalled data from ext4_quota_on(). Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-10-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-14 19:56:53 -04:00
Jan Kara	56c2a0e3d9	ext4: Drop special handling of journalled data from ext4_evict_inode() Now that ext4_writepages() makes sure journalled data is on stable storage, write_inode_now() call in iput_final() is enough to make pagecache pages with journalled data really clean (data committed and checkpointed). So we can drop special handling of journalled data in ext4_evict_inode(). Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-9-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-14 19:56:53 -04:00
Jan Kara	783ae448b7	ext4: Fix special handling of journalled data from extent zeroing The handling of journalled data in ext4_zero_range() is incomplete. We do not need to commit running transaction but we rather need to checkpoint pages with journalled data. If we don't, journal tail can be advanced beyond transaction containing the journalled data and if we then crash before committing the transaction doing the zeroing we will have inconsistent (too old) data in the file. Make sure file pages with journalled data are properly checkpointed before removing them from the page cache. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-8-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-14 19:56:53 -04:00
Jan Kara	c000dfec7e	ext4: Drop special handling of journalled data from extent shifting operations Now that filemap_write_and_wait() makes sure pages with journalled data are safely on disk, ext4_collapse_range() and ext4_insert_range() do not need special handling of journalled data. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-7-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-14 19:56:53 -04:00
Jan Kara	e360c6ed72	ext4: Drop special handling of journalled data from ext4_sync_file() Now that ext4_writepages() make sure all pages with journalled data are stable on disk, we don't need special handling of journalled data in ext4_sync_file(). Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-6-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-14 19:56:53 -04:00
Jan Kara	1f1a55f0bf	ext4: Commit transaction before writing back pages in data=journal mode When journalling data we currently just walk over pages, journal those that are marked for delayed dirtying (only pinned pages dirtied behing our back these days) and checkpoint other dirty pages. Because some pages may be part of running transaction the result is that after filemap_write_and_wait() we are not guaranteed pages are stable on disk. Thus places that want to flush current pagecache content need to jump through hoops to make sure journalled data is not lost. This is manageable in cases completely controlled by ext4 (such as extent shifting operations or inode eviction) but it gets ugly for stuff like fsverity. Furthermore it is rather error prone as people often do not realize journalled data needs special handling. So change ext4_writepages() to commit transaction with inode's data before going through the writeback loop in WB_SYNC_ALL mode. As a result filemap_write_and_wait() is now really getting pages to stable storage and makes pagecache pages safe to reclaim. Consequently we can remove the special handling of journalled data from several places in follow up patches. Note that this will make fsync(2) for journalled data more expensive as we will end up not only committing the transaction we need but also checkpointing the data (which we may have previously skipped if the data was part of the running transaction). If we really cared, we would need to introduce special VFS function for writing out & invalidating page cache for a range, use ->launder_page callback to perform checkpointing, and use it from all the places that need this functionality. But at this point I'm not convinced the complexity is worth it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-5-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-04-14 19:56:53 -04:00

... 3 4 5 6 7 ...

81399 Commits