IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
We were seeing weird errors when we were testing our btrfs backports
before we had the incorrect level check fix. These errors appeared to
be improper error handling, but error injection testing uncovered that
the errors were a result of corruption that occurred from improper error
handling during snapshot delete.
With snapshot delete if we encounter any errors during walk_down or
walk_up we'll simply return an error, we won't abort the transaction.
This is problematic because we will be dropping references for nodes and
leaves along the way, and if we fail in the middle we will leave the
file system corrupt because we don't know where we left off in the drop.
Fix this by making sure we abort if we hit any errors during the walk
down or walk up operations, as we have no idea what operations could
have been left half done at this point.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can get errors in walk_down_proc as we try and lookup extent info for
the snapshot dropping to act on. However if we get an error we simply
return 1 which indicates we're done with walking down, which will lead
us to improperly continue with the snapshot drop with the incorrect
information. Instead break if we get any error from walk_down_proc or
do_walk_down, and handle the case of ret == 1 by returning 0, otherwise
return the ret value that we have.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we mount the file system we do something like this:
while (1) {
lookup fs roots;
for (i = 0; i < num_roots; i++) {
ret = btrfs_orphan_cleanup(roots[i]);
if (ret)
break;
btrfs_put_root(roots[i]);
}
}
for (; i < num_roots; i++)
btrfs_put_root(roots[i]);
As you can see if we break in that inner loop we just go back to the
outer loop and lose the fact that we have to drop references on the
remaining roots we looked up. Fix this by making an out label and
jumping to that on error so we don't leak a reference to the roots we
looked up.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We missed a couple of iput()s in the orphan cleanup failure paths, add
them so we don't get refcount errors. The iput needs to be done in the
check and not under a common label due to the way the code is
structured.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While investigating a problem with error injection I tripped over
curious behavior in the node/leaf splitting code. If we get an EIO when
trying to read either the left or right leaf/node for splitting we'll
simply treat the node as if it were full and continue on. The end
result of this isn't too bad, we simply end up allocating a block when
we may have pushed items into the adjacent blocks.
However this does essentially allow us to continue to modify a file
system that we've gotten errors on, either from a bad disk or csum
mismatch or other corruption. This isn't particularly safe, so instead
handle these btrfs_read_node_slot() usages differently. We allow you to
pass in any slot, the idea being that we save some code if the slot
number is outside of the range of the parent. This means we treat all
errors the same, when in reality we only want to ignore -ENOENT.
Fix this by changing how we call btrfs_read_node_slot(), which is to
only call it for slots we know are valid. This way if we get an error
back from reading the block we can properly pass the error up the chain.
This was validated with the error injection testing I was doing.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_read_node_slot() we have a BUG_ON() that can be converted to an
ASSERT(), it's from an extent buffer and the level is validated at the
time it's read from disk.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While trying to track down a lost EIO problem I hit the following
assertion while doing my error injection testing
BTRFS warning (device nvme1n1): transaction 1609 (with 180224 dirty metadata bytes) is not committed
assertion failed: !found, in fs/btrfs/disk-io.c:4456
------------[ cut here ]------------
kernel BUG at fs/btrfs/messages.h:169!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 1445 Comm: mount Tainted: G W 6.2.0-rc5+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014
RIP: 0010:btrfs_assertfail.constprop.0+0x18/0x1a
RSP: 0018:ffffb95fc3b0bc68 EFLAGS: 00010286
RAX: 0000000000000034 RBX: ffff9941c2ac2000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffb6741f7d RDI: 00000000ffffffff
RBP: ffff9941c2ac2428 R08: 0000000000000000 R09: ffffb95fc3b0bb38
R10: 0000000000000003 R11: ffffffffb71438a8 R12: ffff9941c2ac2428
R13: ffff9941c2ac2450 R14: ffff9941c2ac2450 R15: 000000000002c000
FS: 00007fcea2d07800(0000) GS:ffff9941fbc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f00cc7c83a8 CR3: 000000010c686000 CR4: 0000000000350ef0
Call Trace:
<TASK>
close_ctree+0x426/0x48f
btrfs_mount_root.cold+0x7e/0xee
? legacy_parse_param+0x2b/0x220
legacy_get_tree+0x2b/0x50
vfs_get_tree+0x29/0xc0
vfs_kern_mount.part.0+0x73/0xb0
btrfs_mount+0x11d/0x3d0
? legacy_parse_param+0x2b/0x220
legacy_get_tree+0x2b/0x50
vfs_get_tree+0x29/0xc0
path_mount+0x438/0xa40
__x64_sys_mount+0xe9/0x130
do_syscall_64+0x3e/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
This is because the error injection did an EIO for the root inode lookup
and we simply jumped to closing the ctree. However because we didn't
mark the file system as having an error we skipped all of the broken
transaction cleanup stuff, and thus triggered this ASSERT(). Fix this
by calling btrfs_handle_fs_error() in this case so we have the error set
on the file system.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
@server->origin_fullpath already contains the tree name + optional
prefix, so avoid calling __build_path_from_dentry_optional_prefix() as
it might end up duplicating prefix path from @cifs_sb->prepath into
final full path.
Instead, generate DFS full path by simply merging
@server->origin_fullpath with dentry's path.
This fixes the following case
mount.cifs //root/dfs/dir /mnt/ -o ...
ls /mnt/link
where cifs_dfs_do_automount() will call smb3_parse_devname() with
@devname set to "//root/dfs/dir/link" instead of
"//root/dfs/dir/dir/link".
Fixes: 7ad54b98fc1f ("cifs: use origin fullpath for automounts")
Cc: <stable@vger.kernel.org> # 6.2+
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This is a proposal to revert commit 914eedcb9ba0ff53c33808.
I found this when writing a simple UFFDIO_API test to be the first unit
test in this set. Two things breaks with the commit:
- UFFDIO_API check was lost and missing. According to man page, the
kernel should reject ioctl(UFFDIO_API) if uffdio_api.api != 0xaa. This
check is needed if the api version will be extended in the future, or
user app won't be able to identify which is a new kernel.
- Feature flags checks were removed, which means UFFDIO_API with a
feature that does not exist will also succeed. According to the man
page, we should (and it makes sense) to reject ioctl(UFFDIO_API) if
unknown features passed in.
Link: https://lore.kernel.org/r/20220722201513.1624158-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20230412163922.327282-2-peterx@redhat.com
Fixes: 914eedcb9ba0 ("userfaultfd: don't fail on unrecognized features")
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Zach O'Keefe <zokeefe@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
KASAN report null-ptr-deref:
==================================================================
BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0
Write of size 8 at addr 0000000000000000 by task sync/943
CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461
Call Trace:
<TASK>
dump_stack_lvl+0x7f/0xc0
print_report+0x2ba/0x340
kasan_report+0xc4/0x120
kasan_check_range+0x1b7/0x2e0
__kasan_check_write+0x24/0x40
bdi_split_work_to_wbs+0x5c5/0x7b0
sync_inodes_sb+0x195/0x630
sync_inodes_one_sb+0x3a/0x50
iterate_supers+0x106/0x1b0
ksys_sync+0x98/0x160
[...]
==================================================================
The race that causes the above issue is as follows:
cpu1 cpu2
-------------------------|-------------------------
inode_switch_wbs
INIT_WORK(&isw->work, inode_switch_wbs_work_fn)
queue_rcu_work(isw_wq, &isw->work)
// queue_work async
inode_switch_wbs_work_fn
wb_put_many(old_wb, nr_switched)
percpu_ref_put_many
ref->data->release(ref)
cgwb_release
queue_work(cgwb_release_wq, &wb->release_work)
// queue_work async
&wb->release_work
cgwb_release_workfn
ksys_sync
iterate_supers
sync_inodes_one_sb
sync_inodes_sb
bdi_split_work_to_wbs
kmalloc(sizeof(*work), GFP_ATOMIC)
// alloc memory failed
percpu_ref_exit
ref->data = NULL
kfree(data)
wb_get(wb)
percpu_ref_get(&wb->refcnt)
percpu_ref_get_many(ref, 1)
atomic_long_add(nr, &ref->data->count)
atomic64_add(i, v)
// trigger null-ptr-deref
bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all
wbs. If the allocation of new work fails, the on-stack fallback will be
used and the reference count of the current wb is increased afterwards.
If cgroup writeback membership switches occur before getting the reference
count and the current wb is released as old_wd, then calling wb_get() or
wb_put() will trigger the null pointer dereference above.
This issue was introduced in v4.3-rc7 (see fix tag1). Both
sync_inodes_sb() and __writeback_inodes_sb_nr() calls to
bdi_split_work_to_wbs() can trigger this issue. For scenarios called via
sync_inodes_sb(), originally commit 7fc5854f8c6e ("writeback: synchronize
sync(2) against cgroup writeback membership switches") reduced the
possibility of the issue by adding wb_switch_rwsem, but in v5.14-rc1 (see
fix tag2) removed the "inode_io_list_del_locked(inode, old_wb)" from
inode_switch_wbs_work_fn() so that wb->state contains WB_has_dirty_io,
thus old_wb is not skipped when traversing wbs in bdi_split_work_to_wbs(),
and the issue becomes easily reproducible again.
To solve this problem, percpu_ref_exit() is called under RCU protection to
avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs().
Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(),
and skip the current wb if wb_tryget() fails because the wb has already
been shutdown.
Link: https://lkml.kernel.org/r/20230410130826.1492525-1-libaokun1@huawei.com
Fixes: b817525a4a80 ("writeback: bdi_writeback iteration must not skip dying ones")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Hou Tao <houtao1@huawei.com>
Cc: yangerkun <yangerkun@huawei.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Switch EROFS_I_{VERSION,DATALAYOUT}_BITS into
EROFS_I_{VERSION,DATALAYOUT}_MASK.
Also avoid erofs_bitrange() since its functionality is simple enough.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230414083027.12307-2-hsiangkao@linux.alibaba.com
Given on-disk i_xattr_icount is 16 bits and xattr_isize is calculated
from i_xattr_icount multiplying 4, xattr_isize has a theoretical maximum
of 256K (64K * 4).
Thus declare xattr_isize as unsigned int to avoid the potential overflow.
Fixes: bfb8674dc044 ("staging: erofs: add erofs in-memory stuffs")
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230414061810.6479-1-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Prior to big pclusters, non-compact compression indexes could have
empty headers.
Let's just avoid the legacy path since it can be handled properly
as a specific compression header with z_erofs_fill_inode_lazy() too.
Tested with erofs-utils exist versions.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230413092241.73829-1-hsiangkao@linux.alibaba.com
Let's enable long xattr name prefix feature. Old kernels will just
ignore / skip such extended attributes. In addition, in case you
don't want to mount such images, add another incompatible feature as
an option for this.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230407222808.19670-1-jefflexu@linux.alibaba.com
[ Gao Xiang: minor commit message fix. ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Make .{list,get}xattr routines adapted to long xattr name prefixes.
When the bit 7 of erofs_xattr_entry.e_name_index is set, it indicates
that it refers to a long xattr name prefix.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230411093537.127286-1-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Long xattr name prefixes will be scanned upon mounting and the in-memory
long xattr name prefix array will be initialized accordingly.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230407141710.113882-6-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Besides the predefined xattr name prefixes, introduces long xattr name
prefixes, which work similarly as the predefined name prefixes, except
that they are user specified.
It is especially useful for use cases together with overlayfs like
Composefs model, which introduces diverse xattr values with only a few
common xattr names (trusted.overlay.redirect, trusted.overlay.digest,
and maybe more in the future). That makes the existing predefined
prefixes ineffective in both image size and runtime performance.
When a user specified long xattr name prefix is used, only the trailing
part of the xattr name apart from the long xattr name prefix will be
stored in erofs_xattr_entry.e_name. e_name is empty if the xattr name
matches exactly as the long xattr name prefix. All long xattr prefixes
are stored in the packed or meta inode, which depends if fragments
feature is enabled or not.
For each long xattr name prefix, the on-disk format is kept as the same
as the unique metadata format: ALIGN({__le16 len, data}, 4), where len
represents the total size of struct erofs_xattr_long_prefix, followed
by data of struct erofs_xattr_long_prefix itself.
Each erofs_xattr_long_prefix keeps predefined prefixes (base_index)
and the remaining prefix string without the trailing '\0'.
Two fields are introduced to the on-disk superblock, where
xattr_prefix_count represents the total number of the long xattr name
prefixes recorded, and xattr_prefix_start represents the start offset of
recorded name prefixes in the packed/meta inode divided by 4.
When referring to a long xattr name prefix, the highest bit (bit 7) of
erofs_xattr_entry.e_name_index is set, while the lower bits (bit 0-6)
as a whole represents the index of the referred long name prefix among
all long xattr name prefixes.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230407141710.113882-5-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
packed inode could be used in more scenarios which are independent of
compression in the future.
For example, packed inode could be used to keep extra long xattr
prefixes with the help of following patches.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230407141710.113882-4-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
So that erofs_read_metadata() can read metadata from other inodes
(e.g. packed inode) as well.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230407141710.113882-2-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
As commit 8f7acdae2cd4 ("staging: erofs: kill all failure handling in
fill_super()"), move the initialization of packed inode after root
inode is assigned, so that the iput() in .put_super() is adequate as
the failure handling.
Otherwise, iput() is also needed in .kill_sb(), in case of the mounting
fails halfway.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Fixes: b15b2e307c3a ("erofs: support on-disk compressed fragments data")
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230407141710.113882-3-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
The ztailpacking feature has been merged for a year, it has been mostly
stable now.
Signed-off-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230227084457.3510-1-zbestahu@gmail.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
erofs_xattr_generic_get() won't be called from xattr handlers other than
user/trusted/security xattr handler, and thus there's no need of extra
checking.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230330082910.125374-4-jefflexu@linux.alibaba.com
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Move xattrblock_addr() and xattrblock_offset() helpers into xattr.c,
as they are not used outside of xattr.c.
inlinexattr_header_size() has only one caller, and thus make it inlined
into the caller directly.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230330082910.125374-2-jefflexu@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
- Get rid of all "vle" (variable-length extents) expressions
since they only expand overall name lengths unnecessarily;
- Rename COMPRESSION_LEGACY to COMPRESSED_FULL;
- Move on-disk directory definitions ahead of compression;
- Drop unused extended attribute definitions;
- Move inode ondisk union `i_u` out as `union erofs_inode_i_u`.
No actual logical change.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230331063149.25611-1-hsiangkao@linux.alibaba.com
In order to support mounting multi-blobs container image as a single
block device, add flattened block device feature for EROFS.
In this mode, all meta/data contents will be mapped into one block
space. User could compose a block device(by nbd/ublk/virtio-blk/
vhost-user-blk) from multiple sources and mount the block device by
EROFS directly. It can reduce the number of block devices used, and
it's also benefits in both VM file passthrough and distributed storage
scenarios.
You can test this using the method mentioned by:
https://github.com/dragonflyoss/image-service/pull/1139
1. Compose a (nbd)block device from multi-blobs.
2. Mount EROFS on mntdir/.
3. Compare the md5sum between source dir and mntdir/.
Later, we could also use it to refer original tar blobs.
Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>
Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Tested-by: Jiang Liu <gerry@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230302071751.48425-1-zhujia.zj@bytedance.com
[ Gao Xiang: refine commit message and use erofs_pos(). ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Set the block size to that specified in on-disk superblock.
Also remove the hard constraint of PAGE_SIZE block size for the
uncompressed device backend. This constraint is temporarily remained
for compressed device and fscache backend, as there is more work needed
to handle the condition where the block size is not equal to PAGE_SIZE.
It is worth noting that the on-disk block size is read prior to
erofs_superblock_csum_verify(), as the read block size is needed in the
latter.
Besides, later we are going to make erofs refer to tar data blobs (which
is 512-byte aligned) for OCI containers, where the block size is 512
bytes. In this case, the 512-byte block size may not be adequate for a
directory to contain enough dirents. To fix this, we are also going to
introduce directory block size independent on the block size.
Due to we have already supported block size smaller than PAGE_SIZE now,
disable all these images with such separated directory block size until
we supported this feature later.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230313135309.75269-3-jefflexu@linux.alibaba.com
[ Gao Xiang: update documentation. ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
As the first step of converting hardcoded blocksize to that specified in
on-disk superblock, convert all call sites of hardcoded blocksize to
sb->s_blocksize except for:
1) use sbi->blkszbits instead of sb->s_blocksize in
erofs_superblock_csum_verify() since sb->s_blocksize has not been
updated with the on-disk blocksize yet when the function is called.
2) use inode->i_blkbits instead of sb->s_blocksize in erofs_bread(),
since the inode operated on may be an anonymous inode in fscache mode.
Currently the anonymous inode is allocated from an anonymous mount
maintained in erofs, while in the near future we may allocate anonymous
inodes from a generic API directly and thus have no access to the
anonymous inode's i_sb. Thus we keep the block size in i_blkbits for
anonymous inodes in fscache mode.
Be noted that this patch only gets rid of the hardcoded blocksize, in
preparation for actually setting the on-disk block size in the following
patch. The hard limit of constraining the block size to PAGE_SIZE still
exists until the next patch.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230313135309.75269-2-jefflexu@linux.alibaba.com
[ Gao Xiang: fold a patch to fix incorrect truncated offsets. ]
Link: https://lore.kernel.org/r/20230413035734.15457-1-zhujia.zj@bytedance.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmQ7bhwACgkQiiy9cAdy
T1Fk3Av+NcTPMF6ZIhxXN4IwpsvE0KdXm+BB/+dCw82zi2mVAyZowLSFkM3TKqRh
6GOpSnKu2Vp7TCSNdN0ZtnOcC9q8H/SpFmLojBeoiyUr87tjngd7ktTkUd32FEaf
jfOqS0+NSZPmhB7eKXJ75jOISMvga0x3t1KHbO7vTm12I5b6VY3r1hxiit0RP0fg
7QKWNwSR8erQMkg8+F+n5q9kAIi88ymrPTx8991JdENqzCjJ0dNMLX7ULwD8SiWa
d9PnEFGyQeLoVF/FRQ4hYNRv67Os3xjEFdJtpZKlZ9CKfzgwA1kOYQQRfGb64bBP
wQ0Syga8OudYMq6X1jMGsw0qaGxwC32jIA03M05oQ75A8SaXyb1jauHdwNFJqjmH
JhSZ6qI77TduYK0v92Oa+Y76miW/RoI5sS8i0GrayjwN8NsBsrHH7JuLS/LSFpc/
vlv0fPqBTRpFP7Yv+JJr8lgY6a8aeAF5R4fYPeyGbOpxXm71Af95ZX5Q3JYNzdz4
ZuEpSVVn
=LnMO
-----END PGP SIGNATURE-----
Merge tag '6.3-rc6-ksmbd-server-fix' of git://git.samba.org/ksmbd
Pull ksmbd server fix from Steve French:
"smb311 server preauth integrity negotiate context parsing fix (check
for out of bounds access)"
* tag '6.3-rc6-ksmbd-server-fix' of git://git.samba.org/ksmbd:
ksmbd: avoid out of bounds access in decode_preauth_ctxt()
smb311_decode_neg_context() doesn't properly check against SMB packet
boundaries prior to accessing individual negotiate context entries. This
is due to the length check omitting the eight byte smb2_neg_context
header, as well as incorrect decrementing of len_of_ctxts.
Fixes: 5100d8a3fe03 ("SMB311: Improve checking of negotiate security contexts")
Reported-by: Volker Lendecke <vl@samba.org>
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
These checkings are also related with feature compatibility checkings.
So move them into ext4_check_feature_compatibility(). No functional
change.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Link: https://lore.kernel.org/r/20230323140517.1070239-9-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The naming styles are different for some functions with 'check' in their
names. Some of them are like:
ext4_check_quota_consistency
ext4_check_test_dummy_encryption
ext4_check_opt_consistency
ext4_check_descriptors
ext4_check_feature_compatibility
While the others looks like below:
ext4_geometry_check
ext4_journal_data_mode_check
This is not a big deal and boils down to personal preference. But I'd
like to make them consistent.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Link: https://lore.kernel.org/r/20230323140517.1070239-6-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The only difference here is that ->s_group_desc and ->s_flex_groups share
the same rcu read lock here but it is not necessary. In other places they
do not share the lock at all.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Link: https://lore.kernel.org/r/20230323140517.1070239-4-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Factor out ext4_percpu_param_init() and ext4_percpu_param_destroy(). And
also use ext4_percpu_param_destroy() in ext4_put_super() to avoid
duplicated code. No functional change.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Link: https://lore.kernel.org/r/20230323140517.1070239-3-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
After making ext4_writepages() properly clean all pages there is no need
for special treatment of filesystem freezing. Revert commit
e6c28a26b799c7640b77daff3e4a67808c74381c.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-13-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Since filemap_write_and_wait() is now enough to get journalled data to
final location update the comment in mpage_prepare_extent_to_map().
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-12-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now that ext4_writepages() gets journalled data into its final location
we just use filemap_write_and_wait() instead of special handling of
journalled data in ext4_bmap(). We can also drop EXT4_STATE_JDATA flag
as it is not used anymore.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-11-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now that ext4_writepages() makes sure all journalled data is committed
and checkpointed, sync_filesystem() call done by dquot_quota_on() is
enough for quota IO to see uptodate data. So drop special handling of
journalled data from ext4_quota_on().
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-10-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now that ext4_writepages() makes sure journalled data is on stable
storage, write_inode_now() call in iput_final() is enough to make
pagecache pages with journalled data really clean (data committed and
checkpointed). So we can drop special handling of journalled data in
ext4_evict_inode().
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-9-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The handling of journalled data in ext4_zero_range() is incomplete. We
do not need to commit running transaction but we rather need to
checkpoint pages with journalled data. If we don't, journal tail can be
advanced beyond transaction containing the journalled data and if we
then crash before committing the transaction doing the zeroing we will
have inconsistent (too old) data in the file. Make sure file pages with
journalled data are properly checkpointed before removing them from the
page cache.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-8-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now that filemap_write_and_wait() makes sure pages with journalled data
are safely on disk, ext4_collapse_range() and ext4_insert_range() do
not need special handling of journalled data.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-7-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now that ext4_writepages() make sure all pages with journalled data are
stable on disk, we don't need special handling of journalled data in
ext4_sync_file().
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-6-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When journalling data we currently just walk over pages, journal those
that are marked for delayed dirtying (only pinned pages dirtied behing
our back these days) and checkpoint other dirty pages. Because some
pages may be part of running transaction the result is that after
filemap_write_and_wait() we are not guaranteed pages are stable on disk.
Thus places that want to flush current pagecache content need to jump
through hoops to make sure journalled data is not lost. This is
manageable in cases completely controlled by ext4 (such as extent
shifting operations or inode eviction) but it gets ugly for stuff like
fsverity. Furthermore it is rather error prone as people often do not
realize journalled data needs special handling.
So change ext4_writepages() to commit transaction with inode's data
before going through the writeback loop in WB_SYNC_ALL mode. As a result
filemap_write_and_wait() is now really getting pages to stable storage
and makes pagecache pages safe to reclaim. Consequently we can remove
the special handling of journalled data from several places in follow up
patches.
Note that this will make fsync(2) for journalled data more expensive as
we will end up not only committing the transaction we need but also
checkpointing the data (which we may have previously skipped if the data
was part of the running transaction). If we really cared, we would need
to introduce special VFS function for writing out & invalidating page
cache for a range, use ->launder_page callback to perform checkpointing,
and use it from all the places that need this functionality. But at this
point I'm not convinced the complexity is worth it.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>