IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
All the timestamps in vfat_create() and vfat_mkdir() come from
fat_time_fat2unix() which ensures time granularity. We don't need to
truncate them to fit FAT's format.
Moreover, fat_truncate_crtime() and fat_timespec64_trunc_10ms() are also
removed because there is no caller anymore.
Link: https://lkml.kernel.org/r/20220503152536.2503003-4-cccheng@synology.com
Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
creation time is no longer mixed with change time. Add an in-memory field
for it, and report it in statx if supported.
Link: https://lkml.kernel.org/r/20220503152536.2503003-3-cccheng@synology.com
Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
FAT supports creation time but not change time, and there was no
corresponding timestamp for creation time in previous VFS. The original
implementation took the compromise of saving the in-memory change time
into the on-disk creation time field, but this would lead to compatibility
issues with non-linux systems.
To address this issue, this patch changes the behavior of ctime. It will
no longer be loaded and stored from the creation time on disk. Instead of
that, it'll be consistent with the in-memory mtime and share the same
on-disk field. All updates to mtime will also be applied to ctime in
memory, while all updates to ctime will be ignored.
Link: https://lkml.kernel.org/r/20220503152536.2503003-2-cccheng@synology.com
Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Separate fat_truncate_time() to each timestamps for later creation time
work.
This patch does not introduce any functional changes, it's merely
refactoring change.
Link: https://lkml.kernel.org/r/20220503152536.2503003-1-cccheng@synology.com
Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently it requires poking at debugfs to figure out the size and
population of the zswap cache on a host. There are no counters for reads
and writes against the cache. As a result, it's difficult to understand
zswap behavior on production systems.
Print zswap memory consumption and how many pages are zswapped out in
/proc/meminfo. Count zswapouts and zswapins in /proc/vmstat.
Link: https://lkml.kernel.org/r/20220510152847.230957-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Previously the protection of kernfs_pr_cont_buf was piggy backed by
rename_lock, which means that pr_cont() needs to be protected under
rename_lock. This can cause potential circular lock dependencies.
If there is an OOM, we have the following call hierarchy:
-> cpuset_print_current_mems_allowed()
-> pr_cont_cgroup_name()
-> pr_cont_kernfs_name()
pr_cont_kernfs_name() will grab rename_lock and call printk. So we have
the following lock dependencies:
kernfs_rename_lock -> console_sem
Sometimes, printk does a wakeup before releasing console_sem, which has
the dependence chain:
console_sem -> p->pi_lock -> rq->lock
Now, imagine one wants to read cgroup_name under rq->lock, for example,
printing cgroup_name in a tracepoint in the scheduler code. They will
be holding rq->lock and take rename_lock:
rq->lock -> kernfs_rename_lock
Now they will deadlock.
A prevention to this circular lock dependency is to separate the
protection of pr_cont_buf from rename_lock. In principle, rename_lock
is to protect the integrity of cgroup name when copying to buf. Once
pr_cont_buf has got its content, rename_lock can be dropped. So it's
safe to drop rename_lock after kernfs_name_locked (and
kernfs_path_from_node_locked) and rely on a dedicated pr_cont_lock
to protect pr_cont_buf.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Hao Luo <haoluo@google.com>
Link: https://lore.kernel.org/r/20220516190951.3144144-1-haoluo@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Update client_info_show to show state of courtesy client
and seconds since last renew.
Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This patch allows expired client with lock state to be in COURTESY
state. Lock conflict with COURTESY client is resolved by the fs/lock
code using the lm_lock_expirable and lm_expire_lock callback in the
struct lock_manager_operations.
If conflict client is in COURTESY state, set it to EXPIRABLE and
schedule the laundromat to run immediately to expire the client. The
callback lm_expire_lock waits for the laundromat to flush its work
queue before returning to caller.
Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add 2 new callbacks, lm_lock_expirable and lm_expire_lock, to
lock_manager_operations to allow the lock manager to take appropriate
action to resolve the lock conflict if possible.
A new field, lm_mod_owner, is also added to lock_manager_operations.
The lm_mod_owner is used by the fs/lock code to make sure the lock
manager module such as nfsd, is not freed while lock conflict is being
resolved.
lm_lock_expirable checks and returns true to indicate that the lock
conflict can be resolved else return false. This callback must be
called with the flc_lock held so it can not block.
lm_expire_lock is called to resolve the lock conflict if the returned
value from lm_lock_expirable is true. This callback is called without
the flc_lock held since it's allowed to block. Upon returning from
this callback, the lock conflict should be resolved and the caller is
expected to restart the conflict check from the beginnning of the list.
Lock manager, such as NFSv4 courteous server, uses this callback to
resolve conflict by destroying lock owner, or the NFSv4 courtesy client
(client that has expired but allowed to maintains its states) that owns
the lock.
Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Add helper locks_owner_has_blockers to check if there is any blockers
for a given lockowner.
Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
This patch moves create/destroy of laundry_wq from nfs4_state_start
and nfs4_state_shutdown_net to init_nfsd and exit_nfsd to prevent
the laundromat from being freed while a thread is processing a
conflicting lock.
Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This patch allows expired client with open state to be in COURTESY
state. Share/access conflict with COURTESY client is resolved by
setting COURTESY client to EXPIRABLE state, schedule laundromat
to run and returning nfserr_jukebox to the request client.
Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This patch provides courteous server support for delegation only.
Only expired client with delegation but no conflict and no open
or lock state is allowed to be in COURTESY state.
Delegation conflict with COURTESY/EXPIRABLE client is resolved by
setting it to EXPIRABLE, queue work for the laundromat and return
delay to the caller. Conflict is resolved when the laudromat runs
and expires the EXIRABLE client while the NFS client retries the
OPEN request. Local thread request that gets conflict is doing the
retry in _break_lease.
Client in COURTESY or EXPIRABLE state is allowed to reconnect and
continues to have access to its state. Access to the nfs4_client by
the reconnecting thread and the laundromat is serialized via the
client_lock.
Reviewed-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
nfsd_splice_actor() checks that the page being spliced does not
match the previous element in the svc_rqst::rq_pages array. We
believe this is to prevent a double put_page() in cases where the
READ payload is partially contained in the xdr_buf's head buffer.
However, the NFSD READ proc functions no longer place any part of
the READ payload in the head buffer, in order to properly support
NFS/RDMA READ with Write chunks. Therefore, simplify the logic in
nfsd_splice_actor() to remove this unnecessary check.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Yanming reported a kernel bug in Bugzilla kernel [1], which can be
reproduced. The bug message is:
The kernel message is shown below:
kernel BUG at fs/inode.c:611!
Call Trace:
evict+0x282/0x4e0
__dentry_kill+0x2b2/0x4d0
dput+0x2dd/0x720
do_renameat2+0x596/0x970
__x64_sys_rename+0x78/0x90
do_syscall_64+0x3b/0x90
[1] https://bugzilla.kernel.org/show_bug.cgi?id=215895
The bug is due to fuzzed inode has both inline_data and encrypted flags.
During f2fs_evict_inode(), as the inode was deleted by rename(), it
will cause inline data conversion due to conflicting flags. The page
cache will be polluted and the panic will be triggered in clear_inode().
Try fixing the bug by doing more sanity checks for inline data inode in
sanity_check_inode().
Cc: stable@vger.kernel.org
Reported-by: Ming Yan <yanming@tju.edu.cn>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch tries to fix permission consistency issue as all other
mainline filesystems.
Since the initial introduction of (posix) fallocate back at the turn of
the century, it has been possible to use this syscall to change the
user-visible contents of files. This can happen by extending the file
size during a preallocation, or through any of the newer modes (punch,
zero, collapse, insert range). Because the call can be used to change
file contents, we should treat it like we do any other modification to a
file -- update the mtime, and drop set[ug]id privileges/capabilities.
The VFS function file_modified() does all this for us if pass it a
locked inode, so let's make fallocate drop permissions correctly.
Cc: stable@kernel.org
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It's nonsensical to register a provided buffer ring, if a classic
provided buffer group with the same ID exists. Depending on the order of
which we decide what type to pick, the other type will never get used.
Explicitly disallow it and return an error if this is attempted.
Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We use ->buf_pages != 0 to tell if this is a shared buffer ring or a
classic provided buffer group. If we unregister the shared ring and
then attempt to use it, buf_pages is zero yet the classic list head
isn't properly initialized. This causes io_buffer_select() to think
that we have classic buffers available, but then we crash when we try
and get one from the list.
Just initialize the list if we unregister a shared buffer ring, leaving
it in a sane state for either re-registration or for attempting to use
it. And do the same for the initial setup from the classic path.
Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The parameter desc_size in fsverity_create_info() is useless and it is
not referenced anywhere. The greatest meaning of desc_size here is to
indecate the size of struct fsverity_descriptor and futher calculate the
size of signature. However, the desc->sig_size can do it also and it is
indeed, so remove it.
Therefore, it is no need to acquire desc_size by fsverity_get_descriptor()
in ensure_verity_info(), so remove the parameter desc_ret in
fsverity_get_descriptor() too.
Signed-off-by: Zhang Jianhua <chris.zjh@huawei.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220518132256.2297655-1-chris.zjh@huawei.com
If processing the on-disk mount options fails after any memory was
allocated in the ext4_fs_context, e.g. s_qf_names, then this memory is
leaked. Fix this by calling ext4_fc_free() instead of kfree() directly.
Reproducer:
mkfs.ext4 -F /dev/vdc
tune2fs /dev/vdc -E mount_opts=usrjquota=file
echo clear > /sys/kernel/debug/kmemleak
mount /dev/vdc /vdc
echo scan > /sys/kernel/debug/kmemleak
sleep 5
echo scan > /sys/kernel/debug/kmemleak
cat /sys/kernel/debug/kmemleak
Fixes: 7edfd85b1ffd ("ext4: Completely separate options parsing and sb setup")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220513231605.175121-2-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The 'commit' option is only applicable for ext3 and ext4 filesystems,
and has never been accepted by the ext2 filesystem driver, so the ext4
driver shouldn't allow it on ext2 filesystems.
This fixes a failure in xfstest ext4/053.
Fixes: 8dc0aa8cf0f7 ("ext4: check incompatible mount options while mounting ext2/3")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Link: https://lore.kernel.org/r/20220510183232.172615-1-ebiggers@kernel.org
Fix following includecheck warning:
./fs/ext4/inode.c: linux/dax.h is included more than once.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220504225025.44753-1-yang.lee@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When I wrote the freevxfs driver I had some odd choice of licensing
statements, the options are either GPL (without version) or an odd
BSD-ish licensense with advertising clause.
The GPL vs always meant to be the same as the kernel, that is version
2 only, and the odd BSD-ish license doesn't make much sense. Add
a GPL2.0-only SPDX tag to make the GPL intentions clear and drop the
bogus BSD license.
Acked-by: Krzysztof Błaszkowski <kb@sysmikro.com.pl>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The logic for handling events on child in groups that have a mark on
the parent inode, but without FS_EVENT_ON_CHILD flag in the mask is
duplicated in several places and inconsistent.
Move the logic into the preparation of mark type iterator, so that the
parent mark type will be excluded from all mark type iterations in that
case.
This results in several subtle changes of behavior, hopefully all
desired changes of behavior, for example:
- Group A has a mount mark with FS_MODIFY in mask
- Group A has a mark with ignore mask that does not survive FS_MODIFY
and does not watch children on directory D.
- Group B has a mark with FS_MODIFY in mask that does watch children
on directory D.
- FS_MODIFY event on file D/foo should not clear the ignore mask of
group A, but before this change it does
And if group A ignore mask was set to survive FS_MODIFY:
- FS_MODIFY event on file D/foo should be reported to group A on account
of the mount mark, but before this change it is wrongly ignored
Fixes: 2f02fd3fa13e ("fanotify: fix ignore mask logic for events on child and on dir")
Reported-by: Jan Kara <jack@suse.com>
Link: https://lore.kernel.org/linux-fsdevel/20220314113337.j7slrb5srxukztje@quack3.lan/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220511190213.831646-3-amir73il@gmail.com
fsnotify_foreach_iter_mark_type() is used to reduce boilerplate code
of iterating all marks of a specific group interested in an event
by consulting the iterator report_mask.
Use an open coded version of that iterator in fsnotify_iter_next()
that collects all marks of the current iteration group without
consulting the iterator report_mask.
At the moment, the two iterator variants are the same, but this
decoupling will allow us to exclude some of the group's marks from
reporting the event, for example for event on child and inode marks
on parent did not request to watch events on children.
Fixes: 2f02fd3fa13e ("fanotify: fix ignore mask logic for events on child and on dir")
Reported-by: Jan Kara <jack@suse.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220511190213.831646-2-amir73il@gmail.com
POLL* are unannotated values for the userspace ABI, while everything
in-kernel should use EPOLL* and the __poll_t type.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_file_get_normal isn't marked inline, so don't claim it as such in the
forward declaration.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ERR_PTR abuses the high bits of a pointer to transport error information.
This is only safe for kernel pointers and not user pointers. Fix
io_buffer_select and its helpers to just return NULL for failure and get
rid of this abuse.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Provided buffers allow an application to supply io_uring with buffers
that can then be grabbed for a read/receive request, when the data
source is ready to deliver data. The existing scheme relies on using
IORING_OP_PROVIDE_BUFFERS to do that, but it can be difficult to use
in real world applications. It's pretty efficient if the application
is able to supply back batches of provided buffers when they have been
consumed and the application is ready to recycle them, but if
fragmentation occurs in the buffer space, it can become difficult to
supply enough buffers at the time. This hurts efficiency.
Add a register op, IORING_REGISTER_PBUF_RING, which allows an application
to setup a shared queue for each buffer group of provided buffers. The
application can then supply buffers simply by adding them to this ring,
and the kernel can consume then just as easily. The ring shares the head
with the application, the tail remains private in the kernel.
Provided buffers setup with IORING_REGISTER_PBUF_RING cannot use
IORING_OP_{PROVIDE,REMOVE}_BUFFERS for adding or removing entries to the
ring, they must use the mapped ring. Mapped provided buffer rings can
co-exist with normal provided buffers, just not within the same group ID.
To gauge overhead of the existing scheme and evaluate the mapped ring
approach, a simple NOP benchmark was written. It uses a ring of 128
entries, and submits/completes 32 at the time. 'Replenish' is how
many buffers are provided back at the time after they have been
consumed:
Test Replenish NOPs/sec
================================================================
No provided buffers NA ~30M
Provided buffers 32 ~16M
Provided buffers 1 ~10M
Ring buffers 32 ~27M
Ring buffers 1 ~27M
The ring mapped buffers perform almost as well as not using provided
buffers at all, and they don't care if you provided 1 or more back at
the same time. This means application can just replenish as they go,
rather than need to batch and compact, further reducing overhead in the
application. The NOP benchmark above doesn't need to do any compaction,
so that overhead isn't even reflected in the above test.
Co-developed-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Abstract this out from io_sqe_buffer_register() so we can use it
elsewhere too without duplicating this code.
No intended functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Obviously not really useful since it's not transferring data, but it
is helpful in benchmarking overhead of provided buffers.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_provided_buffer_select() must drop the submit lock, if needed, even
in the error handling case. Failure to do so will leave us with the
ctx->uring_lock held, causing spew like:
====================================
WARNING: iou-wrk-366/368 still has locks held!
5.18.0-rc6-00294-gdf8dc7004331 #994 Not tainted
------------------------------------
1 lock held by iou-wrk-366/368:
#0: ffff0000c72598a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_ring_submit_lock+0x20/0x48
stack backtrace:
CPU: 4 PID: 368 Comm: iou-wrk-366 Not tainted 5.18.0-rc6-00294-gdf8dc7004331 #994
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace.part.0+0xa4/0xd4
show_stack+0x14/0x5c
dump_stack_lvl+0x88/0xb0
dump_stack+0x14/0x2c
debug_check_no_locks_held+0x84/0x90
try_to_freeze.isra.0+0x18/0x44
get_signal+0x94/0x6ec
io_wqe_worker+0x1d8/0x2b4
ret_from_fork+0x10/0x20
and triggering later hangs off get_signal() because we attempt to
re-grab the lock.
Reported-by: syzbot+987d7bb19195ae45208c@syzkaller.appspotmail.com
Fixes: 149c69b04a90 ("io_uring: abstract out provided buffer list selection")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass updated i_size in fscache_unuse_cookie() when called
from nfs_fscache_release_file(), which ensures the size of
an fscache object gets written to the cache storage. Failing
to do so results in unnessary reads from the NFS server, even
when the data is cached, due to a cachefiles object coherency
check failing with a trace similar to the following:
cachefiles_coherency: o=0000000e BAD osiz B=afbb3 c=0
This problem can be reproduced as follows:
#!/bin/bash
v=4.2; NFS_SERVER=127.0.0.1
set -e; trap cleanup EXIT; rc=1
function cleanup {
umount /mnt/nfs > /dev/null 2>&1
RC_STR="TEST PASS"
[ $rc -eq 1 ] && RC_STR="TEST FAIL"
echo "$RC_STR on $(uname -r) with NFSv$v and server $NFS_SERVER"
}
mount -o vers=$v,fsc $NFS_SERVER:/export /mnt/nfs
rm -f /mnt/nfs/file1.bin > /dev/null 2>&1
dd if=/dev/zero of=/mnt/nfs/file1.bin bs=4096 count=1 > /dev/null 2>&1
echo 3 > /proc/sys/vm/drop_caches
echo Read file 1st time from NFS server into fscache
dd if=/mnt/nfs/file1.bin of=/dev/null > /dev/null 2>&1
umount /mnt/nfs && mount -o vers=$v,fsc $NFS_SERVER:/export /mnt/nfs
echo 3 > /proc/sys/vm/drop_caches
echo Read file 2nd time from fscache
dd if=/mnt/nfs/file1.bin of=/dev/null > /dev/null 2>&1
echo Check mountstats for NFS read
grep -q "READ: 0" /proc/self/mountstats # (1st number) == 0
[ $? -eq 0 ] && rc=0
Fixes: a6b5a28eb56c "nfs: Convert to new fscache volume/cookie API"
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Tested-by: Daire Byrne <daire@dneg.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
NFSv4 can lose locks if, for example there is a network partition for
longer than the lease period. When this happens a warning message
NFS: __nfs4_reclaim_open_state: Lock reclaim failed!
is generated, possibly once for each lock (though rate limited).
This is potentially misleading as is can be read as suggesting that lock
reclaim was attempted. However the default behaviour is to not attempt
to recover locks (except due to server report).
This patch changes the reporting to produce at most one message for each
attempt to recover all state from a given server. The message reports
the server name and the number of locks lost if that number is non-zero.
It reports that locks were lost and give no suggestion as to whether
there was an attempt or not.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
There is a race condition in smb2_compound_op:
after_close:
num_rqst++;
if (cfile) {
cifsFileInfo_put(cfile); // sends SMB2_CLOSE to the server
cfile = NULL;
This is triggered by smb2_query_path_info operation that happens during
revalidate_dentry. In smb2_query_path_info, get_readable_path is called to
load the cfile, increasing the reference counter. If in the meantime, this
reference becomes the very last, this call to cifsFileInfo_put(cfile) will
trigger a SMB2_CLOSE request sent to the server just before sending this compound
request – and so then the compound request fails either with EBADF/EIO depending
on the timing at the server, because the handle is already closed.
In the first scenario, the race seems to be happening between smb2_query_path_info
triggered by the rename operation, and between “cleanup” of asynchronous writes – while
fsync(fd) likely waits for the asynchronous writes to complete, releasing the writeback
structures can happen after the close(fd) call. So the EBADF/EIO errors will pop up if
the timing is such that:
1) There are still outstanding references after close(fd) in the writeback structures
2) smb2_query_path_info successfully fetches the cfile, increasing the refcounter by 1
3) All writeback structures release the same cfile, reducing refcounter to 1
4) smb2_compound_op is called with that cfile
In the second scenario, the race seems to be similar – here open triggers the
smb2_query_path_info operation, and if all other threads in the meantime decrease the
refcounter to 1 similarly to the first scenario, again SMB2_CLOSE will be sent to the
server just before issuing the compound request. This case is harder to reproduce.
See https://bugzilla.samba.org/show_bug.cgi?id=15051
Cc: stable@vger.kernel.org
Fixes: 8de9e86c67ba ("cifs: create a helper to find a writeable handle by path name")
Signed-off-by: Ondrej Hubsch <ohubsch@purestorage.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
We gate whether to IOPOLL for a request on whether the opcode is allowed
on a ring setup for IOPOLL and if it's got a file assigned. MSG_RING
is the only one that allows a file yet isn't pollable, it's merely
supported to allow communication on an IOPOLL ring, not because we can
poll for completion of it.
Put the assigned file early and clear it, so we don't attempt to poll
for it.
Reported-by: syzbot+1a0a53300ce782f8b3ad@syzkaller.appspotmail.com
Fixes: 3f1d52abf098 ("io_uring: defer msg-ring file validity check until command issue")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tryng to rename a directory that has all following properties fails with
EINVAL and triggers the 'WARN_ON_ONCE(!fscrypt_has_encryption_key(dir))'
in f2fs_match_ci_name():
- The directory is casefolded
- The directory is encrypted
- The directory's encryption key is not yet set up
- The parent directory is *not* encrypted
The problem is incorrect handling of the lookup of ".." to get the
parent reference to update. fscrypt_setup_filename() treats ".." (and
".") specially, as it's never encrypted. It's passed through as-is, and
setting up the directory's key is not attempted. As the name isn't a
no-key name, f2fs treats it as a "normal" name and attempts a casefolded
comparison. That breaks the assumption of the WARN_ON_ONCE() in
f2fs_match_ci_name() which assumes that for encrypted directories,
casefolded comparisons only happen when the directory's key is set up.
We could just remove this WARN_ON_ONCE(). However, since casefolding is
always a no-op on "." and ".." anyway, let's instead just not casefold
these names. This results in the standard bytewise comparison.
Fixes: 7ad08a58bf67 ("f2fs: Handle casefolding with Encryption")
Cc: <stable@vger.kernel.org> # v5.11+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The f2fs_gc uses a bitmap to indicate pinned sections, but when disabling
chckpoint, we call f2fs_gc() with NULL_SEGNO which selects the same dirty
segment as a victim all the time, resulting in checkpoint=disable failure,
for example. Let's pick another one, if we fail to collect it.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Hulk Robot reported a BUG_ON:
==================================================================
EXT4-fs error (device loop3): ext4_mb_generate_buddy:805: group 0,
block bitmap and bg descriptor inconsistent: 25 vs 31513 free clusters
kernel BUG at fs/ext4/ext4_jbd2.c:53!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 0 PID: 25371 Comm: syz-executor.3 Not tainted 5.10.0+ #1
RIP: 0010:ext4_put_nojournal fs/ext4/ext4_jbd2.c:53 [inline]
RIP: 0010:__ext4_journal_stop+0x10e/0x110 fs/ext4/ext4_jbd2.c:116
[...]
Call Trace:
ext4_write_inline_data_end+0x59a/0x730 fs/ext4/inline.c:795
generic_perform_write+0x279/0x3c0 mm/filemap.c:3344
ext4_buffered_write_iter+0x2e3/0x3d0 fs/ext4/file.c:270
ext4_file_write_iter+0x30a/0x11c0 fs/ext4/file.c:520
do_iter_readv_writev+0x339/0x3c0 fs/read_write.c:732
do_iter_write+0x107/0x430 fs/read_write.c:861
vfs_writev fs/read_write.c:934 [inline]
do_pwritev+0x1e5/0x380 fs/read_write.c:1031
[...]
==================================================================
Above issue may happen as follows:
cpu1 cpu2
__________________________|__________________________
do_pwritev
vfs_writev
do_iter_write
ext4_file_write_iter
ext4_buffered_write_iter
generic_perform_write
ext4_da_write_begin
vfs_fallocate
ext4_fallocate
ext4_convert_inline_data
ext4_convert_inline_data_nolock
ext4_destroy_inline_data_nolock
clear EXT4_STATE_MAY_INLINE_DATA
ext4_map_blocks
ext4_ext_map_blocks
ext4_mb_new_blocks
ext4_mb_regular_allocator
ext4_mb_good_group_nolock
ext4_mb_init_group
ext4_mb_init_cache
ext4_mb_generate_buddy --> error
ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)
ext4_restore_inline_data
set EXT4_STATE_MAY_INLINE_DATA
ext4_block_write_begin
ext4_da_write_end
ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)
ext4_write_inline_data_end
handle=NULL
ext4_journal_stop(handle)
__ext4_journal_stop
ext4_put_nojournal(handle)
ref_cnt = (unsigned long)handle
BUG_ON(ref_cnt == 0) ---> BUG_ON
The lock held by ext4_convert_inline_data is xattr_sem, but the lock
held by generic_perform_write is i_rwsem. Therefore, the two locks can
be concurrent.
To solve above issue, we add inode_lock() for ext4_convert_inline_data().
At the same time, move ext4_convert_inline_data() in front of
ext4_punch_hole(), remove similar handling from ext4_punch_hole().
Fixes: 0c8d414f163f ("ext4: let fallocate handle inline data correctly")
Cc: stable@vger.kernel.org
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220428134031.4153381-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Symlink's external data block is one kind of metadata block, and now
that almost all ext4 metadata block's page cache (e.g. directory blocks,
quota blocks...) belongs to bdev backing inode except the symlink. It
is essentially worked in data=journal mode like other regular file's
data block because probably in order to make it simple for generic VFS
code handling symlinks or some other historical reasons, but the logic
of creating external data block in ext4_symlink() is complicated. and it
also make things confused if user do not want to let the filesystem
worked in data=journal mode. This patch convert the final exceptional
case and make things clean, move the mapping of the symlink's external
data block to bdev like any other metadata block does.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20220424140936.1898920-3-yi.zhang@huawei.com
Current ext4_getblk() might sleep if some resources are not valid or
could be race with a concurrent extents modifing procedure. So we
cannot call ext4_getblk() and ext4_map_blocks() to get map blocks in
the atomic context in some fast path (e.g. the upcoming procedure of
getting symlink external block in the RCU context), even if the map
extents have already been check and cached.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20220424140936.1898920-2-yi.zhang@huawei.com