IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
The helper function nilfs_recovery_copy_block() of
nilfs_recovery_dsync_blocks(), which recovers data from logs created by
data sync writes during a mount after an unclean shutdown, incorrectly
calculates the on-page offset when copying repair data to the file's page
cache. In environments where the block size is smaller than the page
size, this flaw can cause data corruption and leak uninitialized memory
bytes during the recovery process.
Fix these issues by correcting this byte offset calculation on the page.
Link: https://lkml.kernel.org/r/20240124121936.10575-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
lock_task_sighand() can trigger a hard lockup. If NR_CPUS threads call
do_task_stat() at the same time and the process has NR_THREADS, it will
spin with irqs disabled O(NR_CPUS * NR_THREADS) time.
Change do_task_stat() to use sig->stats_lock to gather the statistics
outside of ->siglock protected section, in the likely case this code will
run lockless.
Link: https://lkml.kernel.org/r/20240123153357.GA21857@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Dylan Hatch <dylanbhatch@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "fs/proc: do_task_stat: use sig->stats_".
do_task_stat() has the same problem as getrusage() had before "getrusage:
use sig->stats_lock rather than lock_task_sighand()": a hard lockup. If
NR_CPUS threads call lock_task_sighand() at the same time and the process
has NR_THREADS, spin_lock_irq will spin with irqs disabled O(NR_CPUS *
NR_THREADS) time.
This patch (of 3):
thread_group_cputime() does its own locking, we can safely shift
thread_group_cputime_adjusted() which does another for_each_thread loop
outside of ->siglock protected section.
Not only this removes for_each_thread() from the critical section with
irqs disabled, this removes another case when stats_lock is taken with
siglock held. We want to remove this dependency, then we can change the
users of stats_lock to not disable irqs.
Link: https://lkml.kernel.org/r/20240123153313.GA21832@redhat.com
Link: https://lkml.kernel.org/r/20240123153355.GA21854@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Dylan Hatch <dylanbhatch@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
For shared memory of type SHM_HUGETLB, hugetlb pages are reserved in
shmget() call. If SHM_NORESERVE flags is specified then the hugetlb pages
are not reserved. However when the shared memory is attached with the
shmat() call the hugetlb pages are getting reserved incorrectly for
SHM_HUGETLB shared memory created with SHM_NORESERVE which is a bug.
-------------------------------
Following test shows the issue.
$cat shmhtb.c
int main()
{
int shmflags = 0660 | IPC_CREAT | SHM_HUGETLB | SHM_NORESERVE;
int shmid;
shmid = shmget(SKEY, SHMSZ, shmflags);
if (shmid < 0)
{
printf("shmat: shmget() failed, %d\n", errno);
return 1;
}
printf("After shmget()\n");
system("cat /proc/meminfo | grep -i hugepages_");
shmat(shmid, NULL, 0);
printf("\nAfter shmat()\n");
system("cat /proc/meminfo | grep -i hugepages_");
shmctl(shmid, IPC_RMID, NULL);
return 0;
}
#sysctl -w vm.nr_hugepages=20
#./shmhtb
After shmget()
HugePages_Total: 20
HugePages_Free: 20
HugePages_Rsvd: 0
HugePages_Surp: 0
After shmat()
HugePages_Total: 20
HugePages_Free: 20
HugePages_Rsvd: 5 <--
HugePages_Surp: 0
--------------------------------
Fix is to ensure that hugetlb pages are not reserved for SHM_HUGETLB shared
memory in the shmat() call.
Link: https://lkml.kernel.org/r/1706040282-12388-1-git-send-email-prakash.sangappa@oracle.com
Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ksmbd_iov_pin_rsp_read() doesn't free the provided aux buffer if it
fails. Seems to be the caller's responsibility to clear the buffer in
error case.
Found by Linux Verification Center (linuxtesting.org).
Fixes: e2b76ab8b5c9 ("ksmbd: add support for read compound")
Cc: stable@vger.kernel.org
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
The ksmbd_extract_sharename() function lacked a complete kernel-doc
comment. This patch adds parameter descriptions and detailed function
behavior to improve code readability and maintainability.
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
When we added mount_setattr() I added additional checks compared to the
legacy do_reconfigure_mnt() and do_change_type() helpers used by regular
mount(2). If that mount had a parent then verify that the caller and the
mount namespace the mount is attached to match and if not make sure that
it's an anonymous mount.
The real rootfs falls into neither category. It is neither an anoymous
mount because it is obviously attached to the initial mount namespace
but it also obviously doesn't have a parent mount. So that means legacy
mount(2) allows changing mount properties on the real rootfs but
mount_setattr(2) blocks this. I never thought much about this but of
course someone on this planet of earth changes properties on the real
rootfs as can be seen in [1].
Since util-linux finally switched to the new mount api in 2.39 not so
long ago it also relies on mount_setattr() and that surfaced this issue
when Fedora 39 finally switched to it. Fix this.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2256843
Link: https://lore.kernel.org/r/20240206-vfs-mount-rootfs-v1-1-19b335eee133@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: Karel Zak <kzak@redhat.com>
Cc: stable@vger.kernel.org # v5.12+
Signed-off-by: Christian Brauner <brauner@kernel.org>
- Address a deadlock regression in RELEASE_LOCKOWNER
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmXDkGoACgkQM2qzM29m
f5cD1xAAsqKTmrb2ABdFZadYfVl6lKxtNWEp8S9eRS6eBU6bcNr5sxoTi+eflHuB
a58TfUwj8ffVtmd0qGaWI8RpDuAYtxhJU6l3TQZCLheMlKvsP3u6lZDjk8CeYtIK
9bVDZV7MOh9C/p01p/21P2B3OukgzzF8Lz/AFPCxCtoK5nnaDT5F+8pX8yfZU5x0
bM1gBHzB80BoCdzlimN6QB8EfFkOgIF9Apnk53E676KmkOuADV+CIp4aQMsm8l3r
6lJAdsOCuw5BgSDJWh1vGmFKubfhP841QbglDnnZcc+WTBhvEO0PR8Kv0Ugn5Xek
8NkcnOlti+gjH0EKTHx5P4frV0BcWptLjCjVdruOvBszrZtEaX547Mp/d04GGO0U
8gUEhen/RT7l1E2RM4qII9q5nuaekjI2Da2FGZNmK/j66OPFibiOBMRn3u/haTr5
2axrrO9NOnfWcTQX08iKZygEUY7E4h3iImqkNw+c+avZ6SRiH548TRCrKdlBeuia
Pj3QFRfu/9PQnnl9qnROXtAP70AKmX1iSLZ4s9+KqTzQVCW7tLbZpOn9D1hHCJ1q
0JTVql5rSXCl5YpBsDiqr7qvZhVab/2+1X9wHkLPZd5aIBoUmOQOMebZCUxpu+TR
houx3NBGYNixk0khEkHH3+c0HydvnxVU3xClUgLxjHgEdeS7R6s=
=z3oC
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fix from Chuck Lever:
- Address a deadlock regression in RELEASE_LOCKOWNER
* tag 'nfsd-6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: don't take fi_lock in nfsd_break_deleg_cb()
The MDS will issue the 'Fr' caps for async dirop, while there is
buggy in kclient and it could miss releasing the async dirop caps,
which is 'Fsxr'. And then the MDS will complain with:
"[WRN] client.xxx isn't responding to mclientcaps(revoke) ..."
So when releasing the dirop async requests or when they fail we
should always make sure that being revoked caps could be released.
Link: https://tracker.ceph.com/issues/50223
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In fs/ceph/caps.c, in encode_cap_msg(), "use after free" error was
caught by KASAN at this line - 'ceph_buffer_get(arg->xattr_buf);'. This
implies before the refcount could be increment here, it was freed.
In same file, in "handle_cap_grant()" refcount is decremented by this
line - 'ceph_buffer_put(ci->i_xattrs.blob);'. It appears that a race
occurred and resource was freed by the latter line before the former
line could increment it.
encode_cap_msg() is called by __send_cap() and __send_cap() is called by
ceph_check_caps() after calling __prep_cap(). __prep_cap() is where
arg->xattr_buf is assigned to ci->i_xattrs.blob. This is the spot where
the refcount must be increased to prevent "use after free" error.
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/59259
Signed-off-by: Rishabh Dave <ridave@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The fscrypt code will use i_blkbits to setup ci_data_unit_bits when
allocating the new inode, but ceph will initiate i_blkbits ater when
filling the inode, which is too late. Since ci_data_unit_bits will only
be used by the fscrypt framework so initiating i_blkbits with
CEPH_FSCRYPT_BLOCK_SHIFT is safe.
Link: https://tracker.ceph.com/issues/64035
Fixes: 5b1188847180 ("fscrypt: support crypto data unit size less than filesystem block size")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmXDNuAACgkQxWXV+ddt
WDvBGg/9FuCJm/GkBxgeVKNxdF28fIzYkYHjSYzHSo5A5GFMNENHUDfXcSNjZjUM
ZFWHCXENcnNa7pKONPaW5QIIQuecBPqcXK+lPJXlqFlC22CGSVD7MZ7/Fm7uKJ5W
mhGGuq7NTuTN1MYm480WVa+5DkfVbFkPeZgWOVTQ0tXGxTEKU9pXvwmflx8rbmRG
VPhT0iZO/KmkRSp91BwAJxitw8v76WG9JpGemiFcNOISCdE/HENxrxj8rE6beZoc
g0Kx8YQDTlf119bdwlCdJkvRVEzjIEZIUE2g8J0oKzPE6CmY2a+8+Iv/S0nCCc6V
2nFVHdhLnUH5oIuFEoo026tvu3tMKR1K30EAQyFslsjPE74Hye7MAjr8sEvAF7E/
J4Sbn3NIILkKu1Ozn/RqhPh+XsSyU9tXeO1+BcdmrGY9vDGq18lVbruOrde14fqZ
xHFJloXKsJCw7AcMzNfsa6arRQ7YGa8sGudMLpriUemUUn0MK8OdY6zCq20p43ON
8eUigP3WHOdPfCJXNfgqlJdyjmYdHCWvn4wKpPDMQuU5rMyUloOJDqtR6fxVCatO
0Pjg0zVyLu/CF6+vrL6wP4qT9sRj1Jy2YEh8fFe4fWc9+JOmQZYBm/Eyaw4oU0rg
lOmqE1/TEgl0ra9IHvxcgJo5l7zx2dbHAgMEmScCgIwrLpkh14A=
=nMOd
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- two fixes preventing deletion and manual creation of subvolume qgroup
- unify error code returned for unknown send flags
- fix assertion during subvolume creation when anonymous device could
be allocated by other thread (e.g. due to backref walk)
* tag 'for-6.8-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: do not ASSERT() if the newly created subvolume already got read
btrfs: forbid deleting live subvol qgroup
btrfs: forbid creating subvol qgroups
btrfs: send: return EOPNOTSUPP on unknown flags
commit dfad37051ade ("remap_range: move permission hooks out of
do_clone_file_range()") moved the permission hooks from
do_clone_file_range() out to its caller vfs_clone_file_range(),
but left all the fast sanity checks in do_clone_file_range().
This makes the expensive security hooks be called in situations
that they would not have been called before (e.g. fs does not support
clone).
The only reason for the do_clone_file_range() helper was that overlayfs
did not use to be able to call vfs_clone_file_range() from copy up
context with sb_writers lock held. However, since commit c63e56a4a652
("ovl: do not open/llseek lower file with upper sb_writers held"),
overlayfs just uses an open coded version of vfs_clone_file_range().
Merge_clone_file_range() into vfs_clone_file_range(), restoring the
original order of checks as it was before the regressing commit and adapt
the overlayfs code to call vfs_clone_file_range() before the permission
hooks that were added by commit ca7ab482401c ("ovl: add permission hooks
outside of do_splice_direct()").
Note that in the merge of do_clone_file_range(), the file_start_write()
context was reduced to cover ->remap_file_range() without holding it
over the permission hooks, which was the reason for doing the regressing
commit in the first place.
Reported-and-tested-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202401312229.eddeb9a6-oliver.sang@intel.com
Fixes: dfad37051ade ("remap_range: move permission hooks out of do_clone_file_range()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/r/20240202102258.1582671-1-amir73il@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Two serious ones here that we'll want to backport to stable: a fix for a
race in the thread_with_file code, and another locking fixup in the
subvolume deletion path.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmXBeaYACgkQE6szbY3K
bnb4iA//dByHSjorgEDU/8EZwq6vKSJ5F+M4JqEYTYHP3IG9Fci6dLwCbx/SBFFn
NaYx1mPxSnVCjr1WJEikdUPFW9SAT5Qv/xqqbAKZZ8F6AW1tGM+kK9isV24tpbQx
RYpYZaYObmyHUgd2Fm3hOUltLVkRB2y2wI1SQKbq+31Zk9rYEITIdhe7mXOsD40f
Uo6KJGVATDC8gg6S3bG3ZNG7AUqOx8HErcsrErQCxJpwbbYWzpz+3QbVZNRaID04
AGRrasJcjk6RZ1LM9jasjN+i4mPOrZBS4gyXHg3YcH7+Ft4uGz//afVBU8AcnaX+
zMP12LwFRLZul1Q5lNVcI3t5lJLd2JlWiMssruZcY6tCFmyWBS8DOA5F1PJwy171
8pEcKkNhjgs/LMcpc/GPL09jRjBXYI6yM14nm1HbRE9oMm3HamyZJ0ThSQn5m63x
4wLpjlCfgBI96qPTXLUR4FQzqWe0TNI/pMkMLl0L4wcvqCQSxJ2UwTEjkaDF7hif
G98XPG8G2wgDqWKXD8u/su57E/X1vj893Y11QsMHcp0XwmYRiebsP+kR6sKU8m44
CggyC7CUEhfZM7XmmQFnhOXMMvF9f7BCS4MpvAUSuPzwYWRE8CpUpbBUjQFFGZ9w
h6er30CjOhjkc0JOT416mNvXusO3h9P4x/pyWZ33iQ1fS58ckXU=
=iour
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-02-05' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Two serious ones here that we'll want to backport to stable: a fix for
a race in the thread_with_file code, and another locking fixup in the
subvolume deletion path"
* tag 'bcachefs-2024-02-05' of https://evilpiepirate.org/git/bcachefs:
bcachefs: time_stats: Check for last_event == 0 when updating freq stats
bcachefs: install fd later to avoid race with close
bcachefs: unlock parent dir if entry is not found in subvolume deletion
bcachefs: Fix build on parisc by avoiding __multi3()
A recent change to check_for_locks() changed it to take ->flc_lock while
holding ->fi_lock. This creates a lock inversion (reported by lockdep)
because there is a case where ->fi_lock is taken while holding
->flc_lock.
->flc_lock is held across ->fl_lmops callbacks, and
nfsd_break_deleg_cb() is one of those and does take ->fi_lock. However
it doesn't need to.
Prior to v4.17-rc1~110^2~22 ("nfsd: create a separate lease for each
delegation") nfsd_break_deleg_cb() would walk the ->fi_delegations list
and so needed the lock. Since then it doesn't walk the list and doesn't
need the lock.
Two actions are performed under the lock. One is to call
nfsd_break_one_deleg which calls nfsd4_run_cb(). These doesn't act on
the nfs4_file at all, so don't need the lock.
The other is to set ->fi_had_conflict which is in the nfs4_file.
This field is only ever set here (except when initialised to false)
so there is no possible problem will multiple threads racing when
setting it.
The field is tested twice in nfs4_set_delegation(). The first test does
not hold a lock and is documented as an opportunistic optimisation, so
it doesn't impose any need to hold ->fi_lock while setting
->fi_had_conflict.
The second test in nfs4_set_delegation() *is* make under ->fi_lock, so
removing the locking when ->fi_had_conflict is set could make a change.
The change could only be interesting if ->fi_had_conflict tested as
false even though nfsd_break_one_deleg() ran before ->fi_lock was
unlocked. i.e. while hash_delegation_locked() was running.
As hash_delegation_lock() doesn't interact in any way with nfs4_run_cb()
there can be no importance to this interaction.
So this patch removes the locking from nfsd_break_one_deleg() and moves
the final test on ->fi_had_conflict out of the locked region to make it
clear that locking isn't important to the test. It is still tested
*after* vfs_setlease() has succeeded. This might be significant and as
vfs_setlease() takes ->flc_lock, and nfsd_break_one_deleg() is called
under ->flc_lock this "after" is a true ordering provided by a spinlock.
Fixes: edcf9725150e ("nfsd: fix RELEASE_LOCKOWNER")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Calling fd_install() makes a file reachable for userland, including the
possibility to close the file descriptor, which leads to calling its
'release' hook. If that happens before the code had a chance to bump the
reference of the newly created task struct, the release callback will
call put_task_struct() too early, leading to the premature destruction
of the kernel thread.
Avoid that race by calling fd_install() later, after all the setup is
done.
Fixes: 1c6fdbd8f246 ("bcachefs: Initial commit")
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
and extent handling code.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmW/G4YACgkQ8vlZVpUN
gaPTpwf/c/Fk27GV8ge9PQtR6gmir/lyw2qkvK3Z+12aEsblZRmyvElyZWjAuNQG
bciQyltabIPOA4XxfsZOrdgYI42n0rTTFG7bmI0lr+BJM/HRw0tGGMN91FZla0FP
EXv/AiHKCqlT5OFZbD+8n1TzfdOgWotjug1VLteXve3YKjkDgt5IQm/0Gx9hKBld
IR8SrQlD/rYe+VPvaHz5G4u09Ne5pUE5fDj3xE23wxfU5cloVzuVRCSOGWUCTnCW
T0v6sHeKrmiLC8tIOZkBjer4nXC0MOu0p5+geAjwOArc9VJ1Lh2eAkH+GgDOVprx
ahdl2qmbIbacBYECIeQ/+1i78+O1yw==
=CmYr
-----END PGP SIGNATURE-----
Merge tag 'for-linus-6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Miscellaneous bug fixes and cleanups in ext4's multi-block allocator
and extent handling code"
* tag 'for-linus-6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
ext4: make ext4_set_iomap() recognize IOMAP_DELALLOC map type
ext4: make ext4_map_blocks() distinguish delalloc only extent
ext4: add a hole extent entry in cache after punch
ext4: correct the hole length returned by ext4_map_blocks()
ext4: convert to exclusive lock while inserting delalloc extents
ext4: refactor ext4_da_map_blocks()
ext4: remove 'needed' in trace_ext4_discard_preallocations
ext4: remove unnecessary parameter "needed" in ext4_discard_preallocations
ext4: remove unused return value of ext4_mb_release_group_pa
ext4: remove unused return value of ext4_mb_release_inode_pa
ext4: remove unused return value of ext4_mb_release
ext4: remove unused ext4_allocation_context::ac_groups_considered
ext4: remove unneeded return value of ext4_mb_release_context
ext4: remove unused parameter ngroup in ext4_mb_choose_next_group_*()
ext4: remove unused return value of __mb_check_buddy
ext4: mark the group block bitmap as corrupted before reporting an error
ext4: avoid allocating blocks from corrupted group in ext4_mb_find_by_goal()
ext4: avoid allocating blocks from corrupted group in ext4_mb_try_best_found()
ext4: avoid dividing by 0 in mb_update_avg_fragment_size() when block bitmap corrupt
ext4: avoid bb_free and bb_fragments inconsistency in mb_free_blocks()
...
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmW+oQoACgkQiiy9cAdy
T1E6GwwAk9PQo1N8h6AI0PWCPoiHps8YzpISTgBfchDs+sIvo1sZEVMzTmSlWm4F
cON4nb3QBkD4aqWbCdb1tjby2VAb0aHuRbCQPAczW4+R7He8ELlGYow7IPBWmyI8
CTQRirYrJuIePh1aT2UG4mykNyQzp5i5ycsdcWFiIGliXrTp0De0rvE/60KGLuJ/
lnDZp7+FHytIfpwVzl4bGax4odQh14whxdQme4C9a8kVfYPQGKFyADbuxy0KmWmu
8kkEcGpY/bPdqLE1tGzNoVci9cEdYK6yUzkc8dj2ddsZ7YDHPz0NtsdWlC517qt+
ekDADHRQZiKwyJzMGHibK8E4WoP0+JjB4j6OTc369ICzgjuIutWCCIexbS0IV/po
IuyRQTvLSTfYpE8eoQavAKaty6sDnJcP7FepPtH2k5yGr4+ILZV4ozoH+hqzn2/K
sf/q9ZfkyfaEi2JkMD9TTv7k2EWLMntpxklbpg59VKHY5MGvlQqoRl9b+jzfgKEZ
xS/myr3e
=q5wf
-----END PGP SIGNATURE-----
Merge tag 'v6.8-rc3-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
"Five smb3 client fixes, mostly multichannel related:
- four multichannel fixes including fix for channel allocation when
multiple inactive channels, fix for unneeded race in channel
deallocation, correct redundant channel scaling, and redundant
multichannel disabling scenarios
- add warning if max compound requests reached"
* tag 'v6.8-rc3-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: increase number of PDUs allowed in a compound request
cifs: failure to add channel on iface should bump up weight
cifs: do not search for channel if server is terminating
cifs: avoid redundant calls to disable multichannel
cifs: make sure that channel scaling is done only once
* Clear XFS_ATTR_INCOMPLETE filter on removing xattr from a node format
attribute fork.
* Remove conditional compilation of realtime geometry validator functions to
prevent confusing error messages from being printed on the console during the
mount operation.
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZbo7FAAKCRAH7y4RirJu
9Cn+APsFEbHA8YQpCSxGDM+Xelez9X7wroi6QkyOxRP6Lqq6ogD6A96uuV86TxkQ
Hkse9IAKkFoLmyzohT9u7Bv46M/X4w8=
=Ez8Z
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.8-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Chandan Babu:
- Clear XFS_ATTR_INCOMPLETE filter on removing xattr from a node format
attribute fork
- Remove conditional compilation of realtime geometry validator
functions to prevent confusing error messages from being printed on
the console during the mount operation
* tag 'xfs-6.8-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: remove conditional building of rt geometry validator functions
xfs: reset XFS_ATTR_INCOMPLETE filter on node removal
- Fix the return code for ring_buffer_poll_wait()
It was returing a -EINVAL instead of EPOLLERR.
- Zero out the tracefs_inode so that all fields are initialized.
The ti->private could have had stale data, but instead of
just initializing it to NULL, clear out the entire structure
when it is allocated.
- Fix a crash in timerlat
The hrtimer was initialized at read and not open, but is
canceled at close. If the file was opened and never read
the close will pass a NULL pointer to hrtime_cancel().
- Rewrite of eventfs.
Linus wrote a patch series to remove the dentry references in the
eventfs_inode and to use ref counting and more of proper VFS
interfaces to make it work.
- Add warning to put_ei() if ei is not set to free. That means
something is about to free it when it shouldn't.
- Restructure the eventfs_inode to make it more compact, and remove
the unused llist field.
- Remove the fsnotify*() funtions for when the inodes were being created
in the lookup code. It doesn't make sense to notify about creation
just because something is being looked up.
- The inode hard link count was not accurate. It was being updated
when a file was looked up. The inodes of directories were updating
their parent inode hard link count every time the inode was created.
That means if memory reclaim cleaned a stale directory inode and
the inode was lookup up again, it would increment the parent inode
again as well. Al Viro said to just have all eventfs directories
have a hard link count of 1. That tells user space not to trust it.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZb1l/RQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qk6jAQDmecDOnx+j/Rm5krbX/meVPYXFj2CU
1wO7w1HBzopsBwEA5AjTKm9IGrl/eVG/+jViS165b+sJfwEcblHEFPWcIwo=
=uUzb
-----END PGP SIGNATURE-----
Merge tag 'trace-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing and eventfs fixes from Steven Rostedt:
- Fix the return code for ring_buffer_poll_wait()
It was returing a -EINVAL instead of EPOLLERR.
- Zero out the tracefs_inode so that all fields are initialized.
The ti->private could have had stale data, but instead of just
initializing it to NULL, clear out the entire structure when it is
allocated.
- Fix a crash in timerlat
The hrtimer was initialized at read and not open, but is canceled at
close. If the file was opened and never read the close will pass a
NULL pointer to hrtime_cancel().
- Rewrite of eventfs.
Linus wrote a patch series to remove the dentry references in the
eventfs_inode and to use ref counting and more of proper VFS
interfaces to make it work.
- Add warning to put_ei() if ei is not set to free. That means
something is about to free it when it shouldn't.
- Restructure the eventfs_inode to make it more compact, and remove the
unused llist field.
- Remove the fsnotify*() funtions for when the inodes were being
created in the lookup code. It doesn't make sense to notify about
creation just because something is being looked up.
- The inode hard link count was not accurate.
It was being updated when a file was looked up. The inodes of
directories were updating their parent inode hard link count every
time the inode was created. That means if memory reclaim cleaned a
stale directory inode and the inode was lookup up again, it would
increment the parent inode again as well. Al Viro said to just have
all eventfs directories have a hard link count of 1. That tells user
space not to trust it.
* tag 'trace-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
eventfs: Keep all directory links at 1
eventfs: Remove fsnotify*() functions from lookup()
eventfs: Restructure eventfs_inode structure to be more condensed
eventfs: Warn if an eventfs_inode is freed without is_freed being set
tracing/timerlat: Move hrtimer_init to timerlat_fd open()
eventfs: Get rid of dentry pointers without refcounts
eventfs: Clean up dentry ops and add revalidate function
eventfs: Remove unused d_parent pointer field
tracefs: dentry lookup crapectomy
tracefs: Avoid using the ei->dentry pointer unnecessarily
eventfs: Initialize the tracefs inode properly
tracefs: Zero out the tracefs_inode when allocating it
ring-buffer: Clean ring_buffer_poll_wait() error return
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmW9FuAUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpX2A//ZE1Tb/YHSqiVHvUSH0YFyKc50H24
Q3KqNH7LHRKBBIyIUJxZ2QaYsmmGPQqaB3RDAAQoItYb6mXaJQR0qIAxfKmcaUec
genFrJNuaICIq/U+cQeAypnRSvOhnMfsY9c/jFPCZI72bS8mMDcGC1vvJW+R1N1V
c5VAGuI/W1hkYOaVba3CkQ7sKcV/P+qbCI6dHB7DuCMr8L0foftItz0NIQ7tL9VM
PMRtcUSehgsiDSH7HthvmHvC2bfBXgjgXTde9f5aqMoBkJE/QuCl4W3FCChcAFLg
bxy1Ke6z4SF6N70BiKj+enSBDUVljOZOd/5IZOUPd6BuDcKDK/vKX6Vva5xYjnJ2
iZOBdGFoPVAmevt9iXPehMdoWcEZOZeHzBV82+k9My5wjWdLWp6ZtFQ/zhvt42YN
Z0EWPIKqxN4xSpQFEjczlKA1IWPv6YJTJJBPaCS86rBduU/cMvSwVFGnk3XnES2o
k4fyBF6UUYUe3tCD0E6NfBJCgHiq4V1ZjMMXiXKAUUe6L5zSIucjws27l3VZofFo
abjE/wxR5ad+3T0rt0sx+gFD8aiCzqKYaHgHHYAbofUXiTpEt7gMNhtNGlzUudO9
KpkAiWGHtYUlDP1qIWR8lHTdf8Vj6KpCrcuyPVDIjWb9im6rC+6fh63PE9ddrLfI
CqjwSoFOXRGEjm0=
=Qv1t
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v6.8-rc2-revert' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 revert from Andreas Gruenbacher:
"It turns out that the commit to use GL_NOBLOCK flag for non-blocking
lookups has several issues, and not all of them have a simple fix"
* tag 'gfs2-v6.8-rc2-revert' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
Revert "gfs2: Use GL_NOBLOCK flag for non-blocking lookups"
Commit "gfs2: Use GL_NOBLOCK flag for non-blocking lookups" has several
issues, some of which are non-trivial to fix, so revert it for now:
https://lore.kernel.org/gfs2/20240202050230.GA875515@ZenIV/T/
This reverts commit dd00aaeb343255a8a30de671bd27bde79a47c8e5.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Since ext4_map_blocks() can recognize a delayed allocated only extent,
make ext4_set_iomap() can also recognize it, and remove the useless
separate check in ext4_iomap_begin_report().
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240127015825.1608160-7-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Add a new map flag EXT4_MAP_DELAYED to indicate the mapping range is a
delayed allocated only (not unwritten) one, and making
ext4_map_blocks() can distinguish it, no longer mixing it with holes.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240127015825.1608160-6-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In order to cache hole extents in the extent status tree and keep the
hole length as long as possible, re-add a hole entry to the cache just
after punching a hole.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240127015825.1608160-5-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In ext4_map_blocks(), if we can't find a range of mapping in the
extents cache, we are calling ext4_ext_map_blocks() to search the real
path and ext4_ext_determine_hole() to determine the hole range. But if
the querying range was partially or completely overlaped by a delalloc
extent, we can't find it in the real extent path, so the returned hole
length could be incorrect.
Fortunately, ext4_ext_put_gap_in_cache() have already handle delalloc
extent, but it searches start from the expanded hole_start, doesn't
start from the querying range, so the delalloc extent found could not be
the one that overlaped the querying range, plus, it also didn't adjust
the hole length. Let's just remove ext4_ext_put_gap_in_cache(), handle
delalloc and insert adjusted hole extent in ext4_ext_determine_hole().
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240127015825.1608160-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
ext4_da_map_blocks() only hold i_data_sem in shared mode and i_rwsem
when inserting delalloc extents, it could be raced by another querying
path of ext4_map_blocks() without i_rwsem, .e.g buffered read path.
Suppose we buffered read a file containing just a hole, and without any
cached extents tree, then it is raced by another delayed buffered write
to the same area or the near area belongs to the same hole, and the new
delalloc extent could be overwritten to a hole extent.
pread() pwrite()
filemap_read_folio()
ext4_mpage_readpages()
ext4_map_blocks()
down_read(i_data_sem)
ext4_ext_determine_hole()
//find hole
ext4_ext_put_gap_in_cache()
ext4_es_find_extent_range()
//no delalloc extent
ext4_da_map_blocks()
down_read(i_data_sem)
ext4_insert_delayed_block()
//insert delalloc extent
ext4_es_insert_extent()
//overwrite delalloc extent to hole
This race could lead to inconsistent delalloc extents tree and
incorrect reserved space counter. Fix this by converting to hold
i_data_sem in exclusive mode when adding a new delalloc extent in
ext4_da_map_blocks().
Cc: stable@vger.kernel.org
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240127015825.1608160-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Refactor and cleanup ext4_da_map_blocks(), reduce some unnecessary
parameters and branches, no logic changes.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240127015825.1608160-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
- Fix BUG in iov_iter_revert reported from syzbot.
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAmW7jWYWHGxpbmtpbmpl
b25Aa2VybmVsLm9yZwAKCRBnC/sDUUQhCJvdEACPLj/xOrjTCzOPam6iKaf4NXGF
8LlAtHv9nDwKIXoA3lqMFGkc74cx2JOk/rnuO4l2SE75eWdlTHjo27XKFyDoakcF
kij+1bw7BUy4AwgANc4jIUTFYCx8mbzJF18eTuJdrs2h0+3rFHhcs9PpPmLfjDbZ
BGY6Xhc/a/1/56lA6IMsioLHBiOn7FCVl+I+joj+I0etMfqKg3mkro2LJ2AaiCjv
kvsiZ8A3HHgHM8X3JEOqqzokhTrbtQ6Thrit1NSH7e/JdjKSZvkAIv81otU6Fect
vLQSz4mfRdfzAhwrrX+cTSLljkgv7OwD9AfeyI2Fhhu0lQjzjIFknTghZsIreQm4
1YnAfKukWJ8AfzVmAMQVbqtWfP3WAs2gHYgoIibm8VIkjHEOmPHnK83cJCqFlq92
OvQZsqZK2u3hXj2L3fl+1sQlSt+bcnpSE/vEDxpEwxVN3YifY9pJoyhP09RjH1GE
/VnSHgX1laVeaNXinwlMKL1uTTVCjuosm4zFzByAQwwka6YCG4nLoHVZ6u2M26Bb
a/tw9RTeGnlQiMy/rb5QdxKTBVCuZM8eMbcaVc5yNnD9P7yRqbRNaI0SnmBjOU7I
V6bNqJUZwlPjQMeFJ0cUB5uaYopioA7lCjA0Hv8Eh+FX6v9g7oJvIXiGFPb8o4Pq
adUVsyGvxbVBabdAjg==
=CryD
-----END PGP SIGNATURE-----
Merge tag 'exfat-for-6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
Pull exfat fix from Namjae Jeon:
- Fix BUG in iov_iter_revert reported from syzbot
* tag 'exfat-for-6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
exfat: fix zero the unwritten part for dio read
With the introduction of SMB2_OP_QUERY_WSL_EA, the client may now send
5 commands in a single compound request in order to query xattrs from
potential WSL reparse points, which should be fine as we currently
allow up to 5 PDUs in a single compound request. However, if
encryption is enabled (e.g. 'seal' mount option) or enforced by the
server, current MAX_COMPOUND(5) won't be enough as we require an extra
PDU for the transform header.
Fix this by increasing MAX_COMPOUND to 7 and, while we're at it, add
an WARN_ON_ONCE() and return -EIO instead of -ENOMEM in case we
attempt to send a compound request that couldn't include the extra
transform header.
Signed-off-by: Paulo Alcantara <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
After the interface selection policy change to do a weighted
round robin, each iface maintains a weight_fulfilled. When the
weight_fulfilled reaches the total weight for the iface, we know
that the weights can be reset and ifaces can be allocated from
scratch again.
During channel allocation failures on a particular channel,
weight_fulfilled is not incremented. If a few interfaces are
inactive, we could end up in a situation where the active
interfaces are all allocated for the total_weight, and inactive
ones are all that remain. This can cause a situation where
no more channels can be allocated further.
This change fixes it by increasing weight_fulfilled, even when
channel allocation failure happens. This could mean that if
there are temporary failures in channel allocation, the iface
weights may not strictly be adhered to. But that's still okay.
Fixes: a6d8fb54a515 ("cifs: distribute channels across interfaces based on speed")
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
In order to scale down the channels, the following sequence
of operations happen:
1. server struct is marked for terminate
2. the channel is deallocated in the ses->chans array
3. at a later point the cifsd thread actually terminates the server
Between 2 and 3, there can be calls to find the channel for
a server struct. When that happens, there can be an ugly warning
that's logged. But this is expected.
So this change does two things:
1. in cifs_ses_get_chan_index, if server->terminate is set, return
2. always make sure server->terminate is set with chan_lock held
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When the server reports query network interface info call
as unsupported following a tree connect, it means that
multichannel is unsupported, even if the server capabilities
report otherwise.
When this happens, cifs_chan_skip_or_disable is called to
disable multichannel on the client. However, we only need
to call this when multichannel is currently setup.
Fixes: f591062bdbf4 ("cifs: handle servers that still advertise multichannel after disabling")
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The directory link count in eventfs was somewhat bogus. It was only being
updated when a directory child was being looked up and not on creation.
One solution would be to update in get_attr() the link count by iterating
the ei->children list and then adding 2. But that could slow down simple
stat() calls, especially if it's done on all directories in eventfs.
Another solution would be to add a parent pointer to the eventfs_inode
and keep track of the number of sub directories it has on creation. But
this adds overhead for something not really worthwhile.
The solution decided upon is to keep all directory links in eventfs as 1.
This tells user space not to rely on the hard links of directories. Which
in this case it shouldn't.
Link: https://lore.kernel.org/linux-trace-kernel/20240201002719.GS2087318@ZenIV/
Link: https://lore.kernel.org/linux-trace-kernel/20240201161617.339968298@goodmis.org
Cc: stable@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Fixes: c1504e510238 ("eventfs: Implement eventfs dir creation functions")
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The dentries and inodes are created when referenced in the lookup code.
There's no reason to call fsnotify_*() functions when they are created by
a reference. It doesn't make any sense.
Link: https://lore.kernel.org/linux-trace-kernel/20240201002719.GS2087318@ZenIV/
Link: https://lore.kernel.org/linux-trace-kernel/20240201161617.166973329@goodmis.org
Cc: stable@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Fixes: a376007917776 ("eventfs: Implement functions to create files and dirs when accessed");
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Some of the eventfs_inode structure has holes in it. Rework the structure
to be a bit more condensed, and also remove the no longer used llist
field.
Link: https://lore.kernel.org/linux-trace-kernel/20240201161617.002321438@goodmis.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
There should never be a case where an evenfs_inode is being freed without
is_freed being set. Add a WARN_ON_ONCE() if it ever happens. That would
mean there was one too many put_ei()s.
Link: https://lore.kernel.org/linux-trace-kernel/20240201161616.843551963@goodmis.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The eventfs inode had pointers to dentries (and child dentries) without
actually holding a refcount on said pointer. That is fundamentally
broken, and while eventfs tried to then maintain coherence with dentries
going away by hooking into the '.d_iput' callback, that doesn't actually
work since it's not ordered wrt lookups.
There were two reasonms why eventfs tried to keep a pointer to a dentry:
- the creation of a 'events' directory would actually have a stable
dentry pointer that it created with tracefs_start_creating().
And it needed that dentry when tearing it all down again in
eventfs_remove_events_dir().
This use is actually ok, because the special top-level events
directory dentries are actually stable, not just a temporary cache of
the eventfs data structures.
- the 'eventfs_inode' (aka ei) needs to stay around as long as there
are dentries that refer to it.
It then used these dentry pointers as a replacement for doing
reference counting: it would try to make sure that there was only
ever one dentry associated with an event_inode, and keep a child
dentry array around to see which dentries might still refer to the
parent ei.
This gets rid of the invalid dentry pointer use, and renames the one
valid case to a different name to make it clear that it's not just any
random dentry.
The magic child dentry array that is kind of a "reverse reference list"
is simply replaced by having child dentries take a ref to the ei. As
does the directory dentries. That makes the broken use case go away.
Link: https://lore.kernel.org/linux-trace-kernel/202401291043.e62e89dc-oliver.sang@intel.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240131185513.280463000@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: c1504e510238 ("eventfs: Implement eventfs dir creation functions")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In order for the dentries to stay up-to-date with the eventfs changes,
just add a 'd_revalidate' function that checks the 'is_freed' bit.
Also, clean up the dentry release to actually use d_release() rather
than the slightly odd d_iput() function. We don't care about the inode,
all we want to do is to get rid of the refcount to the eventfs data
added by dentry->d_fsdata.
It would probably be cleaner to make eventfs its own filesystem, or at
least set its own dentry ops when looking up eventfs files. But as it
is, only eventfs dentries use d_fsdata, so we don't really need to split
these things up by use.
Another thing that might be worth doing is to make all eventfs lookups
mark their dentries as not worth caching. We could do that with
d_delete(), but the DCACHE_DONTCACHE flag would likely be even better.
As it is, the dentries are all freeable, but they only tend to get freed
at memory pressure rather than more proactively. But that's a separate
issue.
Link: https://lore.kernel.org/linux-trace-kernel/202401291043.e62e89dc-oliver.sang@intel.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240131185513.124644253@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: c1504e510238 ("eventfs: Implement eventfs dir creation functions")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The dentry lookup for eventfs files was very broken, and had lots of
signs of the old situation where the filesystem names were all created
statically in the dentry tree, rather than being looked up dynamically
based on the eventfs data structures.
You could see it in the naming - how it claimed to "create" dentries
rather than just look up the dentries that were given it.
You could see it in various nonsensical and very incorrect operations,
like using "simple_lookup()" on the dentries that were passed in, which
only results in those dentries becoming negative dentries. Which meant
that any other lookup would possibly return ENOENT if it saw that
negative dentry before the data was then later filled in.
You could see it in the immense amount of nonsensical code that didn't
actually just do lookups.
Link: https://lore.kernel.org/linux-trace-kernel/202401291043.e62e89dc-oliver.sang@intel.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240131233227.73db55e1@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Fixes: c1504e510238 ("eventfs: Implement eventfs dir creation functions")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Following a successful cifs_tree_connect, we have the code
to scale up/down the number of channels in the session.
However, it is not protected by a lock today.
As a result, this code can be executed by several processes
that select the same channel. The core functions handle this
well, as they pick chan_lock. However, we've seen cases where
smb2_reconnect throws some warnings.
To fix that, this change introduces a flags bitmap inside the
cifs_ses structure. A new flag type is used to ensure that
only one process enters this section at any time.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The eventfs_find_events() code tries to walk up the tree to find the
event directory that a dentry belongs to, in order to then find the
eventfs inode that is associated with that event directory.
However, it uses an odd combination of walking the dentry parent,
looking up the eventfs inode associated with that, and then looking up
the dentry from there. Repeat.
But the code shouldn't have back-pointers to dentries in the first
place, and it should just walk the dentry parenthood chain directly.
Similarly, 'set_top_events_ownership()' looks up the dentry from the
eventfs inode, but the only reason it wants a dentry is to look up the
superblock in order to look up the root dentry.
But it already has the real filesystem inode, which has that same
superblock pointer. So just pass in the superblock pointer using the
information that's already there, instead of looking up extraneous data
that is irrelevant.
Link: https://lore.kernel.org/linux-trace-kernel/202401291043.e62e89dc-oliver.sang@intel.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240131185512.638645365@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: c1504e510238 ("eventfs: Implement eventfs dir creation functions")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The tracefs-specific fields in the inode were not initialized before the
inode was exposed to others through the dentry with 'd_instantiate()'.
Move the field initializations up to before the d_instantiate.
Link: https://lore.kernel.org/linux-trace-kernel/20240131185512.478449628@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use eventfs_inode")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202401291043.e62e89dc-oliver.sang@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
eventfs uses the tracefs_inode and assumes that it's already initialized
to zero. That is, it doesn't set fields to zero (like ti->private) after
getting its tracefs_inode. This causes bugs due to stale values.
Just initialize the entire structure to zero on allocation so there isn't
any more surprises.
This is a partial fix to access to ti->private. The assignment still needs
to be made before the dentry is instantiated.
Link: https://lore.kernel.org/linux-trace-kernel/20240131185512.315825944@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use eventfs_inode")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202401291043.e62e89dc-oliver.sang@intel.com
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
[BUG]
There is a syzbot crash, triggered by the ASSERT() during subvolume
creation:
assertion failed: !anon_dev, in fs/btrfs/disk-io.c:1319
------------[ cut here ]------------
kernel BUG at fs/btrfs/disk-io.c:1319!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
RIP: 0010:btrfs_get_root_ref.part.0+0x9aa/0xa60
<TASK>
btrfs_get_new_fs_root+0xd3/0xf0
create_subvol+0xd02/0x1650
btrfs_mksubvol+0xe95/0x12b0
__btrfs_ioctl_snap_create+0x2f9/0x4f0
btrfs_ioctl_snap_create+0x16b/0x200
btrfs_ioctl+0x35f0/0x5cf0
__x64_sys_ioctl+0x19d/0x210
do_syscall_64+0x3f/0xe0
entry_SYSCALL_64_after_hwframe+0x63/0x6b
---[ end trace 0000000000000000 ]---
[CAUSE]
During create_subvol(), after inserting root item for the newly created
subvolume, we would trigger btrfs_get_new_fs_root() to get the
btrfs_root of that subvolume.
The idea here is, we have preallocated an anonymous device number for
the subvolume, thus we can assign it to the new subvolume.
But there is really nothing preventing things like backref walk to read
the new subvolume.
If that happens before we call btrfs_get_new_fs_root(), the subvolume
would be read out, with a new anonymous device number assigned already.
In that case, we would trigger ASSERT(), as we really expect no one to
read out that subvolume (which is not yet accessible from the fs).
But things like backref walk is still possible to trigger the read on
the subvolume.
Thus our assumption on the ASSERT() is not correct in the first place.
[FIX]
Fix it by removing the ASSERT(), and just free the @anon_dev, reset it
to 0, and continue.
If the subvolume tree is read out by something else, it should have
already get a new anon_dev assigned thus we only need to free the
preallocated one.
Reported-by: Chenyuan Yang <chenyuan0y@gmail.com>
Fixes: 2dfb1e43f57d ("btrfs: preallocate anon block device at first phase of snapshot creation")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If a subvolume still exists, forbid deleting its qgroup 0/subvolid.
This behavior generally leads to incorrect behavior in squotas and
doesn't have a legitimate purpose.
Fixes: cecbb533b5fc ("btrfs: record simple quota deltas in delayed refs")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Creating a qgroup 0/subvolid leads to various races and it isn't
helpful, because you can't specify a subvol id when creating a subvol,
so you can't be sure it will be the right one. Any requirements on the
automatic subvol can be gratified by using a higher level qgroup and the
inheritance parameters of subvol creation.
Fixes: cecbb533b5fc ("btrfs: record simple quota deltas in delayed refs")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When some ioctl flags are checked we return EOPNOTSUPP, like for
BTRFS_SCRUB_SUPPORTED_FLAGS, BTRFS_SUBVOL_CREATE_ARGS_MASK or fallocate
modes. The EINVAL is supposed to be for a supported but invalid
values or combination of options. Fix that when checking send flags so
it's consistent with the rest.
CC: stable@vger.kernel.org # 4.14+
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5rryOLzp3EKq8RTbjMHMHeaJubfpsVLF6H4qJnKCUR1w@mail.gmail.com/
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>