IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Before this patch, if function __gfs2_ail_flush detected an error
syncing the ail list, it call gfs2_ail_error which called gfs2_withdraw.
Since __gfs2_ail_flush deals with a specific glock, we shouldn't withdraw
immediately because the withdraw code (signal_our_withdraw) uses glocks
in its processing.
This patch changes the call from gfs2_withdraw to gfs2_withdraw_delayed
which defers the withdraw until a more appropriate context, such as the
logd daemon, discovers the intent to withdraw.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
In the gfs2 withdraw sequence, the dlm protocol is unmounted with a call
to lm_unmount. After a withdraw, users are allowed to unmount the
withdrawn file system. But at that point we may still have glocks left
over that we need to free via unmount's call to gfs2_gl_hash_clear.
These glocks may have never been completed because of whatever problem
caused the withdraw (IO errors or whatever).
Before this patch, function gdlm_put_lock would still try to call into
dlm to unlock these leftover glocks, which resulted in dlm returning
-EINVAL because the lock space was abandoned. These glocks were never
freed because there was no mechanism after that to free them.
This patch adds a check to gdlm_put_lock to see if the locking protocol
was inactive (DFL_UNMOUNT flag) and if so, free the glock and not
make the invalid call into dlm.
I could have combined this "if" with the one that follows, related to
leftover glock LVBs, but I felt the code was more readable with its own
if clause.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
When gfs2 withdraws a file system, it calls signal_our_withdraw which
triggers another node to replay the withdrawing node's journal. Then it
waits until it knows the journal has been replayed. Part of this wait is
to repeatedly call check_journal_clean which calls gfs2_jdesc_check,
which checks to see if the journal is sane. As part of its sanity checks
it needs to re-read its journal's metadata. But with today's code, any
attempt to re-read the metadata results in -EIO because of a check for
the file system withdraw in function gfs2_meta_wait.
This patch adds an additional check for SDF_WITHDRAW_IN_PROG, to tell
if the read is done while the withdraw is in progress. In that case
we allow the metadata read to not be rejected. Therefore the metadata
check is done properly, so the withdraw sequence can finish normally.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, journal inodes were considered regular inodes,
which meant that instead of evicting them, function iput_final would
just put them on the lru for later processing. If the file system
withdrew for whatever reason, the withdraw would never be seen until
the inode was evicted, which could be indefinitely.
This patch marks all journal inodes as "don't cache" which means
function iput_final will evict them immediately, allowing us to
properly recover the journal on other cluster nodes.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Today, gfs2_drop_inode can return "false" for an int value.
I'm sure this was just an oversight. Change to int value.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, withdraws could cause an error that looked like:
Journal recovery skipped for 0 until next mount.
This patch changes it to a more readable:
Journal recovery skipped for jid 0 until next mount.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Before this patch, several functions in gfs2 related to the updating
of the statfs file used a newly acquired/read buffer_head for the
local statfs file. This is completely unnecessary, because other nodes
should never update it. Recreating the buffer is a waste of time.
This patch allows gfs2 to read in the local statefs buffer_head at
mount time and keep it around until unmount time.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Patch 96b1454f2e ("gfs2: move freeze glock outside the make_fs_rw and _ro
functions") changed the gfs2 mount sequence so that it holds the freeze
lock before calling gfs2_make_fs_rw. Before this patch, gfs2_make_fs_rw
called init_threads to initialize the quotad and logd threads. That is a
problem if the system needs to withdraw due to IO errors early in the
mount sequence, for example, while initializing the system statfs inode:
1. An IO error causes the statfs glock to not sync properly after
recovery, and leaves items on the ail list.
2. The leftover items on the ail list causes its do_xmote call to fail,
which makes it want to withdraw. But since the glock code cannot
withdraw (because the withdraw sequence uses glocks) it relies upon
the logd daemon to initiate the withdraw.
3. The withdraw can never be performed by the logd daemon because all
this takes place before the logd daemon is started.
This patch moves function init_threads from super.c to ops_fstype.c
and it changes gfs2_fill_super to start its threads before holding the
freeze lock, and if there's an error, stop its threads after releasing
it. This allows the logd to run unblocked by the freeze lock. Thus,
the logd daemon can perform its withdraw sequence properly.
Fixes: 96b1454f2e ("gfs2: move freeze glock outside the make_fs_rw and _ro functions")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Function gfs2_log_reserve was setting revoke_blks to 0. There's no
need because it calculates it shortly thereafter. This patch removes
the unnecessary set.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This patch does not change function. It adds variable sdp to clean up
function gfs2_ail_error and make it more readable.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This patch adds some crucial information when journal replay detects a
replay of an obsolete rgrp block. For example, it wasn't printing the
journal id or the generation number played. This just supplements what
is logged in this unusual case.
The function that actually complains about the replaying of an obsolete
rgrp block has been split off to avoid long lines and sparse warnings.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Add a rcu argument to the ->get_acl() callback to allow
get_cached_acl_rcu() to call the ->get_acl() method in the next patch.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
These aren't actually used by the only instance implementing the methods.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The permission check in gfs2_setattr is an old and outdated version of
may_setattr(). Switch to the updated version.
Fixes fstest generic/079.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We must not call gfs2_consist (which does a file system withdraw) from
the freeze glock's freeze_go_xmote_bh function because the withdraw
will try to use the freeze glock, thus causing a glock recursion error.
This patch changes freeze_go_xmote_bh to call function
gfs2_assert_withdraw_delayed instead of gfs2_consist to avoid recursion.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
In the case where IS_ERR(lsi->si_sc_inode) is true the error exit path
to free_local does not kfree the allocated object lsi leading to a memory
leak. Fix this by kfree'ing lst before taking the error exit path.
Addresses-Coverity: ("Resource leak")
Fixes: 97fd734ba1 ("gfs2: lookup local statfs inodes prior to journal recovery")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmDa9GoUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTr2Wg//VPDf8gD07ixZ9jOujwCB/0q5n403
doTtonF+26VVN0e3aZY5M949cW/65ogKYiZgzC6JlCkqsGg+8Y8xQ2mxKtUPQoY7
lGCU1uQVnoZMWpRzcigH3iGFz+yyW6PFH7lgr+mcyqBHNPc1zPK5oioV3KKQTtZH
XUP5FGZ9A3noty/feMCcicjAfh3iqwBSa7+QqPe9OlZ08AeDbaHatFzSnOPerpEh
eIOzv795q9MVKOghq+JHgELbnX58IrI3FVWDAWS9+KBfgdb23DV/78mmA9rjjP91
UXD5IJ7hYDQiPkKHX6uHrLyQaIkZlXISEAYiEmiNzfVysopl88P1JAV6n6MfrCVx
MX3ZEvOFAe5aSXjUMXPdhgYgf7dg1gV5L6zmDSTngQqegGxpp8tSXiUY1f5U/JU3
hVSwN0H1xOd6w9GbZM8GJq316gM8DHZGS/0UyZjHNYrH1rt4m7C0K0aukbelJ0EM
YdYceC6zONSk+ltM90Cvfy0WPO7nVm9hrG8nJcyxzyuIptrKqa9BCzLV7FvlObVE
Ia3MQf0j7wF+33Ro3VcyQTV2xadqFIzeJ2yLxJZmJ8pf/Pyu8pB0/+8Ug3Xrve+w
i1Q4y4LikJNNhyhBYt2Qev+myIx7VE8WoQ7tFwOmMHhKs9DbTP4WpyJL/x7Wy9wg
34KFHHVp+jIeUwI=
=47Ww
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
"Various minor gfs2 cleanups and fixes"
* tag 'gfs2-v5.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Clean up gfs2_unstuff_dinode
gfs2: Unstuff before locking page in gfs2_page_mkwrite
gfs2: Clean up the error handling in gfs2_page_mkwrite
gfs2: Fix error handling in init_statfs
gfs2: Fix underflow in gfs2_page_mkwrite
gfs2: Use list_move_tail instead of list_del/list_add_tail
gfs2: Fix do_gfs2_set_flags description
The only difference between iomap_set_page_dirty() and
__set_page_dirty_nobuffers() is that the latter includes a debugging check
that a !Uptodate page has private data.
Link: https://lkml.kernel.org/r/20210615162342.1669332-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the CONFIG_BLOCK default to __set_page_dirty_buffers and just wire
that method up for the missing instances.
[hch@lst.de: ecryptfs: add a ->set_page_dirty cludge]
Link: https://lkml.kernel.org/r/20210624125250.536369-1-hch@lst.de
Link: https://lkml.kernel.org/r/20210614061512.3966143-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Tyler Hicks <code@tyhicks.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Split __gfs2_unstuff_inode off from gfs2_unstuff_dinode and clean up the
code a little. All remaining callers now pass NULL as the page argument
of gfs2_unstuff_dinode, so remove that argument.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In gfs2_page_mkwrite, unstuff inodes before locking the page. That
way, we won't have to pass in the locked page to gfs2_unstuff_inode,
and gfs2_unstuff_inode can look up and lock the page itself.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
We're setting an error number so that block_page_mkwrite_return
translates it into the corresponding VM_FAULT_* code in several places,
but this is getting confusing, so set the VM_FAULT_* codes directly
instead. (No change in functionality.)
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
On an error path, init_statfs calls iput(pn) after pn has already been put.
Fix that by setting pn to NULL after the initial iput.
Fixes: 97fd734ba1 ("gfs2: lookup local statfs inodes prior to journal recovery")
Cc: stable@vger.kernel.org # v5.10+
Reported-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
On filesystems with a block size smaller than PAGE_SIZE and non-empty
files smaller then PAGE_SIZE, gfs2_page_mkwrite could end up allocating
excess blocks beyond the end of the file, similar to fallocate. This
doesn't make sense; fix it.
Reported-by: Bob Peterson <rpeterso@redhat.com>
Fixes: 184b4e6085 ("gfs2: Fix end-of-file handling in gfs2_page_mkwrite")
Cc: stable@vger.kernel.org # v5.5+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Using list_move_tail() instead of list_del() + list_add_tail().
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Commit 88b631cbfb ("gfs2: convert to fileattr") changed the argument list
without updating the description.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This reverts commit b7f55d928e.
As explained by Linus in [*], write faults on a mmap region are reads
from a filesysten point of view, so taking the inode glock exclusively
on write faults is incorrect.
Instead, when a page is marked writable, the .page_mkwrite vm operation
will be called, which is where the exclusive lock taking needs to
happen. I got this wrong because of a broken test case that made me
believe .page_mkwrite isn't getting called when it actually is.
[*] https://lore.kernel.org/lkml/CAHk-=wj8EWr_D65i4oRSj2FTbrc6RdNydNNCGxeabRnwtoU=3Q@mail.gmail.com/
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The GLF_LRU flag is checked under lru_lock in gfs2_glock_remove_from_lru() to
remove the glock from the lru list in __gfs2_glock_put().
On the shrink scan path, the same flag is cleared under lru_lock but because
of cond_resched_lock(&lru_lock) in gfs2_dispose_glock_lru(), progress on the
put side can be made without deleting the glock from the lru list.
Keep GLF_LRU across the race window opened by cond_resched_lock(&lru_lock) to
ensure correct behavior on both sides - clear GLF_LRU after list_del under
lru_lock.
Reported-by: syzbot <syzbot+34ba7ddbf3021981a228@syzkaller.appspotmail.com>
Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When a write fault occurs, we need to take the inode glock of the underlying
inode in exclusive mode. Otherwise, there's no guarantee that the dirty page
will be written back to disk.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, the system ail lists were cleaned up if the logd
process withdrew, but on other withdraws, they were not cleaned up.
This included the cleaning up of the revokes as well.
This patch reorganizes things a bit so that all withdraws (not just logd)
clean up the ail lists, including any pending revokes.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, gfs2 would deadlock because of the following
sequence during mount:
mount
gfs2_fill_super
gfs2_make_fs_rw <--- Detects IO error with glock
kthread_stop(sdp->sd_quotad_process);
<--- Blocked waiting for quotad to finish
logd
Detects IO error and the need to withdraw
calls gfs2_withdraw
gfs2_make_fs_ro
kthread_stop(sdp->sd_quotad_process);
<--- Blocked waiting for quotad to finish
gfs2_quotad
gfs2_statfs_sync
gfs2_glock_wait <---- Blocked waiting for statfs glock to be granted
glock_work_func
do_xmote <---Detects IO error, can't release glock: blocked on withdraw
glops->go_inval
glock_blocked_by_withdraw
requeue glock work & exit <--- work requeued, blocked by withdraw
This patch makes a special exception for the statfs system inode glock,
which allows the statfs glock UNLOCK to proceed normally. That allows the
quotad daemon to exit during the withdraw, which allows the logd daemon
to exit during the withdraw, which allows the mount to exit.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, in the unlikely event that gfs2_glock_dq encountered
a withdraw, it would do a wait_on_bit to wait for its journal to be
recovered, but it never released the glock's spin_lock, which caused a
scheduling-while-atomic error.
This patch unlocks the lockref spin_lock before waiting for recovery.
Fixes: 601ef0d52e ("gfs2: Force withdraw to replay journals and wait for it to finish")
Cc: stable@vger.kernel.org # v5.7+
Reported-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Patch 4a378d8a0d added a new check for I_NEW inodes, but unfortunately
it used the wrong variable, i_flags. This caused GFS2 to withdraw when
gfs2_lookup_by_inum needed to refresh an I_NEW inode. This patch switches
to use the correct variable, i_state.
Fixes: 4a378d8a0d ("gfs2: be careful with inode refresh")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When a direct I/O write falls entirely and falls back to buffered I/O and the
buffered I/O fails, the write failed with return value 0 instead of the error
number reported by the buffered I/O. Fix that.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Patch series "Remove nrexceptional tracking", v2.
We actually use nrexceptional for very little these days. It's a minor
pain to keep in sync with nrpages, but the pain becomes much bigger with
the THP patches because we don't know how many indices a shadow entry
occupies. It's easier to just remove it than keep it accurate.
Also, we save 8 bytes per inode which is nothing to sneeze at; on my
laptop, it would improve shmem_inode_cache from 22 to 23 objects per
16kB, and inode_cache from 26 to 27 objects. Combined, that saves
a megabyte of memory from a combined usage of 25MB for both caches.
Unfortunately, ext4 doesn't cross a magic boundary, so it doesn't save
any memory for ext4.
This patch (of 4):
Instead of checking the two counters (nrpages and nrexceptional), we can
just check whether i_pages is empty.
Link: https://lkml.kernel.org/r/20201026151849.24232-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20201026151849.24232-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Vishal Verma <vishal.l.verma@intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Fix some compiler and kernel-doc warnings.
- Various minor cleanups and optimizations.
- Add a new sysfs gfs2 status file with some filesystem wide
information.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmCKiWIUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqXOA//cgEMi+WZ0pQ1m4Z7Yk58ArAGXOW4
L+efdMjk2zoqgixF502tQzaa2ctz6XpukF4oZbM+Jc+yZxrbZ5CLjUIOWc9RH+Id
WQwj5+5GLbMAPnn5ksHUCTK9V+1oAlpgoY4fMtLdKq234Y6xqWj5qBjvtGUTLFAl
ACvy8FUZplFOkaOSBqgh221LT4Oh0Wthe/Elq5qvqwBfdAiE/p1sHSi2FWxktIlU
wV3PKL96rFsnWN8E6jqyJR1RNJ5d5MYA+PDkTHKcoqcXZrzw4mfu2tCh88Bh9wFb
MEyjtLxE09G1+3Li/T/Tb7qbRKWvxEmkLZXaFAjRUp7zYPvM6twKSg8nihcBDtLi
UgvTrc208CYvYj7QpRQ1dU9lEg47A46rB8dgLz+ymlpNNk/G0gqgvWLevMKnBfaX
AkZviI6qm1iNCBd6wWWPUKqR0qrCWqoe9N8F7cWyZBki7dKkoj29Gt1X1SeIQMjd
n8Mkv6Btd39kBt3DydXlCEaREMQYeDrxBJHxur234hEfFLFraFj5tYFjoeSODZdg
Uxsn5X5dgLy/hHjps8YcuBoEgRMR/aKovK5G1FXDcQR5O6UByqtJuJTJBT8jxAld
vLYHqO6vxdgGATaYAkuGLSLrJjwES+7tEXjtrdarswGo55dPitwtOB1NRWuOor/Z
uTnzJbykMdIzEHs=
=VblJ
-----END PGP SIGNATURE-----
Merge tag 'gfs2-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Fix some compiler and kernel-doc warnings
- Various minor cleanups and optimizations
- Add a new sysfs gfs2 status file with some filesystem wide
information
* tag 'gfs2-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix fall-through warnings for Clang
gfs2: Fix a number of kernel-doc warnings
gfs2: Make gfs2_setattr_simple static
gfs2: Add new sysfs file for gfs2 status
gfs2: Silence possible null pointer dereference warning
gfs2: Turn gfs2_meta_indirect_buffer into gfs2_meta_buffer
gfs2: Replace gfs2_lblk_to_dblk with gfs2_get_extent
gfs2: Turn gfs2_extent_map into gfs2_{get,alloc}_extent
gfs2: Add new gfs2_iomap_get helper
gfs2: Remove unused variable sb_format
gfs2: Fix dir.c function parameter descriptions
gfs2: Eliminate gh parameter from go_xmote_bh func
gfs2: don't create empty buffers for NO_CREATE
Pull fileattr conversion updates from Miklos Szeredi via Al Viro:
"This splits the handling of FS_IOC_[GS]ETFLAGS from ->ioctl() into a
separate method.
The interface is reasonably uniform across the filesystems that
support it and gives nice boilerplate removal"
* 'miklos.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (23 commits)
ovl: remove unneeded ioctls
fuse: convert to fileattr
fuse: add internal open/release helpers
fuse: unsigned open flags
fuse: move ioctl to separate source file
vfs: remove unused ioctl helpers
ubifs: convert to fileattr
reiserfs: convert to fileattr
ocfs2: convert to fileattr
nilfs2: convert to fileattr
jfs: convert to fileattr
hfsplus: convert to fileattr
efivars: convert to fileattr
xfs: convert to fileattr
orangefs: convert to fileattr
gfs2: convert to fileattr
f2fs: convert to fileattr
ext4: convert to fileattr
ext2: convert to fileattr
btrfs: convert to fileattr
...
Pull vfs inode type handling updates from Al Viro:
"We should never change the type bits of ->i_mode or the method tables
(->i_op and ->i_fop) of a live inode.
Unfortunately, not all filesystems took care to prevent that"
* 'work.inode-type-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
spufs: fix bogosity in S_ISGID handling
9p: missing chunk of "fs/9p: Don't update file type when updating file attributes"
openpromfs: don't do unlock_new_inode() until the new inode is set up
hostfs_mknod(): don't bother with init_special_inode()
cifs: have cifs_fattr_to_inode() refuse to change type on live inode
cifs: have ->mkdir() handle race with another client sanely
do_cifs_create(): don't set ->i_mode of something we had not created
gfs2: be careful with inode refresh
ocfs2_inode_lock_update(): make sure we don't change the type bits of i_mode
orangefs_inode_is_stale(): i_mode type bits do *not* form a bitmap...
vboxsf: don't allow to change the inode type
afs: Fix updating of i_mode due to 3rd party change
ceph: don't allow type or device number to change on non-I_NEW inodes
ceph: fix up error handling with snapdirs
new helper: inode_wrong_type()
In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple
warnings by explicitly adding multiple goto statements instead of just
letting the code fall through to the next case.
Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Use the fileattr API to let the VFS handle locking, permission checking and
conversion.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Building the kernel with W=1 results in a number of kernel-doc warnings
like incorrect function names and parameter descriptions. Fix those,
mostly by adding missing parameter descriptions, removing left-over
descriptions, and demoting some less important kernel-doc comments into
regular comments.
Originally proposed by Lee Jones; improved and combined into a single
patch by Andreas.
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
list_sort() internally casts the comparison function passed to it
to a different type with constant struct list_head pointers, and
uses this pointer to call the functions, which trips indirect call
Control-Flow Integrity (CFI) checking.
Instead of removing the consts, this change defines the
list_cmp_func_t type and changes the comparison function types of
all list_sort() callers to use const pointers, thus avoiding type
mismatches.
Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com
This patch adds a new file: /sys/fs/gfs2/*/status which will report
the status of the file system. Catting this file dumps the current
status of the file system according to various superblock variables.
For example:
Journal Checked: 1
Journal Live: 1
Journal ID: 0
Spectator: 0
Withdrawn: 0
No barriers: 0
No recovery: 0
Demote: 0
No Journal ID: 1
Mounted RO: 0
RO Recovery: 0
Skip DLM Unlock: 0
Force AIL Flush: 0
FS Frozen: 0
Withdrawing: 0
Withdraw In Prog: 0
Remote Withdraw: 0
Withdraw Recovery: 0
sd_log_error: 0
sd_log_flush_lock: 0
sd_log_num_revoke: 0
sd_log_in_flight: 0
sd_log_blks_needed: 0
sd_log_blks_free: 32768
sd_log_flush_head: 0
sd_log_flush_tail: 5384
sd_log_blks_reserved: 0
sd_log_revokes_available: 503
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In gfs2_rbm_find, rs is always NULL when minext is NULL, so
gfs2_reservation_check_and_update will never be called on a NULL minext.
This isn't innediately obvious though, so also check for a NULL minext
for better code readability.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Instead of only supporting GFS2_METATYPE_DI and GFS2_METATYPE_IN blocks,
make the block type a parameter of gfs2_meta_indirect_buffer and rename
the function to gfs2_meta_buffer.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Convert gfs2_extent_map to iomap and split it into gfs2_get_extent and
gfs2_alloc_extent. Instead of hardcoding the extent size, pass it in
via the extlen parameter.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Rename the current gfs2_iomap_get and gfs2_iomap_alloc functions to __*.
Add a new gfs2_iomap_get helper that doesn't expose struct metapath.
Rename gfs2_iomap_get_alloc to gfs2_iomap_alloc. Use the new helpers
where they make sense.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch simply fixes a bunch of function parameter comments
in dir.c that were reported by the kernel test robot.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The only glock that uses go_xmote_bh glops function is the freeze glock
which uses freeze_go_xmote_bh. It does not use its gh parameter, so
this patch eliminates the unneeded parameter.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function gfs2_getbuf would create empty buffers when
it was given the NO_CREATE directive from gfs2_journal_wipe. This is a
waste of time: the buffer_head is only used by gfs2_remove_from_journal
to determine if the buffer is pinned (which it won't be if it's newly
created) and if there's an associated bd element (same story).
This patch removes the useless buffer assignment.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, gfs2's freeze function failed to report an error
when the target file system was already frozen as it should (and as
generic vfs function freeze_super does. Similarly, gfs2's thaw function
failed to report an error when trying to thaw a file system that is not
frozen, as vfs function thaw_super does. The errors were checked, but
it always returned a 0 return code.
This patch adds the missing error return codes to gfs2 freeze and thaw.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Interrupting mount with ^C quickly enough can cause the kthread_run()
calls in gfs2's init_threads() to fail and the error path leads to a
deadlock on the s_umount rwsem. The abridged chain of events is:
[mount path]
get_tree_bdev()
sget_fc()
alloc_super()
down_write_nested(&s->s_umount, SINGLE_DEPTH_NESTING); [acquired]
gfs2_fill_super()
gfs2_make_fs_rw()
init_threads()
kthread_run()
( Interrupted )
[Error path]
gfs2_gl_hash_clear()
flush_workqueue(glock_workqueue)
wait_for_completion()
[workqueue context]
glock_work_func()
run_queue()
do_xmote()
freeze_go_sync()
freeze_super()
down_write(&sb->s_umount) [deadlock]
In freeze_go_sync() there is a gfs2_withdrawn() check that we can use to
make sure freeze_super() is not called in the error path, so add a
gfs2_withdraw_delayed() call when init_threads() fails.
Ref: https://bugzilla.kernel.org/show_bug.cgi?id=212231
Reported-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
1) gfs2_dinode_in() should *not* touch ->i_rdev on live inodes; even
"zero and immediately reread the same value from dinode" is broken -
have it overlap with ->release() of char device and you can get all
kinds of bogus behaviour.
2) mismatch on inode type on live inodes should be treated as fs
corruption rather than blindly setting ->i_mode.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmBLzKsQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpi0ID/9djN1db0OrAjQgWdOQsKwzcPG4fmVRHJAu
Zi8SPRj0ByonWGaPWjiSi297/j00dfYFFIXaB1Pfo4j0wX0IK8bJINl0G8SN6Dag
WYBBrT/5rCQgD8fjQ1XhuzuqLwxwcZfYXAnCAlqABG18nPk532D4dX2CMEasl8F7
XWTTj5PqHDN4bCcriH1GEA5S+2nmoz5YXjNZEDcY3/pQMdyb8Jo9mRfZubkrnRxK
c9fz2LjUz0IRaSb+9PILY5qDLOSIh+vHOIk/3BKW9DoqU/S3kTTr4twqnOclfVPH
VgJM9b+sHveVCztCJ9bnNGkW7HWjUQa8gb/B40NBxKEhw7w/HCjykhhxd+QTUQTM
GJVMRGYWhzuUEuU1M1hArPua0GLmPKSvC0CRgbKRmgPNjshTquZPJnBBFwv2wZKQ
GkrwktdK9ihE1ya4gu20MupST3PIpT3jtc6NAizr6DCy0wJ0Z1X5KYnFdbtS79No
I9qPC8lu3AcZq6NXdBfTO9ngIdiUwi9AfSYj7koS/4dmnVccVJmaj0/NNmVp2Ro3
HtaObanBnTi9v8YHl8WgX6lq5RjuQ204fXmd0No4mHFvgxsl7YaX+JBts7S3A2Nf
PoQLqmulcLmzT3EVuEg279aXw2rbnyWHARbF/5/tIr4JcugtLJhwFnBA5YgFreq9
lSbqgoKSHw==
=qHyO
-----END PGP SIGNATURE-----
Merge tag 'block-5.12-2021-03-12-v2' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Mostly just random fixes all over the map.
The only odd-one-out change is finally getting the rename of
BIO_MAX_PAGES to BIO_MAX_VECS done. This should've been done with the
multipage bvec change, but it's been left.
Do it now to avoid hassles around changes piling up for the next merge
window.
Summary:
- NVMe pull request:
- one more quirk (Dmitry Monakhov)
- fix max_zone_append_sectors initialization (Chaitanya Kulkarni)
- nvme-fc reset/create race fix (James Smart)
- fix status code on aborts/resets (Hannes Reinecke)
- fix the CSS check for ZNS namespaces (Chaitanya Kulkarni)
- fix a use after free in a debug printk in nvme-rdma (Lv Yunlong)
- Follow-up NVMe error fix for NULL 'id' (Christoph)
- Fixup for the bd_size_lock being IRQ safe, now that the offending
driver has been dropped (Damien).
- rsxx probe failure error return (Jia-Ju)
- umem probe failure error return (Wei)
- s390/dasd unbind fixes (Stefan)
- blk-cgroup stats summing fix (Xunlei)
- zone reset handling fix (Damien)
- Rename BIO_MAX_PAGES to BIO_MAX_VECS (Christoph)
- Suppress uevent trigger for hidden devices (Daniel)
- Fix handling of discard on busy device (Jan)
- Fix stale cache issue with zone reset (Shin'ichiro)"
* tag 'block-5.12-2021-03-12-v2' of git://git.kernel.dk/linux-block:
nvme: fix the nsid value to print in nvme_validate_or_alloc_ns
block: Discard page cache of zone reset target range
block: Suppress uevent for hidden device when removed
block: rename BIO_MAX_PAGES to BIO_MAX_VECS
nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a Samsung PM1725a
nvme-rdma: Fix a use after free in nvmet_rdma_write_data_done
nvme-core: check ctrl css before setting up zns
nvme-fc: fix racing controller reset and create association
nvme-fc: return NVME_SC_HOST_ABORTED_CMD when a command has been aborted
nvme-fc: set NVME_REQ_CANCELLED in nvme_fc_terminate_exchange()
nvme: add NVME_REQ_CANCELLED flag in nvme_cancel_request()
nvme: simplify error logic in nvme_validate_ns()
nvme: set max_zone_append_sectors nvme_revalidate_zones
block: rsxx: fix error return code of rsxx_pci_probe()
block: Fix REQ_OP_ZONE_RESET_ALL handling
umem: fix error return code in mm_pci_probe()
blk-cgroup: Fix the recursive blkg rwstat
s390/dasd: fix hanging IO request during DASD driver unbind
s390/dasd: fix hanging DASD driver unbind
block: Try to handle busy underlying device on discard
Patch fe3e397668 ("gfs2: Rework the log space allocation logic")
changed gfs2_log_flush to reserve a set of journal blocks in case no
transaction is active. However, gfs2_log_flush also gets called in
cases where we don't have an active journal, for example, for spectator
mounts. In that case, trying to reserve blocks would sleep forever, but
we want gfs2_log_flush to be a no-op instead.
Fixes: fe3e397668 ("gfs2: Rework the log space allocation logic")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function signal_our_withdraw referenced the journal
inode immediately. But corrupt file systems may have some invalid
journals, in which case our attempt to read it in will withdraw and the
resulting signal_our_withdraw would dereference the NULL value.
This patch adds a check to signal_our_withdraw so that if the journal
has not yet been initialized, it simply returns and does the old-style
withdraw.
Thanks, Andy Price, for his analysis.
Reported-by: syzbot+50a8a9cf8127f2c6f5df@syzkaller.appspotmail.com
Fixes: 601ef0d52e ("gfs2: Force withdraw to replay journals and wait for it to finish")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been
horribly confusingly misnamed. Rename it to BIO_MAX_VECS to stop
confusing users of the bio API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch adds code to function trans_drain to remove drained
bd elements from the ail lists, if queued, before freeing the bd.
If we don't remove the bd from the ail, function ail_drain will
try to reference the bd after it has been freed by trans_drain.
Thanks to Andy Price for his analysis of the problem.
Reported-by: Andy Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
It fixes the following warning detected by coccinelle:
./fs/gfs2/super.c:592:5-10: Unneeded variable: "error". Return "0" on
line 628
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Pull misc vfs updates from Al Viro:
"Assorted stuff pile - no common topic here"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
whack-a-mole: don't open-code iminor/imajor
9p: fix misuse of sscanf() in v9fs_stat2inode()
audit_alloc_mark(): don't open-code ERR_CAST()
fs/inode.c: make inode_init_always() initialize i_ino to 0
vfs: don't unnecessarily clone write access for writable fds
* Log space and revoke accounting rework to fix some failed asserts.
* Local resource group glock sharing for better local performance.
* Add support for version 1802 filesystems: trusted xattr support and
'-o rgrplvb' mounts by default.
* Actually synchronize on the inode glock's FREEING bit during withdraw
("gfs2: fix glock confusion in function signal_our_withdraw").
* Fix parallel recovery of multiple journals ("gfs2: keep bios separate
for each journal").
* Various other bug fixes.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmA1TmwUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpDZhAArnFj5AhWMI2+DD5o05EILdgDSpwh
JWYT1pfRqR1OZrs7ZZ7tGZB4H6oytYfJ+4mg9Kk7CE7oJKcBh695IPZoIWv8+BCC
WIgQGJytCFp4tuDNw11HZ0ahgW4zXPyJTt6jidZ5jVkux31JrUS7fVqSsD2vIPqA
iQMcJIH+NLTlYbNt4d5T/ngaoRcx7m18RWkcxf6Y+/DBnnwIe4ZDpZmkWVykuncv
OFSvXK8vKyLWGnvH/MIsywfYeU5rj/0AIu66JhVILQ4v5kGYIigwY3quXP2SoITM
Z0+N5Gj/N4OWSscRS86zyqhnRucrjDkNP2+oGSzJWgtSXE/KplyfInAmQWzhIPRM
n7T0boTp+gOTzGq7ELCzj44KICLG76WgDwaR2bLHuQ2/ppVrHNltZqncP2iwynN6
glfST/eHBUBu1qTYLaOAfkUBlhpKDXu0YPcXX7lH6M0JqyvkRUFfuBAU9dic9D9K
zsxplHGJrZnE9QFWWbS3aOviPlSHaXfkZF0Xv7QCLyuPRhu+e/qfcAoeVhxSd4+e
I0grs/TxM61jyju9SmqnM7P+8qYS55naYH1V+6iNCU5dax8MvdxNZuneBQIa07U+
Y84JPQvTBZDUE0gZ8fUzZtnYS7RqyiG7BL+T4W5Ph7LgxXbgQD7CWerYpg7fBm/j
HEpjKqrS96zfTyk=
=45VG
-----END PGP SIGNATURE-----
Merge tag 'gfs2-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Log space and revoke accounting rework to fix some failed asserts.
- Local resource group glock sharing for better local performance.
- Add support for version 1802 filesystems: trusted xattr support and
'-o rgrplvb' mounts by default.
- Actually synchronize on the inode glock's FREEING bit during withdraw
("gfs2: fix glock confusion in function signal_our_withdraw").
- Fix parallel recovery of multiple journals ("gfs2: keep bios separate
for each journal").
- Various other bug fixes.
* tag 'gfs2-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (49 commits)
gfs2: Don't get stuck with I/O plugged in gfs2_ail1_flush
gfs2: Per-revoke accounting in transactions
gfs2: Rework the log space allocation logic
gfs2: Minor calc_reserved cleanup
gfs2: Use resource group glock sharing
gfs2: Allow node-wide exclusive glock sharing
gfs2: Add local resource group locking
gfs2: Add per-reservation reserved block accounting
gfs2: Rename rs_{free -> requested} and rd_{reserved -> requested}
gfs2: Check for active reservation in gfs2_release
gfs2: Don't search for unreserved space twice
gfs2: Only pass reservation down to gfs2_rbm_find
gfs2: Also reflect single-block allocations in rgd->rd_extfail_pt
gfs2: Recursive gfs2_quota_hold in gfs2_iomap_end
gfs2: Add trusted xattr support
gfs2: Enable rgrplvb for sb_fs_format 1802
gfs2: Don't skip dlm unlock if glock has an lvb
gfs2: Lock imbalance on error path in gfs2_recover_one
gfs2: Move function gfs2_ail_empty_tr
gfs2: Get rid of current_tail()
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
=yPaw
-----END PGP SIGNATURE-----
Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull idmapped mounts from Christian Brauner:
"This introduces idmapped mounts which has been in the making for some
time. Simply put, different mounts can expose the same file or
directory with different ownership. This initial implementation comes
with ports for fat, ext4 and with Christoph's port for xfs with more
filesystems being actively worked on by independent people and
maintainers.
Idmapping mounts handle a wide range of long standing use-cases. Here
are just a few:
- Idmapped mounts make it possible to easily share files between
multiple users or multiple machines especially in complex
scenarios. For example, idmapped mounts will be used in the
implementation of portable home directories in
systemd-homed.service(8) where they allow users to move their home
directory to an external storage device and use it on multiple
computers where they are assigned different uids and gids. This
effectively makes it possible to assign random uids and gids at
login time.
- It is possible to share files from the host with unprivileged
containers without having to change ownership permanently through
chown(2).
- It is possible to idmap a container's rootfs and without having to
mangle every file. For example, Chromebooks use it to share the
user's Download folder with their unprivileged containers in their
Linux subsystem.
- It is possible to share files between containers with
non-overlapping idmappings.
- Filesystem that lack a proper concept of ownership such as fat can
use idmapped mounts to implement discretionary access (DAC)
permission checking.
- They allow users to efficiently changing ownership on a per-mount
basis without having to (recursively) chown(2) all files. In
contrast to chown (2) changing ownership of large sets of files is
instantenous with idmapped mounts. This is especially useful when
ownership of a whole root filesystem of a virtual machine or
container is changed. With idmapped mounts a single syscall
mount_setattr syscall will be sufficient to change the ownership of
all files.
- Idmapped mounts always take the current ownership into account as
idmappings specify what a given uid or gid is supposed to be mapped
to. This contrasts with the chown(2) syscall which cannot by itself
take the current ownership of the files it changes into account. It
simply changes the ownership to the specified uid and gid. This is
especially problematic when recursively chown(2)ing a large set of
files which is commong with the aforementioned portable home
directory and container and vm scenario.
- Idmapped mounts allow to change ownership locally, restricting it
to specific mounts, and temporarily as the ownership changes only
apply as long as the mount exists.
Several userspace projects have either already put up patches and
pull-requests for this feature or will do so should you decide to pull
this:
- systemd: In a wide variety of scenarios but especially right away
in their implementation of portable home directories.
https://systemd.io/HOME_DIRECTORY/
- container runtimes: containerd, runC, LXD:To share data between
host and unprivileged containers, unprivileged and privileged
containers, etc. The pull request for idmapped mounts support in
containerd, the default Kubernetes runtime is already up for quite
a while now: https://github.com/containerd/containerd/pull/4734
- The virtio-fs developers and several users have expressed interest
in using this feature with virtual machines once virtio-fs is
ported.
- ChromeOS: Sharing host-directories with unprivileged containers.
I've tightly synced with all those projects and all of those listed
here have also expressed their need/desire for this feature on the
mailing list. For more info on how people use this there's a bunch of
talks about this too. Here's just two recent ones:
https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdfhttps://fosdem.org/2021/schedule/event/containers_idmap/
This comes with an extensive xfstests suite covering both ext4 and
xfs:
https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts
It covers truncation, creation, opening, xattrs, vfscaps, setid
execution, setgid inheritance and more both with idmapped and
non-idmapped mounts. It already helped to discover an unrelated xfs
setgid inheritance bug which has since been fixed in mainline. It will
be sent for inclusion with the xfstests project should you decide to
merge this.
In order to support per-mount idmappings vfsmounts are marked with
user namespaces. The idmapping of the user namespace will be used to
map the ids of vfs objects when they are accessed through that mount.
By default all vfsmounts are marked with the initial user namespace.
The initial user namespace is used to indicate that a mount is not
idmapped. All operations behave as before and this is verified in the
testsuite.
Based on prior discussions we want to attach the whole user namespace
and not just a dedicated idmapping struct. This allows us to reuse all
the helpers that already exist for dealing with idmappings instead of
introducing a whole new range of helpers. In addition, if we decide in
the future that we are confident enough to enable unprivileged users
to setup idmapped mounts the permission checking can take into account
whether the caller is privileged in the user namespace the mount is
currently marked with.
The user namespace the mount will be marked with can be specified by
passing a file descriptor refering to the user namespace as an
argument to the new mount_setattr() syscall together with the new
MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
of extensibility.
The following conditions must be met in order to create an idmapped
mount:
- The caller must currently have the CAP_SYS_ADMIN capability in the
user namespace the underlying filesystem has been mounted in.
- The underlying filesystem must support idmapped mounts.
- The mount must not already be idmapped. This also implies that the
idmapping of a mount cannot be altered once it has been idmapped.
- The mount must be a detached/anonymous mount, i.e. it must have
been created by calling open_tree() with the OPEN_TREE_CLONE flag
and it must not already have been visible in the filesystem.
The last two points guarantee easier semantics for userspace and the
kernel and make the implementation significantly simpler.
By default vfsmounts are marked with the initial user namespace and no
behavioral or performance changes are observed.
The manpage with a detailed description can be found here:
1d7b902e28
In order to support idmapped mounts, filesystems need to be changed
and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
patches to convert individual filesystem are not very large or
complicated overall as can be seen from the included fat, ext4, and
xfs ports. Patches for other filesystems are actively worked on and
will be sent out separately. The xfstestsuite can be used to verify
that port has been done correctly.
The mount_setattr() syscall is motivated independent of the idmapped
mounts patches and it's been around since July 2019. One of the most
valuable features of the new mount api is the ability to perform
mounts based on file descriptors only.
Together with the lookup restrictions available in the openat2()
RESOLVE_* flag namespace which we added in v5.6 this is the first time
we are close to hardened and race-free (e.g. symlinks) mounting and
path resolution.
While userspace has started porting to the new mount api to mount
proper filesystems and create new bind-mounts it is currently not
possible to change mount options of an already existing bind mount in
the new mount api since the mount_setattr() syscall is missing.
With the addition of the mount_setattr() syscall we remove this last
restriction and userspace can now fully port to the new mount api,
covering every use-case the old mount api could. We also add the
crucial ability to recursively change mount options for a whole mount
tree, both removing and adding mount options at the same time. This
syscall has been requested multiple times by various people and
projects.
There is a simple tool available at
https://github.com/brauner/mount-idmapped
that allows to create idmapped mounts so people can play with this
patch series. I'll add support for the regular mount binary should you
decide to pull this in the following weeks:
Here's an example to a simple idmapped mount of another user's home
directory:
u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt
u1001@f2-vm:/$ ls -al /home/ubuntu/
total 28
drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
drwxr-xr-x 4 root root 4096 Oct 28 04:00 ..
-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile
-rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ ls -al /mnt/
total 28
drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 .
drwxr-xr-x 29 root root 4096 Oct 28 22:01 ..
-rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile
-rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ touch /mnt/my-file
u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file
u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file
u1001@f2-vm:/$ ls -al /mnt/my-file
-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file
u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file
u1001@f2-vm:/$ getfacl /mnt/my-file
getfacl: Removing leading '/' from absolute path names
# file: mnt/my-file
# owner: u1001
# group: u1001
user::rw-
user:u1001:rwx
group::rw-
mask::rwx
other::r--
u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
getfacl: Removing leading '/' from absolute path names
# file: home/ubuntu/my-file
# owner: ubuntu
# group: ubuntu
user::rw-
user:ubuntu:rwx
group::rw-
mask::rwx
other::r--"
* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
xfs: support idmapped mounts
ext4: support idmapped mounts
fat: handle idmapped mounts
tests: add mount_setattr() selftests
fs: introduce MOUNT_ATTR_IDMAP
fs: add mount_setattr()
fs: add attr_flags_to_mnt_flags helper
fs: split out functions to hold writers
namespace: only take read lock in do_reconfigure_mnt()
mount: make {lock,unlock}_mount_hash() static
namespace: take lock_mount_hash() directly when changing flags
nfs: do not export idmapped mounts
overlayfs: do not mount on top of idmapped mounts
ecryptfs: do not mount on top of idmapped mounts
ima: handle idmapped mounts
apparmor: handle idmapped mounts
fs: make helpers idmap mount aware
exec: handle idmapped mounts
would_dump: handle idmapped mounts
...
In gfs2_ail1_flush, we're using I/O plugging to give the block layer a
better chance of merging I/O requests. If we're too aggressive here, we
can end up waiting on I/O to complete while still plugged. Fix that in
a way similar to writeback_sb_inodes, except that we can't use
blk_flush_plug because blk_flush_plug_list is not exported.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmAzoWUACgkQnJ2qBz9k
QNnFgQgAlng0JOzeCQvLpwweqFl0FCxYbOsZXC1xDyvfX3TiA6A6oiOR4tx3uhQN
cOQmJXaiMn4oCXjD1j6WZwGfy23yx0XchaoFK9jy2IqodaB/zUjkiWYYqt0G3XIX
ud35mxjLAGS12BCD0c+vHy2RMsUFl5ep+5aBHRHZJJhCcYbl7e5ctXZ3xB1Q0mgI
r639gD8JhH3ICdu9W0NaMvqOrVhJFNmhSGATKL/N96+oKub2x2ycYE4L2OXegxy3
mnFf26LjA8jt7K+KfHloTvkC6D4HVnnvKFvKiIbGKafiWhAE7q57ZO6BPCMajGue
3UHIhWGmwKXRU72+nW6N+089GbcO/g==
=1e+z
-----END PGP SIGNATURE-----
Merge tag 'lazytime_for_v5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull lazytime updates from Jan Kara:
"Cleanups of the lazytime handling in the writeback code making rules
for calling ->dirty_inode() filesystem handlers saner"
* tag 'lazytime_for_v5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext4: simplify i_state checks in __ext4_update_other_inode_time()
gfs2: don't worry about I_DIRTY_TIME in gfs2_fsync()
fs: improve comments for writeback_single_inode()
fs: drop redundant check from __writeback_single_inode()
fs: clean up __mark_inode_dirty() a bit
fs: pass only I_DIRTY_INODE flags to ->dirty_inode
fs: don't call ->dirty_inode for lazytime timestamp updates
fat: only specify I_DIRTY_TIME when needed in fat_update_time()
fs: only specify I_DIRTY_TIME when needed in generic_update_time()
fs: correctly document the inode dirty flags
In the log, revokes are stored as a revoke descriptor (struct
gfs2_log_descriptor), followed by zero or more additional revoke blocks
(struct gfs2_meta_header). On filesystems with a blocksize of 4k, the
revoke descriptor contains up to 503 revokes, and the metadata blocks
contain up to 509 revokes each. We've so far been reserving space for
revokes in transactions in block granularity, so a lot more space than
necessary was being allocated and then released again.
This patch switches to assigning revokes to transactions individually
instead. Initially, space for the revoke descriptor is reserved and
handed out to transactions. When more revokes than that are reserved,
additional revoke blocks are added. When the log is flushed, the space
for the additional revoke blocks is released, but we keep the space for
the revoke descriptor block allocated.
Transactions may still reserve more revokes than they will actually need
in the end, but now we won't overshoot the target as much, and by only
returning the space for excess revokes at log flush time, we further
reduce the amount of contention between processes.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The current log space allocation logic is hard to understand or extend.
The principle it that when the log is flushed, we may or may not have a
transaction active that has space allocated in the log. To deal with
that, we set aside a magical number of blocks to be used in case we
don't have an active transaction. It isn't clear that the pool will
always be big enough. In addition, we can't return unused log space at
the end of a transaction, so the number of blocks allocated must exactly
match the number of blocks used.
Simplify this as follows:
* When transactions are allocated or merged, always reserve enough
blocks to flush the transaction (err on the safe side).
* In gfs2_log_flush, return any allocated blocks that haven't been used.
* Maintain a pool of spare blocks big enough to do one log flush, as
before.
* In gfs2_log_flush, when we have no active transaction, allocate a
suitable number of blocks. For that, use the spare pool when
called from logd, and leave the pool alone otherwise. This means
that when the log is almost full, logd will still be able to do one
more log flush, which will result in more log space becoming
available.
This will make the log space allocator code easier to work with in
the future.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch takes advantage of the new glock holder sharing feature for
resource groups. We have already introduced local resource group
locking in a previous patch, so competing accesses of local processes
are already under control.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Introduce a new LM_FLAG_NODE_SCOPE glock holder flag: when taking a
glock in LM_ST_EXCLUSIVE (EX) mode and with the LM_FLAG_NODE_SCOPE flag
set, the exclusive lock is shared among all local processes who are
holding the glock in EX mode and have the LM_FLAG_NODE_SCOPE flag set.
From the point of view of other nodes, the lock is still held
exclusively.
A future patch will start using this flag to improve performance with
rgrp sharing.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Prepare for treating resource group glocks as exclusive among nodes but
shared among all tasks running on a node: introduce another layer of
node-specific locking that the local tasks can use to coordinate their
accesses.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Add a rs_reserved field to struct gfs2_blkreserv to keep track of the number of
blocks reserved by this particular reservation, and a rd_reserved field to
struct gfs2_rgrpd to keep track of the total number of reserved blocks in the
resource group. Those blocks are exclusively reserved, as opposed to the
rs_requested / rd_requested blocks which are tracked in the reservation tree
(rd_rstree) and which can be stolen if necessary.
When making a reservation with gfs2_inplace_reserve, rs_reserved is set to
somewhere between ap->min_target and ap->target depending on the number of free
blocks in the resource group. When allocating blocks with gfs2_alloc_blocks,
rs_reserved is decremented accordingly. Eventually, any reserved but not
consumed blocks are returned to the resource group by gfs2_inplace_release.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
We keep track of what we've so far been referring to as reservations in
rd_rstree: the nodes in that tree indicate where in a resource group we'd
like to allocate the next couple of blocks for a particular inode. Local
processes take those as hints, but they may still "steal" blocks from those
extents, so when actually allocating a block, we must double check in the
bitmap whether that block is actually still free. Likewise, other cluster
nodes may "steal" such blocks as well.
One of the following patches introduces resource group glock sharing, i.e.,
sharing of an exclusively locked resource group glock among local processes to
speed up allocations. To make that work, we'll need to keep track of how many
blocks we've actually reserved for each inode, so we end up with two different
kinds of reservations.
Distinguish these two kinds by referring to blocks which are reserved but may
still be "stolen" as "requested". This rename also makes it more obvious that
rs_requested and rd_requested are strongly related.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In gfs2_release, check if the inode has an active reservation to avoid
unnecessary lock taking.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
If gfs2_inplace_reserve has chosen a resource group but it couldn't make a
reservation there, there are too many other reservations in that resource
group. In that case, don't even try to respect existing reservations in
gfs2_alloc_blocks.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Only pass the current reservation down to gfs2_rbm_find rather than the entire
inode; we don't need any of the other information.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Pass a non-NULL minext to gfs2_rbm_find even for single-block allocations. In
gfs2_rbm_find, also set rgd->rd_extfail_pt when a single-block allocation
fails in a resource group: there is no reason for treating that case
differently. In gfs2_reservation_check_and_update, only check how many free
blocks we have if more than one block is requested; we already know there's at
least one free block.
In addition, when allocating N blocks fails in gfs2_rbm_find, we need to set
rd_extfail_pt to N - 1 rather than N: rd_extfail_pt defines the biggest
allocation that might still succeed.
Finally, reset rd_extfail_pt when updating the resource group statistics in
update_rgrp_lvb, as we already do in gfs2_rgrp_bh_get.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When starting an iomap write, gfs2_quota_lock_check -> gfs2_quota_lock
-> gfs2_quota_hold is called from gfs2_iomap_begin. At the end of the
write, before unlocking the quotas, punch_hole -> gfs2_quota_hold can be
called again in gfs2_iomap_end, which is incorrect and leads to a failed
assertion. Instead, move the call to gfs2_quota_unlock before the call
to punch_hole to fix that.
Fixes: 64bc06bb32 ("gfs2: iomap buffered write support")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Add support for an additional filesystem version (sb_fs_format = 1802).
When a filesystem with the new version is mounted, the filesystem
supports "trusted.*" xattrs.
In addition, version 1802 filesystems implement a form of forward
compatibility for xattrs: when xattrs with an unknown prefix (ea_type)
are found on a version 1802 filesystem, those attributes are not shown
by listxattr, and they are not accessible by getxattr, setxattr, or
removexattr.
This mechanism might turn out to be what we need in the future, but if
not, we can always bump the filesystem version and break compatibility
instead.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Andrew Price <anprice@redhat.com>
Turn on rgrplvb by default for sb_fs_format > 1801.
Mount options still have to override this so a new args field to
differentiate between 'off' and 'not specified' is added, and the new
default is applied only when it's not specified.
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Patch fb6791d100 was designed to allow gfs2 to unmount quicker by
skipping the step where it tells dlm to unlock glocks in EX with lvbs.
This was done because when gfs2 unmounts a file system, it destroys the
dlm lockspace shortly after it destroys the glocks so it doesn't need to
unlock them all: the unlock is implied when the lockspace is destroyed
by dlm.
However, that patch introduced a use-after-free in dlm: as part of its
normal dlm_recoverd process, it can call ls_recovery to recover dead
locks. In so doing, it can call recover_rsbs which calls recover_lvb for
any mastered rsbs. Func recover_lvb runs through the list of lkbs queued
to the given rsb (if the glock is cached but unlocked, it will still be
queued to the lkb, but in NL--Unlocked--mode) and if it has an lvb,
copies it to the rsb, thus trying to preserve the lkb. However, when
gfs2 skips the dlm unlock step, it frees the glock and its lvb, which
means dlm's function recover_lvb references the now freed lvb pointer,
copying the freed lvb memory to the rsb.
This patch changes the check in gdlm_put_lock so that it calls
dlm_unlock for all glocks that contain an lvb pointer.
Fixes: fb6791d100 ("GFS2: skip dlm_unlock calls in unmount")
Cc: stable@vger.kernel.org # v3.8+
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In gfs2_recover_one, fix a sd_log_flush_lock imbalance when a recovery
pass fails.
Fixes: c9ebc4b737 ("gfs2: allow journal replay to hold sd_log_flush_lock")
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Keep the current value of the updated log tail in the super block as
sb_log_flush_tail instead of computing it on the fly. This avoids
unnecessary sd_ail_lock taking and cleans up the code.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Use a tighter bound for the number of blocks required by transactions in
gfs2_trans_begin: in the worst case, we'll have mixed data and metadata,
so we'll need a log desciptor for each type.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Wake up log waiters in gfs2_log_release when log space has actually become
available. This is a much better place for the wakeup than gfs2_logd.
Check if enough log space is immeditely available before anything else. If
there isn't, use io_wait_event to wait instead of open-coding it.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Commit 588bff95c9 added gfs2_write_log_header() and started using it in
clean_journal(), with an additional call to log_flush_wait() at the end of
gfs2_write_log_header() which is unnecessary for clean_journal(). Move
that call out of gfs2_write_log_header() to restore the previous behavior.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Move the read locking of sd_log_flush_lock from gfs2_log_reserve to
gfs2_trans_begin, and its unlocking from gfs2_log_release to
gfs2_trans_end. Use gfs2_log_release in two places in which it was open
coded before.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This counter and the associated wait queue are only used so that
gfs2_make_fs_ro can efficiently wait for all pending log space
allocations to fail after setting the filesystem to read-only. This
comes at the cost of waking up that wait queue very frequently.
Instead, when gfs2_log_reserve fails because the filesystem has become
read-only, Wake up sd_log_waitq. In gfs2_make_fs_ro, set the file
system read-only and then wait until all the log space has been
released. Give up and report the problem after a while. With that,
sd_reserving_log and sd_reserving_log_wait can be removed.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Replace the TR_ALLOCED flag by its inverse, TR_ONSTACK: that way, the flag only
needs to be set in the exceptional case of on-stack transactions. Split off
__gfs2_trans_begin from gfs2_trans_begin and use it to replace the open-coded
version in gfs2_ail_empty_gl.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Commit 2e60d7683c ("GFS2: update freeze code to use freeze/thaw_super
on all nodes") optimized away the sb_start_intwrite ... sb_end_intwrite
protection for the on-stack transactions in gfs2_ail_empty_gl with no
explanation. I can't think of a valid reason for doing that, so revert
that change. This simplifies the next commit.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The recovery func can recover multiple journals, but they were all using
the same bio. This resulted in use-after-free related to sdp->sd_log_bio.
This patch moves the variable to the journal descriptor, jd, so that
every recovery can operate on its own bio. And hopefully we never run out.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
If go_free is defined, function signal_our_withdraw is supposed to
synchronize on the GLF_FREEING flag of the inode glock, but it
accidentally does that on the live glock. Fix that and disambiguate
the glock variables.
Fixes: 601ef0d52e ("gfs2: Force withdraw to replay journals and wait for it to finish")
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
This reverts commit 428fd95d85.
Patch 428fd95d85b2 added a call to log_flush_wait to function
gfs2_log_flush. Then gfs2_log_flush calls log_write_header which submits
a write request with the REQ_PREFLUSH flag which also forces it to wait.
This patch removes the unnecessary call to log_flush_wait.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.
As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.
Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The generic_fillattr() helper fills in the basic attributes associated
with an inode. Enable it to handle idmapped mounts. If the inode is
accessed through an idmapped mount map it into the mount's user
namespace before we store the uid and gid. If the initial user namespace
is passed nothing changes so non-idmapped mounts will see identical
behavior as before.
Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The posix acl permission checking helpers determine whether a caller is
privileged over an inode according to the acls associated with the
inode. Add helpers that make it possible to handle acls on idmapped
mounts.
The vfs and the filesystems targeted by this first iteration make use of
posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to
translate basic posix access and default permissions such as the
ACL_USER and ACL_GROUP type according to the initial user namespace (or
the superblock's user namespace) to and from the caller's current user
namespace. Adapt these two helpers to handle idmapped mounts whereby we
either map from or into the mount's user namespace depending on in which
direction we're translating.
Similarly, cap_convert_nscap() is used by the vfs to translate user
namespace and non-user namespace aware filesystem capabilities from the
superblock's user namespace to the caller's user namespace. Enable it to
handle idmapped mounts by accounting for the mount's user namespace.
In addition the fileystems targeted in the first iteration of this patch
series make use of the posix_acl_chmod() and, posix_acl_update_mode()
helpers. Both helpers perform permission checks on the target inode. Let
them handle idmapped mounts. These two helpers are called when posix
acls are set by the respective filesystems to handle this case we extend
the ->set() method to take an additional user namespace argument to pass
the mount's user namespace down.
Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
When file attributes are changed most filesystems rely on the
setattr_prepare(), setattr_copy(), and notify_change() helpers for
initialization and permission checking. Let them handle idmapped mounts.
If the inode is accessed through an idmapped mount map it into the
mount's user namespace. Afterwards the checks are identical to
non-idmapped mounts. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.
Helpers that perform checks on the ia_uid and ia_gid fields in struct
iattr assume that ia_uid and ia_gid are intended values and have already
been mapped correctly at the userspace-kernelspace boundary as we
already do today. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.
Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The inode_owner_or_capable() helper determines whether the caller is the
owner of the inode or is capable with respect to that inode. Allow it to
handle idmapped mounts. If the inode is accessed through an idmapped
mount it according to the mount's user namespace. Afterwards the checks
are identical to non-idmapped mounts. If the initial user namespace is
passed nothing changes so non-idmapped mounts will see identical
behavior as before.
Similarly, allow the inode_init_owner() helper to handle idmapped
mounts. It initializes a new inode on idmapped mounts by mapping the
fsuid and fsgid of the caller from the mount's user namespace. If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.
Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The two helpers inode_permission() and generic_permission() are used by
the vfs to perform basic permission checking by verifying that the
caller is privileged over an inode. In order to handle idmapped mounts
we extend the two helpers with an additional user namespace argument.
On idmapped mounts the two helpers will make sure to map the inode
according to the mount's user namespace and then peform identical
permission checks to inode_permission() and generic_permission(). If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.
Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Pass a set of flags to iomap_dio_rw instead of the boolean
wait_for_completion argument. The IOMAP_DIO_FORCE_WAIT flag
replaces the wait_for_completion, but only needs to be passed
when the iocb isn't synchronous to start with to simplify the
callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
[djwong: rework xfs_file.c so that we can push iomap changes separately]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
As gfs2_quotad_cachep and gfs2_glock_cachep have registered
shrinkers, amending SLAB_RECLAIM_ACCOUNT when creating them,
which improves slab accounting.
Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Function gfs2_write_revokes doesn't actually write any revokes; instead, it
adds revokes to the system transaction during a flush.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The calc_reserved description claims that buf_limit is 502 (on 4k
filesystems), but it is actually 503. Fix / clarify the entire
description.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The BUF_OFFSET and DATABUF_OFFSET definitions are only used in buf_limit
and databuf_limit, respectively, and the rounding done in those
definitions is immediately wiped out by dividing by the element size.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
When reading a resource group from disk or when receiving the resource group
statistics from a Lock Value Block (LVB), set/clear the GBF_FULL flags of all
bitmaps in that resource group according to whether or not the resource group
is full.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Removing a reservation doesn't make any actual space available, so don't clear
the GBF_FULL flags in that case. Otherwise, we'll only spend more time
scanning the bitmaps unnecessarily.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This reverts commit e79e0e1428.
It turns out that we're only setting the GBF_FULL flag of a bitmap if we've
been scanning from the beginning of the bitmap until the end and we haven't
found a single free block, and we're not skipping reservations in that process,
either. This means that in gfs2_rbm_find, we can always skip bitmaps with the
GBF_FULL flag set.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Variable ndata is only used inside "if (!dinode)", so it can be replaced
entirely with *nblocks.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
GFS2 uses struct gfs2_rbm to represent a filesystem block number as a
bit position within a resource group. This representation is used in
the bitmap manipulation code to prevent excessive conversions between
block numbers and bit positions, but also in struct gfs2_blkreserv which
is part of struct gfs2_inode, to mark the start of a reservation. In
the inode, the bit position representation makes less sense: first, the
start position is used as a block number about as often as a bit
position; second, the bit position representation makes the code
unnecessarily complicated and difficult to read.
Therefore, change struct gfs2_blkreserv to represent the start of a
reservation as a block number instead of a bit position. (This requires
keeping track of the resource group in gfs2_blkreserv separately.) With
that change, various things can be slightly simplified, and struct
gfs2_rbm can be moved to rgrp.c.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Change gfs2_rbm_incr to advance an rbm by a given number of blocks. Use that
in gfs2_reservation_check_and_update to save a gfs2_rbm_to_block ->
gfs2_rbm_from_block round trip.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
The I_DIRTY_TIME flag is primary used within the VFS, and there's no
reason for ->fsync() implementations to do anything with it. This is
because when !datasync, the VFS will expire dirty timestamps before
calling ->fsync(). (See vfs_fsync_range().) This turns I_DIRTY_TIME
into I_DIRTY_SYNC.
Therefore, change gfs2_fsync() to not check for I_DIRTY_TIME.
Link: https://lore.kernel.org/r/20210112190253.64307-11-ebiggers@kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
There is no need to call ->dirty_inode for lazytime timestamp updates
(i.e. for __mark_inode_dirty(I_DIRTY_TIME)), since by the definition of
lazytime, filesystems must ignore these updates. Filesystems only need
to care about the updated timestamps when they expire.
Therefore, only call ->dirty_inode when I_DIRTY_INODE is set.
Based on a patch from Christoph Hellwig:
https://lore.kernel.org/r/20200325122825.1086872-4-hch@lst.de
Link: https://lore.kernel.org/r/20210112190253.64307-6-ebiggers@kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Function gfs2_log_write_page is only used in lops.c, so make it static.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, sister functions gfs2_make_fs_rw and gfs2_make_fs_ro locked
(held) the freeze glock by calling gfs2_freeze_lock and gfs2_freeze_unlock.
The problem is, not all the callers of gfs2_make_fs_ro should be doing this.
The three callers of gfs2_make_fs_ro are: remount (gfs2_reconfigure),
signal_our_withdraw, and unmount (gfs2_put_super). But when unmounting the
file system we can get into the following circular lock dependency:
deactivate_super
down_write(&s->s_umount); <-------------------------------------- s_umount
deactivate_locked_super
gfs2_kill_sb
kill_block_super
generic_shutdown_super
gfs2_put_super
gfs2_make_fs_ro
gfs2_glock_nq_init sd_freeze_gl
freeze_go_sync
if (freeze glock in SH)
freeze_super (vfs)
down_write(&sb->s_umount); <------- s_umount
This patch moves the hold of the freeze glock outside the two sister rw/ro
functions to their callers, but it doesn't request the glock from
gfs2_put_super, thus eliminating the circular dependency.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Many places in the gfs2 code queued and dequeued the freeze glock.
Almost all of them acquire it in SHARED mode, and need to specify the
same LM_FLAG_NOEXP and GL_EXACT flags.
This patch adds common helper functions gfs2_freeze_lock and gfs2_freeze_unlock
to make the code more readable, and to prepare for the next patch.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Function signal_our_withdraw needs to work on file systems that have been
partially frozen. To do this, it called flush_workqueue(gfs2_freeze_wq).
This this wrong because it waits for *ALL* file systems to be unfrozen, not
just the one we're withdrawing from. It should only wait for the targetted
file system to be unfrozen. Otherwise it would wait until ALL file systems
are thawed before signaling the withdraw.
This patch changes signal_our_withdraw so it calls flush_work() for the target
file system's freeze work (only) to be completed.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function gfs2_statfs_sync called sb_start_write and
sb_end_write. This is completely unnecessary because, aside from grabbing
glocks, gfs2_statfs_sync does all its updates to statfs with a transaction:
gfs2_trans_begin and _end. And transactions always do sb_start_intwrite in
gfs2_trans_begin and sb_end_intwrite in gfs2_trans_end.
This patch simply removes the call to sb_start_write.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Since commit a0e3cc65fa ("gfs2: Turn gl_delete into a delayed work"), we're
cancelling any pending delete work of an iopen glock before attaching a new
inode to that glock in gfs2_create_inode. This means that delete_work_func can
no longer be queued or running when attaching the iopen glock to the new inode,
and we can revert commit a4923865ea ("GFS2: Prevent delete work from
occurring on glocks used for create"), which tried to achieve the same but in a
racy way.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In gfs2_create_inode and gfs2_inode_lookup, make sure to cancel any pending
delete work before taking the inode glock. Otherwise, gfs2_cancel_delete_work
may block waiting for delete_work_func to complete, and delete_work_func may
block trying to acquire the inode glock in gfs2_inode_lookup.
Reported-by: Alexander Aring <aahringo@redhat.com>
Fixes: a0e3cc65fa ("gfs2: Turn gl_delete into a delayed work")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Commit 20f829999c ("gfs2: Rework read and page fault locking") lifted
the glock lock taking from the low-level ->readpage and ->readahead
address space operations to the higher-level ->read_iter file and
->fault vm operations. The glocks are still taken in LM_ST_SHARED mode
only. On filesystems mounted without the noatime option, ->read_iter
sometimes needs to update the atime as well, though. Right now, this
leads to a failed locking mode assertion in gfs2_dirty_inode.
Fix that by introducing a new update_time inode operation. There, if
the glock is held non-exclusively, upgrade it to an exclusive lock.
Reported-by: Alexander Aring <aahringo@redhat.com>
Fixes: 20f829999c ("gfs2: Rework read and page fault locking")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
GFS2's freeze/thaw mechanism uses a special freeze glock to control its
operation. It does this with a sync glock operation (glops.c) called
freeze_go_sync. When the freeze glock is demoted (glock's do_xmote) the
glops function causes the file system to be frozen. This is intended. However,
GFS2's mount and unmount processes also hold the freeze glock to prevent other
processes, perhaps on different cluster nodes, from mounting the frozen file
system in read-write mode.
Before this patch, there was no check in freeze_go_sync for whether a freeze
in intended or whether the glock demote was caused by a normal unmount.
So it was trying to freeze the file system it's trying to unmount, which
ends up in a deadlock.
This patch adds an additional check to freeze_go_sync so that demotes of the
freeze glock are ignored if they come from the unmount process.
Fixes: 20b3291290 ("gfs2: Fix regression in freeze_go_sync")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
If gfs2 tries to mount a (corrupt) file system that has no resource
groups it still tries to set preferences on the first one, which causes
a kernel null pointer dereference. This patch adds a check to function
gfs2_ri_update so this condition is detected and reported back as an
error.
Reported-by: syzbot+e3f23ce40269a4c9053a@syzkaller.appspotmail.com
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This patch introduce a new globs attribute to define the subclass of the
glock lockref spinlock. This avoid the following lockdep warning, which
occurs when we lock an inode lock while an iopen lock is held:
============================================
WARNING: possible recursive locking detected
5.10.0-rc3+ #4990 Not tainted
--------------------------------------------
kworker/0:1/12 is trying to acquire lock:
ffff9067d45672d8 (&gl->gl_lockref.lock){+.+.}-{3:3}, at: lockref_get+0x9/0x20
but task is already holding lock:
ffff9067da308588 (&gl->gl_lockref.lock){+.+.}-{3:3}, at: delete_work_func+0x164/0x260
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&gl->gl_lockref.lock);
lock(&gl->gl_lockref.lock);
*** DEADLOCK ***
May be due to missing lock nesting notation
3 locks held by kworker/0:1/12:
#0: ffff9067c1bfdd38 ((wq_completion)delete_workqueue){+.+.}-{0:0}, at: process_one_work+0x1b7/0x540
#1: ffffac594006be70 ((work_completion)(&(&gl->gl_delete)->work)){+.+.}-{0:0}, at: process_one_work+0x1b7/0x540
#2: ffff9067da308588 (&gl->gl_lockref.lock){+.+.}-{3:3}, at: delete_work_func+0x164/0x260
stack backtrace:
CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.10.0-rc3+ #4990
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Workqueue: delete_workqueue delete_work_func
Call Trace:
dump_stack+0x8b/0xb0
__lock_acquire.cold+0x19e/0x2e3
lock_acquire+0x150/0x410
? lockref_get+0x9/0x20
_raw_spin_lock+0x27/0x40
? lockref_get+0x9/0x20
lockref_get+0x9/0x20
delete_work_func+0x188/0x260
process_one_work+0x237/0x540
worker_thread+0x4d/0x3b0
? process_one_work+0x540/0x540
kthread+0x127/0x140
? __kthread_bind_mask+0x60/0x60
ret_from_fork+0x22/0x30
Suggested-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Commit 0e539ca1bb ("gfs2: Fix NULL pointer dereference in gfs2_rgrp_dump")
introduced additional locking in gfs2_rgrp_go_dump, which is also used for
dumping resource group glocks via debugfs. However, on that code path, the
glock spin lock is already taken in dump_glock, and taking it again in
gfs2_glock2rgrp leads to deadlock. This can be reproduced with:
$ mkfs.gfs2 -O -p lock_nolock /dev/FOO
$ mount /dev/FOO /mnt/foo
$ touch /mnt/foo/bar
$ cat /sys/kernel/debug/gfs2/FOO/glocks
Fix that by not taking the glock spin lock inside the go_dump callback.
Fixes: 0e539ca1bb ("gfs2: Fix NULL pointer dereference in gfs2_rgrp_dump")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Patch 541656d3a5 ("gfs2: freeze should work on read-only mounts") changed
the check for glock state in function freeze_go_sync() from "gl->gl_state
== LM_ST_SHARED" to "gl->gl_req == LM_ST_EXCLUSIVE". That's wrong and it
regressed gfs2's freeze/thaw mechanism because it caused only the freezing
node (which requests the glock in EX) to queue freeze work.
All nodes go through this go_sync code path during the freeze to drop their
SHared hold on the freeze glock, allowing the freezing node to acquire it
in EXclusive mode. But all the nodes must freeze access to the file system
locally, so they ALL must queue freeze work. The freeze_work calls
freeze_func, which makes a request to reacquire the freeze glock in SH,
effectively blocking until the thaw from the EX holder. Once thawed, the
freezing node drops its EX hold on the freeze glock, then the (blocked)
freeze_func reacquires the freeze glock in SH again (on all nodes, including
the freezer) so all nodes go back to a thawed state.
This patch changes the check back to gl_state == LM_ST_SHARED like it was
prior to 541656d3a5.
Fixes: 541656d3a5 ("gfs2: freeze should work on read-only mounts")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Patch b2a846dbef ("gfs2: Ignore journal log writes for jdata holes")
tried (unsuccessfully) to fix a case in which writes were done to jdata
blocks, the blocks are sent to the ail list, then a punch_hole or truncate
operation caused the blocks to be freed. In other words, the ail items
are for jdata holes. Before b2a846dbef, the jdata hole caused function
gfs2_block_map to return -EIO, which was eventually interpreted as an
IO error to the journal, and then withdraw.
This patch changes function gfs2_get_block_noalloc, which is only used
for jdata writes, so it returns -ENODATA rather than -EIO, and when
-ENODATA is returned to gfs2_ail1_start_one, the error is ignored.
We can safely ignore it because gfs2_ail1_start_one is only called
when the jdata pages have already been written and truncated, so the
ail1 content no longer applies.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
This reverts commit b2a846dbef.
That commit changed the behavior of function gfs2_block_map to return
-ENODATA in cases where a hole (IOMAP_HOLE) is encountered and create is
false. While that fixed the intended problem for jdata, it also broke
other callers of gfs2_block_map such as some jdata block reads. Before
the patch, an encountered hole would be skipped and the buffer seen as
unmapped by the caller. The patch changed the behavior to return
-ENODATA, which is interpreted as an error by the caller.
The -ENODATA return code should be restricted to the specific case where
jdata holes are encountered during ail1 writes. That will be done in a
later patch.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In the fail path of gfs2_check_blk_type, forgetting to call
gfs2_glock_dq_uninit will result in rgd_gh reference leak.
Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Commit fc0e38dae6 ("GFS2: Fix glock deallocation race") fixed a
sd_glock_disposal accounting bug by adding a missing atomic_dec
statement, but it failed to wake up sd_glock_wait when that decrement
causes sd_glock_disposal to reach zero. As a consequence,
gfs2_gl_hash_clear can now run into a 10-minute timeout instead of
being woken up. Add the missing wakeup.
Fixes: fc0e38dae6 ("GFS2: Fix glock deallocation race")
Cc: stable@vger.kernel.org # v2.6.39+
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Right now, we can end up calling cancel_delayed_work_sync from within
delete_work_func via gfs2_lookup_by_inum -> gfs2_inode_lookup ->
gfs2_cancel_delete_work. When that happens, it will result in a
deadlock. Instead, gfs2_inode_lookup should skip the call to
gfs2_cancel_delete_work when called from delete_work_func (blktype ==
GFS2_BLKST_UNLINKED).
Reported-by: Alexander Ahring Oder Aring <aahringo@redhat.com>
Fixes: a0e3cc65fa ("gfs2: Turn gl_delete into a delayed work")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, gfs2_fitrim was not properly checking for a "live" file
system. If the file system had something to trim and the file system
was read-only (or spectator) it would start the trim, but when it starts
the transaction, gfs2_trans_begin returns -EROFS (read-only file system)
and it errors out. However, if the file system was already trimmed so
there's no work to do, it never called gfs2_trans_begin. That code is
bypassed so it never returns the error. Instead, it returns a good
return code with 0 work. All this makes for inconsistent behavior:
The same fstrim command can return -EROFS in one case and 0 in another.
This tripped up xfstests generic/537 which reports the error as:
+fstrim with unrecovered metadata just ate your filesystem
This patch adds a check for a "live" (iow, active journal, iow, RW)
file system, and if not, returns the error properly.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before commit 97fd734ba1, the local statfs_changeX inode was never
initialized for spectator mounts. However, it still checks for
spectator mounts when unmounting everything. There's no good reason to
lookup the statfs_changeX files because spectators cannot perform recovery.
It still, however, needs the master statfs file for statfs calls.
This patch adds the check for spectator mounts to init_statfs.
Fixes: 97fd734ba1 ("gfs2: lookup local statfs inodes prior to journal recovery")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Before this patch, function gfs2_meta_sync called filemap_fdatawrite to write
the address space for the metadata being synced. That's great for inodes, but
resource groups all point to the same superblock-address space, sdp->sd_aspace.
Each rgrp has its own range of blocks on which it should operate. That meant
every time an rgrp's metadata was synced, it would write all of them instead
of just the range.
This patch eliminates function gfs2_meta_sync and tailors specific metasync
functions for inodes and rgrps.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Hi,
Before this patch, function init_journal's "undo" directive jumped to label
fail_jinode_gh. But now that it does statfs initialization, it needs to
jump to fail_statfs instead. Failure to do so means that mount failures
after init_journal is successful will neglect to let go of the proper
statfs information, stranding the statfs_changeX inodes. This makes it
impossible to free its glocks, and results in:
gfs2: fsid=sda.s: G: s:EX n:2/805f f:Dqob t:EX d:UN/603701000 a:0 v:0 r:4 m:200 p:1
gfs2: fsid=sda.s: H: s:EX f:H e:0 p:1397947 [(ended)] init_journal+0x548/0x890 [gfs2]
gfs2: fsid=sda.s: I: n:6/32863 t:8 f:0x00 d:0x00000201 s:24 p:0
gfs2: fsid=sda.s: G: s:SH n:5/805f f:Dqob t:SH d:UN/603712000 a:0 v:0 r:3 m:200 p:0
gfs2: fsid=sda.s: H: s:SH f:EH e:0 p:1397947 [(ended)] gfs2_inode_lookup+0x1fb/0x410 [gfs2]
VFS: Busy inodes after unmount of sda. Self-destruct in 5 seconds. Have a nice day...
The next time the file system is mounted, it then reuses the same glocks,
which ends in a kernel NULL pointer dereference when trying to dump the
reused glock.
This patch makes the "undo" function of init_journal jump to fail_statfs
so the statfs files are properly deconstructed upon failure.
Fixes: 97fd734ba1 ("gfs2: lookup local statfs inodes prior to journal recovery")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Gfs2 creates an address space for its rgrps called sd_aspace, but it never
called truncate_inode_pages_final on it. This confused vfs greatly which
tried to reference the address space after gfs2 had freed the superblock
that contained it.
This patch adds a call to truncate_inode_pages_final for sd_aspace, thus
avoiding the use-after-free.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>