IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
commit 0a31bd5f2bbb ("KMEM_CACHE(): simplify slab cache creation")
introduces a new macro.
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
It is possible for free_blocked_lock() to be called twice concurrently,
once from nfsd4_lock() and once from nfsd4_release_lockowner() calling
remove_blocked_locks(). This is why a kref was added.
It is perfectly safe for locks_delete_block() and kref_put() to be
called in parallel as they use locking or atomicity respectively as
protection. However locks_release_private() has no locking. It is
safe for it to be called twice sequentially, but not concurrently.
This patch moves that call from free_blocked_lock() where it could race
with itself, to free_nbl() where it cannot. This will slightly delay
the freeing of private info or release of the owner - but not by much.
It is arguably more natural for this freeing to happen in free_nbl()
where the structure itself is freed.
This bug was found by code inspection - it has not been seen in practice.
Fixes: 47446d74f170 ("nfsd4: add refcount for nfsd4_blocked_lock")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
When there is layout state on a filesystem that is being "unlocked" that
is now revoked, which involves closing the nfsd_file and releasing the
vfs lease.
To avoid races, ->ls_file can now be accessed either:
- under ->fi_lock for the state's sc_file or
- under rcu_read_lock() if nfsd_file_get() is used.
To support this, ->fence_client and nfsd4_cb_layout_fail() now take a
second argument being the nfsd_file.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Revoking state through 'unlock_filesystem' now revokes any delegation
states found. When the stateids are then freed by the client, the
revoked stateids will be cleaned up correctly.
As there is already support for revoking delegations, we build on that
for admin-revoking.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Revoking state through 'unlock_filesystem' now revokes any open states
found. When the stateids are then freed by the client, the revoked
stateids will be cleaned up correctly.
Possibly the related lock states should be revoked too, but a
subsequent patch will do that for all lock state on the superblock.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Revoking state through 'unlock_filesystem' now revokes any lock states
found. When the stateids are then freed by the client, the revoked
stateids will be cleaned up correctly.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
For NFSv4.1 and later the client easily discovers if there is any
admin-revoked state and will then find and explicitly free it.
For NFSv4.0 there is no such mechanism. The client can only find that
state is admin-revoked if it tries to use that state, and there is no
way for it to explicitly free the state. So the server must hold on to
the stateid (at least) for an indefinite amount of time. A
RELEASE_LOCKOWNER request might justify forgetting some of these
stateids, as would the whole clients lease lapsing, but these are not
reliable.
This patch takes two approaches.
Whenever a client uses an revoked stateid, that stateid is then
discarded and will not be recognised again. This might confuse a client
which expect to get NFS4ERR_ADMIN_REVOKED consistently once it get it at
all, but should mostly work. Hopefully one error will lead to other
resources being closed (e.g. process exits), which will result in more
stateid being freed when a CLOSE attempt gets NFS4ERR_ADMIN_REVOKED.
Also, any admin-revoked stateids that have been that way for more than
one lease time are periodically revoke.
No actual freeing of state happens in this patch. That will come in
future patches which handle the different sorts of revoked state.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add "admin-revoked" to the status information for any states that have
been admin-revoked. This can be useful for confirming correct
behaviour.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Change the "show" functions to show some content even if a file cannot
be found. This is the case for admin-revoked state.
This is primarily useful for debugging - to ensure states are being
removed eventually.
So change several seq_printf() to seq_puts(). Some of these are needed
to keep checkpatch happy. Others were done for consistency.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The NFSv4 protocol allows state to be revoked by the admin and has error
codes which allow this to be communicated to the client.
This patch
- introduces a new state-id status SC_STATUS_ADMIN_REVOKED
which can be set on open, lock, or delegation state.
- reports NFS4ERR_ADMIN_REVOKED when these are accessed
- introduces a per-client counter of these states and returns
SEQ4_STATUS_ADMIN_STATE_REVOKED when the counter is not zero.
Decrements this when freeing any admin-revoked state.
- introduces stub code to find all interesting states for a given
superblock so they can be revoked via the 'unlock_filesystem'
file in /proc/fs/nfsd/
No actual states are handled yet.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
sc_type identifies the type of a state - open, lock, deleg, layout - and
also the status of a state - closed or revoked.
This is a bit untidy and could get worse when "admin-revoked" states are
added. So clean it up.
With this patch, the type is now all that is stored in sc_type. This is
zero when the state is first added to ->cl_stateids (causing it to be
ignored), and is then set appropriately once it is fully initialised.
It is set under ->cl_lock to ensure atomicity w.r.t lookup. It is now
never cleared.
sc_type is still a bit-set even though at most one bit is set. This allows
lookup functions to be given a bitmap of acceptable types.
sc_type is now an unsigned short rather than char. There is no value in
restricting to just 8 bits.
All the constants now start SC_TYPE_ matching the field in which they
are stored. Keeping the existing names and ensuring clear separation
from non-type flags would have required something like
NFS4_STID_TYPE_CLOSED which is cumbersome. The "NFS4" prefix is
redundant was they only appear in NFS4 code, so remove that and change
STID to SC to match the field.
The status is stored in a separate unsigned short named "sc_status". It
has two flags: SC_STATUS_CLOSED and SC_STATUS_REVOKED.
CLOSED combines NFS4_CLOSED_STID, NFS4_CLOSED_DELEG_STID, and is used
for SC_TYPE_LOCK and SC_TYPE_LAYOUT instead of setting the sc_type to zero.
These flags are only ever set, never cleared.
For deleg stateids they are set under the global state_lock.
For open and lock stateids they are set under ->cl_lock.
For layout stateids they are set under ->ls_lock
nfs4_unhash_stid() has been removed, and we never set sc_type = 0. This
was only used for LOCK and LAYOUT stids and they now use
SC_STATUS_CLOSED.
Also TRACE_DEFINE_NUM() calls for the various STID #define have been
removed because these things are not enums, and so that call is
incorrect.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
NFS4_CLOSED_DELEG_STID and NFS4_REVOKED_DELEG_STID are similar in
purpose.
REVOKED is used for NFSv4.1 states which have been revoked because the
lease has expired. CLOSED is used in other cases.
The difference has two practical effects.
1/ REVOKED states are on the ->cl_revoked list
2/ REVOKED states result in nfserr_deleg_revoked from
nfsd4_verify_open_stid() and nfsd4_validate_stateid while
CLOSED states result in nfserr_bad_stid.
Currently a state that is being revoked is first set to "CLOSED" in
unhash_delegation_locked(), then possibly to "REVOKED" in
revoke_delegation(), at which point it is added to the cl_revoked list.
It is possible that a stateid test could see the CLOSED state
which really should be REVOKED, and so return the wrong error code. So
it is safest to remove this window of inconsistency.
With this patch, unhash_delegation_locked() always sets the state
correctly, and revoke_delegation() no longer changes the state.
Also remove a redundant test on minorversion when
NFS4_REVOKED_DELEG_STID is seen - it can only be seen when minorversion
is non-zero.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Code like:
WARN_ON(foo())
looks like an assertion and might not be expected to have any side
effects.
When testing if a function with side-effects fails a construct like
if (foo())
WARN_ON(1);
makes the intent more obvious.
nfsd has several WARN_ON calls where the test has side effects, so it
would be good to change them. These cases don't really need the
WARN_ON. They have never failed in 8 years of usage so let's just
remove the WARN_ON wrapper.
Suggested-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The protocol for creating a new state in nfsd is to allocate the state
leaving it largely uninitialised, add that state to the ->cl_stateids
idr so as to reserve a state-id, then complete initialisation of the
state and only set ->sc_type to non-zero once the state is fully
initialised.
If a state is found in the idr with ->sc_type == 0, it is ignored.
The ->cl_lock lock is used to avoid races - it is held while checking
sc_type during lookup, and held when a non-zero value is stored in
->sc_type.
... except... hash_delegation_locked() finalises the initialisation of a
delegation state, but does NOT hold ->cl_lock.
So this patch takes ->cl_lock at the appropriate time w.r.t other locks,
and so ensures there are no races (which are extremely unlikely in any
case).
As ->fi_lock is often taken when ->cl_lock is held, we need to take
->cl_lock first of those two.
Currently ->cl_lock and state_lock are never both taken at the same time.
We need both for this patch so an arbitrary choice is needed concerning
which to take first. As state_lock is more global, it might be more
contended, so take it first.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
As we do now support write delegations, this comment is unhelpful and
misleading.
Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
As far as I can see, setting cb_seq_status in nfsd4_init_cb() is
superfluous because it is set again in nfsd4_cb_prepare().
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Don't kill the kworker thread, and don't panic while cl_lock is
held. There's no need for scorching the earth here.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Convert a code comment into a real assertion.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Improve observability of backchannel session operation.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add a trace point that records SEQ4_STATUS flags returned in an
NFSv4.1 SEQUENCE response. SEQ4_STATUS flags report backchannel
issues and changes to lease state to clients. Knowing what the
server is reporting to clients is useful for debugging both
configuration and operational issues in real time.
For example, upcoming patches will enable server administrators to
revoke parts of a client's lease; that revocation is indicated to
the client when a subsequent SEQUENCE operation has one or more
SEQ4_STATUS flags that are set.
Sample trace records:
nfsd-927 [006] 615.581821: nfsd_seq4_status: xid=0x095ded07 sessionid=65a032c3:b7845faf:00000001:00000000 status_flags=BACKCHANNEL_FAULT
nfsd-927 [006] 615.588043: nfsd_seq4_status: xid=0x0a5ded07 sessionid=65a032c3:b7845faf:00000001:00000000 status_flags=BACKCHANNEL_FAULT
nfsd-928 [003] 615.588448: nfsd_seq4_status: xid=0x0b5ded07 sessionid=65a032c3:b7845faf:00000001:00000000 status_flags=BACKCHANNEL_FAULT
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
NFSv4.1 clients assume that if they disconnect, that will force the
server to resend pending callback operations once a fresh connection
has been established.
Turns out NFSD has not been resending after reconnect.
Fixes: 7ba6cad6c88f ("nfsd: New helper nfsd4_cb_sequence_done() for processing more cb errors")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
As part of managing a client disconnect, NFSD closes down and
replaces the backchannel rpc_clnt.
If a callback operation is pending when the backchannel rpc_clnt is
shut down, currently nfsd4_run_cb_work() just discards that
callback. But there are multiple cases to deal with here:
o The client's lease is getting destroyed. Throw the CB away.
o The client disconnected. It might be forcing a retransmit of
CB operations, or it could have disconnected for other reasons.
Reschedule the CB so it is retransmitted when the client
reconnects.
Since callback operations can now be rescheduled, ensure that
cb_ops->prepare can be called only once by moving the
cb_ops->prepare paragraph down to just before the rpc_call_async()
call.
Fixes: 2bbfed98a4d8 ("nfsd: Fix races between nfsd4_cb_release() and nfsd4_shutdown_callback()")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Normally, NFSv4 callback operations are supposed to be sent to the
client as soon as they are queued up.
In a moment, I will introduce a recovery path where the server has
to wait for the client to reconnect. We don't want a hard busy wait
here -- the callback should be requeued to try again in several
milliseconds.
For now, convert nfsd4_callback from struct work_struct to struct
delayed_work, and queue with a zero delay argument. This should
avoid behavior changes for current operation.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
I noticed that once an NFSv4.1 callback operation gets a
NFS4ERR_DELAY status on CB_SEQUENCE and then the connection is lost,
the callback client loops, resending it indefinitely.
The switch arm in nfsd4_cb_sequence_done() that handles
NFS4ERR_DELAY uses rpc_restart_call() to rearm the RPC state machine
for the retransmit, but that path does not call the rpc_prepare_call
callback again. Thus cb_seq_status is set to -10008 by the first
NFS4ERR_DELAY result, but is never set back to 1 for the retransmits.
nfsd4_cb_sequence_done() thinks it's getting nothing but a
long series of CB_SEQUENCE NFS4ERR_DELAY replies.
Fixes: 7ba6cad6c88f ("nfsd: New helper nfsd4_cb_sequence_done() for processing more cb errors")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The final bit of stats that is global is the rpc svc_stat. Move this
into the nfsd_net struct and use that everywhere instead of the global
struct. Remove the unused global struct.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This is the last global stat, take it out of the nfsd_stats struct and
make it a global part of nfsd, report it the same as always.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
We have a global set of counters that we modify for all of the nfsd
operations, but now that we're exposing these stats across all network
namespaces we need to make the stats also be per-network namespace. We
already have some caching stats that are per-network namespace, so move
these definitions into the same counter and then adjust all the helpers
and users of these stats to provide the appropriate nfsd_net struct so
that the stats are maintained for the per-network namespace objects.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
We are running nfsd servers inside of containers with their own network
namespace, and we want to monitor these services using the stats found
in /proc. However these are not exposed in the proc inside of the
container, so we have to bind mount the host /proc into our containers
to get at this information.
Separate out the stat counters init and the proc registration, and move
the proc registration into the pernet operations entry and exit points
so that these stats can be exposed inside of network namespaces.
This is an intermediate step, this just exposes the global counters in
the network namespace. Subsequent patches will move these counters into
the per-network namespace container.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
We're going to merge the stats all into per network namespace in
subsequent patches, rename these nn counters to be consistent with the
rest of the stats.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now that this isn't used anywhere, remove it.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Since only one service actually reports the rpc stats there's not much
of a reason to have a pointer to it in the svc_program struct. Adjust
the svc_create_pooled function to take the sv_stats as an argument and
pass the struct through there as desired instead of getting it from the
svc_program->pg_stats.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
A lot of places are setting a blank svc_stats in ->pg_stats and never
utilizing these stats. Remove all of these extra structs as we're not
reporting these stats anywhere.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The maxcount is the maximum number of bytes for the LISTXATTRS4resok
result. This includes the cookie and the count for the name array,
thus subtract 12 bytes from the maxcount: 8 (cookie) + 4 (array count)
when filling up the name array.
Fixes: 23e50fe3a5e6 ("nfsd: implement the xattr functions and en/decode logic")
Signed-off-by: Jorge Mora <mora@netapp.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If the XDR buffer is not large enough to fit all attributes
and the remaining bytes left in the XDR buffer (xdrleft) is
equal to the number of bytes for the current attribute, then
the loop will prematurely exit without setting eof to FALSE.
Also in this case, adding the eof flag to the buffer will
make the reply 4 bytes larger than lsxa_maxcount.
Need to check if there are enough bytes to fit not only the
next attribute name but also the eof as well.
Fixes: 23e50fe3a5e6 ("nfsd: implement the xattr functions and en/decode logic")
Signed-off-by: Jorge Mora <mora@netapp.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Function nfsd4_listxattr_validate_cookie() expects the cookie
as an offset to the list thus it needs to be encoded in big-endian.
Fixes: 23e50fe3a5e6 ("nfsd: implement the xattr functions and en/decode logic")
Signed-off-by: Jorge Mora <mora@netapp.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
If LISTXATTRS is sent with a correct cookie but a small maxcount,
this could lead function nfsd4_listxattr_validate_cookie to
return NFS4ERR_BAD_COOKIE. If maxcount = 20, then second check
on function gives RHS = 3 thus any cookie larger than 3 returns
NFS4ERR_BAD_COOKIE.
There is no need to validate the cookie on the return XDR buffer
since attribute referenced by cookie will be the first in the
return buffer.
Fixes: 23e50fe3a5e6 ("nfsd: implement the xattr functions and en/decode logic")
Signed-off-by: Jorge Mora <mora@netapp.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Calling fput() directly or though filp_close() from a kernel thread like
nfsd causes the final __fput() (if necessary) to be called from a
workqueue. This means that nfsd is not forced to wait for any work to
complete. If the ->release or ->destroy_inode function is slow for any
reason, this can result in nfsd closing files more quickly than the
workqueue can complete the close and the queue of pending closes can
grow without bounces (30 million has been seen at one customer site,
though this was in part due to a slowness in xfs which has since been
fixed).
nfsd does not need this. It is quite appropriate and safe for nfsd to
do its own close work. There is no reason that close should ever wait
for nfsd, so no deadlock can occur.
It should be safe and sensible to change all fput() calls to
__fput_sync(). However in the interests of caution this patch only
changes two - the two that can be most directly affected by client
behaviour and could occur at high frequency.
- the fput() implicitly in flip_close() is changed to __fput_sync()
by calling get_file() first to ensure filp_close() doesn't do
the final fput() itself. If is where files opened for IO are closed.
- the fput() in nfsd_read() is also changed. This is where directories
opened for readdir are closed.
This ensure that minimal fput work is queued to the workqueue.
This removes the need for the flush_delayed_fput() call in
nfsd_file_close_inode_sync()
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The work of closing a file can have non-trivial cost. Doing it in a
separate work queue thread means that cost isn't imposed on the nfsd
threads and an imbalance can be created. This can result in files being
queued for the work queue more quickly that the work queue can process
them, resulting in unbounded growth of the queue and memory exhaustion.
To avoid this work imbalance that exhausts memory, this patch moves all
closing of files into the nfsd threads. This means that when the work
imposes a cost, that cost appears where it would be expected - in the
work of the nfsd thread. A subsequent patch will ensure the final
__fput() is called in the same (nfsd) thread which calls filp_close().
Files opened for NFSv3 are never explicitly closed by the client and are
kept open by the server in the "filecache", which responds to memory
pressure, is garbage collected even when there is no pressure, and
sometimes closes files when there is particular need such as for rename.
These files currently have filp_close() called in a dedicated work
queue, so their __fput() can have no effect on nfsd threads.
This patch discards the work queue and instead has each nfsd thread call
flip_close() on as many as 8 files from the filecache each time it acts
on a client request (or finds there are no pending client requests). If
there are more to be closed, more threads are woken. This spreads the
work of __fput() over multiple threads and imposes any cost on those
threads.
The number 8 is somewhat arbitrary. It needs to be greater than 1 to
ensure that files are closed more quickly than they can be added to the
cache. It needs to be small enough to limit the per-request delays that
will be imposed on clients when all threads are busy closing files.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
- Address a deadlock regression in RELEASE_LOCKOWNER
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmXDkGoACgkQM2qzM29m
f5cD1xAAsqKTmrb2ABdFZadYfVl6lKxtNWEp8S9eRS6eBU6bcNr5sxoTi+eflHuB
a58TfUwj8ffVtmd0qGaWI8RpDuAYtxhJU6l3TQZCLheMlKvsP3u6lZDjk8CeYtIK
9bVDZV7MOh9C/p01p/21P2B3OukgzzF8Lz/AFPCxCtoK5nnaDT5F+8pX8yfZU5x0
bM1gBHzB80BoCdzlimN6QB8EfFkOgIF9Apnk53E676KmkOuADV+CIp4aQMsm8l3r
6lJAdsOCuw5BgSDJWh1vGmFKubfhP841QbglDnnZcc+WTBhvEO0PR8Kv0Ugn5Xek
8NkcnOlti+gjH0EKTHx5P4frV0BcWptLjCjVdruOvBszrZtEaX547Mp/d04GGO0U
8gUEhen/RT7l1E2RM4qII9q5nuaekjI2Da2FGZNmK/j66OPFibiOBMRn3u/haTr5
2axrrO9NOnfWcTQX08iKZygEUY7E4h3iImqkNw+c+avZ6SRiH548TRCrKdlBeuia
Pj3QFRfu/9PQnnl9qnROXtAP70AKmX1iSLZ4s9+KqTzQVCW7tLbZpOn9D1hHCJ1q
0JTVql5rSXCl5YpBsDiqr7qvZhVab/2+1X9wHkLPZd5aIBoUmOQOMebZCUxpu+TR
houx3NBGYNixk0khEkHH3+c0HydvnxVU3xClUgLxjHgEdeS7R6s=
=z3oC
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fix from Chuck Lever:
- Address a deadlock regression in RELEASE_LOCKOWNER
* tag 'nfsd-6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: don't take fi_lock in nfsd_break_deleg_cb()
A recent change to check_for_locks() changed it to take ->flc_lock while
holding ->fi_lock. This creates a lock inversion (reported by lockdep)
because there is a case where ->fi_lock is taken while holding
->flc_lock.
->flc_lock is held across ->fl_lmops callbacks, and
nfsd_break_deleg_cb() is one of those and does take ->fi_lock. However
it doesn't need to.
Prior to v4.17-rc1~110^2~22 ("nfsd: create a separate lease for each
delegation") nfsd_break_deleg_cb() would walk the ->fi_delegations list
and so needed the lock. Since then it doesn't walk the list and doesn't
need the lock.
Two actions are performed under the lock. One is to call
nfsd_break_one_deleg which calls nfsd4_run_cb(). These doesn't act on
the nfs4_file at all, so don't need the lock.
The other is to set ->fi_had_conflict which is in the nfs4_file.
This field is only ever set here (except when initialised to false)
so there is no possible problem will multiple threads racing when
setting it.
The field is tested twice in nfs4_set_delegation(). The first test does
not hold a lock and is documented as an opportunistic optimisation, so
it doesn't impose any need to hold ->fi_lock while setting
->fi_had_conflict.
The second test in nfs4_set_delegation() *is* make under ->fi_lock, so
removing the locking when ->fi_had_conflict is set could make a change.
The change could only be interesting if ->fi_had_conflict tested as
false even though nfsd_break_one_deleg() ran before ->fi_lock was
unlocked. i.e. while hash_delegation_locked() was running.
As hash_delegation_lock() doesn't interact in any way with nfs4_run_cb()
there can be no importance to this interaction.
So this patch removes the locking from nfsd_break_one_deleg() and moves
the final test on ->fi_had_conflict out of the locked region to make it
clear that locking isn't important to the test. It is still tested
*after* vfs_setlease() has succeeded. This might be significant and as
vfs_setlease() takes ->flc_lock, and nfsd_break_one_deleg() is called
under ->flc_lock this "after" is a true ordering provided by a spinlock.
Fixes: edcf9725150e ("nfsd: fix RELEASE_LOCKOWNER")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The test on so_count in nfsd4_release_lockowner() is nonsense and
harmful. Revert to using check_for_locks(), changing that to not sleep.
First: harmful.
As is documented in the kdoc comment for nfsd4_release_lockowner(), the
test on so_count can transiently return a false positive resulting in a
return of NFS4ERR_LOCKS_HELD when in fact no locks are held. This is
clearly a protocol violation and with the Linux NFS client it can cause
incorrect behaviour.
If RELEASE_LOCKOWNER is sent while some other thread is still
processing a LOCK request which failed because, at the time that request
was received, the given owner held a conflicting lock, then the nfsd
thread processing that LOCK request can hold a reference (conflock) to
the lock owner that causes nfsd4_release_lockowner() to return an
incorrect error.
The Linux NFS client ignores that NFS4ERR_LOCKS_HELD error because it
never sends NFS4_RELEASE_LOCKOWNER without first releasing any locks, so
it knows that the error is impossible. It assumes the lock owner was in
fact released so it feels free to use the same lock owner identifier in
some later locking request.
When it does reuse a lock owner identifier for which a previous RELEASE
failed, it will naturally use a lock_seqid of zero. However the server,
which didn't release the lock owner, will expect a larger lock_seqid and
so will respond with NFS4ERR_BAD_SEQID.
So clearly it is harmful to allow a false positive, which testing
so_count allows.
The test is nonsense because ... well... it doesn't mean anything.
so_count is the sum of three different counts.
1/ the set of states listed on so_stateids
2/ the set of active vfs locks owned by any of those states
3/ various transient counts such as for conflicting locks.
When it is tested against '2' it is clear that one of these is the
transient reference obtained by find_lockowner_str_locked(). It is not
clear what the other one is expected to be.
In practice, the count is often 2 because there is precisely one state
on so_stateids. If there were more, this would fail.
In my testing I see two circumstances when RELEASE_LOCKOWNER is called.
In one case, CLOSE is called before RELEASE_LOCKOWNER. That results in
all the lock states being removed, and so the lockowner being discarded
(it is removed when there are no more references which usually happens
when the lock state is discarded). When nfsd4_release_lockowner() finds
that the lock owner doesn't exist, it returns success.
The other case shows an so_count of '2' and precisely one state listed
in so_stateid. It appears that the Linux client uses a separate lock
owner for each file resulting in one lock state per lock owner, so this
test on '2' is safe. For another client it might not be safe.
So this patch changes check_for_locks() to use the (newish)
find_any_file_locked() so that it doesn't take a reference on the
nfs4_file and so never calls nfsd_file_put(), and so never sleeps. With
this check is it safe to restore the use of check_for_locks() rather
than testing so_count against the mysterious '2'.
Fixes: ce3c4ad7f4ce ("NFSD: Fix possible sleep during nfsd4_release_lockowner()")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@vger.kernel.org # v6.2+
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZZ/BCAAKCRBZ7Krx/gZQ
68qqAQD6LtfYLDJGdJM+lNpyiG4BA7coYpPlJtmH7mzL+MbFPgEAnM7XsK6zyvza
3+rEggLM0UFWjg9Ln7Nlq035TeYtFwo=
=w1mD
-----END PGP SIGNATURE-----
Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc filesystem updates from Al Viro:
"Misc cleanups (the part that hadn't been picked by individual fs
trees)"
* tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
apparmorfs: don't duplicate kfree_link()
orangefs: saner arguments passing in readdir guts
ocfs2_find_match(): there's no such thing as NULL or negative ->d_parent
reiserfs_add_entry(): get rid of pointless namelen checks
__ocfs2_add_entry(), ocfs2_prepare_dir_for_insert(): namelen checks
ext4_add_entry(): ->d_name.len is never 0
befs: d_obtain_alias(ERR_PTR(...)) will do the right thing
affs: d_obtain_alias(ERR_PTR(...)) will do the right thing
/proc/sys: use d_splice_alias() calling conventions to simplify failure exits
hostfs: use d_splice_alias() calling conventions to simplify failure exits
udf_fiiter_add_entry(): check for zero ->d_name.len is bogus...
udf: d_obtain_alias(ERR_PTR(...)) will do the right thing...
udf: d_splice_alias() will do the right thing on ERR_PTR() inode
nfsd: kill stale comment about simple_fill_super() requirements
bfs_add_entry(): get rid of pointless ->d_name.len checks
nilfs2: d_obtain_alias(ERR_PTR(...)) will do the right thing...
zonefs: d_splice_alias() will do the right thing on ERR_PTR() inode
change of locking rules for __dentry_kill(), regularized refcounting
rules in that area, assorted cleanups and removal of weird corner
cases (e.g. now ->d_iput() on child is always called before the parent
might hit __dentry_kill(), etc.)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZZ+sQQAKCRBZ7Krx/gZQ
6ybjAQDM5jiS93IUzfHjCWq0nVBX5YGbDAkZOeqxbmIdQb+2UAEA6elP5r0fBBcA
seo3bry4DirQMDaA/Cjh4+8r71YSOQs=
=7+Hk
-----END PGP SIGNATURE-----
Merge tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dcache updates from Al Viro:
"Change of locking rules for __dentry_kill(), regularized refcounting
rules in that area, assorted cleanups and removal of weird corner
cases (e.g. now ->d_iput() on child is always called before the parent
might hit __dentry_kill(), etc)"
* tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
dcache: remove unnecessary NULL check in dget_dlock()
kill DCACHE_MAY_FREE
__d_unalias() doesn't use inode argument
d_alloc_parallel(): in-lookup hash insertion doesn't need an RCU variant
get rid of DCACHE_GENOCIDE
d_genocide(): move the extern into fs/internal.h
simple_fill_super(): don't bother with d_genocide() on failure
nsfs: use d_make_root()
d_alloc_pseudo(): move setting ->d_op there from the (sole) caller
kill d_instantate_anon(), fold __d_instantiate_anon() into remaining caller
retain_dentry(): introduce a trimmed-down lockless variant
__dentry_kill(): new locking scheme
d_prune_aliases(): use a shrink list
switch select_collect{,2}() to use of to_shrink_list()
to_shrink_list(): call only if refcount is 0
fold dentry_kill() into dput()
don't try to cut corners in shrink_lock_dentry()
fold the call of retain_dentry() into fast_dput()
Call retain_dentry() with refcount 0
dentry_kill(): don't bother with retain_dentry() on slow path
...
broken in 6.5; we really can't lock two unrelated directories
without holding ->s_vfs_rename_mutex first and in case of
same-parent rename of a subdirectory 6.5 ends up doing just
that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZZ+lyQAKCRBZ7Krx/gZQ
60MWAP94hTqeMIpjhsUIkrTnylrIFaiw4UCWFJzIRG1QQYKqCgD/XUaWI9np7dL6
0wR/j4CQSdJjiEFKUFE2pD3QoSuJYAQ=
=+x0+
-----END PGP SIGNATURE-----
Merge tag 'pull-rename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull rename updates from Al Viro:
"Fix directory locking scheme on rename
This was broken in 6.5; we really can't lock two unrelated directories
without holding ->s_vfs_rename_mutex first and in case of same-parent
rename of a subdirectory 6.5 ends up doing just that"
* tag 'pull-rename' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
rename(): avoid a deadlock in the case of parents having no common ancestor
kill lock_two_inodes()
rename(): fix the locking of subdirectories
f2fs: Avoid reading renamed directory if parent does not change
ext4: don't access the source subdirectory content on same-directory rename
ext2: Avoid reading renamed directory if parent does not change
udf_rename(): only access the child content on cross-directory rename
ocfs2: Avoid touching renamed directory if parent does not change
reiserfs: Avoid touching renamed directory if parent does not change
The bulk of the patches for this release are clean-ups and minor bug
fixes.
There is one significant revert to mention: support for RDMA Read
operations in the server's RPC-over-RDMA transport implementation
has been fixed so it waits for Read completion in a way that avoids
tying up an nfsd thread. This prevents a possible DoS vector if an
RPC-over-RDMA client should become unresponsive during RDMA Read
operations.
As always I am grateful to NFSD contributors, reviewers, and
testers.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmWdW34ACgkQM2qzM29m
f5fKmw/+PcjoNDWR55kTmOo8j0h4HF8rhunvP2C50svnnsX63y1WKkLaxyAFN/Hl
UFucJDQBjJvwi+PEbGOXcjkizuG5mhRBFvFIYDJYGWsE1s7B/v3E/Servvt1wSek
UjoTjknYrqH6R3YfA8zBaWRJUXwvVQW3Bzo4mShrQK7He9/7nBHdUe0aWbAA9oW3
QgzKH/FzqCS03MvuxQv74KgBcl3diIrDaj041A3CtSnXzSKqwc3LaUAd5B4BL+oq
GnxpV1rtZla50M4Ntddi+vSjUvHWZySQ1GEJj7rKLTwpGXkxM2NuMkGx676WR4Iv
sYDX0fsica2elKbqJem8pk68qi6XEdZVAdoOHdgNJRClmYHby8xkrL/TYKiQZf42
IN9FogoVSZ+vSdI158Weim9+0Jqf+ffIh57ZtOyQQQAGZkdhB6GhcbdHJhQ9eOgB
LAiAL7bsoWvDmBh5m9KnBmQYGpZoDUa6AT0bIvGD2O4/MdpHBkyT8Xwt+210nPOK
mpBtxe5O8cUcg7A5/TwnVRg5jKp4CF8VWh2R8sGDhcYV8UfRthB38h4rHNhv4vxt
l6ZUgmtTxrs1rCeh6aoiWTKXeQmI8meWlcet7cxw/axAsaTXkYPi5mslxF9f4O8u
nQ8q7LuZQy2CKZO/t98STwx7s9OJcDOwcy51rnKK85TlCwnxFWg=
=mIKg
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"The bulk of the patches for this release are clean-ups and minor bug
fixes.
There is one significant revert to mention: support for RDMA Read
operations in the server's RPC-over-RDMA transport implementation has
been fixed so it waits for Read completion in a way that avoids tying
up an nfsd thread. This prevents a possible DoS vector if an
RPC-over-RDMA client should become unresponsive during RDMA Read
operations.
As always I am grateful to NFSD contributors, reviewers, and testers"
* tag 'nfsd-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (56 commits)
nfsd: rename nfsd_last_thread() to nfsd_destroy_serv()
SUNRPC: discard sv_refcnt, and svc_get/svc_put
svc: don't hold reference for poolstats, only mutex.
SUNRPC: remove printk when back channel request not found
svcrdma: Implement multi-stage Read completion again
svcrdma: Copy construction of svc_rqst::rq_arg to rdma_read_complete()
svcrdma: Add back svcxprt_rdma::sc_read_complete_q
svcrdma: Add back svc_rdma_recv_ctxt::rc_pages
svcrdma: Clean up comment in svc_rdma_accept()
svcrdma: Remove queue-shortening warnings
svcrdma: Remove pointer addresses shown in dprintk()
svcrdma: Optimize svc_rdma_cc_init()
svcrdma: De-duplicate completion ID initialization helpers
svcrdma: Move the svc_rdma_cc_init() call
svcrdma: Remove struct svc_rdma_read_info
svcrdma: Update the synopsis of svc_rdma_read_special()
svcrdma: Update the synopsis of svc_rdma_read_call_chunk()
svcrdma: Update synopsis of svc_rdma_read_multiple_chunks()
svcrdma: Update synopsis of svc_rdma_copy_inline_range()
svcrdma: Update the synopsis of svc_rdma_read_data_item()
...