2966 Commits

Author SHA1 Message Date
David Howells
d64f4554dd fscache: Provide a means to begin an operation
Provide a function to begin a read operation:

	int fscache_begin_read_operation(
		struct netfs_cache_resources *cres,
		struct fscache_cookie *cookie)

This is primarily intended to be called by network filesystems on behalf of
netfslib, but may also be called to use the I/O access functions directly.
It attaches the resources required by the cache to cres struct from the
supplied cookie.

This holds access to the cache behind the cookie for the duration of the
operation and forces cache withdrawal and cookie invalidation to perform
synchronisation on the operation.  cres->inval_counter is set from the
cookie at this point so that it can be compared at the end of the
operation.

Note that this does not guarantee that the cache state is fully set up and
able to perform I/O immediately; looking up and creation may be left in
progress in the background.  The operations intended to be called by the
network filesystem, such as reading and writing, are expected to wait for
the cookie to move to the correct state.

This will, however, potentially sleep, waiting for a certain minimum state
to be set or for operations such as invalidate to advance far enough that
I/O can resume.


Also provide a function for the cache to call to wait for the cache object
to get to a state where it can be used for certain things:

	bool fscache_wait_for_operation(struct netfs_cache_resources *cres,
					enum fscache_want_stage stage);

This looks at the cache resources provided by the begin function and waits
for them to get to an appropriate stage.  There's a choice of wanting just
some parameters (FSCACHE_WANT_PARAM) or the ability to do I/O
(FSCACHE_WANT_READ or FSCACHE_WANT_WRITE).

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819603692.215744.146724961588817028.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906910672.143852.13856103384424986357.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967110245.1823006.2239170567540431836.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021513617.640689.16627329360866150606.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 09:22:19 +00:00
David Howells
d24af13e2e fscache: Implement cookie invalidation
Add a function to invalidate the cache behind a cookie:

	void fscache_invalidate(struct fscache_cookie *cookie,
				const void *aux_data,
				loff_t size,
				unsigned int flags)

This causes any cached data for the specified cookie to be discarded.  If
the cookie is marked as being in use, a new cache object will be created if
possible and future I/O will use that instead.  In-flight I/O should be
abandoned (writes) or reconsidered (reads).  Each time it is called
cookie->inval_counter is incremented and this can be used to detect
invalidation at the end of an I/O operation.

The coherency data attached to the cookie can be updated and the cookie
size should be reset.  One flag is available, FSCACHE_INVAL_DIO_WRITE,
which should be used to indicate invalidation due to a DIO write on a
file.  This will temporarily disable caching for this cookie.

Changes
=======
ver #2:
 - Should only change to inval state if can get access to cache.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819602231.215744.11206598147269491575.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906909707.143852.18056070560477964891.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967107447.1823006.5945029409592119962.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021512640.640689.11418616313147754172.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 09:22:19 +00:00
David Howells
12bb21a29c fscache: Implement cookie user counting and resource pinning
Provide a pair of functions to count the number of users of a cookie (open
files, writeback, invalidation, resizing, reads, writes), to obtain and pin
resources for the cookie and to prevent culling for the whilst there are
users.

The first function marks a cookie as being in use:

	void fscache_use_cookie(struct fscache_cookie *cookie,
				bool will_modify);

The caller should indicate the cookie to use and whether or not the caller
is in a context that may modify the cookie (e.g. a file open O_RDWR).

If the cookie is not already resourced, fscache will ask the cache backend
in the background to do whatever it needs to look up, create or otherwise
obtain the resources necessary to access data.  This is pinned to the
cookie and may not be culled, though it may be withdrawn if the cache as a
whole is withdrawn.

The second function removes the in-use mark from a cookie and, optionally,
updates the coherency data:

	void fscache_unuse_cookie(struct fscache_cookie *cookie,
				  const void *aux_data,
				  const loff_t *object_size);

If non-NULL, the aux_data buffer and/or the object_size will be saved into
the cookie and will be set on the backing store when the object is
committed.

If this removes the last usage on a cookie, the cookie is placed onto an
LRU list from which it will be removed and closed after a couple of seconds
if it doesn't get reused.  This prevents resource overload in the cache -
in particular it prevents it from holding too many files open.

Changes
=======
ver #2:
 - Fix fscache_unuse_cookie() to use atomic_dec_and_lock() to avoid a
   potential race if the cookie gets reused before it completes the
   unusement.
 - Added missing transition to LRU_DISCARDING state.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819600612.215744.13678350304176542741.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906907567.143852.16979631199380722019.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967106467.1823006.6790864931048582667.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021511674.640689.10084988363699111860.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 09:22:19 +00:00
David Howells
5d00e426f9 fscache: Implement simple cookie state machine
Implement a very simple cookie state machine to handle lookup,
invalidation, withdrawal, relinquishment and, to be added later, commit on
LRU discard.

Three cache methods are provided: ->lookup_cookie() to look up and, if
necessary, create a data storage object; ->withdraw_cookie() to free the
resources associated with that object and potentially delete it; and
->prepare_to_write(), to do prepare for changes to the cached data to be
modified locally.

Changes
=======
ver #3:
 - Fix a race between LRU discard and relinquishment whereby the former
   would override the latter and thus the latter would never happen[1].

ver #2:
 - Don't hold n_accesses elevated whilst cache is bound to a cookie, but
   rather add a flag that prevents the state machine from being queued when
   n_accesses reaches 0.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/599331.1639410068@warthog.procyon.org.uk/ [1]
Link: https://lore.kernel.org/r/163819599657.215744.15799615296912341745.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906903925.143852.1805855338154353867.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967105456.1823006.14730395299835841776.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021510706.640689.7961423370243272583.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 09:22:19 +00:00
David Howells
bfa22da3ed fscache: Provide and use cache methods to lookup/create/free a volume
Add cache methods to lookup, create and remove a volume.

Looking up or creating the volume requires the cache pinning for access;
freeing the volume requires the volume pinning for access.  The
->acquire_volume() method is used to ask the cache backend to lookup and,
if necessary, create a volume; the ->free_volume() method is used to free
the resources for a volume.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819597821.215744.5225318658134989949.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906898645.143852.8537799955945956818.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967099771.1823006.1455197910571061835.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021507345.640689.4073511598838843040.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 09:22:19 +00:00
David Howells
a7733fb632 fscache: Implement cookie-level access helpers
Add a number of helper functions to manage access to a cookie, pinning the
cache object in place for the duration to prevent cache withdrawal from
removing it:

 (1) void fscache_init_access_gate(struct fscache_cookie *cookie);

     This function initialises the access count when a cache binds to a
     cookie.  An extra ref is taken on the access count to prevent wakeups
     while the cache is active.  We're only interested in the wakeup when a
     cookie is being withdrawn and we're waiting for it to quiesce - at
     which point the counter will be decremented before the wait.

     The FSCACHE_COOKIE_NACC_ELEVATED flag is set on the cookie to keep
     track of the extra ref in order to handle a race between
     relinquishment and withdrawal both trying to drop the extra ref.

 (2) bool fscache_begin_cookie_access(struct fscache_cookie *cookie,
				      enum fscache_access_trace why);

     This function attempts to begin access upon a cookie, pinning it in
     place if it's cached.  If successful, it returns true and leaves a the
     access count incremented.

 (3) void fscache_end_cookie_access(struct fscache_cookie *cookie,
				    enum fscache_access_trace why);

     This function drops the access count obtained by (2), permitting
     object withdrawal to take place when it reaches zero.

A tracepoint is provided to track changes to the access counter on a
cookie.

Changes
=======
ver #2:
 - Don't hold n_accesses elevated whilst cache is bound to a cookie, but
   rather add a flag that prevents the state machine from being queued when
   n_accesses reaches 0.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819595085.215744.1706073049250505427.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906895313.143852.10141619544149102193.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967095980.1823006.1133648159424418877.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021503063.640689.8870918985269528670.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 09:22:19 +00:00
David Howells
e6acd3299b fscache: Implement volume-level access helpers
Add a pair of helper functions to manage access to a volume, pinning the
volume in place for the duration to prevent cache withdrawal from removing
it:

	bool fscache_begin_volume_access(struct fscache_volume *volume,
					 enum fscache_access_trace why);
	void fscache_end_volume_access(struct fscache_volume *volume,
				       enum fscache_access_trace why);

The way the access gate on the volume works/will work is:

  (1) If the cache tests as not live (state is not FSCACHE_CACHE_IS_ACTIVE),
      then we return false to indicate access was not permitted.

  (2) If the cache tests as live, then we increment the volume's n_accesses
      count and then recheck the cache liveness, ending the access if it
      ceased to be live.

  (3) When we end the access, we decrement the volume's n_accesses and wake
      up the any waiters if it reaches 0.

  (4) Whilst the cache is caching, the volume's n_accesses is kept
      artificially incremented to prevent wakeups from happening.

  (5) When the cache is taken offline, the state is changed to prevent new
      accesses, the volume's n_accesses is decremented and we wait for it to
      become 0.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819594158.215744.8285859817391683254.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906894315.143852.5454793807544710479.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967095028.1823006.9173132503876627466.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021501546.640689.9631510472149608443.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 09:22:19 +00:00
David Howells
23e12e285a fscache: Implement cache-level access helpers
Add a pair of functions to pin/unpin a cache that we're wanting to do a
high-level access to (such as creating or removing a volume):

	bool fscache_begin_cache_access(struct fscache_cache *cache,
					enum fscache_access_trace why);
	void fscache_end_cache_access(struct fscache_cache *cache,
				      enum fscache_access_trace why);

The way the access gate works/will work is:

 (1) If the cache tests as not live (state is not FSCACHE_CACHE_IS_ACTIVE),
     then we return false to indicate access was not permitted.

 (2) If the cache tests as live, then we increment the n_accesses count and
     then recheck the liveness, ending the access if it ceased to be live.

 (3) When we end the access, we decrement n_accesses and wake up the any
     waiters if it reaches 0.

 (4) Whilst the cache is caching, n_accesses is kept artificially
     incremented to prevent wakeups from happening.

 (5) When the cache is taken offline, the state is changed to prevent new
     accesses, n_accesses is decremented and we wait for n_accesses to
     become 0.

Note that some of this is implemented in a later patch.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819593239.215744.7537428720603638088.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906893368.143852.14164004598465617981.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967093977.1823006.6967886507023056409.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021499995.640689.18286203753480287850.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 09:22:19 +00:00
David Howells
7f3283aba3 fscache: Implement cookie registration
Add functions to the fscache API to allow data file cookies to be acquired
and relinquished by the network filesystem.  It is intended that the
filesystem will create such cookies per-inode under a volume.

To request a cookie, the filesystem should call:

	struct fscache_cookie *
	fscache_acquire_cookie(struct fscache_volume *volume,
			       u8 advice,
			       const void *index_key,
			       size_t index_key_len,
			       const void *aux_data,
			       size_t aux_data_len,
			       loff_t object_size)


The filesystem must first have created a volume cookie, which is passed in
here.  If it passes in NULL then the function will just return a NULL
cookie.

A binary key should be passed in index_key and is of size index_key_len.
This is saved in the cookie and is used to locate the associated data in
the cache.

A coherency data buffer of size aux_data_len will be allocated and
initialised from the buffer pointed to by aux_data.  This is used to
validate cache objects when they're opened and is stored on disk with them
when they're committed.  The data is stored in the cookie and will be
updateable by various functions in later patches.

The object_size must also be given.  This is also used to perform a
coherency check and to size the backing storage appropriately.

This function disallows a cookie from being acquired twice in parallel,
though it will cause the second user to wait if the first is busy
relinquishing its cookie.


When a network filesystem has finished with a cookie, it should call:

	void
	fscache_relinquish_cookie(struct fscache_volume *volume,
				  bool retire)

If retire is true, any backing data will be discarded immediately.

Changes
=======
ver #3:
 - fscache_hash()'s size parameter is now in bytes.  Use __le32 as the unit
   to round up to.
 - When comparing cookies, simply see if the attributes are the same rather
   than subtracting them to produce a strcmp-style return[1].
 - Add a check to see if the cookie is still hashed at the point of
   freeing.

ver #2:
 - Don't hold n_accesses elevated whilst cache is bound to a cookie, but
   rather add a flag that prevents the state machine from being queued when
   n_accesses reaches 0.
 - Remove the unused cookie pointer field from the fscache_acquire
   tracepoint.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/CAHk-=whtkzB446+hX0zdLsdcUJsJ=8_-0S1mE_R+YurThfUbLA@mail.gmail.com/ [1]
Link: https://lore.kernel.org/r/163819590658.215744.14934902514281054323.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906891983.143852.6219772337558577395.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967088507.1823006.12659006350221417165.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021498432.640689.12743483856927722772.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 09:22:19 +00:00
David Howells
62ab633523 fscache: Implement volume registration
Add functions to the fscache API to allow volumes to be acquired and
relinquished by the network filesystem.  A volume is an index of data
storage cache objects.  A volume is represented by a volume cookie in the
API.  A filesystem would typically create a volume for a superblock and
then create per-inode cookies within it.

To request a volume, the filesystem calls:

	struct fscache_volume *
	fscache_acquire_volume(const char *volume_key,
			       const char *cache_name,
			       const void *coherency_data,
			       size_t coherency_len)

The volume_key is a printable string used to match the volume in the cache.
It should not contain any '/' characters.  For AFS, for example, this would
be "afs,<cellname>,<volume_id>", e.g. "afs,example.com,523001".

The cache_name can be NULL, but if not it should be a string indicating the
name of the cache to use if there's more than one available.

The coherency data, if given, is an arbitrarily-sized blob that's attached
to the volume and is compared when the volume is looked up.  If it doesn't
match, the old volume is judged to be out of date and it and everything
within it is discarded.

Acquiring a volume twice concurrently is disallowed, though the function
will wait if an old volume cookie is being relinquishing.


When a network filesystem has finished with a volume, it should return the
volume cookie by calling:

	void
	fscache_relinquish_volume(struct fscache_volume *volume,
				  const void *coherency_data,
				  bool invalidate)

If invalidate is true, the entire volume will be discarded; if false, the
volume will be synced and the coherency data will be updated.

Changes
=======
ver #4:
 - Removed an extraneous param from kdoc on fscache_relinquish_volume()[3].

ver #3:
 - fscache_hash()'s size parameter is now in bytes.  Use __le32 as the unit
   to round up to.
 - When comparing cookies, simply see if the attributes are the same rather
   than subtracting them to produce a strcmp-style return[2].
 - Make the coherency data an arbitrary blob rather than a u64, but don't
   store it for the moment.

ver #2:
 - Fix error check[1].
 - Make a fscache_acquire_volume() return errors, including EBUSY if a
   conflicting volume cookie already exists.  No error is printed now -
   that's left to the netfs.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/20211203095608.GC2480@kili/ [1]
Link: https://lore.kernel.org/r/CAHk-=whtkzB446+hX0zdLsdcUJsJ=8_-0S1mE_R+YurThfUbLA@mail.gmail.com/ [2]
Link: https://lore.kernel.org/r/20211220224646.30e8205c@canb.auug.org.au/ [3]
Link: https://lore.kernel.org/r/163819588944.215744.1629085755564865996.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906890630.143852.13972180614535611154.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967086836.1823006.8191672796841981763.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021495816.640689.4403156093668590217.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 09:22:19 +00:00
David Howells
9549332df4 fscache: Implement cache registration
Implement a register of caches and provide functions to manage it.

Two functions are provided for the cache backend to use:

 (1) Acquire a cache cookie:

	struct fscache_cache *fscache_acquire_cache(const char *name)

     This gets the cache cookie for a cache of the specified name and moves
     it to the preparation state.  If a nameless cache cookie exists, that
     will be given this name and used.

 (2) Relinquish a cache cookie:

	void fscache_relinquish_cache(struct fscache_cache *cache);

     This relinquishes a cache cookie, cleans it and makes it available if
     it's still referenced by a network filesystem.

Note that network filesystems don't deal with cache cookies directly, but
rather go straight to the volume registration.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819587157.215744.13523139317322503286.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906889665.143852.10378009165231294456.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967085081.1823006.2218944206363626210.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021494847.640689.10109692261640524343.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 09:22:19 +00:00
David Howells
1e1236b841 fscache: Introduce new driver
Introduce basic skeleton of the new, rewritten fscache driver.

Changes
=======
ver #3:
 - Use remove_proc_subtree(), not remove_proc_entry() to remove a populated
   dir.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819584034.215744.4290533472390439030.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906887770.143852.3577888294989185666.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967080039.1823006.5702921801104057922.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021491014.640689.4292699878317589512.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 09:22:19 +00:00
David Howells
9e1aa6b8f4 netfs: Display the netfs inode number in the netfs_read tracepoint
Display the netfs inode number in the netfs_read tracepoint so that this
can be used to correlate with the cachefiles_prep_read tracepoint.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819581097.215744.17476611915583897051.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906885903.143852.12229407815154182247.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967078164.1823006.15286989199782861123.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021487412.640689.7544388469390936443.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 09:22:19 +00:00
David Howells
2cee6fbb7f fscache: Remove the contents of the fscache driver, pending rewrite
Remove the code that comprises the fscache driver as it's going to be
substantially rewritten, with the majority of the code being erased in the
rewrite.

A small piece of linux/fscache.h is left as that is #included by a bunch of
network filesystems.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819578724.215744.18210619052245724238.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906884814.143852.6727245089843862889.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967077097.1823006.1377665951499979089.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021485548.640689.13876080567388696162.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 09:22:19 +00:00
David Howells
850cba069c cachefiles: Delete the cachefiles driver pending rewrite
Delete the code from the cachefiles driver to make it easier to rewrite and
resubmit in a logical manner.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819577641.215744.12718114397770666596.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906883770.143852.4149714614981373410.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967076066.1823006.7175712134577687753.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021483619.640689.7586546280515844702.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 09:22:19 +00:00
Pierguido Lambri
4b0c359b81 SUNRPC: Add source address/port to rpc_socket* traces
The rpc_socket* traces now show also the source address
and port. An example is:

kworker/u17:1-951   [005] 134218.925343: rpc_socket_close:
   socket:[46913] srcaddr=192.168.100.187:793 dstaddr=192.168.100.129:2049
   state=4 (DISCONNECTING) sk_state=7 (CLOSE)
kworker/u17:0-242   [006] 134360.841370: rpc_socket_connect:
   error=-115 socket:[56322] srcaddr=192.168.100.187:769
   dstaddr=192.168.100.129:2049 state=2 (CONNECTING) sk_state=2 (SYN_SENT)
       <idle>-0     [006] 134360.841859: rpc_socket_state_change: socket:[56322]
   srcaddr=192.168.100.187:769 dstaddr=192.168.100.129:2049 state=2 (CONNECTING)
   sk_state=1 (ESTABLISHED)

Signed-off-by: Pierguido Lambri <plambri@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-01-06 14:00:20 -05:00
Hannes Reinecke
c318458c93 ata: libata: add tracepoints for ATA error handling
Add tracepoints for ATA error handling.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
2022-01-05 19:33:01 +09:00
Hannes Reinecke
7fad6ad6a3 ata: libata-sff: tracepoints for HSM state machine
Add tracepoints for the HSM state machine and drop DPRINTK calls

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
2022-01-05 19:33:01 +09:00
Hannes Reinecke
c206a389c9 ata: libata: tracepoints for bus-master DMA
Add tracepoints for bus-master DMA and taskfile related functions.
That allows us to drop the relevant DPRINTK() calls.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
2022-01-05 19:32:51 +09:00
Matthew Wilcox (Oracle)
a0580c6f9b filemap: Convert tracing of page cache operations to folio
Pass the folio instead of a page.  The page was already implicitly a
folio as it accessed page->mapping directly.  Add the order of the folio
to the tracepoint, as this is important information.  Also drop printing
the address of the struct page as the pfn provides better information
than the struct page address.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-04 13:15:33 -05:00
Hannes Reinecke
fc914faad6 ata: libata: add qc_prep tracepoint
Convert the existing ata_qc_issue() tracepoint into a template,
and add tracepoints for ata_qc_prep() and ata_qc_issue() based
on that template.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
2022-01-04 20:01:01 +09:00
Hannes Reinecke
f8ec26d0f5 ata: libata: add reset tracepoints
To follow the flow of control we should be using tracepoints, as
they will tie in with the actual I/O flow and deliver a better
overview about what it happening.
This patch adds tracepoints for hard reset, soft reset, and postreset
and adds them in the libata-eh control flow.
With that we can drop the reset DPRINTK calls in the various drivers.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
2022-01-04 20:00:57 +09:00
Joerg Roedel
66dc1b791c Merge branches 'arm/smmu', 'virtio', 'x86/amd', 'x86/vt-d' and 'core' into next 2022-01-04 10:33:45 +01:00
Josef Bacik
2e4e97abac btrfs: pass fs_info to trace_btrfs_transaction_commit
The root on the trans->root can be anything, and generally we're
committing from the transaction kthread so it's usually the tree_root.
Change this to just take an fs_info, and to maintain compatibility
simply put the ROOT_TREE_OBJECTID as the root objectid for the
tracepoint.  This will allow use to remove trans->root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03 15:09:47 +01:00
Mel Gorman
1b4e3f26f9 mm: vmscan: Reduce throttling due to a failure to make progress
Mike Galbraith, Alexey Avramov and Darrick Wong all reported similar
problems due to reclaim throttling for excessive lengths of time.  In
Alexey's case, a memory hog that should go OOM quickly stalls for
several minutes before stalling.  In Mike and Darrick's cases, a small
memcg environment stalled excessively even though the system had enough
memory overall.

Commit 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is
being made") introduced the problem although commit a19594ca4a8b
("mm/vmscan: increase the timeout if page reclaim is not making
progress") made it worse.  Systems at or near an OOM state that cannot
be recovered must reach OOM quickly and memcg should kill tasks if a
memcg is near OOM.

To address this, only stall for the first zone in the zonelist, reduce
the timeout to 1 tick for VMSCAN_THROTTLE_NOPROGRESS and only stall if
the scan control nr_reclaimed is 0, kswapd is still active and there
were excessive pages pending for writeback.  If kswapd has stopped
reclaiming due to excessive failures, do not stall at all so that OOM
triggers relatively quickly.  Similarly, if an LRU is simply congested,
only lightly throttle similar to NOPROGRESS.

Alexey's original case was the most straight forward

	for i in {1..3}; do tail /dev/zero; done

On vanilla 5.16-rc1, this test stalled heavily, after the patch the test
completes in a few seconds similar to 5.15.

Alexey's second test case added watching a youtube video while tail runs
10 times.  On 5.15, playback only jitters slightly, 5.16-rc1 stalls a
lot with lots of frames missing and numerous audio glitches.  With this
patch applies, the video plays similarly to 5.15.

[lkp@intel.com: Fix W=1 build warning]

Link: https://lore.kernel.org/r/99e779783d6c7fce96448a3402061b9dc1b3b602.camel@gmx.de
Link: https://lore.kernel.org/r/20211124011954.7cab9bb4@mail.inbox.lv
Link: https://lore.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
Link: https://lore.kernel.org/r/20211202150614.22440-1-mgorman@techsingularity.net
Link: https://linux-regtracking.leemhuis.info/regzbot/regression/20211124011954.7cab9bb4@mail.inbox.lv/
Reported-and-tested-by: Alexey Avramov <hakavlad@inbox.lv>
Reported-and-tested-by: Mike Galbraith <efault@gmx.de>
Reported-and-tested-by: Darrick J. Wong <djwong@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Tracked-by: Thorsten Leemhuis <regressions@leemhuis.info>
Fixes: 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being made")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-12-31 11:17:07 -08:00
Chuck Lever
5089f3d975 SUNRPC: Remove low signal-to-noise tracepoints
I'm about to add more information to the server-side SUNRPC
tracepoints, so I'm going to offset the increased trace log
consumption by getting rid of some tracepoints that fire frequently
but don't offer much value.

trace_svc_xprt_received() was useful for debugging, perhaps, but
is not generally informative.

trace_svc_handle_xprt() reports largely the same information as
trace_svc_xdr_recvfrom().

As a clean-up, rename trace_svc_xprt_do_enqueue() to match
svc_xprt_dequeue().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-12-13 13:43:00 -05:00
Jaegeuk Kim
bd984c0309 f2fs: show more DIO information in tracepoint
This prints more information of DIO in tracepoint.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-10 15:48:31 -08:00
Gao Xiang
469407a3b5 erofs: clean up erofs_map_blocks tracepoints
Since the new type of chunk-based files is introduced, there is no
need to leave flatmode tracepoints.

Rename to erofs_map_blocks instead.

Link: https://lore.kernel.org/r/20211209012918.30337-1-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@yulong.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-12-09 10:02:10 +08:00
Masami Hiramatsu
55de2c0b56 tracing: Add '__rel_loc' using trace event macros
Add '__rel_loc' using trace event macros. These macros are usually
not used in the kernel, except for testing purpose.
This also add "rel_" variant of macros for dynamic_array string,
and bitmask.

Link: https://lkml.kernel.org/r/163757342119.510314.816029622439099016.stgit@devnote2

Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-06 15:37:21 -05:00
Dafna Hirschfeld
94aedac49d iommu: Log iova range in map/unmap trace events
In case of an iommu page fault, the faulting iova is logged
in trace_io_page_fault. It is therefore convenient to log
the iova range in mapping/unmapping trace events so that it
is easier to see if the faulting iova was recently in any of
those ranges.

Signed-off-by: Dafna Hirschfeld <dafna.hirschfeld@collabora.com>
Link: https://lore.kernel.org/r/20211104071620.27290-1-dafna.hirschfeld@collabora.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2021-12-06 11:59:31 +01:00
Eric Biggers
ccf7cf9237 f2fs: fix the f2fs_file_write_iter tracepoint
Pass in the original position and count rather than the position and
count that were updated by the write.  Also use the correct types for
all arguments, in particular the file offset which was being truncated
to 32 bits on 32-bit platforms.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-04 10:53:35 -08:00
William Kucharski
e14da77113 cgroup: Trace event cgroup id fields should be u64
Various trace event fields that store cgroup IDs were declared as
ints, but cgroup_id(() returns a u64 and the structures and associated
TP_printk() calls were not updated to reflect this.

Fixes: 743210386c03 ("cgroup: use cgrp->kn->id as the cgroup ID")
Signed-off-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-12-01 07:23:35 -10:00
Christoph Hellwig
f3fa33acca block: remove the ->rq_disk field in struct request
Just use the disk attached to the request_queue instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Linus Torvalds
7413927713 NFS client bugfixes for Linux 5.16
Highlights include:
 
 Stable fixes:
 - NFSv42: Fix pagecache invalidation after COPY/CLONE
 
 Bugfixes:
 - NFSv42: Don't fail clone() just because the server failed to return
   post-op attributes
 - SUNRPC: use different lockdep keys for INET6 and LOCAL
 - NFSv4.1: handle NFS4ERR_NOSPC from CREATE_SESSION
 - SUNRPC: fix header include guard in trace header
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmGiTnMACgkQZwvnipYK
 APJoJQ//VZYSCx/mGaTIj5oUwjBKE/n/9rz2EUGS1cfYjZjPpb5Xgm1tn+1A4c01
 Ztu9/hKgwrDqknqkmtKvP1GsX5vYUqgfqAlc880Q2nXAqaLJBBZgB6BFMmTtcoQx
 C24L0tgxlZVD9Vw0DJEVDVgDxXA/9VmdSQK6uptQRQhcYf4VtR1wAzELHWdkdkfq
 1WrREeAwGWw1BaPTrlPn9XwW9qTaMlBH05XRHh6dM7gFmoIe3td7kq7BOnFxFsnA
 AQZ/nCgeMTE04kQQMzYqUc4YqZvnzUHxueZ6q8s0K1RKJBpNIQNgdWUMa285Qo4a
 JA9oBCPPo0JjmsEge2Km12zyBJoA7lLQDfc6UQJON50ADF0sTu3wszgGuC63KkhE
 V+kUogK2WOlnGky2yYrHmv43mcCcyoJ/g+g+38GXNYGorsFi/XUhvctEpFFFPF71
 0umQwWhA6Dhc52hMj5DN4nfspp//hEuV9o7/zJzlPi0elC+xtVaBWZCorNvznlXW
 C/O5yobJVd89PuIE17Stg+c0Rq3k7RVPPoyS2IaMkZ5cs7DtT5Tz3nKClPAA8Bur
 mPLAMkHSOSLO26cy30SVZCIx1JDMJU9dx/PkFemCexkzYXQxgp9px/8wmM2Xe6Oc
 /hDqi8V7ayJUuYpYuJ6sA8oUqj3j+NkoP4w5HhlnmZDBhFMEBhM=
 =jeHm
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client fixes from Trond Myklebust:
 "Highlights include:

  Stable fixes:

   - NFSv42: Fix pagecache invalidation after COPY/CLONE

  Bugfixes:

   - NFSv42: Don't fail clone() just because the server failed to return
     post-op attributes

   - SUNRPC: use different lockdep keys for INET6 and LOCAL

   - NFSv4.1: handle NFS4ERR_NOSPC from CREATE_SESSION

   - SUNRPC: fix header include guard in trace header"

* tag 'nfs-for-5.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  SUNRPC: use different lock keys for INET6 and LOCAL
  sunrpc: fix header include guard in trace header
  NFSv4.1: handle NFS4ERR_NOSPC by CREATE_SESSION
  NFSv42: Fix pagecache invalidation after COPY/CLONE
  NFS: Add a tracepoint to show the results of nfs_set_cache_invalid()
  NFSv42: Don't fail clone() unless the OP_CLONE operation failed
2021-11-27 10:33:55 -08:00
Thiago Rafael Becker
268bb03856 sunrpc: fix header include guard in trace header
rpcgss.h include protection was protecting against the define for
rpcrdma.h.

Signed-off-by: Thiago Rafael Becker <trbecker@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-11-17 18:27:32 -05:00
Linus Torvalds
5664896ba2 f2fs-for-5.16-rc1
In this cycle, we've applied relatively small number of patches which fix subtle
 corner cases mainly, while introducing a new mount option to be able to fragment
 the disk intentionally for performance tests.
 
 Enhancement:
  - add a mount option to fragmente on-disk layout to understand the performance
  - support direct IO for multi-partitions
  - add a fault injection of dquot_initialize
 
 Bug fix:
  - address some lockdep complaints
  - fix a deadlock issue with quota
  - fix a memory tuning condition
  - fix compression condition to improve the ratio
  - fix disabling compression on the non-empty compressed file
  - invalidate cached pages before IPU/DIO writes
 
 And, we've added some minor clean-ups as usual.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmGMILwACgkQQBSofoJI
 UNJDRA/+KPyCXdY0OqL26BuGKj+z7hW6bz7tlh6h3wdnPdsR/W3ehbqQEr3GBb+q
 yokmD75/in7vZwGsDHGowFWMAfWOHYEqHz5UAq91sHjhfZzLDNUgLFWJedBX2XJb
 UoEAa7KzRt9M9K2p/5vSTs07RN3okUiRkFhVBBQJIaL7xi6MpadN/XAqpyoBqsiP
 pAV6J3GF6WNF19P/hkN1CJI8rV+PFrvY6C23lMkP7mnsWh03jMSgDDuhLHMQpAba
 EJYq7QbSatsLDRdR+jUQwIfMucvvzN7M6ja9+NTGlbeACvND8vXKYXOwngCq9+je
 2PIU4J8zNqnEkLsPn8STm4zwZHCA7VFdeCobCZcaVZCZFBzVqCkVYE9wqFVaQmr1
 bCrRFvEb+D1pkHYFujVXwCAfPlO6twiAInFNMa3WQ3FduJq2nhc8OLCJJ46D1KT2
 ZzzLv2EIIlncxPvgLIhiEE9DgPOyV56PQAO3OTsBZcvycU32aHo4hyexju1ubKiD
 CZFEHLnPbxX8Ulh3NX4uUxqPAEVhM/aw4l4e8xhmVRY3uj75geY7M6rt1vD+Y5Et
 EwbUE8XbLy+GhqbbO/SX9G38pftOiIquH1J0RuhuVNNmkIDkQvnSNp8WqHTdjEJE
 NiHZ5bkRkii34Wfrax9UccqGDswh/gjHAXEfGD8nFfcQZwLP1n8=
 =KGQ3
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this cycle, we've applied relatively small number of patches which
  fix subtle corner cases mainly, while introducing a new mount option
  to be able to fragment the disk intentionally for performance tests.

  Enhancements:

   - add a mount option to fragmente on-disk layout to understand the
     performance

   - support direct IO for multi-partitions

   - add a fault injection of dquot_initialize

  Bug fixes:

   - address some lockdep complaints

   - fix a deadlock issue with quota

   - fix a memory tuning condition

   - fix compression condition to improve the ratio

   - fix disabling compression on the non-empty compressed file

   - invalidate cached pages before IPU/DIO writes

  And, we've added some minor clean-ups as usual"

* tag 'f2fs-for-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
  f2fs: fix UAF in f2fs_available_free_memory
  f2fs: invalidate META_MAPPING before IPU/DIO write
  f2fs: support fault injection for dquot_initialize()
  f2fs: fix incorrect return value in f2fs_sanity_check_ckpt()
  f2fs: compress: disallow disabling compress on non-empty compressed file
  f2fs: compress: fix overwrite may reduce compress ratio unproperly
  f2fs: multidevice: support direct IO
  f2fs: introduce fragment allocation mode mount option
  f2fs: replace snprintf in show functions with sysfs_emit
  f2fs: include non-compressed blocks in compr_written_block
  f2fs: fix wrong condition to trigger background checkpoint correctly
  f2fs: fix to use WHINT_MODE
  f2fs: fix up f2fs_lookup tracepoints
  f2fs: set SBI_NEED_FSCK flag when inconsistent node block found
  f2fs: introduce excess_dirty_threshold()
  f2fs: avoid attaching SB_ACTIVE flag during mount
  f2fs: quota: fix potential deadlock
  f2fs: should use GFP_NOFS for directory inodes
2021-11-13 11:20:22 -08:00
Linus Torvalds
0f7ddea622 netfs, 9p, afs and ceph (partial) foliation
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmGNO7sACgkQ+7dXa6fL
 C2sDTA//SLJBMoY719Z/RvZcZb3PUkuwYtqu8j/F6C1n231yg5TrlkchslV635Ph
 4lUuy/+pEVtsWot4JBxBsziBsi5WCjucn6opFAO2UOyT0aiE9ucY2MG+fNo6/b5k
 RRGqbEODKHScC7pm00AJlqd8gJUvHz6Zy08KetHvkSI4bqBz5VpKxDSNxcWvsbx1
 T6FMY+E61vSd0bEOp1/sZ1gRKK5nG9BJrba9V/MuOzj94MqVd9Ajdarr2vfQ7IyS
 Qe16rOBM8u6oCRJqRcwdz2Ma0Zf/0Bm6JpoP7LsEdbzXLKPiHPWM6kNd9WZFmMt1
 KFGGC3xG3Yxufasdpf5KCa6wlw1U5hWVITqRubHGxg49IdnrwxNK2zLqpJNr/lZC
 vsOg31PkAWmJiMCAxhwL+u++Qar27jlXcdiO6tyqIDYHWyzwGWqF5zlzZ46NR26W
 SX7oB36drIzH+UMDqxKGti2hRYTPKKwjCJ6p0EfhEYQ0oGa/bNFJA3bPxupJCkJe
 PD0pnXdEmhKwvY0fLH6Ghr/PAckQttstTFpaHZ40XFzglgd3Sm5DEcg2Xm38LQ+n
 4elMUA+c807ZTDLSDkGTL9QKPmgoM3AFoAsVxV9eqtEYAzYdnYM1GJ0CGOWM1lvs
 vDrB7rqn/CFB6ks8k1BOalq2DTmPVU1I1f1GwnkyOiVtlxyrslk=
 =K23c
 -----END PGP SIGNATURE-----

Merge tag 'netfs-folio-20211111' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull netfs, 9p, afs and ceph (partial) foliation from David Howells:
 "This converts netfslib, 9p and afs to use folios. It also partially
  converts ceph so that it uses folios on the boundaries with netfslib.

  To help with this, a couple of folio helper functions are added in the
  first two patches.

  These patches don't touch fscache and cachefiles as I intend to remove
  all the code that deals with pages directly from there. Only nfs and
  cifs are using the old fscache I/O API now. The new API uses iov_iter
  instead.

  Thanks to Jeff Layton, Dominique Martinet and AuriStor for testing and
  retesting the patches"

* tag 'netfs-folio-20211111' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Use folios in directory handling
  netfs, 9p, afs, ceph: Use folios
  folio: Add a function to get the host inode for a folio
  folio: Add a function to change the private data attached to a folio
2021-11-13 11:15:15 -08:00
Linus Torvalds
38764c7340 A slow cycle for nfsd: mainly cleanup, including Neil's patch dropping
support for a filehandle format deprecated 20 years ago, and further
 xdr-related cleanup from Chuck.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAmGMPYkVHGJmaWVsZHNA
 ZmllbGRzZXMub3JnAAoJECebzXlCjuG+JVwQAKbrpgbzl91u+T6W9MUGgQVzDpeP
 XIy3NxCu/4pZ8SToWF3trz71sskokmkPPaZyuISD2C8e4DxO5LQ3fJLhtS9CjRFB
 x4iZUxH7V2BoWrb5SY6TDWBEqaq4MY9f7tIbvUu5xpa0FIupLqJjYh2CP8vqtsbm
 lblQKXz4ao0jwDzSVimNnPcTccpB25VIzwHsSOszRhN4rTjMgyHoETx2cqJne5IU
 Tx/hH0UlpnwuQ7aVpcjMoKqIyUWDTMejx51pyZhHB47DVKL7HsnZvg59mTpXFcBx
 29edvWT9yy1+w3nGkTYSkOgO9DyHvCbmQzIsvoYlmbZ2sdmTKK8Wuv2Ehcw3OfvL
 MXGmy2EXIhzvTZXyN6pL1bBwwNSxdqJhVSxvrPLz1EymIkxf/IDI8eyUicVXd3Vq
 K2xOn+CXyIbXWCU85ru8UA77r1+x//gSwqcJvtKUavbNJUwNt935CE2n3+o/0OL/
 pToZ89nhcaRyDP1jJKA37K48VLNtBXzZZQlRovyLelNojam/kzZkXX8dI6oV9VD1
 Ymjm0mbdZzwhE3C1HxKlxwZqhN+7YoyxMQuWjFMp28wxH+dkz/USCulKZ3/H+neD
 0YBSgvwe92JqkZTW2AOjipL+beAuKJ4zsfCCl2XZig/rHGutiwOf2GfgdRmJM6AD
 6aiufVWKNNRQef9y
 =yKBl
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "A slow cycle for nfsd: mainly cleanup, including Neil's patch dropping
  support for a filehandle format deprecated 20 years ago, and further
  xdr-related cleanup from Chuck"

* tag 'nfsd-5.16' of git://linux-nfs.org/~bfields/linux: (26 commits)
  nfsd4: remove obselete comment
  nfsd: document server-to-server-copy parameters
  NFSD:fix boolreturn.cocci warning
  nfsd: update create verifier comment
  SUNRPC: Change return value type of .pc_encode
  SUNRPC: Replace the "__be32 *p" parameter to .pc_encode
  NFSD: Save location of NFSv4 COMPOUND status
  SUNRPC: Change return value type of .pc_decode
  SUNRPC: Replace the "__be32 *p" parameter to .pc_decode
  SUNRPC: De-duplicate .pc_release() call sites
  SUNRPC: Simplify the SVC dispatch code path
  SUNRPC: Capture value of xdr_buf::page_base
  SUNRPC: Add trace event when alloc_pages_bulk() makes no progress
  svcrdma: Split svcrmda_wc_{read,write} tracepoints
  svcrdma: Split the svcrdma_wc_send() tracepoint
  svcrdma: Split the svcrdma_wc_receive() tracepoint
  NFSD: Have legacy NFSD WRITE decoders use xdr_stream_subsegment()
  SUNRPC: xdr_stream_subsegment() must handle non-zero page_bases
  NFSD: Initialize pointer ni with NULL and not plain integer 0
  NFSD: simplify struct nfsfh
  ...
2021-11-10 16:45:54 -08:00
Linus Torvalds
2ec20f4895 NFS client updates for Linux 5.16
Highlights include:
 
 Features:
 - NFSv4.1 can always retrieve and cache the ACCESS mode on OPEN
 - Optimisations for READDIR and the 'ls -l' style workload
 - Further replacements of dprintk() with tracepoints and other tracing
   improvements
 - Ensure we re-probe NFSv4 server capabilities when the user does a
   "mount -o remount"
 
 Bugfixes:
 - Fix an Oops in pnfs_mark_request_commit()
 - Fix up deadlocks in the commit code
 - Fix regressions in NFSv2/v3 attribute revalidation due to the
   change_attr_type optimisations
 - Fix some dentry verifier races
 - Fix some missing dentry verifier settings
 - Fix a performance regression in nfs_set_open_stateid_locked()
 - SUNRPC was sending multiple SYN calls when re-establishing a TCP
   connection.
 - Fix multiple NFSv4 issues due to missing sanity checking of server
   return values
 - Fix a potential Oops when FREE_STATEID races with an unmount
 
 Cleanups:
 - Clean up the labelled NFS code
 - Remove unused header <linux/pnfs_osd_xdr.h>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmGL5c4ACgkQZwvnipYK
 APLFyQ//endoc1HYNpTNpcvlWiAgombBQumjBLrk73Qr+M2Vq9uK6+WmaqYTCHhU
 SfX6kbptiyGrd+f/pdIXCjIfPCnCRPRZYpRx8BxHwNr5vqOQIr9rvT/1Mvg2G9Oi
 IkdwVDmrN3ZjK/dbvyYSxhsLwuwrnaNm0oHkHxDO/EFghqEsesU1Aj1yywbFIZZA
 onRXVXh8r1T9pqL25HyHzZjD1kxvEiKuAMFis2NCKHexSmsvGF4Xs71J3AiCKuc2
 XXLged3ng7WRhNCvvrZmfA0AVkZ+iklpVJQzBeXzxuYB81pRZr99yXuv3FKE5aEl
 UIPv73b2uTq2SlXtZe2ggsVOdB0JDIRx+9jIH0iV3tOOjapfaTGdTwDx8JR1qHza
 wVxB24evk3rW6EFrZNPogaf3JiZmwlVCSUlSZZ3T5c+5l36yZV+WuoSTOe4ajttm
 y/uUkA1p2iFpYb9qNoO6kQ1ue3YO34TCqYPrUipzXWvTG1ZjJ5yGV5LZR0VvB4QT
 bYpInua7SC/t9RwJ1/HWBrk1G9/xufC4WI7xJf6dJzSDSEo8n6x24nxY0OwUIClb
 YzoVWv+bwTHgqkVlTO52XH3VX9E3XBgt5GLtxstQT3hXIndIEoitBqPms0buP/Af
 RveTtV1pNCqhmGrmZJGInH3veIELn3l/pTywqITuhIBNCG3Rj5g=
 =n8lj
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Features:
   - NFSv4.1 can always retrieve and cache the ACCESS mode on OPEN
   - Optimisations for READDIR and the 'ls -l' style workload
   - Further replacements of dprintk() with tracepoints and other
     tracing improvements
   - Ensure we re-probe NFSv4 server capabilities when the user does a
     "mount -o remount"

  Bugfixes:
   - Fix an Oops in pnfs_mark_request_commit()
   - Fix up deadlocks in the commit code
   - Fix regressions in NFSv2/v3 attribute revalidation due to the
     change_attr_type optimisations
   - Fix some dentry verifier races
   - Fix some missing dentry verifier settings
   - Fix a performance regression in nfs_set_open_stateid_locked()
   - SUNRPC was sending multiple SYN calls when re-establishing a TCP
     connection.
   - Fix multiple NFSv4 issues due to missing sanity checking of server
     return values
   - Fix a potential Oops when FREE_STATEID races with an unmount

  Cleanups:
   - Clean up the labelled NFS code
   - Remove unused header <linux/pnfs_osd_xdr.h>"

* tag 'nfs-for-5.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (84 commits)
  NFSv4: Sanity check the parameters in nfs41_update_target_slotid()
  NFS: Remove the nfs4_label argument from decode_getattr_*() functions
  NFS: Remove the nfs4_label argument from nfs_setsecurity
  NFS: Remove the nfs4_label argument from nfs_fhget()
  NFS: Remove the nfs4_label argument from nfs_add_or_obtain()
  NFS: Remove the nfs4_label argument from nfs_instantiate()
  NFS: Remove the nfs4_label from the nfs_setattrres
  NFS: Remove the nfs4_label from the nfs4_getattr_res
  NFS: Remove the f_label from the nfs4_opendata and nfs_openres
  NFS: Remove the nfs4_label from the nfs4_lookupp_res struct
  NFS: Remove the label from the nfs4_lookup_res struct
  NFS: Remove the nfs4_label from the nfs4_link_res struct
  NFS: Remove the nfs4_label from the nfs4_create_res struct
  NFS: Remove the nfs4_label from the nfs_entry struct
  NFS: Create a new nfs_alloc_fattr_with_label() function
  NFS: Always initialise fattr->label in nfs_fattr_alloc()
  NFSv4.2: alloc_file_pseudo() takes an open flag, not an f_mode
  NFS: Don't allocate nfs_fattr on the stack in __nfs42_ssc_open()
  NFSv4: Remove unnecessary 'minor version' check
  NFSv4: Fix potential Oops in decode_op_map()
  ...
2021-11-10 16:32:46 -08:00
David Howells
78525c74d9 netfs, 9p, afs, ceph: Use folios
Convert the netfs helper library to use folios throughout, convert the 9p
and afs filesystems to use folios in their file I/O paths and convert the
ceph filesystem to use just enough folios to compile.

With these changes, afs passes -g quick xfstests.

Changes
=======
ver #5:
 - Got rid of folio_end{io,_read,_write}() and inlined the stuff it does
   instead (Willy decided he didn't want this after all).

ver #4:
 - Fixed a bug in afs_redirty_page() whereby it didn't set the next page
   index in the loop and returned too early.
 - Simplified a check in v9fs_vfs_write_folio_locked()[1].
 - Undid a change to afs_symlink_readpage()[1].
 - Used offset_in_folio() in afs_write_end()[1].
 - Changed from using page_endio() to folio_end{io,_read,_write}()[1].

ver #2:
 - Add 9p foliation.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Dominique Martinet <asmadeus@codewreck.org>
Tested-by: kafs-testing@auristor.com
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: v9fs-developer@lists.sourceforge.net
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/YYKa3bfQZxK5/wDN@casper.infradead.org/ [1]
Link: https://lore.kernel.org/r/2408234.1628687271@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/162877311459.3085614.10601478228012245108.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/162981153551.1901565.3124454657133703341.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/163005745264.2472992.9852048135392188995.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163584187452.4023316.500389675405550116.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/163649328026.309189.1124218109373941936.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/163657852454.834781.9265101983152100556.stgit@warthog.procyon.org.uk/ # v5
2021-11-10 21:16:56 +00:00
Linus Torvalds
512b7931ad Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "257 patches.

  Subsystems affected by this patch series: scripts, ocfs2, vfs, and
  mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache,
  gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc,
  pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools,
  memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm,
  vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram,
  cleanups, kfence, and damon)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits)
  mm/damon: remove return value from before_terminate callback
  mm/damon: fix a few spelling mistakes in comments and a pr_debug message
  mm/damon: simplify stop mechanism
  Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions
  Docs/admin-guide/mm/damon/start: simplify the content
  Docs/admin-guide/mm/damon/start: fix a wrong link
  Docs/admin-guide/mm/damon/start: fix wrong example commands
  mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on
  mm/damon: remove unnecessary variable initialization
  Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM
  mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)
  selftests/damon: support watermarks
  mm/damon/dbgfs: support watermarks
  mm/damon/schemes: activate schemes based on a watermarks mechanism
  tools/selftests/damon: update for regions prioritization of schemes
  mm/damon/dbgfs: support prioritization weights
  mm/damon/vaddr,paddr: support pageout prioritization
  mm/damon/schemes: prioritize regions within the quotas
  mm/damon/selftests: support schemes quotas
  mm/damon/dbgfs: support quotas of schemes
  ...
2021-11-06 14:08:17 -07:00
Mel Gorman
69392a403f mm/vmscan: throttle reclaim when no progress is being made
Memcg reclaim throttles on congestion if no reclaim progress is made.
This makes little sense, it might be due to writeback or a host of other
factors.

For !memcg reclaim, it's messy.  Direct reclaim primarily is throttled
in the page allocator if it is failing to make progress.  Kswapd
throttles if too many pages are under writeback and marked for immediate
reclaim.

This patch explicitly throttles if reclaim is failing to make progress.

[vbabka@suse.cz: Remove redundant code]

Link: https://lkml.kernel.org/r/20211022144651.19914-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 13:30:40 -07:00
Mel Gorman
d818fca1ca mm/vmscan: throttle reclaim and compaction when too may pages are isolated
Page reclaim throttles on congestion if too many parallel reclaim
instances have isolated too many pages.  This makes no sense, excessive
parallelisation has nothing to do with writeback or congestion.

This patch creates an additional workqueue to sleep on when too many
pages are isolated.  The throttled tasks are woken when the number of
isolated pages is reduced or a timeout occurs.  There may be some false
positive wakeups for GFP_NOIO/GFP_NOFS callers but the tasks will
throttle again if necessary.

[shy828301@gmail.com: Wake up from compaction context]
[vbabka@suse.cz: Account number of throttled tasks only for writeback]

Link: https://lkml.kernel.org/r/20211022144651.19914-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 13:30:40 -07:00
Mel Gorman
8cd7c588de mm/vmscan: throttle reclaim until some writeback completes if congested
Patch series "Remove dependency on congestion_wait in mm/", v5.

This series that removes all calls to congestion_wait in mm/ and deletes
wait_iff_congested.  It's not a clever implementation but
congestion_wait has been broken for a long time [1].

Even if congestion throttling worked, it was never a great idea.  While
excessive dirty/writeback pages at the tail of the LRU is one
possibility that reclaim may be slow, there is also the problem of too
many pages being isolated and reclaim failing for other reasons
(elevated references, too many pages isolated, excessive LRU contention
etc).

This series replaces the "congestion" throttling with 3 different types.

 - If there are too many dirty/writeback pages, sleep until a timeout or
   enough pages get cleaned

 - If too many pages are isolated, sleep until enough isolated pages are
   either reclaimed or put back on the LRU

 - If no progress is being made, direct reclaim tasks sleep until
   another task makes progress with acceptable efficiency.

This was initially tested with a mix of workloads that used to trigger
corner cases that no longer work.  A new test case was created called
"stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
created XFS filesystem.  Note that it may be necessary to increase the
timeout of ssh if executing remotely as ssh itself can get throttled and
the connection may timeout.

stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
to check the impact as the number of direct reclaimers increase.  It has
four types of worker.

 - One "anon latency" worker creates small mappings with mmap() and
   times how long it takes to fault the mapping reading it 4K at a time

 - X file writers which is fio randomly writing X files where the total
   size of the files add up to the allowed dirty_ratio. fio is allowed
   to run for a warmup period to allow some file-backed pages to
   accumulate. The duration of the warmup is based on the best-case
   linear write speed of the storage.

 - Y file readers which is fio randomly reading small files

 - Z anon memory hogs which continually map (100-dirty_ratio)% of memory

 - Total estimated WSS = (100+dirty_ration) percentage of memory

X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4

The intent is to maximise the total WSS with a mix of file and anon
memory where some anonymous memory must be swapped and there is a high
likelihood of dirty/writeback pages reaching the end of the LRU.

The test can be configured to have no background readers to stress
dirty/writeback pages.  The results below are based on having zero
readers.

The short summary of the results is that the series works and stalls
until some event occurs but the timeouts may need adjustment.

The test results are not broken down by patch as the series should be
treated as one block that replaces a broken throttling mechanism with a
working one.

Finally, three machines were tested but I'm reporting the worst set of
results.  The other two machines had much better latencies for example.

First the results of the "anon latency" latency

  stutterp
                                5.15.0-rc1             5.15.0-rc1
                                   vanilla mm-reclaimcongest-v5r4
  Amean     mmap-4      31.4003 (   0.00%)   2661.0198 (-8374.52%)
  Amean     mmap-7      38.1641 (   0.00%)    149.2891 (-291.18%)
  Amean     mmap-12     60.0981 (   0.00%)    187.8105 (-212.51%)
  Amean     mmap-21    161.2699 (   0.00%)    213.9107 ( -32.64%)
  Amean     mmap-30    174.5589 (   0.00%)    377.7548 (-116.41%)
  Amean     mmap-48   8106.8160 (   0.00%)   1070.5616 (  86.79%)
  Stddev    mmap-4      41.3455 (   0.00%)  27573.9676 (-66591.66%)
  Stddev    mmap-7      53.5556 (   0.00%)   4608.5860 (-8505.23%)
  Stddev    mmap-12    171.3897 (   0.00%)   5559.4542 (-3143.75%)
  Stddev    mmap-21   1506.6752 (   0.00%)   5746.2507 (-281.39%)
  Stddev    mmap-30    557.5806 (   0.00%)   7678.1624 (-1277.05%)
  Stddev    mmap-48  61681.5718 (   0.00%)  14507.2830 (  76.48%)
  Max-90    mmap-4      31.4243 (   0.00%)     83.1457 (-164.59%)
  Max-90    mmap-7      41.0410 (   0.00%)     41.0720 (  -0.08%)
  Max-90    mmap-12     66.5255 (   0.00%)     53.9073 (  18.97%)
  Max-90    mmap-21    146.7479 (   0.00%)    105.9540 (  27.80%)
  Max-90    mmap-30    193.9513 (   0.00%)     64.3067 (  66.84%)
  Max-90    mmap-48    277.9137 (   0.00%)    591.0594 (-112.68%)
  Max       mmap-4    1913.8009 (   0.00%) 299623.9695 (-15555.96%)
  Max       mmap-7    2423.9665 (   0.00%) 204453.1708 (-8334.65%)
  Max       mmap-12   6845.6573 (   0.00%) 221090.3366 (-3129.64%)
  Max       mmap-21  56278.6508 (   0.00%) 213877.3496 (-280.03%)
  Max       mmap-30  19716.2990 (   0.00%) 216287.6229 (-997.00%)
  Max       mmap-48 477923.9400 (   0.00%) 245414.8238 (  48.65%)

For most thread counts, the time to mmap() is unfortunately increased.
In earlier versions of the series, this was lower but a large number of
throttling events were reaching their timeout increasing the amount of
inefficient scanning of the LRU.  There is no prioritisation of reclaim
tasks making progress based on each tasks rate of page allocation versus
progress of reclaim.  The variance is also impacted for high worker
counts but in all cases, the differences in latency are not
statistically significant due to very large maximum outliers.  Max-90
shows that 90% of the stalls are comparable but the Max results show the
massive outliers which are increased to to stalling.

It is expected that this will be very machine dependant.  Due to the
test design, reclaim is difficult so allocations stall and there are
variances depending on whether THPs can be allocated or not.  The amount
of memory will affect exactly how bad the corner cases are and how often
they trigger.  The warmup period calculation is not ideal as it's based
on linear writes where as fio is randomly writing multiple files from
multiple tasks so the start state of the test is variable.  For example,
these are the latencies on a single-socket machine that had more memory

  Amean     mmap-4      42.2287 (   0.00%)     49.6838 * -17.65%*
  Amean     mmap-7     216.4326 (   0.00%)     47.4451 *  78.08%*
  Amean     mmap-12   2412.0588 (   0.00%)     51.7497 (  97.85%)
  Amean     mmap-21   5546.2548 (   0.00%)     51.8862 (  99.06%)
  Amean     mmap-30   1085.3121 (   0.00%)     72.1004 (  93.36%)

The overall system CPU usage and elapsed time is as follows

                    5.15.0-rc3  5.15.0-rc3
                       vanilla mm-reclaimcongest-v5r4
  Duration User        6989.03      983.42
  Duration System      7308.12      799.68
  Duration Elapsed     2277.67     2092.98

The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
stalling.

The high-level /proc/vmstats show

                                       5.15.0-rc1     5.15.0-rc1
                                          vanilla mm-reclaimcongest-v5r2
  Ops Direct pages scanned          1056608451.00   503594991.00
  Ops Kswapd pages scanned           109795048.00   147289810.00
  Ops Kswapd pages reclaimed          63269243.00    31036005.00
  Ops Direct pages reclaimed          10803973.00     6328887.00
  Ops Kswapd efficiency %                   57.62          21.07
  Ops Kswapd velocity                    48204.98       57572.86
  Ops Direct efficiency %                    1.02           1.26
  Ops Direct velocity                   463898.83      196845.97

Kswapd scanned less pages but the detailed pattern is different.  The
vanilla kernel scans slowly over time where as the patches exhibits
burst patterns of scan activity.  Direct reclaim scanning is reduced by
52% due to stalling.

The pattern for stealing pages is also slightly different.  Both kernels
exhibit spikes but the vanilla kernel when reclaiming shows pages being
reclaimed over a period of time where as the patches tend to reclaim in
spikes.  The difference is that vanilla is not throttling and instead
scanning constantly finding some pages over time where as the patched
kernel throttles and reclaims in spikes.

  Ops Percentage direct scans               90.59          77.37

For direct reclaim, vanilla scanned 90.59% of pages where as with the
patches, 77.37% were direct reclaim due to throttling

  Ops Page writes by reclaim           2613590.00     1687131.00

Page writes from reclaim context are reduced.

  Ops Page writes anon                 2932752.00     1917048.00

And there is less swapping.

  Ops Page reclaim immediate         996248528.00   107664764.00

The number of pages encountered at the tail of the LRU tagged for
immediate reclaim but still dirty/writeback is reduced by 89%.

  Ops Slabs scanned                     164284.00      153608.00

Slab scan activity is similar.

ftrace was used to gather stall activity

  Vanilla
  -------
      1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
      2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
      8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
     29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
  82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0

The fast majority of wait_iff_congested calls do not stall at all.  What
is likely happening is that cond_resched() reschedules the task for a
short period when the BDI is not registering congestion (which it never
will in this test setup).

      1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
      2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
      4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
    380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
    778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000

congestion_wait if called always exceeds the timeout as there is no
trigger to wake it up.

Bottom line: Vanilla will throttle but it's not effective.

Patch series
------------

Kswapd throttle activity was always due to scanning pages tagged for
immediate reclaim at the tail of the LRU

      1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
      4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
      5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
      6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
     11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
     11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
     94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
    112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK

The majority of events did not stall or stalled for a short period.
Roughly 16% of stalls reached the timeout before expiry.  For direct
reclaim, the number of times stalled for each reason were

   6624 reason=VMSCAN_THROTTLE_ISOLATED
  93246 reason=VMSCAN_THROTTLE_NOPROGRESS
  96934 reason=VMSCAN_THROTTLE_WRITEBACK

The most common reason to stall was due to excessive pages tagged for
immediate reclaim at the tail of the LRU followed by a failure to make
forward.  A relatively small number were due to too many pages isolated
from the LRU by parallel threads

For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was

      9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED
     12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED
     83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED
   6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED

Most did not stall at all.  A small number reached the timeout.

For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over
the map

      1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS
      1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS
      1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS
      1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS
      3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS
      3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS
      3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS
      4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS
      4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS
      4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS
      4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS
      4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS
      4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS
      5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS
      5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS
      5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS
      5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS
      6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS
      7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS
      7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS
      7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS
      7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS
      8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS
      8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS
      8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS
      9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
      9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPROGRESS
      9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS
     10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPROGRESS
     10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS
     10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS
     11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS
     13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS
     13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS
     14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS
     14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS
     14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS
     16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS
     17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS
     17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS
     17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS
     18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS
     20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS
     20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS
     20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS
     21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS
     23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS
     23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS
     25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS
     25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS
     26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS
     27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS
     28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS
     29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS
     30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS
     30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS
     31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS
     32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS
     33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS
     35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
     35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
     36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS
     36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS
     37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS
     38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS
     40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS
     43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS
     55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS
     56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS
     58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
     59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS
     61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS
     71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS
     71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS
     79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS
     82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS
     82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS
     85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS
     85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS
     88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS
     90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS
     90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS
     94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS
    118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS
    119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS
    126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
    146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS
    148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS
    148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS
    159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS
    178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS
    183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS
    237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS
    266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS
    313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS
    347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS
    470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS
    559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS
    964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS
   2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
   2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS
   7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS
  22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS
  51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS

The full timeout is often hit but a large number also do not stall at
all.  The remainder slept a little allowing other reclaim tasks to make
progress.

While this timeout could be further increased, it could also negatively
impact worst-case behaviour when there is no prioritisation of what task
should make progress.

For VMSCAN_THROTTLE_WRITEBACK, the breakdown was

      1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK
      2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK
      3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK
      5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK
      5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK
      6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
      7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK
     11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK
     12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK
     16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK
     24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK
     28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
     30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK
     30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK
     32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK
     42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK
     77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK
     99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK
    137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK
    190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
    339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
    518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
    852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
   3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
   7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
  83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK

The majority hit the timeout in direct reclaim context although a
sizable number did not stall at all.  This is very different to kswapd
where only a tiny percentage of stalls due to writeback reached the
timeout.

Bottom line, the throttling appears to work and the wakeup events may
limit worst case stalls.  There might be some grounds for adjusting
timeouts but it's likely futile as the worst-case scenarios depend on
the workload, memory size and the speed of the storage.  A better
approach to improve the series further would be to prioritise tasks
based on their rate of allocation with the caveat that it may be very
expensive to track.

This patch (of 5):

Page reclaim throttles on wait_iff_congested under the following
conditions:

 - kswapd is encountering pages under writeback and marked for immediate
   reclaim implying that pages are cycling through the LRU faster than
   pages can be cleaned.

 - Direct reclaim will stall if all dirty pages are backed by congested
   inodes.

wait_iff_congested is almost completely broken with few exceptions.
This patch adds a new node-based workqueue and tracks the number of
throttled tasks and pages written back since throttling started.  If
enough pages belonging to the node are written back then the throttled
tasks will wake early.  If not, the throttled tasks sleeps until the
timeout expires.

[neilb@suse.de: Uninterruptible sleep and simpler wakeups]
[hdanton@sina.com: Avoid race when reclaim starts]
[vbabka@suse.cz: vmstat irq-safe api, clarifications]

Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/ [1]
Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: NeilBrown <neilb@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 13:30:40 -07:00
Gang Li
627ae8284f mm: mmap_lock: use DECLARE_EVENT_CLASS and DEFINE_EVENT_FN
By using DECLARE_EVENT_CLASS and TRACE_EVENT_FN, we can save a lot of
space from duplicate code.

Link: https://lkml.kernel.org/r/20211009071243.70286-1-ligang.bdlg@bytedance.com
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 13:30:36 -07:00
Gang Li
f595e3411d mm: mmap_lock: remove redundant newline in TP_printk
Ftrace core will add newline automatically on printing, so using it in
TP_printkcreates a blank line.

Link: https://lkml.kernel.org/r/20211009071105.69544-1-ligang.bdlg@bytedance.com
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 13:30:36 -07:00
Alexander Aring
92732376fd fs: dlm: trace socket handling
This patch adds tracepoints for dlm socket receive and send
functionality. We can use it to track how much data was send or received
to or from a specific nodeid.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-11-02 14:39:20 -05:00
Alexander Aring
f1d3b8f91d fs: dlm: initial support for tracepoints
This patch adds initial support for dlm tracepoints. It will introduce
tracepoints to dlm main functionality dlm_lock()/dlm_unlock() and their
complete ast() callback or blocking bast() callback.

The lock/unlock functionality has a start and end tracepoint, this is
because there exists a race in case if would have a tracepoint at the
end position only the complete/blocking callbacks could occur before. To
work with eBPF tracing and using their lookup hash functionality there
could be problems that an entry was not inserted yet. However use the
start functionality for hash insert and check again in end functionality
if there was an dlm internal error so there is no ast callback. In further
it might also that locks with local masters will occur those callbacks
immediately so we must have such functionality.

I did not make everything accessible yet, although it seems eBPF can be
used to access a lot of internal datastructures if it's aware of the
struct definitions of the running kernel instance. We still can change
it, if you do eBPF experiments e.g. time measurements between lock and
callback functionality you can simple use the local lkb_id field as hash
value in combination with the lockspace id if you have multiple
lockspaces. Otherwise you can simple use trace-cmd for some functionality,
e.g. `trace-cmd record -e dlm` and `trace-cmd report` afterwards.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2021-11-02 14:39:20 -05:00
Chuck Lever
8791545eda NFS: Move NFS protocol display macros to global header
Refactor: surface useful show_ macros so they can be shared between
the client and server trace code.

Additional clean up:
- Housekeeping: ensure the correct #include files are pulled in
  and add proper TRACE_DEFINE_ENUM where they are missing
- Use a consistent naming scheme for the helpers
- Store values to be displayed symbolically as unsigned long, as
  that is the type that the __print_yada() functions take

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-11-02 12:31:23 -04:00
Chuck Lever
9d2d48bbbd NFS: Move generic FS show macros to global header
Refactor: Surface useful show_ macros for use by other trace
subsystems.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-11-02 12:31:23 -04:00