120 Commits

Author SHA1 Message Date
Linus Torvalds
f008b1d6e1 Netfs prep for write helpers
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmI1HOwACgkQ+7dXa6fL
 C2u9mA/+LUdXHqlvET/PAtFTg75bUPeOFGLnuDnYl1Ng2FCKMSodAohpbVtENxsK
 E/gTVS7uiVZFQgC+YmNA00z6eIQkAaDVyvKyEcUbKREBbUgONfJ/HLeaK/NvVKxx
 TY5gx/POdG6yHRQXL6JGBqSJUB8bZrGKwnJm8ebzeKOji9n7GSJBYiMlYBA7EAhs
 Aut/P7Y39ISHLw3y+y5czBeRoubljmTyznbP20xUZEzrRwhTpNwpJVzBGUZU635T
 93Sqcp//0U5LIdn6Pg6DUGHBMBTNDNJChb21ZoBusF/HHswXsOOnf/mcRUBSJUTI
 M1WSpNLk8PRBgajMdIymQpGU1sCZZzJ3krrSA3RcXdN6GPHwZg8kKjoroHsLDL6l
 igPbDSMJ5wfiwA2A2gXbY1CkAl3ik5ccb7ZqhTwS0WBk0vOnHmAsE9cs/bBo7Xii
 GTiWXEFOgtJiXANPMS2P9DiOS3ZQNf+wxotCYdkGPOXuX9wnIo1Kmy8XfujQ1bXf
 pJsEZKfeyROKrzyKWgqLI64/Kg5xNueoFQZfDpOlZYzF1uDstynADPUt0eQD706q
 jcuKaXLN3rn5gSPun5mWOYbRtXVgOLdFL/7zptMVJwFKBFguQENhjG4UMNZcjkVA
 3Mr0kGocsgoCSk1oDBkFlrw1wIsXxWbkRBL1Pww6kovivuGUwoo=
 =j0yx
 -----END PGP SIGNATURE-----

Merge tag 'netfs-prep-20220318' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull netfs updates from David Howells:
 "Netfs prep for write helpers.

  Having had a go at implementing write helpers and content encryption
  support in netfslib, it seems that the netfs_read_{,sub}request
  structs and the equivalent write request structs were almost the same
  and so should be merged, thereby requiring only one set of
  alloc/get/put functions and a common set of tracepoints.

  Merging the structs also has the advantage that if a bounce buffer is
  added to the request struct, a read operation can be performed to fill
  the bounce buffer, the contents of the buffer can be modified and then
  a write operation can be performed on it to send the data wherever it
  needs to go using the same request structure all the way through. The
  I/O handlers would then transparently perform any required crypto.
  This should make it easier to perform RMW cycles if needed.

  The potentially common functions and structs, however, by their names
  all proclaim themselves to be associated with the read side of things.

  The bulk of these changes alter this in the following ways:

   - Rename struct netfs_read_{,sub}request to netfs_io_{,sub}request.

   - Rename some enums, members and flags to make them more appropriate.

   - Adjust some comments to match.

   - Drop "read"/"rreq" from the names of common functions. For
     instance, netfs_get_read_request() becomes netfs_get_request().

   - The ->init_rreq() and ->issue_op() methods become ->init_request()
     and ->issue_read(). I've kept the latter as a read-specific
     function and in another branch added an ->issue_write() method.

  The driver source is then reorganised into a number of files:

        fs/netfs/buffered_read.c        Create read reqs to the pagecache
        fs/netfs/io.c                   Dispatchers for read and write reqs
        fs/netfs/main.c                 Some general miscellaneous bits
        fs/netfs/objects.c              Alloc, get and put functions
        fs/netfs/stats.c                Optional procfs statistics.

  and future development can be fitted into this scheme, e.g.:

        fs/netfs/buffered_write.c       Modify the pagecache
        fs/netfs/buffered_flush.c       Writeback from the pagecache
        fs/netfs/direct_read.c          DIO read support
        fs/netfs/direct_write.c         DIO write support
        fs/netfs/unbuffered_write.c     Write modifications directly back

  Beyond the above changes, there are also some changes that affect how
  things work:

   - Make fscache_end_operation() generally available.

   - In the netfs tracing header, generate enums from the symbol ->
     string mapping tables rather than manually coding them.

   - Add a struct for filesystems that uses netfslib to put into their
     inode wrapper structs to hold extra state that netfslib is
     interested in, such as the fscache cookie. This allows netfslib
     functions to be set in filesystem operation tables and jumped to
     directly without having to have a filesystem wrapper.

   - Add a member to the struct added above to track the remote inode
     length as that may differ if local modifications are buffered. We
     may need to supply an appropriate EOF pointer when storing data (in
     AFS for example).

   - Pass extra information to netfs_alloc_request() so that the
     ->init_request() hook can access it and retain information to
     indicate the origin of the operation.

   - Make the ->init_request() hook return an error, thereby allowing a
     filesystem that isn't allowed to cache an inode (ceph or cifs, for
     example) to skip readahead.

   - Switch to using refcount_t for subrequests and add tracepoints to
     log refcount changes for the request and subrequest structs.

   - Add a function to consolidate dispatching a read request. Similar
     code is used in three places and another couple are likely to be
     added in the future"

Link: https://lore.kernel.org/all/2639515.1648483225@warthog.procyon.org.uk/

* tag 'netfs-prep-20220318' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Maintain netfs_i_context::remote_i_size
  netfs: Keep track of the actual remote file size
  netfs: Split some core bits out into their own file
  netfs: Split fs/netfs/read_helper.c
  netfs: Rename read_helper.c to io.c
  netfs: Prepare to split read_helper.c
  netfs: Add a function to consolidate beginning a read
  netfs: Add a netfs inode context
  ceph: Make ceph_init_request() check caps on readahead
  netfs: Change ->init_request() to return an error code
  netfs: Refactor arguments for netfs_alloc_read_request
  netfs: Adjust the netfs_failure tracepoint to indicate non-subreq lines
  netfs: Trace refcounting on the netfs_io_subrequest struct
  netfs: Trace refcounting on the netfs_io_request struct
  netfs: Adjust the netfs_rreq tracepoint slightly
  netfs: Split netfs_io_* object handling out
  netfs: Finish off rename of netfs_read_request to netfs_io_request
  netfs: Rename netfs_read_*request to netfs_io_*request
  netfs: Generate enums from trace symbol mapping lists
  fscache: export fscache_end_operation()
2022-03-31 15:49:36 -07:00
Linus Torvalds
6b1f86f8e9 Filesystem folio changes for 5.18
Primarily this series converts some of the address_space operations
 to take a folio instead of a page.
 
 ->is_partially_uptodate() takes a folio instead of a page and changes the
 type of the 'from' and 'count' arguments to make it obvious they're bytes.
 ->invalidatepage() becomes ->invalidate_folio() and has a similar type change.
 ->launder_page() becomes ->launder_folio()
 ->set_page_dirty() becomes ->dirty_folio() and adds the address_space as
 an argument.
 
 There are a couple of other misc changes up front that weren't worth
 separating into their own pull request.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4hqMACgkQDpNsjXcp
 gj7r7Af/fVJ7m8kKqjP/IayX3HiJRuIDQw+vM++BlRNXdjz+IyED6whdmFGxJeOY
 BMyT+8ApOAz7ErS4G+7fAv4ScJK/aEgFUsnSeAiCp0PliiEJ5NNJzElp6sVmQ7H5
 SX7+Ek444FZUGsQuy0qL7/ELpR3ditnD7x+5U2g0p5TeaHGUQn84crRyfR4xuhNG
 EBD9D71BOb7OxUcOHe93pTkK51QsQ0aCrcIsB1tkK5KR0BAthn1HqF7ehL90Rvrr
 omx5M7aDWGY4oj7IKrhlAs+55Ah2WaOzrZBp0FXNbr4UENDBKWKyUxErwa4xPkf6
 Gm1iQG/CspOHnxN3YWsd5WjtlL3A+A==
 =cOiq
 -----END PGP SIGNATURE-----

Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache

Pull filesystem folio updates from Matthew Wilcox:
 "Primarily this series converts some of the address_space operations to
  take a folio instead of a page.

  Notably:

   - a_ops->is_partially_uptodate() takes a folio instead of a page and
     changes the type of the 'from' and 'count' arguments to make it
     obvious they're bytes.

   - a_ops->invalidatepage() becomes ->invalidate_folio() and has a
     similar type change.

   - a_ops->launder_page() becomes ->launder_folio()

   - a_ops->set_page_dirty() becomes ->dirty_folio() and adds the
     address_space as an argument.

  There are a couple of other misc changes up front that weren't worth
  separating into their own pull request"

* tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits)
  fs: Remove aops ->set_page_dirty
  fb_defio: Use noop_dirty_folio()
  fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio
  fs: Convert __set_page_dirty_buffers to block_dirty_folio
  nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio()
  mm: Convert swap_set_page_dirty() to swap_dirty_folio()
  ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio
  f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio
  f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio
  f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio
  afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio()
  btrfs: Convert extent_range_redirty_for_io() to use folios
  fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio
  btrfs: Convert from set_page_dirty to dirty_folio
  fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio()
  fs: Add aops->dirty_folio
  fs: Remove aops->launder_page
  orangefs: Convert launder_page to launder_folio
  nfs: Convert from launder_page to launder_folio
  fuse: Convert from launder_page to launder_folio
  ...
2022-03-22 18:26:56 -07:00
David Howells
ab487a4cdf afs: Maintain netfs_i_context::remote_i_size
Make afs use netfslib's tracking for the server's idea of what the current
inode size is independently of inode->i_size.  We really want to use this
value when calculating the new vnode size when initiating a StoreData RPC
op rather than the size stat() presents to the user (ie. inode->i_size) as
the latter is affected by as-yet uncommitted writes.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-afs@lists.infradead.org

Link: https://lore.kernel.org/r/164623014626.3564931.8375344024648265358.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/164678220204.1200972.17408022517463940584.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/164692923592.2099075.5466132542956550401.stgit@warthog.procyon.org.uk/ # v3
2022-03-18 09:29:05 +00:00
David Howells
bc899ee1c8 netfs: Add a netfs inode context
Add a netfs_i_context struct that should be included in the network
filesystem's own inode struct wrapper, directly after the VFS's inode
struct, e.g.:

	struct my_inode {
		struct {
			/* These must be contiguous */
			struct inode		vfs_inode;
			struct netfs_i_context	netfs_ctx;
		};
	};

The netfs_i_context struct so far contains a single field for the network
filesystem to use - the cache cookie:

	struct netfs_i_context {
		...
		struct fscache_cookie	*cache;
	};

Three functions are provided to help with this:

 (1) void netfs_i_context_init(struct inode *inode,
			       const struct netfs_request_ops *ops);

     Initialise the netfs context and set the operations.

 (2) struct netfs_i_context *netfs_i_context(struct inode *inode);

     Find the netfs context from the VFS inode.

 (3) struct inode *netfs_inode(struct netfs_i_context *ctx);

     Find the VFS inode from the netfs context.

Changes
=======
ver #4)
 - Fix netfs_is_cache_enabled() to check cookie->cache_priv to see if a
   cache is present[3].
 - Fix netfs_skip_folio_read() to zero out all of the page, not just some
   of it[3].

ver #3)
 - Split out the bit to move ceph cap-getting on readahead into
   ceph_init_request()[1].
 - Stick in a comment to the netfs inode structs indicating the contiguity
   requirements[2].

ver #2)
 - Adjust documentation to match.
 - Use "#if IS_ENABLED()" in netfs_i_cookie(), not "#ifdef".
 - Move the cap check from ceph_readahead() to ceph_init_request() to be
   called from netfslib.
 - Remove ceph_readahead() and use  netfs_readahead() directly instead.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com

Link: https://lore.kernel.org/r/8af0d47f17d89c06bbf602496dd845f2b0bf25b3.camel@kernel.org/ [1]
Link: https://lore.kernel.org/r/beaf4f6a6c2575ed489adb14b257253c868f9a5c.camel@kernel.org/ [2]
Link: https://lore.kernel.org/r/3536452.1647421585@warthog.procyon.org.uk/ [3]
Link: https://lore.kernel.org/r/164622984545.3564931.15691742939278418580.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/164678213320.1200972.16807551936267647470.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/164692909854.2099075.9535537286264248057.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/306388.1647595110@warthog.procyon.org.uk/ # v4
2022-03-18 09:29:05 +00:00
Matthew Wilcox (Oracle)
8fb72b4a76 fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio()
Convert all users of fscache_set_page_dirty to use fscache_dirty_folio.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15 08:34:36 -04:00
Matthew Wilcox (Oracle)
a42442dd73 afs: Convert from launder_page to launder_folio
Straightforward conversion.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15 08:23:30 -04:00
David Howells
173ce1ca47 afs: Fix potential thrashing in afs writeback
In afs_writepages_region(), if the dirty page we find is undergoing
writeback or write to cache, but the sync_mode is WB_SYNC_NONE, we go
round the loop trying the same page again and again with no pausing or
waiting unless and until another thread manages to clear the writeback
and fscache flags.

Fix this with three measures:

 (1) Advance start to after the page we found.

 (2) Break out of the loop and return if rescheduling is requested.

 (3) Arbitrarily give up after a maximum of 5 skips.

Fixes: 31143d5d515e ("AFS: implement basic file write support")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
Acked-by: Marc Dionne <marc.dionne@auristor.com>
Link: https://lore.kernel.org/r/164692725757.2097000.2060513769492301854.stgit@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-11 10:24:37 -08:00
David Howells
c7f75ef33b afs: Copy local writes to the cache when writing to the server
When writing to the server from afs_writepage() or afs_writepages(), copy
the data to the cache object too.

To make this possible, the cookie must have its active users count
incremented when the page is dirtied and kept incremented until we manage
to clean up all the pages.  This allows the writeback to take place after
the last file struct is released.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: kafs-testing@auristor.com
Acked-by: Jeff Layton <jlayton@kernel.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819662333.215744.7531373404219224438.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906970998.143852.674420788614608063.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967176564.1823006.16666056085593949570.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021570208.640689.9193494979708031862.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:44:52 +00:00
David Howells
523d27cda1 afs: Convert afs to use the new fscache API
Change the afs filesystem to support the new afs driver.

The following changes have been made:

 (1) The fscache_netfs struct is no more, and there's no need to register
     the filesystem as a whole.  There's also no longer a cell cookie.

 (2) The volume cookie is now an fscache_volume cookie, allocated with
     fscache_acquire_volume().  This function takes three parameters: a
     string representing the "volume" in the index, a string naming the
     cache to use (or NULL) and a u64 that conveys coherency metadata for
     the volume.

     For afs, I've made it render the volume name string as:

        "afs,<cell>,<volume_id>"

     and the coherency data is currently 0.

 (3) The fscache_cookie_def is no more and needed information is passed
     directly to fscache_acquire_cookie().  The cache no longer calls back
     into the filesystem, but rather metadata changes are indicated at
     other times.

     fscache_acquire_cookie() is passed the same keying and coherency
     information as before, except that these are now stored in big endian
     form instead of cpu endian.  This makes the cache more copyable.

 (4) fscache_use_cookie() and fscache_unuse_cookie() are called when a file
     is opened or closed to prevent a cache file from being culled and to
     keep resources to hand that are needed to do I/O.

     fscache_use_cookie() is given an indication if the cache is likely to
     be modified locally (e.g. the file is open for writing).

     fscache_unuse_cookie() is given a coherency update if we had the file
     open for writing and will update that.

 (5) fscache_invalidate() is now given uptodate auxiliary data and a file
     size.  It can also take a flag to indicate if this was due to a DIO
     write.  This is wrapped into afs_fscache_invalidate() now for
     convenience.

 (6) fscache_resize() now gets called from the finalisation of
     afs_setattr(), and afs_setattr() does use/unuse of the cookie around
     the call to support this.

 (7) fscache_note_page_release() is called from afs_release_page().

 (8) Use a killable wait in nfs_vm_page_mkwrite() when waiting for
     PG_fscache to be cleared.

Render the parts of the cookie key for an afs inode cookie as big endian.

Changes
=======
ver #2:
 - Use gfpflags_allow_blocking() rather than using flag directly.
 - fscache_acquire_volume() now returns errors.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Tested-by: kafs-testing@auristor.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819661382.215744.1485608824741611837.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906970002.143852.17678518584089878259.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967174665.1823006.1301789965454084220.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021568841.640689.6684240152253400380.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 13:44:47 +00:00
David Howells
78525c74d9 netfs, 9p, afs, ceph: Use folios
Convert the netfs helper library to use folios throughout, convert the 9p
and afs filesystems to use folios in their file I/O paths and convert the
ceph filesystem to use just enough folios to compile.

With these changes, afs passes -g quick xfstests.

Changes
=======
ver #5:
 - Got rid of folio_end{io,_read,_write}() and inlined the stuff it does
   instead (Willy decided he didn't want this after all).

ver #4:
 - Fixed a bug in afs_redirty_page() whereby it didn't set the next page
   index in the loop and returned too early.
 - Simplified a check in v9fs_vfs_write_folio_locked()[1].
 - Undid a change to afs_symlink_readpage()[1].
 - Used offset_in_folio() in afs_write_end()[1].
 - Changed from using page_endio() to folio_end{io,_read,_write}()[1].

ver #2:
 - Add 9p foliation.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Dominique Martinet <asmadeus@codewreck.org>
Tested-by: kafs-testing@auristor.com
cc: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: v9fs-developer@lists.sourceforge.net
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/YYKa3bfQZxK5/wDN@casper.infradead.org/ [1]
Link: https://lore.kernel.org/r/2408234.1628687271@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/162877311459.3085614.10601478228012245108.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/162981153551.1901565.3124454657133703341.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/163005745264.2472992.9852048135392188995.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163584187452.4023316.500389675405550116.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/163649328026.309189.1124218109373941936.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/163657852454.834781.9265101983152100556.stgit@warthog.procyon.org.uk/ # v5
2021-11-10 21:16:56 +00:00
Linus Torvalds
49f8275c7d Memory folios
Add memory folios, a new type to represent either order-0 pages or
 the head page of a compound page.  This should be enough infrastructure
 to support filesystems converting from pages to folios.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmF9uI0ACgkQDpNsjXcp
 gj7MUAf/R7LCZ+xFiIedw7SAgb/DGK0C9uVjuBEIZgAw21ZUw/GuPI6cuKBMFGGf
 rRcdtlvMpwi7yZJcoNXxaqU/xPaaJMjf2XxscIvYJP1mjlZVuwmP9dOx0neNvWOc
 T+8lqR6c1TLl82lpqIjGFLwvj2eVowq2d3J5jsaIJFd4odmmYVInrhJXOzC/LQ54
 Niloj5ksehf+KUIRLDz7ycppvIHhlVsoAl0eM2dWBAtL0mvT7Nyn/3y+vnMfV2v3
 Flb4opwJUgTJleYc16oxTn9svT2yS8q2uuUemRDLW8ABghoAtH3fUUk43RN+5Krd
 LYCtbeawtkikPVXZMfWybsx5vn0c3Q==
 =7SBe
 -----END PGP SIGNATURE-----

Merge tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache

Pull memory folios from Matthew Wilcox:
 "Add memory folios, a new type to represent either order-0 pages or the
  head page of a compound page. This should be enough infrastructure to
  support filesystems converting from pages to folios.

  The point of all this churn is to allow filesystems and the page cache
  to manage memory in larger chunks than PAGE_SIZE. The original plan
  was to use compound pages like THP does, but I ran into problems with
  some functions expecting only a head page while others expect the
  precise page containing a particular byte.

  The folio type allows a function to declare that it's expecting only a
  head page. Almost incidentally, this allows us to remove various calls
  to VM_BUG_ON(PageTail(page)) and compound_head().

  This converts just parts of the core MM and the page cache. For 5.17,
  we intend to convert various filesystems (XFS and AFS are ready; other
  filesystems may make it) and also convert more of the MM and page
  cache to folios. For 5.18, multi-page folios should be ready.

  The multi-page folios offer some improvement to some workloads. The
  80% win is real, but appears to be an artificial benchmark (postgres
  startup, which isn't a serious workload). Real workloads (eg building
  the kernel, running postgres in a steady state, etc) seem to benefit
  between 0-10%. I haven't heard of any performance losses as a result
  of this series. Nobody has done any serious performance tuning; I
  imagine that tweaking the readahead algorithm could provide some more
  interesting wins. There are also other places where we could choose to
  create large folios and currently do not, such as writes that are
  larger than PAGE_SIZE.

  I'd like to thank all my reviewers who've offered review/ack tags:
  Christoph Hellwig, David Howells, Jan Kara, Jeff Layton, Johannes
  Weiner, Kirill A. Shutemov, Michal Hocko, Mike Rapoport, Vlastimil
  Babka, William Kucharski, Yu Zhao and Zi Yan.

  I'd also like to thank those who gave feedback I incorporated but
  haven't offered up review tags for this part of the series: Nick
  Piggin, Mel Gorman, Ming Lei, Darrick Wong, Ted Ts'o, John Hubbard,
  Hugh Dickins, and probably a few others who I forget"

* tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache: (90 commits)
  mm/writeback: Add folio_write_one
  mm/filemap: Add FGP_STABLE
  mm/filemap: Add filemap_get_folio
  mm/filemap: Convert mapping_get_entry to return a folio
  mm/filemap: Add filemap_add_folio()
  mm/filemap: Add filemap_alloc_folio
  mm/page_alloc: Add folio allocation functions
  mm/lru: Add folio_add_lru()
  mm/lru: Convert __pagevec_lru_add_fn to take a folio
  mm: Add folio_evictable()
  mm/workingset: Convert workingset_refault() to take a folio
  mm/filemap: Add readahead_folio()
  mm/filemap: Add folio_mkwrite_check_truncate()
  mm/filemap: Add i_blocks_per_folio()
  mm/writeback: Add folio_redirty_for_writepage()
  mm/writeback: Add folio_account_redirty()
  mm/writeback: Add folio_clear_dirty_for_io()
  mm/writeback: Add folio_cancel_dirty()
  mm/writeback: Add folio_account_cleaned()
  mm/writeback: Add filemap_dirty_folio()
  ...
2021-11-01 08:47:59 -07:00
David Howells
5c0522484e afs: Fix afs_launder_page() to set correct start file position
Fix afs_launder_page() to set the starting position of the StoreData RPC at
the offset into the page at which the modified data starts instead of at
the beginning of the page (the iov_iter is correctly offset).

The offset got lost during the conversion to passing an iov_iter into
afs_store_data().

Changes:
ver #2:
 - Use page_offset() rather than manually calculating it[1].

Fixes: bd80d8a80e12 ("afs: Use ITER_XARRAY for writing")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/YST/0e92OdSH0zjg@casper.infradead.org/ [1]
Link: https://lore.kernel.org/r/162880783179.3421678.7795105718190440134.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162937512409.1449272.18441473411207824084.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162981148752.1901565.3663780601682206026.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163005741670.2472992.2073548908229887941.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163221839087.3143591.14278359695763025231.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163292980654.4004896.7134735179887998551.stgit@warthog.procyon.org.uk/ # v2
2021-10-05 11:22:06 +01:00
Matthew Wilcox (Oracle)
490e016f22 mm/writeback: Add folio_wait_writeback()
wait_on_page_writeback_killable() only has one caller, so convert it to
call folio_wait_writeback_killable().  For the wait_on_page_writeback()
callers, add a compatibility wrapper around folio_wait_writeback().

Turning PageWriteback() into folio_test_writeback() eliminates a call
to compound_head() which saves 8 bytes and 15 bytes in the two
functions.  Unfortunately, that is more than offset by adding the
wait_on_page_writeback compatibility wrapper for a net increase in text
of 7 bytes.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Howells <dhowells@redhat.com>
2021-09-27 09:27:30 -04:00
David Howells
9d37e1cab2 afs: Fix updating of i_blocks on file/dir extension
When an afs file or directory is modified locally such that the total file
size is extended, i_blocks needs to be recalculated too.

Fix this by making afs_write_end() and afs_edit_dir_add() call
afs_set_i_size() rather than setting inode->i_size directly as that also
recalculates inode->i_blocks.

This can be tested by creating and writing into directories and files and
then examining them with du.  Without this change, directories show a 4
blocks (they start out at 2048 bytes) and files show 0 blocks; with this
change, they should show a number of blocks proportional to the file size
rounded up to 1024.

Fixes: 31143d5d515e ("AFS: implement basic file write support")
Fixes: 63a4681ff39c ("afs: Locally edit directory data for mkdir/create/unlink/...")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163113612442.352844.11162345591911691150.stgit@warthog.procyon.org.uk/
2021-09-13 09:14:21 +01:00
David Howells
3978d81652 afs: Add missing vnode validation checks
afs_d_revalidate() should only be validating the directory entry it is
given and the directory to which that belongs; it shouldn't be validating
the inode/vnode to which that dentry points.  Besides, validation need to
be done even if we don't call afs_d_revalidate() - which might be the case
if we're starting from a file descriptor.

In order for afs_d_revalidate() to be fixed, validation points must be
added in some other places.  Certain directory operations, such as
afs_unlink(), already check this, but not all and not all file operations
either.

Note that the validation of a vnode not only checks to see if the
attributes we have are correct, but also gets a promise from the server to
notify us if that file gets changed by a third party.

Add the following checks:

 - Check the vnode we're going to make a hard link to.
 - Check the vnode we're going to move/rename.
 - Check the vnode we're going to read from.
 - Check the vnode we're going to write to.
 - Check the vnode we're going to sync.
 - Check the vnode we're going to make a mapped page writable for.

Some of these aren't strictly necessary as we're going to perform a server
operation that might get the attributes anyway from which we can determine
if something changed - though it might not get us a callback promise.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163111667354.283156.12720698333342917516.stgit@warthog.procyon.org.uk/
2021-09-13 09:10:39 +01:00
David Howells
581b2027af afs: Fix page leak
There's a loop in afs_extend_writeback() that adds extra pages to a write
we want to make to improve the efficiency of the writeback by making it
larger.  This loop stops, however, if we hit a page we can't write back
from immediately, but it doesn't get rid of the page ref we speculatively
acquired.

This was caused by the removal of the cleanup loop when the code switched
from using find_get_pages_contig() to xarray scanning as the latter only
gets a single page at a time, not a batch.

Fix this by putting the page on a ref on an early break from the loop.
Unfortunately, we can't just add that page to the pagevec we're employing
as we'll go through that and add those pages to the RPC call.

This was found by the generic/074 test.  It leaks ~4GiB of RAM each time it
is run - which can be observed with "top".

Fixes: e87b03f5830e ("afs: Prepare for use of THPs")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-tested-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163111666635.283156.177701903478910460.stgit@warthog.procyon.org.uk/
2021-09-10 22:14:51 +01:00
David Howells
5a972474cf afs: Fix setting of writeback_index
Fix afs_writepages() to always set mapping->writeback_index to a page index
and not a byte position[1].

Fixes: 31143d5d515e ("AFS: implement basic file write support")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/CAB9dFdvHsLsw7CMnB+4cgciWDSqVjuij4mH3TaXnHQB8sz5rHw@mail.gmail.com/ [1]
Link: https://lore.kernel.org/r/162610728339.3408253.4604750166391496546.stgit@warthog.procyon.org.uk/ # v2 (no v1)
2021-07-21 15:10:23 +01:00
Tom Rix
afe6949862 afs: check function return
Static analysis reports this problem

write.c:773:29: warning: Assigned value is garbage or undefined
  mapping->writeback_index = next;
                           ^ ~~~~
The call to afs_writepages_region() can return without setting
next.  So check the function return before using next.

Changes:
 ver #2:
   - Need to fix the range_cyclic case also[1].

Fixes: e87b03f5830e ("afs: Prepare for use of THPs")
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20210430155031.3287870-1-trix@redhat.com
Link: https://lore.kernel.org/r/CAB9dFdvHsLsw7CMnB+4cgciWDSqVjuij4mH3TaXnHQB8sz5rHw@mail.gmail.com/ [1]
Link: https://lore.kernel.org/r/162609464716.3133237.10354897554363093252.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162610727640.3408253.8687445613469681311.stgit@warthog.procyon.org.uk/ # v2
2021-07-21 15:10:23 +01:00
Linus Torvalds
9e736cf7d6 netfslib fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmDQ9Z4ACgkQ+7dXa6fL
 C2sVfg/9H3OlL4Vwv5CERgb5kbDoALAh5epYMFyvNVy9l5iTvYekxG3qna6ee0+G
 pTRHSU5tZUCUslpafv8LUz1RE/iM127y+BP+qq2joG9jkT4q3QfeV4oDr+dgZWCL
 obp+rjQSTKkaGh3eqjDx7gCSp5mqQI/M8MXe1VOXxUzAnsf2nH4iJLv/A9NN17xW
 l7sQfZJGHD1BPqsxxFiSr+UCkCLuLDybUDYL6+PZcFu0jSON0h4yEtIwsAA7a1Zw
 TvgUpequrbyTORI3sHJ8eIixWosh4yLTJ4pDqs8qDqq3Wm5eTZjss0wxEaVfpaTu
 cg/CcNUBiHJ/6Q8r+JiuPbEnfnQ9woL8951/CNi+cOGhTy9LNoG70orSZKKynmW5
 QmpYBK5BkM57EPj7DYZJRI1Bwy41pJapFj8tXbMjObU+ZyaXMysmBDanIRK/cLoy
 fr4Sz+1D8yJPQ0GDgC4051CxrhynOEnRo8JbESg8CD4PnqFeM7EoCh48H2oVvR4N
 9v2xuaRyBvi2KTmKSktRe+s1DS80acVMUYP33nT2zAthvL91VVgMY3Hz2/QrlAp8
 h1hREME8aRcN4LrBIgp/ET150hUh44U2A07PqnYYe0653MH7aHHFfk134ZuTmbub
 Pfc1MHtWgPAWN4c140ILBRTidJeShszoJGgpD6tflkj10VI2s34=
 =2c6l
 -----END PGP SIGNATURE-----

Merge tag 'netfs-fixes-20210621' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull netfs fixes from David Howells:
 "This contains patches to fix netfs_write_begin() and afs_write_end()
  in the following ways:

  (1) In netfs_write_begin(), extract the decision about whether to skip
      a page out to its own helper and have that clear around the region
      to be written, but not clear that region. This requires the
      filesystem to patch it up afterwards if the hole doesn't get
      completely filled.

  (2) Use offset_in_thp() in (1) rather than manually calculating the
      offset into the page.

  (3) Due to (1), afs_write_end() now needs to handle short data write
      into the page by generic_perform_write(). I've adopted an
      analogous approach to ceph of just returning 0 in this case and
      letting the caller go round again.

  It also adds a note that (in the future) the len parameter may extend
  beyond the page allocated. This is because the page allocation is
  deferred to write_begin() and that gets to decide what size of THP to
  allocate."

Jeff Layton points out:
 "The netfs fix in particular fixes a data corruption bug in cephfs"

* tag 'netfs-fixes-20210621' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  netfs: fix test for whether we can skip read when writing beyond EOF
  afs: Fix afs_write_end() to handle short writes
2021-06-25 09:41:29 -07:00
David Howells
66e9c6a86b afs: Fix afs_write_end() to handle short writes
Fix afs_write_end() to correctly handle a short copy into the intended
write region of the page.  Two things are necessary:

 (1) If the page is not up to date, then we should just return 0
     (ie. indicating a zero-length copy).  The loop in
     generic_perform_write() will go around again, possibly breaking up the
     iterator into discrete chunks[1].

     This is analogous to commit b9de313cf05fe08fa59efaf19756ec5283af672a
     for ceph.

 (2) The page should not have been set uptodate if it wasn't completely set
     up by netfs_write_begin() (this will be fixed in the next patch), so
     we need to set uptodate here in such a case.

Also remove the assertion that was checking that the page was set uptodate
since it's now set uptodate if it wasn't already a few lines above.  The
assertion was from when uptodate was set elsewhere.

Changes:
v3: Remove the handling of len exceeding the end of the page.

Fixes: 3003bbd0697b ("afs: Use the netfs_write_begin() helper")
Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/YMwVp268KTzTf8cN@zeniv-ca.linux.org.uk/ [1]
Link: https://lore.kernel.org/r/162367682522.460125.5652091227576721609.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162391825688.1173366.3437507255136307904.stgit@warthog.procyon.org.uk/ # v2
2021-06-21 21:23:36 +01:00
Matthew Wilcox (Oracle)
9620ad86d0 afs: Re-enable freezing once a page fault is interrupted
If a task is killed during a page fault, it does not currently call
sb_end_pagefault(), which means that the filesystem cannot be frozen
at any time thereafter.  This may be reported by lockdep like this:

====================================
WARNING: fsstress/10757 still has locks held!
5.13.0-rc4-build4+ #91 Not tainted
------------------------------------
1 lock held by fsstress/10757:
 #0: ffff888104eac530
 (
sb_pagefaults

as filesystem freezing is modelled as a lock.

Fix this by removing all the direct returns from within the function,
and using 'ret' to indicate whether we were interrupted or successful.

Fixes: 1cf7a1518aef ("afs: Implement shared-writeable mmap")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20210616154900.1958373-1-willy@infradead.org/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-18 13:49:07 -07:00
Marc Dionne
dc2557308e afs: Fix partial writeback of large files on fsync and close
In commit e87b03f5830e ("afs: Prepare for use of THPs"), the return
value for afs_write_back_from_locked_page was changed from a number
of pages to a length in bytes.  The loop in afs_writepages_region uses
the return value to compute the index that will be used to find dirty
pages in the next iteration, but treats it as a number of pages and
wrongly multiplies it by PAGE_SIZE.  This gives a very large index value,
potentially skipping any dirty data that was not covered in the first
pass, which is limited to 256M.

This causes fsync(), and indirectly close(), to only do a partial
writeback of a large file's dirty data.  The rest is eventually written
back by background threads after dirty_expire_centisecs.

Fixes: e87b03f5830e ("afs: Prepare for use of THPs")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20210604175504.4055-1-marc.c.dionne@gmail.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-07 12:56:05 -07:00
David Howells
22650f1481 afs: Fix speculative status fetches
The generic/464 xfstest causes kAFS to emit occasional warnings of the
form:

        kAFS: vnode modified {100055:8a} 30->31 YFS.StoreData64 (c=6015)

This indicates that the data version received back from the server did not
match the expected value (the DV should be incremented monotonically for
each individual modification op committed to a vnode).

What is happening is that a lookup call is doing a bulk status fetch
speculatively on a bunch of vnodes in a directory besides getting the
status of the vnode it's actually interested in.  This is racing with a
StoreData operation (though it could also occur with, say, a MakeDir op).

On the client, a modification operation locks the vnode, but the bulk
status fetch only locks the parent directory, so no ordering is imposed
there (thereby avoiding an avenue to deadlock).

On the server, the StoreData op handler doesn't lock the vnode until it's
received all the request data, and downgrades the lock after committing the
data until it has finished sending change notifications to other clients -
which allows the status fetch to occur before it has finished.

This means that:

 - a status fetch can access the target vnode either side of the exclusive
   section of the modification

 - the status fetch could start before the modification, yet finish after,
   and vice-versa.

 - the status fetch and the modification RPCs can complete in either order.

 - the status fetch can return either the before or the after DV from the
   modification.

 - the status fetch might regress the locally cached DV.

Some of these are handled by the previous fix[1], but that's not sufficient
because it checks the DV it received against the DV it cached at the start
of the op, but the DV might've been updated in the meantime by a locally
generated modification op.

Fix this by the following means:

 (1) Keep track of when we're performing a modification operation on a
     vnode.  This is done by marking vnode parameters with a 'modification'
     note that causes the AFS_VNODE_MODIFYING flag to be set on the vnode
     for the duration.

 (2) Alter the speculation race detection to ignore speculative status
     fetches if either the vnode is marked as being modified or the data
     version number is not what we expected.

Note that whilst the "vnode modified" warning does get recovered from as it
causes the client to refetch the status at the next opportunity, it will
also invalidate the pagecache, so changes might get lost.

Fixes: a9e5c87ca744 ("afs: Fix speculative status fetch going out of order wrt to modifications")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-and-reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/160605082531.252452.14708077925602709042.stgit@warthog.procyon.org.uk/ [1]
Link: https://lore.kernel.org/linux-fsdevel/161961335926.39335.2552653972195467566.stgit@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-01 11:55:36 -07:00
David Howells
3003bbd069 afs: Use the netfs_write_begin() helper
Make AFS use the new netfs_write_begin() helper to do the pre-reading
required before the write.  If successful, the helper returns with the
required page filled in and locked.  It may read more than just one page,
expanding the read to meet cache granularity requirements as necessary.

Note: A more advanced version of this could be made that does
generic_perform_write() for a whole cache granule.  This would make it
easier to avoid doing the download/read for the data to be overwritten.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/160588546422.3465195.1546354372589291098.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161539563244.286939.16537296241609909980.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/161653819291.2770958.406013201547420544.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/161789102743.6155.17396591236631761195.stgit@warthog.procyon.org.uk/ # v6
2021-04-23 10:17:28 +01:00
David Howells
5cbf03985c afs: Use new netfs lib read helper API
Make AFS use the new netfs read helpers to implement the VM read
operations:

 - afs_readpage() now hands off responsibility to netfs_readpage().

 - afs_readpages() is gone and replaced with afs_readahead().

 - afs_readahead() just hands off responsibility to netfs_readahead().

These make use of the cache if a cookie is supplied, otherwise just call
the ->issue_op() method a sufficient number of times to complete the entire
request.

Changes:
v5:
- Use proper wait function for PG_fscache in afs_page_mkwrite()[1].
- Use killable wait for PG_writeback in afs_page_mkwrite()[1].

v4:
- Folded in error handling fixes to afs_req_issue_op().
- Added flag to netfs_subreq_terminated() to indicate that the caller may
  have been running async and stuff that might sleep needs punting to a
  workqueue.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/2499407.1616505440@warthog.procyon.org.uk [1]
Link: https://lore.kernel.org/r/160588542733.3465195.7526541422073350302.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161118158436.1232039.3884845981224091996.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161161053540.2537118.14904446369309535330.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/161340418739.1303470.5908092911600241280.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/161539561926.286939.5729036262354802339.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/161653817977.2770958.17696456811587237197.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/161789101258.6155.3879271028895121537.stgit@warthog.procyon.org.uk/ # v6
2021-04-23 10:17:28 +01:00
David Howells
e87b03f583 afs: Prepare for use of THPs
As a prelude to supporting transparent huge pages, use thp_size() and
similar rather than PAGE_SIZE/SHIFT.

Further, try and frame everything in terms of file positions and lengths
rather than page indices and numbers of pages.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/160588540227.3465195.4752143929716269062.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161118155821.1232039.540445038028845740.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161161051439.2537118.15577827510426326534.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/161340415869.1303470.6040191748634322355.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/161539559365.286939.18344613540296085269.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/161653815142.2770958.454490670311230206.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/161789098713.6155.16394227991842480300.stgit@warthog.procyon.org.uk/ # v6
2021-04-23 10:17:27 +01:00
David Howells
810caa3e67 afs: Extract writeback extension into its own function
Extract writeback extension into its own function to break up the writeback
function a bit.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/160588538471.3465195.782513375683399583.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161118154610.1232039.1765365632920504822.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161161050546.2537118.2202554806419189453.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/161340414102.1303470.9078891484034668985.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/161539558417.286939.2879469588895925399.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/161653813972.2770958.12671731209438112378.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/161789097132.6155.4916609419912731964.stgit@warthog.procyon.org.uk/ # v6
2021-04-23 10:17:27 +01:00
David Howells
630f5dda84 afs: Wait on PG_fscache before modifying/releasing a page
PG_fscache is going to be used to indicate that a page is being written to
the cache, and that the page should not be modified or released until it's
finished.

Make afs_invalidatepage() and afs_releasepage() wait for it.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/158861253957.340223.7465334678444521655.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/159465832417.1377938.3571599385208729791.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/160588536286.3465195.13231895135369807920.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161118153708.1232039.3535103645871176749.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161161049369.2537118.11591934943429117060.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/161340412903.1303470.6424701655031380012.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/161539556890.286939.5873470593519458598.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/161653812726.2770958.18167145829938766503.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/161789096241.6155.5907241930823579235.stgit@warthog.procyon.org.uk/ # v6
2021-04-23 10:17:27 +01:00
David Howells
bd80d8a80e afs: Use ITER_XARRAY for writing
Use a single ITER_XARRAY iterator to describe the portion of a file to be
transmitted to the server rather than generating a series of small
ITER_BVEC iterators on the fly.  This will make it easier to implement AIO
in afs.

In theory we could maybe use one giant ITER_BVEC, but that means
potentially allocating a huge array of bio_vec structs (max 256 per page)
when in fact the pagecache already has a structure listing all the relevant
pages (radix_tree/xarray) that can be walked over.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/153685395197.14766.16289516750731233933.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/158861251312.340223.17924900795425422532.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/159465828607.1377938.6903132788463419368.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/160588535018.3465195.14509994354240338307.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161118152415.1232039.6452879415814850025.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161161048194.2537118.13763612220937637316.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/161340411602.1303470.4661108879482218408.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/161539555629.286939.5241869986617154517.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/161653811456.2770958.7017388543246759245.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/161789095005.6155.6789055030327407928.stgit@warthog.procyon.org.uk/ # v6
2021-04-23 10:17:27 +01:00
David Howells
c450846461 afs: Set up the iov_iter before calling afs_extract_data()
afs_extract_data() sets up a temporary iov_iter and passes it to AF_RXRPC
each time it is called to describe the remaining buffer to be filled.

Instead:

 (1) Put an iterator in the afs_call struct.

 (2) Set the iterator for each marshalling stage to load data into the
     appropriate places.  A number of convenience functions are provided to
     this end (eg. afs_extract_to_buf()).

     This iterator is then passed to afs_extract_data().

 (3) Use the new ITER_XARRAY iterator when reading data to load directly
     into the inode's pages without needing to create a list of them.

This will allow O_DIRECT calls to be supported in future patches.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/152898380012.11616.12094591785228251717.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/153685394431.14766.3178466345696987059.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/153999787395.866.11218209749223643998.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/154033911195.12041.3882700371848894587.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/158861250059.340223.1248231474865140653.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/159465827399.1377938.11181327349704960046.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/160588533776.3465195.3612752083351956948.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161118151238.1232039.17015723405750601161.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161161047240.2537118.14721975104810564022.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/161340410333.1303470.16260122230371140878.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/161539554187.286939.15305559004905459852.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/161653810525.2770958.4630666029125411789.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/161789093719.6155.7877160739235087723.stgit@warthog.procyon.org.uk/ # v6
2021-04-23 10:17:27 +01:00
David Howells
c69bf479ba afs: Move key to afs_read struct
Stash the key used to authenticate read operations in the afs_read struct.
This will be necessary to reissue the operation against the server if a
read from the cache fails in upcoming cache changes.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/158861248336.340223.1851189950710196001.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/159465823899.1377938.11925978022348532049.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/160588529557.3465195.7303323479305254243.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161118147693.1232039.13780672951838643842.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161161043340.2537118.511899217704140722.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/161340406678.1303470.12676824086429446370.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/161539550819.286939.1268332875889175195.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/161653806683.2770958.11300984379283401542.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/161789089556.6155.14603302893431820997.stgit@warthog.procyon.org.uk/ # v6
2021-04-23 10:17:26 +01:00
David Howells
67d78a6f6e afs: Pass page into dirty region helpers to provide THP size
Pass a pointer to the page being accessed into the dirty region helpers so
that the size of the page can be determined in case it's a transparent huge
page.

This also required the page to be passed into the afs_page_dirty trace
point - so there's no need to specifically pass in the index or private
data as these can be retrieved directly from the page struct.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/160588527183.3465195.16107942526481976308.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161118144921.1232039.11377711180492625929.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161161040747.2537118.11435394902674511430.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/161340404553.1303470.11414163641767769882.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/161539548385.286939.8864598314493255313.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/161653804285.2770958.3497360004849598038.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/161789087043.6155.16922142208140170528.stgit@warthog.procyon.org.uk/ # v6
2021-04-23 10:17:26 +01:00
David Howells
03ffae9092 afs: Disable use of the fscache I/O routines
Disable use of the fscache I/O routined by the AFS filesystem.  It's about
to transition to passing iov_iters down and fscache is about to have its
I/O path to use iov_iter, so all that needs to change.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/158861209824.340223.1864211542341758994.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/159465768717.1376105.2229314852486665807.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/160588457929.3465195.1730097418904945578.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161118143744.1232039.2727898205333669064.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/161161039077.2537118.7986870854927176905.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/161340403323.1303470.8159439948319423431.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/161539547167.286939.3536238932531122332.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/161653802797.2770958.547311814861545911.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/161789085806.6155.2596146255056027428.stgit@warthog.procyon.org.uk/ # v6
2021-04-23 10:17:25 +01:00
Matthew Wilcox (Oracle)
75b6979961 afs: Use wait_on_page_writeback_killable
Open-coding this function meant it missed out on the recent bugfix
for waiters being woken by a delayed wake event from a previous
instantiation of the page[1].

[DH: Changed the patch to use vmf->page rather than variable page which
 doesn't exist yet upstream]

Fixes: 1cf7a1518aef ("afs: Implement shared-writeable mmap")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: kafs-testing@auristor.com
cc: linux-afs@lists.infradead.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20210320054104.1300774-4-willy@infradead.org
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c2407cf7d22d0c0d94cf20342b3b8f06f1d904e7 [1]
2021-03-23 20:54:37 +00:00
David Howells
3ad216ee73 afs: Fix afs_write_end() when called with copied == 0 [ver #3]
When afs_write_end() is called with copied == 0, it tries to set the
dirty region, but there's no way to actually encode a 0-length region in
the encoding in page->private.

"0,0", for example, indicates a 1-byte region at offset 0.  The maths
miscalculates this and sets it incorrectly.

Fix it to just do nothing but unlock and put the page in this case.  We
don't actually need to mark the page dirty as nothing presumably
changed.

Fixes: 65dd2d6072d3 ("afs: Alter dirty range encoding in page->private")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-14 11:51:18 -08:00
David Howells
2d9900f26a afs: Fix dirty-region encoding on ppc32 with 64K pages
The dirty region bounds stored in page->private on an afs page are 15 bits
on a 32-bit box and can, at most, represent a range of up to 32K within a
32K page with a resolution of 1 byte.  This is a problem for powerpc32 with
64K pages enabled.

Further, transparent huge pages may get up to 2M, which will be a problem
for the afs filesystem on all 32-bit arches in the future.

Fix this by decreasing the resolution.  For the moment, a 64K page will
have a resolution determined from PAGE_SIZE.  In the future, the page will
need to be passed in to the helper functions so that the page size can be
assessed and the resolution determined dynamically.

Note that this might not be the ideal way to handle this, since it may
allow some leakage of undirtied zero bytes to the server's copy in the case
of a 3rd-party conflict.  Fixing that would require a separately allocated
record and is a more complicated fix.

Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2020-10-29 13:53:04 +00:00
David Howells
f86726a69d afs: Fix afs_invalidatepage to adjust the dirty region
Fix afs_invalidatepage() to adjust the dirty region recorded in
page->private when truncating a page.  If the dirty region is entirely
removed, then the private data is cleared and the page dirty state is
cleared.

Without this, if the page is truncated and then expanded again by truncate,
zeros from the expanded, but no-longer dirty region may get written back to
the server if the page gets laundered due to a conflicting 3rd-party write.

It mustn't, however, shorten the dirty region of the page if that page is
still mmapped and has been marked dirty by afs_page_mkwrite(), so a flag is
stored in page->private to record this.

Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-10-29 13:53:04 +00:00
David Howells
65dd2d6072 afs: Alter dirty range encoding in page->private
Currently, page->private on an afs page is used to store the range of
dirtied data within the page, where the range includes the lower bound, but
excludes the upper bound (e.g. 0-1 is a range covering a single byte).

This, however, requires a superfluous bit for the last-byte bound so that
on a 4KiB page, it can say 0-4096 to indicate the whole page, the idea
being that having both numbers the same would indicate an empty range.
This is unnecessary as the PG_private bit is clear if it's an empty range
(as is PG_dirty).

Alter the way the dirty range is encoded in page->private such that the
upper bound is reduced by 1 (e.g. 0-0 is then specified the same single
byte range mentioned above).

Applying this to both bounds frees up two bits, one of which can be used in
a future commit.

This allows the afs filesystem to be compiled on ppc32 with 64K pages;
without this, the following warnings are seen:

../fs/afs/internal.h: In function 'afs_page_dirty_to':
../fs/afs/internal.h:881:15: warning: right shift count >= width of type [-Wshift-count-overflow]
  881 |  return (priv >> __AFS_PAGE_PRIV_SHIFT) & __AFS_PAGE_PRIV_MASK;
      |               ^~
../fs/afs/internal.h: In function 'afs_page_dirty':
../fs/afs/internal.h:886:28: warning: left shift count >= width of type [-Wshift-count-overflow]
  886 |  return ((unsigned long)to << __AFS_PAGE_PRIV_SHIFT) | from;
      |                            ^~

Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-10-29 13:53:04 +00:00
David Howells
185f0c7073 afs: Wrap page->private manipulations in inline functions
The afs filesystem uses page->private to store the dirty range within a
page such that in the event of a conflicting 3rd-party write to the server,
we write back just the bits that got changed locally.

However, there are a couple of problems with this:

 (1) I need a bit to note if the page might be mapped so that partial
     invalidation doesn't shrink the range.

 (2) There aren't necessarily sufficient bits to store the entire range of
     data altered (say it's a 32-bit system with 64KiB pages or transparent
     huge pages are in use).

So wrap the accesses in inline functions so that future commits can change
how this works.

Also move them out of the tracing header into the in-directory header.
There's not really any need for them to be in the tracing header.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-10-29 13:53:04 +00:00
David Howells
f792e3ac82 afs: Fix where page->private is set during write
In afs, page->private is set to indicate the dirty region of a page.  This
is done in afs_write_begin(), but that can't take account of whether the
copy into the page actually worked.

Fix this by moving the change of page->private into afs_write_end().

Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-10-29 13:53:04 +00:00
David Howells
21db2cdc66 afs: Fix page leak on afs_write_begin() failure
Fix the leak of the target page in afs_write_begin() when it fails.

Fixes: 15b4650e55e0 ("afs: convert to new aops")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Nick Piggin <npiggin@gmail.com>
2020-10-29 13:53:04 +00:00
David Howells
fa04a40b16 afs: Fix to take ref on page when PG_private is set
Fix afs to take a ref on a page when it sets PG_private on it and to drop
the ref when removing the flag.

Note that in afs_write_begin(), a lot of the time, PG_private is already
set on a page to which we're going to add some data.  In such a case, we
leave the bit set and mustn't increment the page count.

As suggested by Matthew Wilcox, use attach/detach_page_private() where
possible.

Fixes: 31143d5d515e ("AFS: implement basic file write support")
Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2020-10-29 13:53:04 +00:00
David Howells
d383e346f9 afs: Fix afs_launder_page to not clear PG_writeback
Fix afs_launder_page() to not clear PG_writeback on the page it is
laundering as the flag isn't set in this case.

Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-10-27 22:05:56 +00:00
David Howells
ec0fa0b659 afs: Fix deadlock between writeback and truncate
The afs filesystem has a lock[*] that it uses to serialise I/O operations
going to the server (vnode->io_lock), as the server will only perform one
modification operation at a time on any given file or directory.  This
prevents the the filesystem from filling up all the call slots to a server
with calls that aren't going to be executed in parallel anyway, thereby
allowing operations on other files to obtain slots.

  [*] Note that is probably redundant for directories at least since
      i_rwsem is used to serialise directory modifications and
      lookup/reading vs modification.  The server does allow parallel
      non-modification ops, however.

When a file truncation op completes, we truncate the in-memory copy of the
file to match - but we do it whilst still holding the io_lock, the idea
being to prevent races with other operations.

However, if writeback starts in a worker thread simultaneously with
truncation (whilst notify_change() is called with i_rwsem locked, writeback
pays it no heed), it may manage to set PG_writeback bits on the pages that
will get truncated before afs_setattr_success() manages to call
truncate_pagecache().  Truncate will then wait for those pages - whilst
still inside io_lock:

    # cat /proc/8837/stack
    [<0>] wait_on_page_bit_common+0x184/0x1e7
    [<0>] truncate_inode_pages_range+0x37f/0x3eb
    [<0>] truncate_pagecache+0x3c/0x53
    [<0>] afs_setattr_success+0x4d/0x6e
    [<0>] afs_wait_for_operation+0xd8/0x169
    [<0>] afs_do_sync_operation+0x16/0x1f
    [<0>] afs_setattr+0x1fb/0x25d
    [<0>] notify_change+0x2cf/0x3c4
    [<0>] do_truncate+0x7f/0xb2
    [<0>] do_sys_ftruncate+0xd1/0x104
    [<0>] do_syscall_64+0x2d/0x3a
    [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

The writeback operation, however, stalls indefinitely because it needs to
get the io_lock to proceed:

    # cat /proc/5940/stack
    [<0>] afs_get_io_locks+0x58/0x1ae
    [<0>] afs_begin_vnode_operation+0xc7/0xd1
    [<0>] afs_store_data+0x1b2/0x2a3
    [<0>] afs_write_back_from_locked_page+0x418/0x57c
    [<0>] afs_writepages_region+0x196/0x224
    [<0>] afs_writepages+0x74/0x156
    [<0>] do_writepages+0x2d/0x56
    [<0>] __writeback_single_inode+0x84/0x207
    [<0>] writeback_sb_inodes+0x238/0x3cf
    [<0>] __writeback_inodes_wb+0x68/0x9f
    [<0>] wb_writeback+0x145/0x26c
    [<0>] wb_do_writeback+0x16a/0x194
    [<0>] wb_workfn+0x74/0x177
    [<0>] process_one_work+0x174/0x264
    [<0>] worker_thread+0x117/0x1b9
    [<0>] kthread+0xec/0xf1
    [<0>] ret_from_fork+0x1f/0x30

and thus deadlock has occurred.

Note that whilst afs_setattr() calls filemap_write_and_wait(), the fact
that the caller is holding i_rwsem doesn't preclude more pages being
dirtied through an mmap'd region.

Fix this by:

 (1) Use the vnode validate_lock to mediate access between afs_setattr()
     and afs_writepages():

     (a) Exclusively lock validate_lock in afs_setattr() around the whole
     	 RPC operation.

     (b) If WB_SYNC_ALL isn't set on entry to afs_writepages(), trying to
     	 shared-lock validate_lock and returning immediately if we couldn't
     	 get it.

     (c) If WB_SYNC_ALL is set, wait for the lock.

     The validate_lock is also used to validate a file and to zap its cache
     if the file was altered by a third party, so it's probably a good fit
     for this.

 (2) Move the truncation outside of the io_lock in setattr, using the same
     hook as is used for local directory editing.

     This requires the old i_size to be retained in the operation record as
     we commit the revised status to the inode members inside the io_lock
     still, but we still need to know if we reduced the file size.

Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-08 10:50:55 -07:00
Gustavo A. R. Silva
df561f6688 treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23 17:36:59 -05:00
David Howells
811f04bac1 afs: Fix interruption of operations
The afs filesystem driver allows unstarted operations to be cancelled by
signal, but most of these can easily be restarted (mkdir for example).  The
primary culprits for reproducing this are those applications that use
SIGALRM to display a progress counter.

File lock-extension operation is marked uninterruptible as we have a
limited time in which to do it, and the release op is marked
uninterruptible also as if we fail to unlock a file, we'll have to wait 20
mins before anyone can lock it again.

The store operation logs a warning if it gets interruption, e.g.:

	kAFS: Unexpected error from FS.StoreData -4

because it's run from the background - but it can also be run from
fdatasync()-type things.  However, store options aren't marked
interruptible at the moment.

Fix this in the following ways:

 (1) Mark store operations as uninterruptible.  It might make sense to
     relax this for certain situations, but I'm not sure how to make sure
     that background store ops aren't affected by signals to foreground
     processes that happen to trigger them.

 (2) In afs_get_io_locks(), where we're getting the serialisation lock for
     talking to the fileserver, return ERESTARTSYS rather than EINTR
     because a lot of the operations (e.g. mkdir) are restartable if we
     haven't yet started sending the op to the server.

Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-15 15:49:04 -07:00
David Howells
793fe82ee3 afs: Fix truncation issues and mmap writeback size
Fix the following issues:

 (1) Fix writeback to reduce the size of a store operation to i_size,
     effectively discarding the extra data.

     The problem comes when afs_page_mkwrite() records that a page is about
     to be modified by mmap().  It doesn't know what bits of the page are
     going to be modified, so it records the whole page as being dirty
     (this is stored in page->private as start and end offsets).

     Without this, the marshalling for the store to the server extends the
     size of the file to the end of the page (in afs_fs_store_data() and
     yfs_fs_store_data()).

 (2) Fix setattr to actually truncate the pagecache, thereby clearing
     the discarded part of a file.

 (3) Fix setattr to check that the new size is okay and to disable
     ATTR_SIZE if i_size wouldn't change.

 (4) Force i_size to be updated as the result of a truncate.

 (5) Don't truncate if ATTR_SIZE is not set.

 (6) Call pagecache_isize_extended() if the file was enlarged.

Note that truncate_set_size() isn't used because the setting of i_size is
done inside afs_vnode_commit_status() under the vnode->cb_lock.

Found with the generic/029 and generic/393 xfstests.

Fixes: 31143d5d515e ("AFS: implement basic file write support")
Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15 15:41:02 +01:00
David Howells
da8d075512 afs: Concoct ctimes
The in-kernel afs filesystem ignores ctime because the AFS fileserver
protocol doesn't support ctimes.  This, however, causes various xfstests to
fail.

Work around this by:

 (1) Setting ctime to attr->ia_ctime in afs_setattr().

 (2) Not ignoring ATTR_MTIME_SET, ATTR_TIMES_SET and ATTR_TOUCH settings.

 (3) Setting the ctime from the server mtime when on the target file when
     creating a hard link to it.

 (4) Setting the ctime on directories from their revised mtimes when
     renaming/moving a file.

Found by the generic/221 and generic/309 xfstests.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15 15:41:02 +01:00
David Howells
1f32ef7989 afs: afs_write_end() should change i_size under the right lock
Fix afs_write_end() to change i_size under vnode->cb_lock rather than
->wb_lock so that it doesn't race with afs_vnode_commit_status() and
afs_getattr().

The ->wb_lock is only meant to guard access to ->wb_keys which isn't
accessed by that piece of code.

Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15 15:41:02 +01:00
David Howells
bb41348928 afs: Fix non-setting of mtime when writing into mmap
The mtime on an inode needs to be updated when a write is made into an
mmap'ed section.  There are three ways in which this could be done: update
it when page_mkwrite is called, update it when a page is changed from dirty
to writeback or leave it to the server and fix the mtime up from the reply
to the StoreData RPC.

Found with the generic/215 xfstest.

Fixes: 1cf7a1518aef ("afs: Implement shared-writeable mmap")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15 15:41:02 +01:00