IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmI1HOwACgkQ+7dXa6fL
C2u9mA/+LUdXHqlvET/PAtFTg75bUPeOFGLnuDnYl1Ng2FCKMSodAohpbVtENxsK
E/gTVS7uiVZFQgC+YmNA00z6eIQkAaDVyvKyEcUbKREBbUgONfJ/HLeaK/NvVKxx
TY5gx/POdG6yHRQXL6JGBqSJUB8bZrGKwnJm8ebzeKOji9n7GSJBYiMlYBA7EAhs
Aut/P7Y39ISHLw3y+y5czBeRoubljmTyznbP20xUZEzrRwhTpNwpJVzBGUZU635T
93Sqcp//0U5LIdn6Pg6DUGHBMBTNDNJChb21ZoBusF/HHswXsOOnf/mcRUBSJUTI
M1WSpNLk8PRBgajMdIymQpGU1sCZZzJ3krrSA3RcXdN6GPHwZg8kKjoroHsLDL6l
igPbDSMJ5wfiwA2A2gXbY1CkAl3ik5ccb7ZqhTwS0WBk0vOnHmAsE9cs/bBo7Xii
GTiWXEFOgtJiXANPMS2P9DiOS3ZQNf+wxotCYdkGPOXuX9wnIo1Kmy8XfujQ1bXf
pJsEZKfeyROKrzyKWgqLI64/Kg5xNueoFQZfDpOlZYzF1uDstynADPUt0eQD706q
jcuKaXLN3rn5gSPun5mWOYbRtXVgOLdFL/7zptMVJwFKBFguQENhjG4UMNZcjkVA
3Mr0kGocsgoCSk1oDBkFlrw1wIsXxWbkRBL1Pww6kovivuGUwoo=
=j0yx
-----END PGP SIGNATURE-----
Merge tag 'netfs-prep-20220318' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull netfs updates from David Howells:
"Netfs prep for write helpers.
Having had a go at implementing write helpers and content encryption
support in netfslib, it seems that the netfs_read_{,sub}request
structs and the equivalent write request structs were almost the same
and so should be merged, thereby requiring only one set of
alloc/get/put functions and a common set of tracepoints.
Merging the structs also has the advantage that if a bounce buffer is
added to the request struct, a read operation can be performed to fill
the bounce buffer, the contents of the buffer can be modified and then
a write operation can be performed on it to send the data wherever it
needs to go using the same request structure all the way through. The
I/O handlers would then transparently perform any required crypto.
This should make it easier to perform RMW cycles if needed.
The potentially common functions and structs, however, by their names
all proclaim themselves to be associated with the read side of things.
The bulk of these changes alter this in the following ways:
- Rename struct netfs_read_{,sub}request to netfs_io_{,sub}request.
- Rename some enums, members and flags to make them more appropriate.
- Adjust some comments to match.
- Drop "read"/"rreq" from the names of common functions. For
instance, netfs_get_read_request() becomes netfs_get_request().
- The ->init_rreq() and ->issue_op() methods become ->init_request()
and ->issue_read(). I've kept the latter as a read-specific
function and in another branch added an ->issue_write() method.
The driver source is then reorganised into a number of files:
fs/netfs/buffered_read.c Create read reqs to the pagecache
fs/netfs/io.c Dispatchers for read and write reqs
fs/netfs/main.c Some general miscellaneous bits
fs/netfs/objects.c Alloc, get and put functions
fs/netfs/stats.c Optional procfs statistics.
and future development can be fitted into this scheme, e.g.:
fs/netfs/buffered_write.c Modify the pagecache
fs/netfs/buffered_flush.c Writeback from the pagecache
fs/netfs/direct_read.c DIO read support
fs/netfs/direct_write.c DIO write support
fs/netfs/unbuffered_write.c Write modifications directly back
Beyond the above changes, there are also some changes that affect how
things work:
- Make fscache_end_operation() generally available.
- In the netfs tracing header, generate enums from the symbol ->
string mapping tables rather than manually coding them.
- Add a struct for filesystems that uses netfslib to put into their
inode wrapper structs to hold extra state that netfslib is
interested in, such as the fscache cookie. This allows netfslib
functions to be set in filesystem operation tables and jumped to
directly without having to have a filesystem wrapper.
- Add a member to the struct added above to track the remote inode
length as that may differ if local modifications are buffered. We
may need to supply an appropriate EOF pointer when storing data (in
AFS for example).
- Pass extra information to netfs_alloc_request() so that the
->init_request() hook can access it and retain information to
indicate the origin of the operation.
- Make the ->init_request() hook return an error, thereby allowing a
filesystem that isn't allowed to cache an inode (ceph or cifs, for
example) to skip readahead.
- Switch to using refcount_t for subrequests and add tracepoints to
log refcount changes for the request and subrequest structs.
- Add a function to consolidate dispatching a read request. Similar
code is used in three places and another couple are likely to be
added in the future"
Link: https://lore.kernel.org/all/2639515.1648483225@warthog.procyon.org.uk/
* tag 'netfs-prep-20220318' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Maintain netfs_i_context::remote_i_size
netfs: Keep track of the actual remote file size
netfs: Split some core bits out into their own file
netfs: Split fs/netfs/read_helper.c
netfs: Rename read_helper.c to io.c
netfs: Prepare to split read_helper.c
netfs: Add a function to consolidate beginning a read
netfs: Add a netfs inode context
ceph: Make ceph_init_request() check caps on readahead
netfs: Change ->init_request() to return an error code
netfs: Refactor arguments for netfs_alloc_read_request
netfs: Adjust the netfs_failure tracepoint to indicate non-subreq lines
netfs: Trace refcounting on the netfs_io_subrequest struct
netfs: Trace refcounting on the netfs_io_request struct
netfs: Adjust the netfs_rreq tracepoint slightly
netfs: Split netfs_io_* object handling out
netfs: Finish off rename of netfs_read_request to netfs_io_request
netfs: Rename netfs_read_*request to netfs_io_*request
netfs: Generate enums from trace symbol mapping lists
fscache: export fscache_end_operation()
Primarily this series converts some of the address_space operations
to take a folio instead of a page.
->is_partially_uptodate() takes a folio instead of a page and changes the
type of the 'from' and 'count' arguments to make it obvious they're bytes.
->invalidatepage() becomes ->invalidate_folio() and has a similar type change.
->launder_page() becomes ->launder_folio()
->set_page_dirty() becomes ->dirty_folio() and adds the address_space as
an argument.
There are a couple of other misc changes up front that weren't worth
separating into their own pull request.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4hqMACgkQDpNsjXcp
gj7r7Af/fVJ7m8kKqjP/IayX3HiJRuIDQw+vM++BlRNXdjz+IyED6whdmFGxJeOY
BMyT+8ApOAz7ErS4G+7fAv4ScJK/aEgFUsnSeAiCp0PliiEJ5NNJzElp6sVmQ7H5
SX7+Ek444FZUGsQuy0qL7/ELpR3ditnD7x+5U2g0p5TeaHGUQn84crRyfR4xuhNG
EBD9D71BOb7OxUcOHe93pTkK51QsQ0aCrcIsB1tkK5KR0BAthn1HqF7ehL90Rvrr
omx5M7aDWGY4oj7IKrhlAs+55Ah2WaOzrZBp0FXNbr4UENDBKWKyUxErwa4xPkf6
Gm1iQG/CspOHnxN3YWsd5WjtlL3A+A==
=cOiq
-----END PGP SIGNATURE-----
Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache
Pull filesystem folio updates from Matthew Wilcox:
"Primarily this series converts some of the address_space operations to
take a folio instead of a page.
Notably:
- a_ops->is_partially_uptodate() takes a folio instead of a page and
changes the type of the 'from' and 'count' arguments to make it
obvious they're bytes.
- a_ops->invalidatepage() becomes ->invalidate_folio() and has a
similar type change.
- a_ops->launder_page() becomes ->launder_folio()
- a_ops->set_page_dirty() becomes ->dirty_folio() and adds the
address_space as an argument.
There are a couple of other misc changes up front that weren't worth
separating into their own pull request"
* tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits)
fs: Remove aops ->set_page_dirty
fb_defio: Use noop_dirty_folio()
fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio
fs: Convert __set_page_dirty_buffers to block_dirty_folio
nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio()
mm: Convert swap_set_page_dirty() to swap_dirty_folio()
ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio
f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio
f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio
f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio
afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio()
btrfs: Convert extent_range_redirty_for_io() to use folios
fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio
btrfs: Convert from set_page_dirty to dirty_folio
fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio()
fs: Add aops->dirty_folio
fs: Remove aops->launder_page
orangefs: Convert launder_page to launder_folio
nfs: Convert from launder_page to launder_folio
fuse: Convert from launder_page to launder_folio
...
Convert all users of fscache_set_page_dirty to use fscache_dirty_folio.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Straightforward conversion.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
In afs_writepages_region(), if the dirty page we find is undergoing
writeback or write to cache, but the sync_mode is WB_SYNC_NONE, we go
round the loop trying the same page again and again with no pausing or
waiting unless and until another thread manages to clear the writeback
and fscache flags.
Fix this with three measures:
(1) Advance start to after the page we found.
(2) Break out of the loop and return if rescheduling is requested.
(3) Arbitrarily give up after a maximum of 5 skips.
Fixes: 31143d5d515e ("AFS: implement basic file write support")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
Acked-by: Marc Dionne <marc.dionne@auristor.com>
Link: https://lore.kernel.org/r/164692725757.2097000.2060513769492301854.stgit@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the afs filesystem to support the new afs driver.
The following changes have been made:
(1) The fscache_netfs struct is no more, and there's no need to register
the filesystem as a whole. There's also no longer a cell cookie.
(2) The volume cookie is now an fscache_volume cookie, allocated with
fscache_acquire_volume(). This function takes three parameters: a
string representing the "volume" in the index, a string naming the
cache to use (or NULL) and a u64 that conveys coherency metadata for
the volume.
For afs, I've made it render the volume name string as:
"afs,<cell>,<volume_id>"
and the coherency data is currently 0.
(3) The fscache_cookie_def is no more and needed information is passed
directly to fscache_acquire_cookie(). The cache no longer calls back
into the filesystem, but rather metadata changes are indicated at
other times.
fscache_acquire_cookie() is passed the same keying and coherency
information as before, except that these are now stored in big endian
form instead of cpu endian. This makes the cache more copyable.
(4) fscache_use_cookie() and fscache_unuse_cookie() are called when a file
is opened or closed to prevent a cache file from being culled and to
keep resources to hand that are needed to do I/O.
fscache_use_cookie() is given an indication if the cache is likely to
be modified locally (e.g. the file is open for writing).
fscache_unuse_cookie() is given a coherency update if we had the file
open for writing and will update that.
(5) fscache_invalidate() is now given uptodate auxiliary data and a file
size. It can also take a flag to indicate if this was due to a DIO
write. This is wrapped into afs_fscache_invalidate() now for
convenience.
(6) fscache_resize() now gets called from the finalisation of
afs_setattr(), and afs_setattr() does use/unuse of the cookie around
the call to support this.
(7) fscache_note_page_release() is called from afs_release_page().
(8) Use a killable wait in nfs_vm_page_mkwrite() when waiting for
PG_fscache to be cleared.
Render the parts of the cookie key for an afs inode cookie as big endian.
Changes
=======
ver #2:
- Use gfpflags_allow_blocking() rather than using flag directly.
- fscache_acquire_volume() now returns errors.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Tested-by: kafs-testing@auristor.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819661382.215744.1485608824741611837.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906970002.143852.17678518584089878259.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967174665.1823006.1301789965454084220.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021568841.640689.6684240152253400380.stgit@warthog.procyon.org.uk/ # v4
Add memory folios, a new type to represent either order-0 pages or
the head page of a compound page. This should be enough infrastructure
to support filesystems converting from pages to folios.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmF9uI0ACgkQDpNsjXcp
gj7MUAf/R7LCZ+xFiIedw7SAgb/DGK0C9uVjuBEIZgAw21ZUw/GuPI6cuKBMFGGf
rRcdtlvMpwi7yZJcoNXxaqU/xPaaJMjf2XxscIvYJP1mjlZVuwmP9dOx0neNvWOc
T+8lqR6c1TLl82lpqIjGFLwvj2eVowq2d3J5jsaIJFd4odmmYVInrhJXOzC/LQ54
Niloj5ksehf+KUIRLDz7ycppvIHhlVsoAl0eM2dWBAtL0mvT7Nyn/3y+vnMfV2v3
Flb4opwJUgTJleYc16oxTn9svT2yS8q2uuUemRDLW8ABghoAtH3fUUk43RN+5Krd
LYCtbeawtkikPVXZMfWybsx5vn0c3Q==
=7SBe
-----END PGP SIGNATURE-----
Merge tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache
Pull memory folios from Matthew Wilcox:
"Add memory folios, a new type to represent either order-0 pages or the
head page of a compound page. This should be enough infrastructure to
support filesystems converting from pages to folios.
The point of all this churn is to allow filesystems and the page cache
to manage memory in larger chunks than PAGE_SIZE. The original plan
was to use compound pages like THP does, but I ran into problems with
some functions expecting only a head page while others expect the
precise page containing a particular byte.
The folio type allows a function to declare that it's expecting only a
head page. Almost incidentally, this allows us to remove various calls
to VM_BUG_ON(PageTail(page)) and compound_head().
This converts just parts of the core MM and the page cache. For 5.17,
we intend to convert various filesystems (XFS and AFS are ready; other
filesystems may make it) and also convert more of the MM and page
cache to folios. For 5.18, multi-page folios should be ready.
The multi-page folios offer some improvement to some workloads. The
80% win is real, but appears to be an artificial benchmark (postgres
startup, which isn't a serious workload). Real workloads (eg building
the kernel, running postgres in a steady state, etc) seem to benefit
between 0-10%. I haven't heard of any performance losses as a result
of this series. Nobody has done any serious performance tuning; I
imagine that tweaking the readahead algorithm could provide some more
interesting wins. There are also other places where we could choose to
create large folios and currently do not, such as writes that are
larger than PAGE_SIZE.
I'd like to thank all my reviewers who've offered review/ack tags:
Christoph Hellwig, David Howells, Jan Kara, Jeff Layton, Johannes
Weiner, Kirill A. Shutemov, Michal Hocko, Mike Rapoport, Vlastimil
Babka, William Kucharski, Yu Zhao and Zi Yan.
I'd also like to thank those who gave feedback I incorporated but
haven't offered up review tags for this part of the series: Nick
Piggin, Mel Gorman, Ming Lei, Darrick Wong, Ted Ts'o, John Hubbard,
Hugh Dickins, and probably a few others who I forget"
* tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache: (90 commits)
mm/writeback: Add folio_write_one
mm/filemap: Add FGP_STABLE
mm/filemap: Add filemap_get_folio
mm/filemap: Convert mapping_get_entry to return a folio
mm/filemap: Add filemap_add_folio()
mm/filemap: Add filemap_alloc_folio
mm/page_alloc: Add folio allocation functions
mm/lru: Add folio_add_lru()
mm/lru: Convert __pagevec_lru_add_fn to take a folio
mm: Add folio_evictable()
mm/workingset: Convert workingset_refault() to take a folio
mm/filemap: Add readahead_folio()
mm/filemap: Add folio_mkwrite_check_truncate()
mm/filemap: Add i_blocks_per_folio()
mm/writeback: Add folio_redirty_for_writepage()
mm/writeback: Add folio_account_redirty()
mm/writeback: Add folio_clear_dirty_for_io()
mm/writeback: Add folio_cancel_dirty()
mm/writeback: Add folio_account_cleaned()
mm/writeback: Add filemap_dirty_folio()
...
wait_on_page_writeback_killable() only has one caller, so convert it to
call folio_wait_writeback_killable(). For the wait_on_page_writeback()
callers, add a compatibility wrapper around folio_wait_writeback().
Turning PageWriteback() into folio_test_writeback() eliminates a call
to compound_head() which saves 8 bytes and 15 bytes in the two
functions. Unfortunately, that is more than offset by adding the
wait_on_page_writeback compatibility wrapper for a net increase in text
of 7 bytes.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Howells <dhowells@redhat.com>
When an afs file or directory is modified locally such that the total file
size is extended, i_blocks needs to be recalculated too.
Fix this by making afs_write_end() and afs_edit_dir_add() call
afs_set_i_size() rather than setting inode->i_size directly as that also
recalculates inode->i_blocks.
This can be tested by creating and writing into directories and files and
then examining them with du. Without this change, directories show a 4
blocks (they start out at 2048 bytes) and files show 0 blocks; with this
change, they should show a number of blocks proportional to the file size
rounded up to 1024.
Fixes: 31143d5d515e ("AFS: implement basic file write support")
Fixes: 63a4681ff39c ("afs: Locally edit directory data for mkdir/create/unlink/...")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163113612442.352844.11162345591911691150.stgit@warthog.procyon.org.uk/
afs_d_revalidate() should only be validating the directory entry it is
given and the directory to which that belongs; it shouldn't be validating
the inode/vnode to which that dentry points. Besides, validation need to
be done even if we don't call afs_d_revalidate() - which might be the case
if we're starting from a file descriptor.
In order for afs_d_revalidate() to be fixed, validation points must be
added in some other places. Certain directory operations, such as
afs_unlink(), already check this, but not all and not all file operations
either.
Note that the validation of a vnode not only checks to see if the
attributes we have are correct, but also gets a promise from the server to
notify us if that file gets changed by a third party.
Add the following checks:
- Check the vnode we're going to make a hard link to.
- Check the vnode we're going to move/rename.
- Check the vnode we're going to read from.
- Check the vnode we're going to write to.
- Check the vnode we're going to sync.
- Check the vnode we're going to make a mapped page writable for.
Some of these aren't strictly necessary as we're going to perform a server
operation that might get the attributes anyway from which we can determine
if something changed - though it might not get us a callback promise.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Markus Suvanto <markus.suvanto@gmail.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163111667354.283156.12720698333342917516.stgit@warthog.procyon.org.uk/
There's a loop in afs_extend_writeback() that adds extra pages to a write
we want to make to improve the efficiency of the writeback by making it
larger. This loop stops, however, if we hit a page we can't write back
from immediately, but it doesn't get rid of the page ref we speculatively
acquired.
This was caused by the removal of the cleanup loop when the code switched
from using find_get_pages_contig() to xarray scanning as the latter only
gets a single page at a time, not a batch.
Fix this by putting the page on a ref on an early break from the loop.
Unfortunately, we can't just add that page to the pagevec we're employing
as we'll go through that and add those pages to the RPC call.
This was found by the generic/074 test. It leaks ~4GiB of RAM each time it
is run - which can be observed with "top".
Fixes: e87b03f5830e ("afs: Prepare for use of THPs")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-tested-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/163111666635.283156.177701903478910460.stgit@warthog.procyon.org.uk/
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmDQ9Z4ACgkQ+7dXa6fL
C2sVfg/9H3OlL4Vwv5CERgb5kbDoALAh5epYMFyvNVy9l5iTvYekxG3qna6ee0+G
pTRHSU5tZUCUslpafv8LUz1RE/iM127y+BP+qq2joG9jkT4q3QfeV4oDr+dgZWCL
obp+rjQSTKkaGh3eqjDx7gCSp5mqQI/M8MXe1VOXxUzAnsf2nH4iJLv/A9NN17xW
l7sQfZJGHD1BPqsxxFiSr+UCkCLuLDybUDYL6+PZcFu0jSON0h4yEtIwsAA7a1Zw
TvgUpequrbyTORI3sHJ8eIixWosh4yLTJ4pDqs8qDqq3Wm5eTZjss0wxEaVfpaTu
cg/CcNUBiHJ/6Q8r+JiuPbEnfnQ9woL8951/CNi+cOGhTy9LNoG70orSZKKynmW5
QmpYBK5BkM57EPj7DYZJRI1Bwy41pJapFj8tXbMjObU+ZyaXMysmBDanIRK/cLoy
fr4Sz+1D8yJPQ0GDgC4051CxrhynOEnRo8JbESg8CD4PnqFeM7EoCh48H2oVvR4N
9v2xuaRyBvi2KTmKSktRe+s1DS80acVMUYP33nT2zAthvL91VVgMY3Hz2/QrlAp8
h1hREME8aRcN4LrBIgp/ET150hUh44U2A07PqnYYe0653MH7aHHFfk134ZuTmbub
Pfc1MHtWgPAWN4c140ILBRTidJeShszoJGgpD6tflkj10VI2s34=
=2c6l
-----END PGP SIGNATURE-----
Merge tag 'netfs-fixes-20210621' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull netfs fixes from David Howells:
"This contains patches to fix netfs_write_begin() and afs_write_end()
in the following ways:
(1) In netfs_write_begin(), extract the decision about whether to skip
a page out to its own helper and have that clear around the region
to be written, but not clear that region. This requires the
filesystem to patch it up afterwards if the hole doesn't get
completely filled.
(2) Use offset_in_thp() in (1) rather than manually calculating the
offset into the page.
(3) Due to (1), afs_write_end() now needs to handle short data write
into the page by generic_perform_write(). I've adopted an
analogous approach to ceph of just returning 0 in this case and
letting the caller go round again.
It also adds a note that (in the future) the len parameter may extend
beyond the page allocated. This is because the page allocation is
deferred to write_begin() and that gets to decide what size of THP to
allocate."
Jeff Layton points out:
"The netfs fix in particular fixes a data corruption bug in cephfs"
* tag 'netfs-fixes-20210621' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
netfs: fix test for whether we can skip read when writing beyond EOF
afs: Fix afs_write_end() to handle short writes
Fix afs_write_end() to correctly handle a short copy into the intended
write region of the page. Two things are necessary:
(1) If the page is not up to date, then we should just return 0
(ie. indicating a zero-length copy). The loop in
generic_perform_write() will go around again, possibly breaking up the
iterator into discrete chunks[1].
This is analogous to commit b9de313cf05fe08fa59efaf19756ec5283af672a
for ceph.
(2) The page should not have been set uptodate if it wasn't completely set
up by netfs_write_begin() (this will be fixed in the next patch), so
we need to set uptodate here in such a case.
Also remove the assertion that was checking that the page was set uptodate
since it's now set uptodate if it wasn't already a few lines above. The
assertion was from when uptodate was set elsewhere.
Changes:
v3: Remove the handling of len exceeding the end of the page.
Fixes: 3003bbd0697b ("afs: Use the netfs_write_begin() helper")
Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/YMwVp268KTzTf8cN@zeniv-ca.linux.org.uk/ [1]
Link: https://lore.kernel.org/r/162367682522.460125.5652091227576721609.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/162391825688.1173366.3437507255136307904.stgit@warthog.procyon.org.uk/ # v2
If a task is killed during a page fault, it does not currently call
sb_end_pagefault(), which means that the filesystem cannot be frozen
at any time thereafter. This may be reported by lockdep like this:
====================================
WARNING: fsstress/10757 still has locks held!
5.13.0-rc4-build4+ #91 Not tainted
------------------------------------
1 lock held by fsstress/10757:
#0: ffff888104eac530
(
sb_pagefaults
as filesystem freezing is modelled as a lock.
Fix this by removing all the direct returns from within the function,
and using 'ret' to indicate whether we were interrupted or successful.
Fixes: 1cf7a1518aef ("afs: Implement shared-writeable mmap")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20210616154900.1958373-1-willy@infradead.org/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit e87b03f5830e ("afs: Prepare for use of THPs"), the return
value for afs_write_back_from_locked_page was changed from a number
of pages to a length in bytes. The loop in afs_writepages_region uses
the return value to compute the index that will be used to find dirty
pages in the next iteration, but treats it as a number of pages and
wrongly multiplies it by PAGE_SIZE. This gives a very large index value,
potentially skipping any dirty data that was not covered in the first
pass, which is limited to 256M.
This causes fsync(), and indirectly close(), to only do a partial
writeback of a large file's dirty data. The rest is eventually written
back by background threads after dirty_expire_centisecs.
Fixes: e87b03f5830e ("afs: Prepare for use of THPs")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/20210604175504.4055-1-marc.c.dionne@gmail.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The generic/464 xfstest causes kAFS to emit occasional warnings of the
form:
kAFS: vnode modified {100055:8a} 30->31 YFS.StoreData64 (c=6015)
This indicates that the data version received back from the server did not
match the expected value (the DV should be incremented monotonically for
each individual modification op committed to a vnode).
What is happening is that a lookup call is doing a bulk status fetch
speculatively on a bunch of vnodes in a directory besides getting the
status of the vnode it's actually interested in. This is racing with a
StoreData operation (though it could also occur with, say, a MakeDir op).
On the client, a modification operation locks the vnode, but the bulk
status fetch only locks the parent directory, so no ordering is imposed
there (thereby avoiding an avenue to deadlock).
On the server, the StoreData op handler doesn't lock the vnode until it's
received all the request data, and downgrades the lock after committing the
data until it has finished sending change notifications to other clients -
which allows the status fetch to occur before it has finished.
This means that:
- a status fetch can access the target vnode either side of the exclusive
section of the modification
- the status fetch could start before the modification, yet finish after,
and vice-versa.
- the status fetch and the modification RPCs can complete in either order.
- the status fetch can return either the before or the after DV from the
modification.
- the status fetch might regress the locally cached DV.
Some of these are handled by the previous fix[1], but that's not sufficient
because it checks the DV it received against the DV it cached at the start
of the op, but the DV might've been updated in the meantime by a locally
generated modification op.
Fix this by the following means:
(1) Keep track of when we're performing a modification operation on a
vnode. This is done by marking vnode parameters with a 'modification'
note that causes the AFS_VNODE_MODIFYING flag to be set on the vnode
for the duration.
(2) Alter the speculation race detection to ignore speculative status
fetches if either the vnode is marked as being modified or the data
version number is not what we expected.
Note that whilst the "vnode modified" warning does get recovered from as it
causes the client to refetch the status at the next opportunity, it will
also invalidate the pagecache, so changes might get lost.
Fixes: a9e5c87ca744 ("afs: Fix speculative status fetch going out of order wrt to modifications")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-and-reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/160605082531.252452.14708077925602709042.stgit@warthog.procyon.org.uk/ [1]
Link: https://lore.kernel.org/linux-fsdevel/161961335926.39335.2552653972195467566.stgit@warthog.procyon.org.uk/ # v1
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When afs_write_end() is called with copied == 0, it tries to set the
dirty region, but there's no way to actually encode a 0-length region in
the encoding in page->private.
"0,0", for example, indicates a 1-byte region at offset 0. The maths
miscalculates this and sets it incorrectly.
Fix it to just do nothing but unlock and put the page in this case. We
don't actually need to mark the page dirty as nothing presumably
changed.
Fixes: 65dd2d6072d3 ("afs: Alter dirty range encoding in page->private")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The dirty region bounds stored in page->private on an afs page are 15 bits
on a 32-bit box and can, at most, represent a range of up to 32K within a
32K page with a resolution of 1 byte. This is a problem for powerpc32 with
64K pages enabled.
Further, transparent huge pages may get up to 2M, which will be a problem
for the afs filesystem on all 32-bit arches in the future.
Fix this by decreasing the resolution. For the moment, a 64K page will
have a resolution determined from PAGE_SIZE. In the future, the page will
need to be passed in to the helper functions so that the page size can be
assessed and the resolution determined dynamically.
Note that this might not be the ideal way to handle this, since it may
allow some leakage of undirtied zero bytes to the server's copy in the case
of a 3rd-party conflict. Fixing that would require a separately allocated
record and is a more complicated fix.
Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Fix afs_invalidatepage() to adjust the dirty region recorded in
page->private when truncating a page. If the dirty region is entirely
removed, then the private data is cleared and the page dirty state is
cleared.
Without this, if the page is truncated and then expanded again by truncate,
zeros from the expanded, but no-longer dirty region may get written back to
the server if the page gets laundered due to a conflicting 3rd-party write.
It mustn't, however, shorten the dirty region of the page if that page is
still mmapped and has been marked dirty by afs_page_mkwrite(), so a flag is
stored in page->private to record this.
Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record")
Signed-off-by: David Howells <dhowells@redhat.com>
Currently, page->private on an afs page is used to store the range of
dirtied data within the page, where the range includes the lower bound, but
excludes the upper bound (e.g. 0-1 is a range covering a single byte).
This, however, requires a superfluous bit for the last-byte bound so that
on a 4KiB page, it can say 0-4096 to indicate the whole page, the idea
being that having both numbers the same would indicate an empty range.
This is unnecessary as the PG_private bit is clear if it's an empty range
(as is PG_dirty).
Alter the way the dirty range is encoded in page->private such that the
upper bound is reduced by 1 (e.g. 0-0 is then specified the same single
byte range mentioned above).
Applying this to both bounds frees up two bits, one of which can be used in
a future commit.
This allows the afs filesystem to be compiled on ppc32 with 64K pages;
without this, the following warnings are seen:
../fs/afs/internal.h: In function 'afs_page_dirty_to':
../fs/afs/internal.h:881:15: warning: right shift count >= width of type [-Wshift-count-overflow]
881 | return (priv >> __AFS_PAGE_PRIV_SHIFT) & __AFS_PAGE_PRIV_MASK;
| ^~
../fs/afs/internal.h: In function 'afs_page_dirty':
../fs/afs/internal.h:886:28: warning: left shift count >= width of type [-Wshift-count-overflow]
886 | return ((unsigned long)to << __AFS_PAGE_PRIV_SHIFT) | from;
| ^~
Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record")
Signed-off-by: David Howells <dhowells@redhat.com>
The afs filesystem uses page->private to store the dirty range within a
page such that in the event of a conflicting 3rd-party write to the server,
we write back just the bits that got changed locally.
However, there are a couple of problems with this:
(1) I need a bit to note if the page might be mapped so that partial
invalidation doesn't shrink the range.
(2) There aren't necessarily sufficient bits to store the entire range of
data altered (say it's a 32-bit system with 64KiB pages or transparent
huge pages are in use).
So wrap the accesses in inline functions so that future commits can change
how this works.
Also move them out of the tracing header into the in-directory header.
There's not really any need for them to be in the tracing header.
Signed-off-by: David Howells <dhowells@redhat.com>
In afs, page->private is set to indicate the dirty region of a page. This
is done in afs_write_begin(), but that can't take account of whether the
copy into the page actually worked.
Fix this by moving the change of page->private into afs_write_end().
Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record")
Signed-off-by: David Howells <dhowells@redhat.com>
Fix the leak of the target page in afs_write_begin() when it fails.
Fixes: 15b4650e55e0 ("afs: convert to new aops")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Nick Piggin <npiggin@gmail.com>
Fix afs to take a ref on a page when it sets PG_private on it and to drop
the ref when removing the flag.
Note that in afs_write_begin(), a lot of the time, PG_private is already
set on a page to which we're going to add some data. In such a case, we
leave the bit set and mustn't increment the page count.
As suggested by Matthew Wilcox, use attach/detach_page_private() where
possible.
Fixes: 31143d5d515e ("AFS: implement basic file write support")
Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Fix afs_launder_page() to not clear PG_writeback on the page it is
laundering as the flag isn't set in this case.
Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record")
Signed-off-by: David Howells <dhowells@redhat.com>
The afs filesystem has a lock[*] that it uses to serialise I/O operations
going to the server (vnode->io_lock), as the server will only perform one
modification operation at a time on any given file or directory. This
prevents the the filesystem from filling up all the call slots to a server
with calls that aren't going to be executed in parallel anyway, thereby
allowing operations on other files to obtain slots.
[*] Note that is probably redundant for directories at least since
i_rwsem is used to serialise directory modifications and
lookup/reading vs modification. The server does allow parallel
non-modification ops, however.
When a file truncation op completes, we truncate the in-memory copy of the
file to match - but we do it whilst still holding the io_lock, the idea
being to prevent races with other operations.
However, if writeback starts in a worker thread simultaneously with
truncation (whilst notify_change() is called with i_rwsem locked, writeback
pays it no heed), it may manage to set PG_writeback bits on the pages that
will get truncated before afs_setattr_success() manages to call
truncate_pagecache(). Truncate will then wait for those pages - whilst
still inside io_lock:
# cat /proc/8837/stack
[<0>] wait_on_page_bit_common+0x184/0x1e7
[<0>] truncate_inode_pages_range+0x37f/0x3eb
[<0>] truncate_pagecache+0x3c/0x53
[<0>] afs_setattr_success+0x4d/0x6e
[<0>] afs_wait_for_operation+0xd8/0x169
[<0>] afs_do_sync_operation+0x16/0x1f
[<0>] afs_setattr+0x1fb/0x25d
[<0>] notify_change+0x2cf/0x3c4
[<0>] do_truncate+0x7f/0xb2
[<0>] do_sys_ftruncate+0xd1/0x104
[<0>] do_syscall_64+0x2d/0x3a
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
The writeback operation, however, stalls indefinitely because it needs to
get the io_lock to proceed:
# cat /proc/5940/stack
[<0>] afs_get_io_locks+0x58/0x1ae
[<0>] afs_begin_vnode_operation+0xc7/0xd1
[<0>] afs_store_data+0x1b2/0x2a3
[<0>] afs_write_back_from_locked_page+0x418/0x57c
[<0>] afs_writepages_region+0x196/0x224
[<0>] afs_writepages+0x74/0x156
[<0>] do_writepages+0x2d/0x56
[<0>] __writeback_single_inode+0x84/0x207
[<0>] writeback_sb_inodes+0x238/0x3cf
[<0>] __writeback_inodes_wb+0x68/0x9f
[<0>] wb_writeback+0x145/0x26c
[<0>] wb_do_writeback+0x16a/0x194
[<0>] wb_workfn+0x74/0x177
[<0>] process_one_work+0x174/0x264
[<0>] worker_thread+0x117/0x1b9
[<0>] kthread+0xec/0xf1
[<0>] ret_from_fork+0x1f/0x30
and thus deadlock has occurred.
Note that whilst afs_setattr() calls filemap_write_and_wait(), the fact
that the caller is holding i_rwsem doesn't preclude more pages being
dirtied through an mmap'd region.
Fix this by:
(1) Use the vnode validate_lock to mediate access between afs_setattr()
and afs_writepages():
(a) Exclusively lock validate_lock in afs_setattr() around the whole
RPC operation.
(b) If WB_SYNC_ALL isn't set on entry to afs_writepages(), trying to
shared-lock validate_lock and returning immediately if we couldn't
get it.
(c) If WB_SYNC_ALL is set, wait for the lock.
The validate_lock is also used to validate a file and to zap its cache
if the file was altered by a third party, so it's probably a good fit
for this.
(2) Move the truncation outside of the io_lock in setattr, using the same
hook as is used for local directory editing.
This requires the old i_size to be retained in the operation record as
we commit the revised status to the inode members inside the io_lock
still, but we still need to know if we reduced the file size.
Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The afs filesystem driver allows unstarted operations to be cancelled by
signal, but most of these can easily be restarted (mkdir for example). The
primary culprits for reproducing this are those applications that use
SIGALRM to display a progress counter.
File lock-extension operation is marked uninterruptible as we have a
limited time in which to do it, and the release op is marked
uninterruptible also as if we fail to unlock a file, we'll have to wait 20
mins before anyone can lock it again.
The store operation logs a warning if it gets interruption, e.g.:
kAFS: Unexpected error from FS.StoreData -4
because it's run from the background - but it can also be run from
fdatasync()-type things. However, store options aren't marked
interruptible at the moment.
Fix this in the following ways:
(1) Mark store operations as uninterruptible. It might make sense to
relax this for certain situations, but I'm not sure how to make sure
that background store ops aren't affected by signals to foreground
processes that happen to trigger them.
(2) In afs_get_io_locks(), where we're getting the serialisation lock for
talking to the fileserver, return ERESTARTSYS rather than EINTR
because a lot of the operations (e.g. mkdir) are restartable if we
haven't yet started sending the op to the server.
Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the following issues:
(1) Fix writeback to reduce the size of a store operation to i_size,
effectively discarding the extra data.
The problem comes when afs_page_mkwrite() records that a page is about
to be modified by mmap(). It doesn't know what bits of the page are
going to be modified, so it records the whole page as being dirty
(this is stored in page->private as start and end offsets).
Without this, the marshalling for the store to the server extends the
size of the file to the end of the page (in afs_fs_store_data() and
yfs_fs_store_data()).
(2) Fix setattr to actually truncate the pagecache, thereby clearing
the discarded part of a file.
(3) Fix setattr to check that the new size is okay and to disable
ATTR_SIZE if i_size wouldn't change.
(4) Force i_size to be updated as the result of a truncate.
(5) Don't truncate if ATTR_SIZE is not set.
(6) Call pagecache_isize_extended() if the file was enlarged.
Note that truncate_set_size() isn't used because the setting of i_size is
done inside afs_vnode_commit_status() under the vnode->cb_lock.
Found with the generic/029 and generic/393 xfstests.
Fixes: 31143d5d515e ("AFS: implement basic file write support")
Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record")
Signed-off-by: David Howells <dhowells@redhat.com>
The in-kernel afs filesystem ignores ctime because the AFS fileserver
protocol doesn't support ctimes. This, however, causes various xfstests to
fail.
Work around this by:
(1) Setting ctime to attr->ia_ctime in afs_setattr().
(2) Not ignoring ATTR_MTIME_SET, ATTR_TIMES_SET and ATTR_TOUCH settings.
(3) Setting the ctime from the server mtime when on the target file when
creating a hard link to it.
(4) Setting the ctime on directories from their revised mtimes when
renaming/moving a file.
Found by the generic/221 and generic/309 xfstests.
Signed-off-by: David Howells <dhowells@redhat.com>
Fix afs_write_end() to change i_size under vnode->cb_lock rather than
->wb_lock so that it doesn't race with afs_vnode_commit_status() and
afs_getattr().
The ->wb_lock is only meant to guard access to ->wb_keys which isn't
accessed by that piece of code.
Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record")
Signed-off-by: David Howells <dhowells@redhat.com>
The mtime on an inode needs to be updated when a write is made into an
mmap'ed section. There are three ways in which this could be done: update
it when page_mkwrite is called, update it when a page is changed from dirty
to writeback or leave it to the server and fix the mtime up from the reply
to the StoreData RPC.
Found with the generic/215 xfstest.
Fixes: 1cf7a1518aef ("afs: Implement shared-writeable mmap")
Signed-off-by: David Howells <dhowells@redhat.com>