58583 Commits

Author SHA1 Message Date
David Howells
b134d687dd afs: Log more information for "kAFS: AFS vnode with undefined type\n"
Log more information when "kAFS: AFS vnode with undefined type\n" is
displayed due to a vnode record being retrieved from the server that
appears to have a duff file type (usually 0).  This prints more information
to try and help pin down the problem.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-07 16:48:44 +01:00
Shenghui Wang
7889f44dd9 io_uring: use cpu_online() to check p->sq_thread_cpu instead of cpu_possible()
This issue is found by running liburing/test/io_uring_setup test.

When test run, the testcase "attempt to bind to invalid cpu" would not
pass with messages like:
   io_uring_setup(1, 0xbfc2f7c8), \
flags: IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF, \
resv: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000, \
sq_thread_cpu: 2
   expected -1, got 3
   FAIL

On my system, there is:
   CPU(s) possible : 0-3
   CPU(s) online   : 0-1
   CPU(s) offline  : 2-3
   CPU(s) present  : 0-1

The sq_thread_cpu 2 is offline on my system, so the bind should fail.
But cpu_possible() will pass the check. We shouldn't be able to bind
to an offline cpu. Use cpu_online() to do the check.

After the change, the testcase run as expected: EINVAL will be returned
for cpu offlined.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-07 08:41:26 -06:00
Roberto Bergantinos Corpas
950a578c61 NFS: make nfs_match_client killable
Actually we don't do anything with return value from
    nfs_wait_client_init_complete in nfs_match_client, as a
    consequence if we get a fatal signal and client is not
    fully initialised, we'll loop to "again" label

    This has been proven to cause soft lockups on some scenarios
    (no-carrier but configured network interfaces)

Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-05-07 10:38:14 -04:00
Linus Torvalds
81ff5d2cba Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 "API:
   - Add support for AEAD in simd
   - Add fuzz testing to testmgr
   - Add panic_on_fail module parameter to testmgr
   - Use per-CPU struct instead multiple variables in scompress
   - Change verify API for akcipher

  Algorithms:
   - Convert x86 AEAD algorithms over to simd
   - Forbid 2-key 3DES in FIPS mode
   - Add EC-RDSA (GOST 34.10) algorithm

  Drivers:
   - Set output IV with ctr-aes in crypto4xx
   - Set output IV in rockchip
   - Fix potential length overflow with hashing in sun4i-ss
   - Fix computation error with ctr in vmx
   - Add SM4 protected keys support in ccree
   - Remove long-broken mxc-scc driver
   - Add rfc4106(gcm(aes)) cipher support in cavium/nitrox"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (179 commits)
  crypto: ccree - use a proper le32 type for le32 val
  crypto: ccree - remove set but not used variable 'du_size'
  crypto: ccree - Make cc_sec_disable static
  crypto: ccree - fix spelling mistake "protedcted" -> "protected"
  crypto: caam/qi2 - generate hash keys in-place
  crypto: caam/qi2 - fix DMA mapping of stack memory
  crypto: caam/qi2 - fix zero-length buffer DMA mapping
  crypto: stm32/cryp - update to return iv_out
  crypto: stm32/cryp - remove request mutex protection
  crypto: stm32/cryp - add weak key check for DES
  crypto: atmel - remove set but not used variable 'alg_name'
  crypto: picoxcell - Use dev_get_drvdata()
  crypto: crypto4xx - get rid of redundant using_sd variable
  crypto: crypto4xx - use sync skcipher for fallback
  crypto: crypto4xx - fix cfb and ofb "overran dst buffer" issues
  crypto: crypto4xx - fix ctr-aes missing output IV
  crypto: ecrdsa - select ASN1 and OID_REGISTRY for EC-RDSA
  crypto: ux500 - use ccflags-y instead of CFLAGS_<basename>.o
  crypto: ccree - handle tee fips error during power management resume
  crypto: ccree - add function to handle cryptocell tee fips error
  ...
2019-05-06 20:15:06 -07:00
Linus Torvalds
2c6a392cdd Merge branch 'core-stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull stack trace updates from Ingo Molnar:
 "So Thomas looked at the stacktrace code recently and noticed a few
  weirdnesses, and we all know how such stories of crummy kernel code
  meeting German engineering perfection end: a 45-patch series to clean
  it all up! :-)

  Here's the changes in Thomas's words:

   'Struct stack_trace is a sinkhole for input and output parameters
    which is largely pointless for most usage sites. In fact if embedded
    into other data structures it creates indirections and extra storage
    overhead for no benefit.

    Looking at all usage sites makes it clear that they just require an
    interface which is based on a storage array. That array is either on
    stack, global or embedded into some other data structure.

    Some of the stack depot usage sites are outright wrong, but
    fortunately the wrongness just causes more stack being used for
    nothing and does not have functional impact.

    Another oddity is the inconsistent termination of the stack trace
    with ULONG_MAX. It's pointless as the number of entries is what
    determines the length of the stored trace. In fact quite some call
    sites remove the ULONG_MAX marker afterwards with or without nasty
    comments about it. Not all architectures do that and those which do,
    do it inconsistenly either conditional on nr_entries == 0 or
    unconditionally.

    The following series cleans that up by:

      1) Removing the ULONG_MAX termination in the architecture code

      2) Removing the ULONG_MAX fixups at the call sites

      3) Providing plain storage array based interfaces for stacktrace
         and stackdepot.

      4) Cleaning up the mess at the callsites including some related
         cleanups.

      5) Removing the struct stack_trace based interfaces

    This is not changing the struct stack_trace interfaces at the
    architecture level, but it removes the exposure to the generic
    code'"

* 'core-stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
  x86/stacktrace: Use common infrastructure
  stacktrace: Provide common infrastructure
  lib/stackdepot: Remove obsolete functions
  stacktrace: Remove obsolete functions
  livepatch: Simplify stack trace retrieval
  tracing: Remove the last struct stack_trace usage
  tracing: Simplify stack trace retrieval
  tracing: Make ftrace_trace_userstack() static and conditional
  tracing: Use percpu stack trace buffer more intelligently
  tracing: Simplify stacktrace retrieval in histograms
  lockdep: Simplify stack trace handling
  lockdep: Remove save argument from check_prev_add()
  lockdep: Remove unused trace argument from print_circular_bug()
  drm: Simplify stacktrace handling
  dm persistent data: Simplify stack trace handling
  dm bufio: Simplify stack trace retrieval
  btrfs: ref-verify: Simplify stack trace retrieval
  dma/debug: Simplify stracktrace retrieval
  fault-inject: Simplify stacktrace retrieval
  mm/page_owner: Simplify stack trace handling
  ...
2019-05-06 13:11:48 -07:00
Theodore Ts'o
db90f41916 ext4: export /sys/fs/ext4/feature/casefold if Unicode support is present
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-05-06 14:03:52 -04:00
Colin Ian King
efeb862bd5 io_uring: fix shadowed variable ret return code being not checked
Currently variable ret is declared in a while-loop code block that
shadows another variable ret. When an error occurs in the while-loop
the error return in ret is not being set in the outer code block and
so the error check on ret is always going to be checking on the wrong
ret variable resulting in check that is always going to be true and
a premature return occurs.

Fix this by removing the declaration of the inner while-loop variable
ret so that shadowing does not occur.

Addresses-Coverity: ("'Constant' variable guards dead code")
Fixes: 6b06314c47e1 ("io_uring: add file set registration")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-06 10:21:34 -06:00
Kirill Smelkov
438ab720c6 vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files
This amends commit 10dce8af3422 ("fs: stream_open - opener for
stream-like files so that read and write can run simultaneously without
deadlock") in how position is passed into .read()/.write() handler for
stream-like files:

Rasmus noticed that we currently pass 0 as position and ignore any position
change if that is done by a file implementation. This papers over bugs if ppos
is used in files that declare themselves as being stream-like as such bugs will
go unnoticed. Even if a file implementation is correctly converted into using
stream_open, its read/write later could be changed to use ppos and even though
that won't be working correctly, that bug might go unnoticed without someone
doing wrong behaviour analysis. It is thus better to pass ppos=NULL into
read/write for stream-like files as that don't give any chance for ppos usage
bugs because it will oops if ppos is ever used inside .read() or .write().

Note 1: rw_verify_area, new_sync_{read,write} needs to be updated
because they are called by vfs_read/vfs_write & friends before
file_operations .read/.write .

Note 2: if file backend uses new-style .read_iter/.write_iter, position
is still passed into there as non-pointer kiocb.ki_pos . Currently
stream_open.cocci (semantic patch added by 10dce8af3422) ignores files
whose file_operations has *_iter methods.

Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
2019-05-06 17:46:52 +03:00
Linus Torvalds
51987affd6 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:

 - a couple of ->i_link use-after-free fixes

 - regression fix for wrong errno on absent device name in mount(2)
   (this cycle stuff)

 - ancient UFS braino in large GID handling on Solaris UFS images (bogus
   cut'n'paste from large UID handling; wrong field checked to decide
   whether we should look at old (16bit) or new (32bit) field)

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ufs: fix braino in ufs_get_inode_gid() for solaris UFS flavour
  Abort file_remove_privs() for non-reg. files
  [fix] get rid of checking for absent device name in vfs_get_tree()
  apparmorfs: fix use-after-free on symlink traversal
  securityfs: fix use-after-free on symlink traversal
2019-05-05 09:28:45 -07:00
Martin Brandenburg
33713cd09c orangefs: truncate before updating size
Otherwise we race with orangefs_writepage/orangefs_writepages
which and does not expect i_size < page_offset.

Fixes xfstests generic/129.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-05-03 14:39:10 -04:00
Mike Marshall
dd59a6475c orangefs: copy Orangefs-sized blocks into the pagecache if possible.
->readpage looks in file->private_data to try and find out how the
userspace program set "count" in read(2) or with "dd bs=" or whatever.

->readpage uses "count" and inode->i_size to calculate how much
data Orangefs should deposit in the Orangefs shared buffer, and
remembers which slot the data is in.

After copying data from the Orangefs shared buffer slot into
"the page", readpage tries to increment through the pagecache index
and fill as many pages as it can from the extra data in the shared
buffer. Hopefully these extra pages will soon be needed by the vfs,
and they'll be in the pagecache already.

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
2019-05-03 14:32:39 -04:00
Mike Marshall
4077a0f25b orangefs: pass slot index back to readpage.
When userspace deposits more than a page of data into the shared buffer,
we'll need to know which slot it is in when we get back to readpage
so that we can try to use the extra data to fill some extra pages.

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
2019-05-03 14:32:39 -04:00
Mike Marshall
c2549f8c7a orangefs: remember count when reading.
Orangefs wins when it can do IO on large (up to four meg) blocks at a time,
and looses when it has to do tiny "small io" reads and writes. Accessing
Orangefs through the pagecache with the kernel module helps with small io,
both reading and writing, a great deal. Readpage generally tries to fetch a
page (four k) at a time. We'll let users use "count" (as in read(2) or
pread(2) for example) as a knob to control how much data they get from
Orangefs at a time and we'll try to use the data to fill extra
pagecache pages when we get to ->readpage, hopefully resulting in
fewer calls to readpage and Orangefs userspace.

We need a way to remember how they set count so that we can still have
it available when we get to ->readpage.

 - We'll use file->private_data to keep track of "count".
   We'll wrap generic_file_open with orangefs_file_open and
   initialize private_data to NULL there.

 - In ->read_iter we have access to both "count" and file, so
   we'll kmalloc some space onto file->private_data and store
   "count" there.

 - We'll kfree file->private_data each time we visit ->flush and
   reinitialize it to NULL.

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
2019-05-03 14:32:39 -04:00
Martin Brandenburg
8f04e1be78 orangefs: add orangefs_revalidate_mapping
This is modeled after NFS, except our method is different.  We use a
simple timer to determine whether to invalidate the page cache.  This
is bound to perform.

This addes a sysfs parameter cache_timeout_msecs which controls the time
between page cache invalidations.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-05-03 14:32:39 -04:00
Martin Brandenburg
c472ebc255 orangefs: implement writepages
Go through pages and look for a consecutive writable region.  After
finding a number of consecutive writable pages or when finding that
the next page's dirty range is not contiguous and cannot be written
as one request, send the write to the server.

The number of pages is determined by the client-core's buffer size.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-05-03 14:32:39 -04:00
Martin Brandenburg
52e2d0a380 orangefs: write range tracking
Attach the actual range of bytes written to plus the responsible uid/gid
to each dirty page.  This information must be sent to the server when
the page is written out.

Now write_begin, page_mkwrite, and invalidatepage keep up with this
information.  There are several conditions where they must write out the
page immediately to store the new range.  Two non-contiguous ranges
cannot be stored on a single page.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-05-03 14:32:38 -04:00
Martin Brandenburg
90fc07065a orangefs: avoid fsync service operation on flush
Without this, an fsync call is sent to the server even if no data
changed.  This resulted in a rather severe (50%) performance regression
under certain metadata-heavy workloads.

In the past, everything was direct IO.  Nothing happend on a close call.
An explicit fsync call would send an fsync request to the server which
in turn fsynced the underlying file.

Now there are cached writes.  Then fsync began writing out dirty pages
in addition to making an fsync request to the server, and close began
calling fsync.

With this commit, close only writes out dirty pages, and does not make
the fsync request.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-05-03 14:32:38 -04:00
Martin Brandenburg
8a88bbce6f orangefs: skip inode writeout if nothing to write
Would happen if an inode is dirty but whatever happened is not something
that can be written out to OrangeFS.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-05-03 14:32:38 -04:00
Martin Brandenburg
3e9dfc6e1e orangefs: move do_readv_writev to direct_IO
direct_IO was the only caller and all direct_IO did was call it,
so there's no use in having the code spread out into so many functions.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-05-03 14:32:38 -04:00
Martin Brandenburg
43f3457604 orangefs: do not return successful read when the client-core disappeared
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-05-03 14:32:38 -04:00
Martin Brandenburg
85ac799cf9 orangefs: implement writepage
Now orangefs_inode_getattr fills from cache if an inode has dirty pages.

also if attr_valid and dirty pages and !flags, we spin on inode writeback
before returning if pages still dirty after: should it be other way

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-05-03 14:32:38 -04:00
Martin Brandenburg
c453dcfc79 orangefs: migrate to generic_file_read_iter
Remove orangefs_inode_read.  It was used by readpage.  Calling
wait_for_direct_io directly serves the purpose just as well.  There is
now no check of the bufmap size in the readpage path.  There are already
other places the bufmap size is assumed to be greater than PAGE_SIZE.

Important to call truncate_inode_pages now in the write path so a
subsequent read sees the new data.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-05-03 14:32:38 -04:00
Martin Brandenburg
0dcac0f781 orangefs: service ops done for writeback are not killable
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-05-03 14:32:38 -04:00
Martin Brandenburg
a68d9c606a orangefs: remove orangefs_readpages
It's a copy of the loop which would run in read_pages from
mm/readahead.c.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-05-03 14:32:38 -04:00
Martin Brandenburg
afd9fb2a31 orangefs: reorganize setattr functions to track attribute changes
OrangeFS accepts a mask indicating which attributes were changed.  The
kernel must not set any bits except those that were actually changed.
The kernel must set the uid/gid of the request to the actual uid/gid
responsible for the change.

Code path for notify_change initiated setattrs is

orangefs_setattr(dentry, iattr)
-> __orangefs_setattr(inode, iattr)

In kernel changes are initiated by calling __orangefs_setattr.

Code path for writeback is

orangefs_write_inode
-> orangefs_inode_setattr

attr_valid and attr_uid and attr_gid change together under i_lock.
I_DIRTY changes separately.

__orangefs_setattr
	lock
	if needs to be cleaned first, unlock and retry
	set attr_valid
	copy data in
	unlock
	mark_inode_dirty

orangefs_inode_setattr
	lock
	copy attributes out
	unlock
	clear getattr_time
	# __writeback_single_inode clears dirty

orangefs_inode_getattr
	# possible to get here with attr_valid set and not dirty
	lock
	if getattr_time ok or attr_valid set, unlock and return
	unlock
	do server operation
	# another thread may getattr or setattr, so check for that
	lock
	if getattr_time ok or attr_valid, unlock and return
	else, copy in
	update getattr_time
	unlock

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-05-03 14:32:38 -04:00
Martin Brandenburg
df2d7337b5 orangefs: let setattr write to cached inode
This is a fairly big change, but ultimately it's not a lot of code.

Implement write_inode and then avoid the call to orangefs_inode_setattr
within orangefs_setattr.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-05-03 14:32:38 -04:00
Martin Brandenburg
f2d34c738c orangefs: set up and use backing_dev_info
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-05-03 14:32:38 -04:00
Martin Brandenburg
5e4f606e26 orangefs: hold i_lock during inode_getattr
This should be a no-op now.  When inode writeback works, this will
prevent a getattr from overwriting inode data while an inode is
transitioning to dirty.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-05-03 14:32:38 -04:00
Martin Brandenburg
5e7f1d4338 orangefs: update attributes rather than relying on server
This should be a no-op now, but once inode writeback works, it'll be
necessary to have the correct attribute in the dirty inode.

Previously the attribute fetch timeout was marked invalid and the server
provided the updated attribute.  When the inode is dirty, the server
cannot be consulted since it does not yet know the pending setattr.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-05-03 14:32:38 -04:00
Martin Brandenburg
8b60785c1d orangefs: simplify orangefs_inode_getattr interface
No need to store the received mask.  It is either STATX_BASIC_STATS or
STATX_BASIC_STATS & ~STATX_SIZE.  If STATX_SIZE is requested, the cache
is bypassed anyway, so the cached mask is unnecessary to decide whether
to do a real getattr.

This is a change.  Previously a getattr would want size and use the
cached size.  All of the in-kernel callers that wanted size did not want
a cached size.  Now a getattr cannot use the cached size if it wants
size at all.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-05-03 14:32:38 -04:00
Martin Brandenburg
66d5477d70 orangefs: do not invalidate attributes on inode create
When an inode is created, we fetch attributes from the server.  There is
no need to turn around and invalidate them.

No need to initialize attributes after the getattr either.  Either it'll
be exactly the same, or it'll be something else and wrong.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-05-03 14:32:37 -04:00
Martin Brandenburg
fc2e2e9c43 orangefs: implement xattr cache
This uses the same timeout as the getattr cache.  This substantially
increases performance when writing files with smaller buffer sizes.

When writing, the size is (often) changed, which causes a call to
notify_change which calls security_inode_need_killpriv which needs a
getxattr.  Caching it reduces traffic to the server.

Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2019-05-03 14:32:37 -04:00
David S. Miller
ff24e4980a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three trivial overlapping conflicts.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-02 22:14:21 -04:00
Stefan Bühler
5dcf877fb1 req->error only used for iopoll
No need to set it in io_poll_add; io_poll_complete doesn't use it to set
the result in the CQE.

Signed-off-by: Stefan Bühler <source@stbuehler.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-02 14:08:54 -06:00
Jens Axboe
9b402849e8 io_uring: add support for eventfd notifications
Allow registration of an eventfd, which will trigger an event every
time a completion event happens for this io_uring instance.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-02 14:08:54 -06:00
Jens Axboe
5d17b4a4b7 io_uring: add support for IORING_OP_SYNC_FILE_RANGE
This behaves just like sync_file_range(2) does.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-02 14:08:54 -06:00
Jens Axboe
22f96b3808 fs: add sync_file_range() helper
This just pulls out the ksys_sync_file_range() code to work on a struct
file instead of an fd, so we can use it elsewhere.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-02 14:08:53 -06:00
Jens Axboe
de0617e467 io_uring: add support for marking commands as draining
There are no ordering constraints between the submission and completion
side of io_uring. But sometimes that would be useful to have. One common
example is doing an fsync, for instance, and have it ordered with
previous writes. Without support for that, the application must do this
tracking itself.

This adds a general SQE flag, IOSQE_IO_DRAIN. If a command is marked
with this flag, then it will not be issued before previous commands have
completed, and subsequent commands submitted after the drain will not be
issued before the drain is started.. If there are no pending commands,
setting this flag will not change the behavior of the issue of the
command.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-02 14:08:53 -06:00
Linus Torvalds
5ce3307b6d for-linus-20190502
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlzLBg4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsG8EACHxeAgDY3j+ZSLqn9sCFzvDo+hc9q2x64o
 ydscoxcdRSZYAU1dP6bnjQBFZbwVHaDzYG11L9aPwlvbsxis9/XWASc/gCo8sumN
 itmk3jKW5zYXlLw57ojh2+/UbPKnO5fKJcA66AakDGTopp3nYZQvov1kUJKOwTiz
 Ir48vEwST4DU9szkfGGeKOGG735veU/yarUNyADwc+CIigBcIa/A/bZtY8P4xV5d
 6bSLnMayehntZaHpVUPwDqOgP7+nSqIzQ88BPAKQnSzDat/SK4C0pgIX9L2nS1/4
 7fBpg3/HN0sFgUuhAG/3ILD82yD3CWDGbt2xif2CP0ylvVOMB6d/ObDAIxwTBKVd
 MwJ3U0RK+Ph2rvaglp6ifk3YgS0Nzt4N5F9AVmp32RR3QpqzJ2pLN4PUjbdWaEbr
 lugu6uengn0flu90mesBazrVNrZkrZ2w859tq9TNWPYpdIthKrvNn44J1bXx+yUj
 F1d+Bf0Sxxnc1b8hEKn9KOXPRoIY07hd/wUi2a/F4RVGSFa2b9CnaengqXIDzQNC
 duEErt+Pd2xlBA4WhPlKSI2PmrWJvpDr7iuAiqdC8cC69hpF4F2/yWCiL+JYI4pF
 v0zPn214KZ2rg9XZfb0Me8ff1BNvmXJMKzvecX2mOcclcO97PFVkqJhcx/75JE/6
 ooQiTofl+w==
 =KjZ4
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190502' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "This is mostly io_uring fixes/tweaks. Most of these were actually done
  in time for the last -rc, but I wanted to ensure that everything
  tested out great before including them. The code delta looks larger
  than it really is, as it's mostly just comment additions/changes.

  Outside of the comment additions/changes, this is mostly removal of
  unnecessary barriers. In all, this pull request contains:

   - Tweak to how we handle errors at submission time. We now post a
     completion event if the error occurs on behalf of an sqe, instead
     of returning it through the system call. If the error happens
     outside of a specific sqe, we return the error through the system
     call. This makes it nicer to use and makes the "normal" use case
     behave the same as the offload cases. (me)

   - Fix for a missing req reference drop from async context (me)

   - If an sqe is submitted with RWF_NOWAIT, don't punt it to async
     context. Return -EAGAIN directly, instead of using it as a hint to
     do async punt. (Stefan)

   - Fix notes on barriers (Stefan)

   - Remove unnecessary barriers (Stefan)

   - Fix potential double free of memory in setup error (Mark)

   - Further improve sq poll CPU validation (Mark)

   - Fix page allocation warning and leak on buffer registration error
     (Mark)

   - Fix iov_iter_type() for new no-ref flag (Ming)

   - Fix a case where dio doesn't honor bio no-page-ref (Ming)"

* tag 'for-linus-20190502' of git://git.kernel.dk/linux-block:
  io_uring: avoid page allocation warnings
  iov_iter: fix iov_iter_type
  block: fix handling for BIO_NO_PAGE_REF
  io_uring: drop req submit reference always in async punt
  io_uring: free allocated io_memory once
  io_uring: fix SQPOLL cpu validation
  io_uring: have submission side sqe errors post a cqe
  io_uring: remove unnecessary barrier after unsetting IORING_SQ_NEED_WAKEUP
  io_uring: remove unnecessary barrier after incrementing dropped counter
  io_uring: remove unnecessary barrier before reading SQ tail
  io_uring: remove unnecessary barrier after updating SQ head
  io_uring: remove unnecessary barrier before reading cq head
  io_uring: remove unnecessary barrier before wq_has_sleeper
  io_uring: fix notes on barriers
  io_uring: fix handling SQEs requesting NOWAIT
2019-05-02 09:55:04 -07:00
Nikolay Borisov
b1c16ac978 btrfs: Use kvmalloc for allocating compressed path context
Recent refactoring of cow_file_range_async means it's now possible to
request a rather large physically contiguous memory via kmalloc. The
size is dependent on the number of 512k chunks that the compressed range
consists of. David reported multiple OOM messages on such large
allocations. Fix it by switching to using kvmalloc.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-02 13:48:19 +02:00
Nikolay Borisov
7447555fe7 btrfs: Factor out common extent locking code in submit_compressed_extents
Irrespective of whether the compress code fell back to uncompressed or
a compressed extent has to be submitted, the extent range is always
locked. So factor out the common lock_extent call at the beginning of
the loop. No functional changes just removes one duplicate lock_extent
call.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-02 13:48:19 +02:00
Nikolay Borisov
4336650aff btrfs: Set io_tree only once in submit_compressed_extents
The inode never changes so it's sufficient to dereference it and get
the iotree only once, before the execution of the main loop. No
functional changes, only the size of the function is decreased:

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-44 (-44)
Function                                     old     new   delta
submit_compressed_extents                   1240    1196     -44
Total: Before=88476, After=88432, chg -0.05%

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-02 13:48:19 +02:00
Nikolay Borisov
69684c5a88 btrfs: Replace clear_extent_bit with unlock_extent
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-02 13:48:19 +02:00
Nikolay Borisov
1368c6dac7 btrfs: Make compress_file_range take only struct async_chunk
All context this function needs is held within struct async_chunk.
Currently we not only pass the struct but also every individual member.
This is redundant, simplify it by only passing struct async_chunk and
leaving it to compress_file_range to extract the values it requires.
No functional changes.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-02 13:48:19 +02:00
Nikolay Borisov
c5a68aec4e btrfs: Remove fs_info from struct async_chunk
The associated btrfs_work already contains a reference to the fs_info so
use that instead of passing it via async_chunk. No functional changes.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-02 13:48:19 +02:00
Nikolay Borisov
b5326271e7 btrfs: Rename async_cow to async_chunk
Now that we have an explicit async_chunk struct rename references to
variables of this type to async_chunk. No functional changes.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-02 13:48:18 +02:00
Nikolay Borisov
97db120451 btrfs: Preallocate chunks in cow_file_range_async
This commit changes the implementation of cow_file_range_async in order
to get rid of the BUG_ON in the middle of the loop. Additionally it
reworks the inner loop in the hopes of making it more understandable.

The idea is to make async_cow be a top-level structured, shared amongst
all chunks being sent for compression. This allows to perform one memory
allocation at the beginning and gracefully fail the IO if there isn't
enough memory. Now, each chunk is going to be described by an
async_chunk struct. It's the responsibility of the final chunk
to actually free the memory.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-02 13:48:18 +02:00
Josef Bacik
c8eaeac7b7 btrfs: reserve delalloc metadata differently
With the per-inode block reserves we started refilling the reserve based
on the calculated size of the outstanding csum bytes and extents for the
inode, including the amount we were adding with the new operation.

However, generic/224 exposed a problem with this approach.  With 1000
files all writing at the same time we ended up with a bunch of bytes
being reserved but unusable.

When you write to a file we reserve space for the csum leaves for those
bytes, the number of extent items required to cover those bytes, and a
single transaction item for updating the inode at ordered extent finish
for that range of bytes.  This is held until the ordered extent finishes
and we release all of the reserved space.

If a second write comes in at this point we would add a single
reservation for the new outstanding extent and however many reservations
for the csum leaves.  At this point we find the delta of how much we
have reserved and how much outstanding size this is and attempt to
reserve this delta.  If the first write finishes it will not release any
space, because the space it had reserved for the initial write is still
needed for the second write.  However some space would have been used,
as we have added csums, extent items, and dirtied the inode.  Our
reserved space would be > 0 but less than the total needed reserved
space.

This is just for a single inode, now consider generic/224.  This has
1000 inodes writing in parallel to a very small file system, 1GiB.  In
my testing this usually means we get about a 120MiB metadata area to
work with, more than enough to allow the writes to continue, but not
enough if all of the inodes are stuck trying to reserve the slack space
while continuing to hold their leftovers from their initial writes.

Fix this by pre-reserved _only_ for the space we are currently trying to
add.  Then once that is successful modify our inodes csum count and
outstanding extents, and then add the newly reserved space to the inodes
block_rsv.  This allows us to actually pass generic/224 without running
out of metadata space.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-02 13:47:12 +02:00
Al Viro
4e9036042f ufs: fix braino in ufs_get_inode_gid() for solaris UFS flavour
To choose whether to pick the GID from the old (16bit) or new (32bit)
field, we should check if the old gid field is set to 0xffff.  Mainline
checks the old *UID* field instead - cut'n'paste from the corresponding
code in ufs_get_inode_uid().

Fixes: 252e211e90ce
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-02 02:24:50 -04:00
Eric Sandeen
910832697c xfs: change some error-less functions to void types
There are several functions which have no opportunity to return
an error, and don't contain any ASSERTs which could be argued
to be better constructed as error cases.  So, make them voids
to simplify the callers.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-05-01 20:26:30 -07:00