Commit Graph

45 Commits

Author SHA1 Message Date
Pavel Begunkov
6567506b68 io_uring: disallow defer-tw run w/ no submitters
We try to restrict CQ waiters when IORING_SETUP_DEFER_TASKRUN is set,
but if nothing has been submitted yet it'll allow any waiter, which
violates the contract.

Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/b4f0d3f14236d7059d08c5abe2661ef0b78b5528.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:14:55 -06:00
Pavel Begunkov
76de6749d1 io_uring: further limit non-owner defer-tw cq waiting
In case of DEFER_TASK_WORK we try to restrict waiters to only one task,
which is also the only submitter; however, we don't do it reliably,
which might be very confusing and backfire in the future. E.g. we
currently allow multiple tasks in io_iopoll_check().

Fixes: c0e0d6ba25 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/94c83c0a7fe468260ee2ec31bdb0095d6e874ba2.1662652536.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 13:14:46 -06:00
Jens Axboe
dac6a0eae7 io_uring: ensure iopoll runs local task work as well
Combine the two checks we have for task_work running and whether or not
we need to shuffle the mutex into one, so we unify how task_work is run
in the iopoll loop. This helps ensure that local task_work is run when
needed, and also optimizes that path to avoid a mutex shuffle if it's
not needed.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:43 -06:00
Jens Axboe
8ac5d85a89 io_uring: add local task_work run helper that is entered locked
We have a few spots that drop the mutex just to run local task_work,
which immediately tries to grab it again. Add a helper that just passes
in whether we're locked already.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:43 -06:00
Dylan Yudaken
c0e0d6ba25 io_uring: add IORING_SETUP_DEFER_TASKRUN
Allow deferring async tasks until the user calls io_uring_enter(2) with
the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at
io_uring_setup time. This functionality requires that the later
io_uring_enter will be called from the same submission task, and therefore
restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is also
set.

Being able to hand pick when tasks are run prevents the problem where
there is current work to be done, however task work runs anyway.

For example, a common workload would obtain a batch of CQEs, and process
each one. Interrupting this to additional taskwork would add latency but
not gain anything. If instead task work is deferred to just before more
CQEs are obtained then no additional latency is added.

The way this is implemented is by trying to keep task work local to a
io_ring_ctx, rather than to the submission task. This is required, as the
application will want to wake up only a single io_ring_ctx at a time to
process work, and so the lists of work have to be kept separate.

This has some other benefits like not having to check the task continually
in handle_tw_list (and potentially unlocking/locking those), and reducing
locks in the submit & process completions path.

There are networking cases where using this option can reduce request
latency by 50%. For example a contrived example using [1] where the client
sends 2k data and receives the same data back while doing some system
calls (to trigger task work) shows this reduction. The reason ends up
being that if sending responses is delayed by processing task work, then
the client side sits idle. Whereas reordering the sends first means that
the client runs it's workload in parallel with the local task work.

[1]:
Using https://github.com/DylanZA/netbench/tree/defer_run
Client:
./netbench  --client_only 1 --control_port 10000 --host <host> --tx "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 --workload 1000"
Server:
./netbench  --server_only 1 --control_port 10000  --rx "io_uring --defer_taskrun 0 --workload 100"   --rx "io_uring  --defer_taskrun 1 --workload 100"

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-5-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 10:30:42 -06:00
Pavel Begunkov
bd1a3783dd io_uring: export req alloc from core
We want to do request allocation out of the core io_uring code, make the
allocation functions public for other io_uring parts.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0314fedd3a02a514210ba42d4720332538c65956.1658913593.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-27 08:50:50 -06:00
Pavel Begunkov
63809137eb io_uring: flush notifiers after sendzc
Allow to flush notifiers as a part of sendzc request by setting
IORING_SENDZC_FLUSH flag. When the sendzc request succeedes it will
flush the used [active] notifier.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e0b4d9a6797e2fd6092824fe42953db7a519bbc8.1657643355.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:41:07 -06:00
Pavel Begunkov
eb42cebb2c io_uring: add zc notification infrastructure
Add internal part of send zerocopy notifications. There are two main
structures, the first one is struct io_notif, which carries inside
struct ubuf_info and maps 1:1 to it. io_uring will be binding a number
of zerocopy send requests to it and ask to complete (aka flush) it. When
flushed and all attached requests and skbs complete, it'll generate one
and only one CQE. There are intended to be passed into the network layer
as struct msghdr::msg_ubuf.

The second concept is notification slots. The userspace will be able to
register an array of slots and subsequently addressing them by the index
in the array. Slots are independent of each other. Each slot can have
only one notifier at a time (called active notifier) but many notifiers
during the lifetime. When active, a notifier not going to post any
completion but the userspace can attach requests to it by specifying
the corresponding slot while issueing send zc requests. Eventually, the
userspace will want to "flush" the notifier losing any way to attach
new requests to it, however it can use the next atomatically added
notifier of this slot or of any other slot.

When the network layer is done with all enqueued skbs attached to a
notifier and doesn't need the specified in them user data, the flushed
notifier will post a CQE.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3ecf54c31a85762bf679b0a432c9f43ecf7e61cc.1657643355.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:41:06 -06:00
Pavel Begunkov
e70cb60893 io_uring: export io_put_task()
Make io_put_task() available to non-core parts of io_uring, we'll need
it for notification infrastructure.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3686807d4c03b72e389947b0e8692d4d44334ef0.1657643355.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:41:06 -06:00
Jens Axboe
f6b543fd03 io_uring: ensure REQ_F_ISREG is set async offload
If we're offloading requests directly to io-wq because IOSQE_ASYNC was
set in the sqe, we can miss hashing writes appropriately because we
haven't set REQ_F_ISREG yet. This can cause a performance regression
with buffered writes, as io-wq then no longer correctly serializes writes
to that file.

Ensure that we set the flags in io_prep_async_work(), which will cause
the io-wq work item to be hashed appropriately.

Fixes: 584b0180f0 ("io_uring: move read/write file prep state into actual opcode handler")
Link: https://lore.kernel.org/io-uring/20220608080054.GB22428@xsang-OptiPlex-9020/
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:18 -06:00
Dylan Yudaken
e0486f3f7c io_uring: only trace one of complete or overflow
In overflow we see a duplcate line in the trace, and in some cases 3
lines (if initial io_post_aux_cqe fails).
Instead just trace once for each CQE

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-13-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Dylan Yudaken
52120f0fad io_uring: add allow_overflow to io_post_aux_cqe
Some use cases of io_post_aux_cqe would not want to overflow as is, but
might want to change the flags/result. For example multishot receive
requires in order CQE, and so if there is an overflow it would need to
stop receiving until the overflow is taken care of.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-8-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Dylan Yudaken
114eccdf0e io_uring: add IOU_STOP_MULTISHOT return code
For multishot we want a way to signal the caller that multishot has ended
but also this might not be an error return.

For example sockets return 0 when closed, which should end a multishot
recv, but still have a CQE with result 0

Introduce IOU_STOP_MULTISHOT which does this and indicates that the return
code is stored inside req->cqe

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630091231.1456789-7-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:17 -06:00
Dylan Yudaken
ed5ccb3bee io_uring: remove priority tw list optimisation
This optimisation has some built in assumptions that make it easy to
introduce bugs. It also does not have clear wins that make it worth keeping.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-2-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:15 -06:00
Pavel Begunkov
a6b21fbb4c io_uring: move list helpers to a separate file
It's annoying to have io-wq.h as a dependency every time we want some of
struct io_wq_work_list helpers, move them into a separate file.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c1d891ce12b30767d1d2a3b7db2ca3abc1ecc4a2.1655802465.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:15 -06:00
Pavel Begunkov
625d38b3fd io_uring: improve io_run_task_work()
Since SQPOLL now uses TWA_SIGNAL_NO_IPI, there won't be task work items
without TIF_NOTIFY_SIGNAL. Simplify io_run_task_work() by removing
task->task_works check. Even though looks it doesn't cause extra cache
bouncing, it's still nice to not touch it an extra time when it might be
not cached.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/75d4f34b0c671075892821a409e28da6cb1d64fe.1655802465.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:15 -06:00
Pavel Begunkov
9da070b142 io_uring: consistent naming for inline completion
Improve naming of the inline/deferred completion helper so it's
consistent with it's *_post counterpart. Add some comments and extra
lockdeps to ensure the locking is done right.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/797c619943dac06529e9d3fcb16e4c3cde6ad1a3.1655684496.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:15 -06:00
Pavel Begunkov
46929b0868 io_uring: add io_commit_cqring_flush()
Since __io_commit_cqring_flush users moved to different files, introduce
io_commit_cqring_flush() helper and encapsulate all flags testing details
inside.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0da03887435dd9869ffe46dcd3962bf104afcca3.1655684496.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:15 -06:00
Pavel Begunkov
253993210b io_uring: introduce locking helpers for CQE posting
spin_lock(&ctx->completion_lock);
/* post CQEs */
io_commit_cqring(ctx);
spin_unlock(&ctx->completion_lock);
io_cqring_ev_posted(ctx);

We have many places repeating this sequence, and the three function
unlock section is not perfect from the maintainance perspective and also
makes it harder to add new locking/sync trick.

Introduce two helpers. io_cq_lock(), which is simple and only grabs
->completion_lock, and io_cq_unlock_post() encapsulating the three call
section.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fe0c682bf7f7b55d9be55b0d034be9c1949277dc.1655684496.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:14 -06:00
Pavel Begunkov
d9dee4302a io_uring: remove ->flush_cqes optimisation
It's not clear how widely used IOSQE_CQE_SKIP_SUCCESS is, and how often
->flush_cqes flag prevents from completion being flushed. Sometimes it's
high level of concurrency that enables it at least for one CQE, but
sometimes it doesn't save much because nobody waiting on the CQ.

Remove ->flush_cqes flag and the optimisation, it should benefit the
normal use case. Note, that there is no spurious eventfd problem with
that as checks for spuriousness were incorporated into
io_eventfd_signal().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/692e81eeddccc096f449a7960365fa7b4a18f8e6.1655637157.git.asml.silence@gmail.com
[axboe: remove now dead state->flush_cqes variable]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:14 -06:00
Pavel Begunkov
9046c6415b io_uring: reshuffle io_uring/io_uring.h
It's a good idea to first do forward declarations and then inline
helpers, otherwise there will be keep stumbling on dependencies
between them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1d7fa6672ed43f20ccc0c54ae201369ebc3ebfab.1655637157.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:14 -06:00
Pavel Begunkov
ab1c84d855 io_uring: make io_uring_types.h public
Move io_uring types to linux/include, need them public so tracing can
see the definitions and we can clean trace/events/io_uring.h

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a15f12e8cb7289b2de0deaddcc7518d98a132d17.1655384063.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:14 -06:00
Pavel Begunkov
b3659a65be io_uring: change ->cqe_cached invariant for CQE32
With IORING_SETUP_CQE32 ->cqe_cached doesn't store a real address but
rather an implicit offset into cqes. Store the real cqe pointer and
increment it accordingly if CQE32.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1ee1838cba16bed96381a006950b36ba640d998c.1655455613.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:14 -06:00
Pavel Begunkov
e8c328c391 io_uring: deduplicate io_get_cqe() calls
Deduplicate calls to io_get_cqe() from __io_fill_cqe_req().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4fa077986cc3abab7c59ff4e7c390c783885465f.1655455613.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:14 -06:00
Pavel Begunkov
ae5735c69b io_uring: deduplicate __io_fill_cqe_req tracing
Deduplicate two trace_io_uring_complete() calls in __io_fill_cqe_req().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/277ed85dba5189ab7d932164b314013a0f0b0fdc.1655455613.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:14 -06:00
Pavel Begunkov
68494a65d0 io_uring: introduce io_req_cqe_overflow()
__io_fill_cqe_req() is hot and inlined, we want it to be as small as
possible. Add io_req_cqe_overflow() accepting only a request and doing
all overflow accounting, and replace with it two calls to 6 argument
io_cqring_event_overflow().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/048b9fbcce56814d77a1a540409c98c3d383edcb.1655455613.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:14 -06:00
Pavel Begunkov
faf88dde06 io_uring: don't inline __io_get_cqe()
__io_get_cqe() is not as hot as io_get_cqe(), no need to inline it, it
sheds ~500B from the binary.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c1ac829198a881b7af8710926f99a3559b9f24c0.1655455613.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:14 -06:00
Pavel Begunkov
d245bca637 io_uring: don't expose io_fill_cqe_aux()
Deduplicate some code and add a helper for filling an aux CQE, locking
and notification.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b7c6557c8f9dc5c4cfb01292116c682a0ff61081.1655455613.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:14 -06:00
Pavel Begunkov
75d7b3aec1 io_uring: kill REQ_F_COMPLETE_INLINE
REQ_F_COMPLETE_INLINE is only needed to delay queueing into the
completion list to io_queue_sqe() as __io_req_complete() is inlined and
we don't want to bloat the kernel.

As now we complete in a more centralised fashion in io_issue_sqe() we
can get rid of the flag and queue to the list directly.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/600ba20a9338b8a39b249b23d3d177803613dde4.1655371007.git.asml.silence@gmail.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:13 -06:00
Pavel Begunkov
aa1e90f64e io_uring: move small helpers to headers
There is a bunch of inline helpers that will be useful not only to the
core of io_uring, move them to headers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/22df99c83723e44cba7e945e8519e64e3642c064.1655310733.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:13 -06:00
Jens Axboe
f3b44f92e5 io_uring: move read/write related opcodes to its own file
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:12 -06:00
Jens Axboe
c98817e6cd io_uring: move remaining file table manipulation to filetable.c
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:12 -06:00
Jens Axboe
7357298448 io_uring: move rsrc related data, core, and commands
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:12 -06:00
Jens Axboe
3b77495a97 io_uring: split provided buffers handling into its own file
Move both the opcodes related to it, and the internals code dealing with
it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:12 -06:00
Jens Axboe
7aaff708a7 io_uring: move cancelation into its own file
This also helps cleanup the io_uring.h cancel parts, as we can make
things static in the cancel.c file, mostly.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:12 -06:00
Jens Axboe
329061d3e2 io_uring: move poll handling into its own file
Add a io_poll_issue() rather than export the general task_work locking
and io_issue_sqe(), and put the io_op_defs definition and structure into
a separate header file so that poll can use it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:12 -06:00
Jens Axboe
c9f06aa7de io_uring: move io_uring_task (tctx) helpers into its own file
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:12 -06:00
Jens Axboe
17437f3114 io_uring: move SQPOLL related handling into its own file
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:12 -06:00
Jens Axboe
59915143e8 io_uring: move timeout opcodes and handling into its own file
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:12 -06:00
Jens Axboe
f9ead18c10 io_uring: split network related opcodes into its own file
While at it, convert the handlers to just use io_eopnotsupp_prep()
if CONFIG_NET isn't set.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:11 -06:00
Jens Axboe
99f15d8d61 io_uring: move uring_cmd handling to its own file
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:11 -06:00
Jens Axboe
cd40cae29e io_uring: split out open/close operations
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:11 -06:00
Jens Axboe
531113bbd5 io_uring: split out splice related operations
This splits out splice and tee support.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:11 -06:00
Jens Axboe
97b388d70b io_uring: handle completions in the core
Normally request handlers complete requests themselves, if they don't
return an error. For the latter case, the core will complete it for
them.

This is unhandy for pushing opcode handlers further out, as we don't
want a bunch of inline completion code and we don't want to make the
completion path slower than it is now.

Let the core handle any completion, unless the handler explicitly
asks us not to.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:11 -06:00
Jens Axboe
de23077eda io_uring: set completion results upfront
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-24 18:39:11 -06:00