linux

iv/linux

Author	SHA1	Message	Date
Pavel Begunkov	6567506b68	io_uring: disallow defer-tw run w/ no submitters We try to restrict CQ waiters when IORING_SETUP_DEFER_TASKRUN is set, but if nothing has been submitted yet it'll allow any waiter, which violates the contract. Fixes: `c0e0d6ba25` ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/b4f0d3f14236d7059d08c5abe2661ef0b78b5528.1662652536.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-21 13:14:55 -06:00
Pavel Begunkov	76de6749d1	io_uring: further limit non-owner defer-tw cq waiting In case of DEFER_TASK_WORK we try to restrict waiters to only one task, which is also the only submitter; however, we don't do it reliably, which might be very confusing and backfire in the future. E.g. we currently allow multiple tasks in io_iopoll_check(). Fixes: `c0e0d6ba25` ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/94c83c0a7fe468260ee2ec31bdb0095d6e874ba2.1662652536.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-21 13:14:46 -06:00
Jens Axboe	dac6a0eae7	io_uring: ensure iopoll runs local task work as well Combine the two checks we have for task_work running and whether or not we need to shuffle the mutex into one, so we unify how task_work is run in the iopoll loop. This helps ensure that local task_work is run when needed, and also optimizes that path to avoid a mutex shuffle if it's not needed. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-21 10:30:43 -06:00
Jens Axboe	8ac5d85a89	io_uring: add local task_work run helper that is entered locked We have a few spots that drop the mutex just to run local task_work, which immediately tries to grab it again. Add a helper that just passes in whether we're locked already. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-21 10:30:43 -06:00
Dylan Yudaken	c0e0d6ba25	io_uring: add IORING_SETUP_DEFER_TASKRUN Allow deferring async tasks until the user calls io_uring_enter(2) with the IORING_ENTER_GETEVENTS flag. Enable this mode with a flag at io_uring_setup time. This functionality requires that the later io_uring_enter will be called from the same submission task, and therefore restrict this flag to work only when IORING_SETUP_SINGLE_ISSUER is also set. Being able to hand pick when tasks are run prevents the problem where there is current work to be done, however task work runs anyway. For example, a common workload would obtain a batch of CQEs, and process each one. Interrupting this to additional taskwork would add latency but not gain anything. If instead task work is deferred to just before more CQEs are obtained then no additional latency is added. The way this is implemented is by trying to keep task work local to a io_ring_ctx, rather than to the submission task. This is required, as the application will want to wake up only a single io_ring_ctx at a time to process work, and so the lists of work have to be kept separate. This has some other benefits like not having to check the task continually in handle_tw_list (and potentially unlocking/locking those), and reducing locks in the submit & process completions path. There are networking cases where using this option can reduce request latency by 50%. For example a contrived example using [1] where the client sends 2k data and receives the same data back while doing some system calls (to trigger task work) shows this reduction. The reason ends up being that if sending responses is delayed by processing task work, then the client side sits idle. Whereas reordering the sends first means that the client runs it's workload in parallel with the local task work. [1]: Using https://github.com/DylanZA/netbench/tree/defer_run Client: ./netbench --client_only 1 --control_port 10000 --host <host> --tx "epoll --threads 16 --per_thread 1 --size 2048 --resp 2048 --workload 1000" Server: ./netbench --server_only 1 --control_port 10000 --rx "io_uring --defer_taskrun 0 --workload 100" --rx "io_uring --defer_taskrun 1 --workload 100" Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220830125013.570060-5-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-09-21 10:30:42 -06:00
Pavel Begunkov	bd1a3783dd	io_uring: export req alloc from core We want to do request allocation out of the core io_uring code, make the allocation functions public for other io_uring parts. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0314fedd3a02a514210ba42d4720332538c65956.1658913593.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-27 08:50:50 -06:00
Pavel Begunkov	63809137eb	io_uring: flush notifiers after sendzc Allow to flush notifiers as a part of sendzc request by setting IORING_SENDZC_FLUSH flag. When the sendzc request succeedes it will flush the used [active] notifier. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e0b4d9a6797e2fd6092824fe42953db7a519bbc8.1657643355.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:41:07 -06:00
Pavel Begunkov	eb42cebb2c	io_uring: add zc notification infrastructure Add internal part of send zerocopy notifications. There are two main structures, the first one is struct io_notif, which carries inside struct ubuf_info and maps 1:1 to it. io_uring will be binding a number of zerocopy send requests to it and ask to complete (aka flush) it. When flushed and all attached requests and skbs complete, it'll generate one and only one CQE. There are intended to be passed into the network layer as struct msghdr::msg_ubuf. The second concept is notification slots. The userspace will be able to register an array of slots and subsequently addressing them by the index in the array. Slots are independent of each other. Each slot can have only one notifier at a time (called active notifier) but many notifiers during the lifetime. When active, a notifier not going to post any completion but the userspace can attach requests to it by specifying the corresponding slot while issueing send zc requests. Eventually, the userspace will want to "flush" the notifier losing any way to attach new requests to it, however it can use the next atomatically added notifier of this slot or of any other slot. When the network layer is done with all enqueued skbs attached to a notifier and doesn't need the specified in them user data, the flushed notifier will post a CQE. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3ecf54c31a85762bf679b0a432c9f43ecf7e61cc.1657643355.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:41:06 -06:00
Pavel Begunkov	e70cb60893	io_uring: export io_put_task() Make io_put_task() available to non-core parts of io_uring, we'll need it for notification infrastructure. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3686807d4c03b72e389947b0e8692d4d44334ef0.1657643355.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:41:06 -06:00
Jens Axboe	f6b543fd03	io_uring: ensure REQ_F_ISREG is set async offload If we're offloading requests directly to io-wq because IOSQE_ASYNC was set in the sqe, we can miss hashing writes appropriately because we haven't set REQ_F_ISREG yet. This can cause a performance regression with buffered writes, as io-wq then no longer correctly serializes writes to that file. Ensure that we set the flags in io_prep_async_work(), which will cause the io-wq work item to be hashed appropriately. Fixes: `584b0180f0` ("io_uring: move read/write file prep state into actual opcode handler") Link: https://lore.kernel.org/io-uring/20220608080054.GB22428@xsang-OptiPlex-9020/ Reported-by: kernel test robot <oliver.sang@intel.com> Tested-by: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:18 -06:00
Dylan Yudaken	e0486f3f7c	io_uring: only trace one of complete or overflow In overflow we see a duplcate line in the trace, and in some cases 3 lines (if initial io_post_aux_cqe fails). Instead just trace once for each CQE Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630091231.1456789-13-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Dylan Yudaken	52120f0fad	io_uring: add allow_overflow to io_post_aux_cqe Some use cases of io_post_aux_cqe would not want to overflow as is, but might want to change the flags/result. For example multishot receive requires in order CQE, and so if there is an overflow it would need to stop receiving until the overflow is taken care of. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630091231.1456789-8-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Dylan Yudaken	114eccdf0e	io_uring: add IOU_STOP_MULTISHOT return code For multishot we want a way to signal the caller that multishot has ended but also this might not be an error return. For example sockets return 0 when closed, which should end a multishot recv, but still have a CQE with result 0 Introduce IOU_STOP_MULTISHOT which does this and indicates that the return code is stored inside req->cqe Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630091231.1456789-7-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Dylan Yudaken	ed5ccb3bee	io_uring: remove priority tw list optimisation This optimisation has some built in assumptions that make it easy to introduce bugs. It also does not have clear wins that make it worth keeping. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220622134028.2013417-2-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:15 -06:00
Pavel Begunkov	a6b21fbb4c	io_uring: move list helpers to a separate file It's annoying to have io-wq.h as a dependency every time we want some of struct io_wq_work_list helpers, move them into a separate file. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c1d891ce12b30767d1d2a3b7db2ca3abc1ecc4a2.1655802465.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:15 -06:00
Pavel Begunkov	625d38b3fd	io_uring: improve io_run_task_work() Since SQPOLL now uses TWA_SIGNAL_NO_IPI, there won't be task work items without TIF_NOTIFY_SIGNAL. Simplify io_run_task_work() by removing task->task_works check. Even though looks it doesn't cause extra cache bouncing, it's still nice to not touch it an extra time when it might be not cached. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/75d4f34b0c671075892821a409e28da6cb1d64fe.1655802465.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:15 -06:00
Pavel Begunkov	9da070b142	io_uring: consistent naming for inline completion Improve naming of the inline/deferred completion helper so it's consistent with it's *_post counterpart. Add some comments and extra lockdeps to ensure the locking is done right. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/797c619943dac06529e9d3fcb16e4c3cde6ad1a3.1655684496.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:15 -06:00
Pavel Begunkov	46929b0868	io_uring: add io_commit_cqring_flush() Since __io_commit_cqring_flush users moved to different files, introduce io_commit_cqring_flush() helper and encapsulate all flags testing details inside. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0da03887435dd9869ffe46dcd3962bf104afcca3.1655684496.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:15 -06:00
Pavel Begunkov	253993210b	io_uring: introduce locking helpers for CQE posting spin_lock(&ctx->completion_lock); /* post CQEs */ io_commit_cqring(ctx); spin_unlock(&ctx->completion_lock); io_cqring_ev_posted(ctx); We have many places repeating this sequence, and the three function unlock section is not perfect from the maintainance perspective and also makes it harder to add new locking/sync trick. Introduce two helpers. io_cq_lock(), which is simple and only grabs ->completion_lock, and io_cq_unlock_post() encapsulating the three call section. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fe0c682bf7f7b55d9be55b0d034be9c1949277dc.1655684496.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:14 -06:00
Pavel Begunkov	d9dee4302a	io_uring: remove ->flush_cqes optimisation It's not clear how widely used IOSQE_CQE_SKIP_SUCCESS is, and how often ->flush_cqes flag prevents from completion being flushed. Sometimes it's high level of concurrency that enables it at least for one CQE, but sometimes it doesn't save much because nobody waiting on the CQ. Remove ->flush_cqes flag and the optimisation, it should benefit the normal use case. Note, that there is no spurious eventfd problem with that as checks for spuriousness were incorporated into io_eventfd_signal(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/692e81eeddccc096f449a7960365fa7b4a18f8e6.1655637157.git.asml.silence@gmail.com [axboe: remove now dead state->flush_cqes variable] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:14 -06:00
Pavel Begunkov	9046c6415b	io_uring: reshuffle io_uring/io_uring.h It's a good idea to first do forward declarations and then inline helpers, otherwise there will be keep stumbling on dependencies between them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1d7fa6672ed43f20ccc0c54ae201369ebc3ebfab.1655637157.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:14 -06:00
Pavel Begunkov	ab1c84d855	io_uring: make io_uring_types.h public Move io_uring types to linux/include, need them public so tracing can see the definitions and we can clean trace/events/io_uring.h Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a15f12e8cb7289b2de0deaddcc7518d98a132d17.1655384063.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:14 -06:00
Pavel Begunkov	b3659a65be	io_uring: change ->cqe_cached invariant for CQE32 With IORING_SETUP_CQE32 ->cqe_cached doesn't store a real address but rather an implicit offset into cqes. Store the real cqe pointer and increment it accordingly if CQE32. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1ee1838cba16bed96381a006950b36ba640d998c.1655455613.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:14 -06:00
Pavel Begunkov	e8c328c391	io_uring: deduplicate io_get_cqe() calls Deduplicate calls to io_get_cqe() from __io_fill_cqe_req(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4fa077986cc3abab7c59ff4e7c390c783885465f.1655455613.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:14 -06:00
Pavel Begunkov	ae5735c69b	io_uring: deduplicate __io_fill_cqe_req tracing Deduplicate two trace_io_uring_complete() calls in __io_fill_cqe_req(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/277ed85dba5189ab7d932164b314013a0f0b0fdc.1655455613.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:14 -06:00
Pavel Begunkov	68494a65d0	io_uring: introduce io_req_cqe_overflow() __io_fill_cqe_req() is hot and inlined, we want it to be as small as possible. Add io_req_cqe_overflow() accepting only a request and doing all overflow accounting, and replace with it two calls to 6 argument io_cqring_event_overflow(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/048b9fbcce56814d77a1a540409c98c3d383edcb.1655455613.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:14 -06:00
Pavel Begunkov	faf88dde06	io_uring: don't inline __io_get_cqe() __io_get_cqe() is not as hot as io_get_cqe(), no need to inline it, it sheds ~500B from the binary. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c1ac829198a881b7af8710926f99a3559b9f24c0.1655455613.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:14 -06:00
Pavel Begunkov	d245bca637	io_uring: don't expose io_fill_cqe_aux() Deduplicate some code and add a helper for filling an aux CQE, locking and notification. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b7c6557c8f9dc5c4cfb01292116c682a0ff61081.1655455613.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:14 -06:00
Pavel Begunkov	75d7b3aec1	io_uring: kill REQ_F_COMPLETE_INLINE REQ_F_COMPLETE_INLINE is only needed to delay queueing into the completion list to io_queue_sqe() as __io_req_complete() is inlined and we don't want to bloat the kernel. As now we complete in a more centralised fashion in io_issue_sqe() we can get rid of the flag and queue to the list directly. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/600ba20a9338b8a39b249b23d3d177803613dde4.1655371007.git.asml.silence@gmail.com Reviewed-by: Hao Xu <howeyxu@tencent.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:13 -06:00
Pavel Begunkov	aa1e90f64e	io_uring: move small helpers to headers There is a bunch of inline helpers that will be useful not only to the core of io_uring, move them to headers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/22df99c83723e44cba7e945e8519e64e3642c064.1655310733.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:13 -06:00
Jens Axboe	f3b44f92e5	io_uring: move read/write related opcodes to its own file Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:12 -06:00
Jens Axboe	c98817e6cd	io_uring: move remaining file table manipulation to filetable.c Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:12 -06:00
Jens Axboe	7357298448	io_uring: move rsrc related data, core, and commands Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:12 -06:00
Jens Axboe	3b77495a97	io_uring: split provided buffers handling into its own file Move both the opcodes related to it, and the internals code dealing with it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:12 -06:00
Jens Axboe	7aaff708a7	io_uring: move cancelation into its own file This also helps cleanup the io_uring.h cancel parts, as we can make things static in the cancel.c file, mostly. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:12 -06:00
Jens Axboe	329061d3e2	io_uring: move poll handling into its own file Add a io_poll_issue() rather than export the general task_work locking and io_issue_sqe(), and put the io_op_defs definition and structure into a separate header file so that poll can use it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:12 -06:00
Jens Axboe	c9f06aa7de	io_uring: move io_uring_task (tctx) helpers into its own file Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:12 -06:00
Jens Axboe	17437f3114	io_uring: move SQPOLL related handling into its own file Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:12 -06:00
Jens Axboe	59915143e8	io_uring: move timeout opcodes and handling into its own file Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:12 -06:00
Jens Axboe	f9ead18c10	io_uring: split network related opcodes into its own file While at it, convert the handlers to just use io_eopnotsupp_prep() if CONFIG_NET isn't set. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:11 -06:00
Jens Axboe	99f15d8d61	io_uring: move uring_cmd handling to its own file Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:11 -06:00
Jens Axboe	cd40cae29e	io_uring: split out open/close operations Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:11 -06:00
Jens Axboe	531113bbd5	io_uring: split out splice related operations This splits out splice and tee support. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:11 -06:00
Jens Axboe	97b388d70b	io_uring: handle completions in the core Normally request handlers complete requests themselves, if they don't return an error. For the latter case, the core will complete it for them. This is unhandy for pushing opcode handlers further out, as we don't want a bunch of inline completion code and we don't want to make the completion path slower than it is now. Let the core handle any completion, unless the handler explicitly asks us not to. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:11 -06:00
Jens Axboe	de23077eda	io_uring: set completion results upfront Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:11 -06:00

45 Commits