linux

iv/linux

Author	SHA1	Message	Date
Jens Axboe	2be2eb02e2	io_uring: ensure reads re-import for selected buffers If we drop buffers between scheduling a retry, then we need to re-import when we start the request again. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 09:55:01 -07:00
Jens Axboe	9af177ee3e	io_uring: retry early for reads if we can poll Most of the logic in io_read() deals with regular files, and in some ways it would make sense to split the handling into S_IFREG and others. But at least for retry, we don't need to bother setting up a bunch of state just to abort in the loop later. In particular, don't bother forcing setup of async data for a normal non-vectored read when we don't need it. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 09:45:57 -07:00
Stefan Roesch	1b6fe6e0df	io-uring: Make statx API stable One of the key architectual tenets is to keep the parameters for io-uring stable. After the call has been submitted, its value can be changed. Unfortunaltely this is not the case for the current statx implementation. IO-Uring change: This changes replaces the const char * filename pointer in the io_statx structure with a struct filename . In addition it also creates the filename object during the prepare phase. With this change, the opcode also needs to invoke cleanup, so the filename object gets freed after processing the request. fs change: This replaces the const char __user filename parameter in the two functions do_statx and vfs_statx with a struct filename *. In addition to be able to correctly construct a filename object a new helper function getname_statx_lookup_flags is introduced. The function makes sure that do_statx and vfs_statx is invoked with the correct lookup flags. Signed-off-by: Stefan Roesch <shr@fb.com> Link: https://lore.kernel.org/r/20220225185326.1373304-2-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 09:33:55 -07:00
Olivier Langlois	adc8682ec6	io_uring: Add support for napi_busy_poll The sqpoll thread can be used for performing the napi busy poll in a similar way that it does io polling for file systems supporting direct access bypassing the page cache. The other way that io_uring can be used for napi busy poll is by calling io_uring_enter() to get events. If the user specify a timeout value, it is distributed between polling and sleeping by using the systemwide setting /proc/sys/net/core/busy_poll. The changes have been tested with this program: https://github.com/lano1106/io_uring_udp_ping and the result is: Without sqpoll: NAPI busy loop disabled: rtt min/avg/max/mdev = 40.631/42.050/58.667/1.547 us NAPI busy loop enabled: rtt min/avg/max/mdev = 30.619/31.753/61.433/1.456 us With sqpoll: NAPI busy loop disabled: rtt min/avg/max/mdev = 42.087/44.438/59.508/1.533 us NAPI busy loop enabled: rtt min/avg/max/mdev = 35.779/37.347/52.201/0.924 us Co-developed-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/810bd9408ffc510ff08269e78dca9df4af0b9e4e.1646777484.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 09:18:30 -07:00
Olivier Langlois	950e79dd73	io_uring: minor io_cqring_wait() optimization Move up the block manipulating the sig variable to execute code that may encounter an error and exit first before continuing executing the rest of the function and avoid useless computations Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/84513f7cc1b1fb31d8f4cb910aee033391d036b4.1646777484.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 09:18:30 -07:00
Jens Axboe	4f57f06ce2	io_uring: add support for IORING_OP_MSG_RING command This adds support for IORING_OP_MSG_RING, which allows an SQE to signal another ring. That allows either waking up someone waiting on the ring, or even passing a 64-bit value via the user_data field in the CQE. sqe->fd must contain the fd of a ring that should receive the CQE. sqe->off will be propagated to the cqe->user_data on the target ring, and sqe->len will be propagated to cqe->res. The results CQE will have IORING_CQE_F_MSG set in its flags, to indicate that this CQE was generated from a messaging request rather than a SQE issued locally on that ring. This effectively allows passing a 64-bit and a 32-bit quantify between the two rings. This request type has the following request specific error cases: - -EBADFD. Set if the sqe->fd doesn't point to a file descriptor that is of the io_uring type. - -EOVERFLOW. Set if we were not able to deliver a request to the target ring. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 09:16:04 -07:00
Jens Axboe	cc3cec8367	io_uring: speedup provided buffer handling In testing high frequency workloads with provided buffers, we spend a lot of time in allocating and freeing the buffer units themselves. Rather than repeatedly free and alloc them, add a recycling cache instead. There are two caches: - ctx->io_buffers_cache. This is the one we grab from in the submission path, and it's protected by ctx->uring_lock. For inline completions, we can recycle straight back to this cache and not need any extra locking. - ctx->io_buffers_comp. If we're not under uring_lock, then we use this list to recycle buffers. It's protected by the completion_lock. On adding a new buffer, check io_buffers_cache. If it's empty, check if we can splice entries from the io_buffers_comp_cache. This reduces about 5-10% of overhead from provided buffers, bringing it pretty close to the non-provided path. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:33:14 -07:00
Jens Axboe	e7a6c00dc7	io_uring: add support for registering ring file descriptors Lots of workloads use multiple threads, in which case the file table is shared between them. This makes getting and putting the ring file descriptor for each io_uring_enter(2) system call more expensive, as it involves an atomic get and put for each call. Similarly to how we allow registering normal file descriptors to avoid this overhead, add support for an io_uring_register(2) API that allows to register the ring fds themselves: 1) IORING_REGISTER_RING_FDS - takes an array of io_uring_rsrc_update structs, and registers them with the task. 2) IORING_UNREGISTER_RING_FDS - takes an array of io_uring_src_update structs, and unregisters them. When a ring fd is registered, it is internally represented by an offset. This offset is returned to the application, and the application then uses this offset and sets IORING_ENTER_REGISTERED_RING for the io_uring_enter(2) system call. This works just like using a registered file descriptor, rather than a real one, in an SQE, where IOSQE_FIXED_FILE gets set to tell io_uring that we're using an internal offset/descriptor rather than a real file descriptor. In initial testing, this provides a nice bump in performance for threaded applications in real world cases where the batch count (eg number of requests submitted per io_uring_enter(2) invocation) is low. In a microbenchmark, submitting NOP requests, we see the following increases in performance: Requests per syscall Baseline Registered Increase ---------------------------------------------------------------- 1 ~7030K ~8080K +15% 2 ~13120K ~14800K +13% 4 ~22740K ~25300K +11% Co-developed-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Dylan Yudaken	63c3654973	io_uring: documentation fixup Fix incorrect name reference in comment. ki_filp does not exist in the struct, but file does. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220224105157.1332353-1-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Dylan Yudaken	b4aec40015	io_uring: do not recalculate ppos unnecessarily There is a slight optimisation to be had by calculating the correct pos pointer inside io_kiocb_update_pos and then using that later. It seems code size drops by a bit: 000000000000a1b0 0000000000000400 t io_read 000000000000a5b0 0000000000000319 t io_write vs 000000000000a1b0 00000000000003f6 t io_read 000000000000a5b0 0000000000000310 t io_write Signed-off-by: Dylan Yudaken <dylany@fb.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Dylan Yudaken	d34e1e5b39	io_uring: update kiocb->ki_pos at execution time Update kiocb->ki_pos at execution time rather than in io_prep_rw(). io_prep_rw() happens before the job is enqueued to a worker and so the offset might be read multiple times before being executed once. Ensures that the file position in a set of _linked_ SQEs will be only obtained after earlier SQEs have completed, and so will include their incremented file position. Signed-off-by: Dylan Yudaken <dylany@fb.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Dylan Yudaken	af9c45eceb	io_uring: remove duplicated calls to io_kiocb_ppos io_kiocb_ppos is called in both branches, and it seems that the compiler does not fuse this. Fusing removes a few bytes from loop_rw_iter. Before: $ nm -S fs/io_uring.o \| grep loop_rw_iter 0000000000002430 0000000000000124 t loop_rw_iter After: $ nm -S fs/io_uring.o \| grep loop_rw_iter 0000000000002430 000000000000010d t loop_rw_iter Signed-off-by: Dylan Yudaken <dylany@fb.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Olivier Langlois	c5020bc8d9	io_uring: Remove unneeded test in io_run_task_work_sig() Avoid testing TIF_NOTIFY_SIGNAL twice by calling task_sigpending() directly from io_run_task_work_sig() Signed-off-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/bd7c0495f7656e803e5736708591bb665e6eaacd.1645041650.git.olivier@trillion01.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Stefan Roesch	502c87d655	io-uring: Make tracepoints consistent. This makes the io-uring tracepoints consistent. Where it makes sense the tracepoints start with the following four fields: - context (ring) - request - user_data - opcode. Signed-off-by: Stefan Roesch <shr@fb.com> Link: https://lore.kernel.org/r/20220214180430.70572-3-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Stefan Roesch	d5ec1dfaf5	io-uring: add __fill_cqe function This introduces the __fill_cqe function. This is necessary to correctly issue the io_uring_complete tracepoint. Signed-off-by: Stefan Roesch <shr@fb.com> Link: https://lore.kernel.org/r/20220214180430.70572-2-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Nathan Chancellor	f0a4e62bb5	io_uring: Fix use of uninitialized ret in io_eventfd_register() Clang warns: fs/io_uring.c:9396:9: warning: variable 'ret' is uninitialized when used here [-Wuninitialized] return ret; ^~~ fs/io_uring.c:9373:13: note: initialize the variable 'ret' to silence this warning int fd, ret; ^ = 0 1 warning generated. Just return 0 directly and reduce the scope of ret to the if statement, as that is the only place that it is used, which is how the function was before the fixes commit. Fixes: 1a75fac9a0f9 ("io_uring: avoid ring quiesce while registering/unregistering eventfd") Link: https://github.com/ClangBuiltLinux/linux/issues/1579 Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20220207162410.1013466-1-nathan@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Usama Arif	8bb649ee1d	io_uring: remove ring quiesce for io_uring_register None of the opcodes in io_uring_register use ring quiesce anymore. Hence io_register_op_must_quiesce always returns false and io_ctx_quiesce is never called. Signed-off-by: Usama Arif <usama.arif@bytedance.com> Link: https://lore.kernel.org/r/20220204145117.1186568-6-usama.arif@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Usama Arif	ff16cfcfda	io_uring: avoid ring quiesce while registering restrictions and enabling rings IORING_SETUP_R_DISABLED prevents submitting requests and so there will be no requests until IORING_REGISTER_ENABLE_RINGS is called. And IORING_REGISTER_RESTRICTIONS works only before IORING_REGISTER_ENABLE_RINGS is called. Hence ring quiesce is not needed for these opcodes. Signed-off-by: Usama Arif <usama.arif@bytedance.com> Link: https://lore.kernel.org/r/20220204145117.1186568-5-usama.arif@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Usama Arif	c75312dd59	io_uring: avoid ring quiesce while registering async eventfd This is done using the RCU data structure (io_ev_fd). eventfd_async is moved from io_ring_ctx to io_ev_fd which is RCU protected hence avoiding ring quiesce which is much more expensive than an RCU lock. The place where eventfd_async is read is already under rcu_read_lock so there is no extra RCU read-side critical section needed. Signed-off-by: Usama Arif <usama.arif@bytedance.com> Link: https://lore.kernel.org/r/20220204145117.1186568-4-usama.arif@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Usama Arif	77bc59b498	io_uring: avoid ring quiesce while registering/unregistering eventfd This is done by creating a new RCU data structure (io_ev_fd) as part of io_ring_ctx that holds the eventfd_ctx. The function io_eventfd_signal is executed under rcu_read_lock with a single rcu_dereference to io_ev_fd so that if another thread unregisters the eventfd while io_eventfd_signal is still being executed, the eventfd_signal for which io_eventfd_signal was called completes successfully. The process of registering/unregistering eventfd is already done under uring_lock so multiple threads won't enter a race condition while registering/unregistering eventfd. With the above approach ring quiesce can be avoided which is much more expensive then using RCU lock. On the system tested, io_uring_register with IORING_REGISTER_EVENTFD takes less than 1ms with RCU lock, compared to 15ms before with ring quiesce. Signed-off-by: Usama Arif <usama.arif@bytedance.com> Link: https://lore.kernel.org/r/20220204145117.1186568-3-usama.arif@bytedance.com [axboe: long line fixups] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Usama Arif	2757be22c0	io_uring: remove trace for eventfd The information on whether eventfd is registered is not very useful and would result in the tracepoint being enclosed in an rcu_readlock in a later patch that tries to avoid ring quiesce for registering eventfd. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Usama Arif <usama.arif@bytedance.com> Link: https://lore.kernel.org/r/20220204145117.1186568-2-usama.arif@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-10 06:32:49 -07:00
Christoph Hellwig	41d36a9f3e	fs: remove kiocb.ki_hint This field is entirely unused now except for a tracepoint in f2fs, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220308060529.736277-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-03-08 17:55:03 -07:00
Dylan Yudaken	80912cef18	io_uring: disallow modification of rsrc_data during quiesce io_rsrc_ref_quiesce will unlock the uring while it waits for references to the io_rsrc_data to be killed. There are other places to the data that might add references to data via calls to io_rsrc_node_switch. There is a race condition where this reference can be added after the completion has been signalled. At this point the io_rsrc_ref_quiesce call will wake up and relock the uring, assuming the data is unused and can be freed - although it is actually being used. To fix this check in io_rsrc_ref_quiesce if a resource has been revived. Reported-by: syzbot+ca8bf833622a1662745b@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220222161751.995746-1-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-22 09:57:32 -07:00
Jens Axboe	228339662b	io_uring: don't convert to jiffies for waiting on timeouts If an application calls io_uring_enter(2) with a timespec passed in, convert that timespec to ktime_t rather than jiffies. The latter does not provide the granularity the application may expect, and may in fact provided different granularity on different systems, depending on what the HZ value is configured at. Turn the timespec into an absolute ktime_t, and use that with schedule_hrtimeout() instead. Link: https://github.com/axboe/liburing/issues/531 Cc: stable@vger.kernel.org Reported-by: Bob Chen <chenbo.chen@alibaba-inc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-21 05:55:42 -07:00
Eric Dumazet	f240762f88	io_uring: add a schedule point in io_add_buffers() Looping ~65535 times doing kmalloc() calls can trigger soft lockups, especially with DEBUG features (like KASAN). [ 253.536212] watchdog: BUG: soft lockup - CPU#64 stuck for 26s! [b219417889:12575] [ 253.544433] Modules linked in: vfat fat i2c_mux_pca954x i2c_mux spidev cdc_acm xhci_pci xhci_hcd sha3_generic gq(O) [ 253.544451] CPU: 64 PID: 12575 Comm: b219417889 Tainted: G S O 5.17.0-smp-DEV #801 [ 253.544457] RIP: 0010:kernel_text_address (./include/asm-generic/sections.h:192 ./include/linux/kallsyms.h:29 kernel/extable.c:67 kernel/extable.c:98) [ 253.544464] Code: 0f 93 c0 48 c7 c1 e0 63 d7 a4 48 39 cb 0f 92 c1 20 c1 0f b6 c1 5b 5d c3 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 53 48 89 fb <48> c7 c0 00 00 80 a0 41 be 01 00 00 00 48 39 c7 72 0c 48 c7 c0 40 [ 253.544468] RSP: 0018:ffff8882d8baf4c0 EFLAGS: 00000246 [ 253.544471] RAX: 1ffff1105b175e00 RBX: ffffffffa13ef09a RCX: 00000000a13ef001 [ 253.544474] RDX: ffffffffa13ef09a RSI: ffff8882d8baf558 RDI: ffffffffa13ef09a [ 253.544476] RBP: ffff8882d8baf4d8 R08: ffff8882d8baf5e0 R09: 0000000000000004 [ 253.544479] R10: ffff8882d8baf5e8 R11: ffffffffa0d59a50 R12: ffff8882eab20380 [ 253.544481] R13: ffffffffa0d59a50 R14: dffffc0000000000 R15: 1ffff1105b175eb0 [ 253.544483] FS: 00000000016d3380(0000) GS:ffff88af48c00000(0000) knlGS:0000000000000000 [ 253.544486] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 253.544488] CR2: 00000000004af0f0 CR3: 00000002eabfa004 CR4: 00000000003706e0 [ 253.544491] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 253.544492] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 253.544494] Call Trace: [ 253.544496] <TASK> [ 253.544498] ? io_queue_sqe (fs/io_uring.c:7143) [ 253.544505] __kernel_text_address (kernel/extable.c:78) [ 253.544508] unwind_get_return_address (arch/x86/kernel/unwind_frame.c:19) [ 253.544514] arch_stack_walk (arch/x86/kernel/stacktrace.c:27) [ 253.544517] ? io_queue_sqe (fs/io_uring.c:7143) [ 253.544521] stack_trace_save (kernel/stacktrace.c:123) [ 253.544527] ____kasan_kmalloc (mm/kasan/common.c:39 mm/kasan/common.c:45 mm/kasan/common.c:436 mm/kasan/common.c:515) [ 253.544531] ? ____kasan_kmalloc (mm/kasan/common.c:39 mm/kasan/common.c:45 mm/kasan/common.c:436 mm/kasan/common.c:515) [ 253.544533] ? __kasan_kmalloc (mm/kasan/common.c:524) [ 253.544535] ? kmem_cache_alloc_trace (./include/linux/kasan.h:270 mm/slab.c:3567) [ 253.544541] ? io_issue_sqe (fs/io_uring.c:4556 fs/io_uring.c:4589 fs/io_uring.c:6828) [ 253.544544] ? __io_queue_sqe (fs/io_uring.c:?) [ 253.544551] __kasan_kmalloc (mm/kasan/common.c:524) [ 253.544553] kmem_cache_alloc_trace (./include/linux/kasan.h:270 mm/slab.c:3567) [ 253.544556] ? io_issue_sqe (fs/io_uring.c:4556 fs/io_uring.c:4589 fs/io_uring.c:6828) [ 253.544560] io_issue_sqe (fs/io_uring.c:4556 fs/io_uring.c:4589 fs/io_uring.c:6828) [ 253.544564] ? __kasan_slab_alloc (mm/kasan/common.c:45 mm/kasan/common.c:436 mm/kasan/common.c:469) [ 253.544567] ? __kasan_slab_alloc (mm/kasan/common.c:39 mm/kasan/common.c:45 mm/kasan/common.c:436 mm/kasan/common.c:469) [ 253.544569] ? kmem_cache_alloc_bulk (mm/slab.h:732 mm/slab.c:3546) [ 253.544573] ? __io_alloc_req_refill (fs/io_uring.c:2078) [ 253.544578] ? io_submit_sqes (fs/io_uring.c:7441) [ 253.544581] ? __se_sys_io_uring_enter (fs/io_uring.c:10154 fs/io_uring.c:10096) [ 253.544584] ? __x64_sys_io_uring_enter (fs/io_uring.c:10096) [ 253.544587] ? do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) [ 253.544590] ? entry_SYSCALL_64_after_hwframe (??:?) [ 253.544596] __io_queue_sqe (fs/io_uring.c:?) [ 253.544600] io_queue_sqe (fs/io_uring.c:7143) [ 253.544603] io_submit_sqe (fs/io_uring.c:?) [ 253.544608] io_submit_sqes (fs/io_uring.c:?) [ 253.544612] __se_sys_io_uring_enter (fs/io_uring.c:10154 fs/io_uring.c:10096) [ 253.544616] __x64_sys_io_uring_enter (fs/io_uring.c:10096) [ 253.544619] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) [ 253.544623] entry_SYSCALL_64_after_hwframe (??:?) Fixes: ddf0322db79c ("io_uring: add IORING_OP_PROVIDE_BUFFERS") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: io-uring <io-uring@vger.kernel.org> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20220215041003.2394784-1-eric.dumazet@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-15 07:47:16 -07:00
Shakeel Butt	0a3f1e0bea	mm: io_uring: allow oom-killer from io_uring_setup On an overcommitted system which is running multiple workloads of varying priorities, it is preferred to trigger an oom-killer to kill a low priority workload than to let the high priority workload receiving ENOMEMs. On our memory overcommitted systems, we are seeing a lot of ENOMEMs instead of oom-kills because io_uring_setup callchain is using __GFP_NORETRY gfp flag which avoids the oom-killer. Let's remove it and allow the oom-killer to kill a lower priority job. Signed-off-by: Shakeel Butt <shakeelb@google.com> Link: https://lore.kernel.org/r/20220125051736.2981459-1-shakeelb@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-07 08:44:01 -07:00
Alviro Iskandar Setiawan	0d7c1153d9	io_uring: Clean up a false-positive warning from GCC 9.3.0 In io_recv(), if import_single_range() fails, the @flags variable is uninitialized, then it will goto out_free. After the goto, the compiler doesn't know that (ret < min_ret) is always true, so it thinks the "if ((flags & MSG_WAITALL) ..." path could be taken. The complaint comes from gcc-9 (Debian 9.3.0-22) 9.3.0: ``` fs/io_uring.c:5238 io_recvfrom() error: uninitialized symbol 'flags' ``` Fix this by bypassing the @ret and @flags check when import_single_range() fails. Reasons: 1. import_single_range() only returns -EFAULT when it fails. 2. At that point, @flags is uninitialized and shouldn't be read. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reported-by: "Chen, Rong A" <rong.a.chen@intel.com> Link: https://lore.gnuweeb.org/timl/d33bb5a9-8173-f65b-f653-51fc0681c6d6@intel.com/ Cc: Pavel Begunkov <asml.silence@gmail.com> Suggested-by: Ammar Faizi <ammarfaizi2@gnuweeb.org> Fixes: 7297ce3d59449de49d3c9e1f64ae25488750a1fc ("io_uring: improve send/recv error handling") Signed-off-by: Alviro Iskandar Setiawan <alviro.iskandar@gmail.com> Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org> Link: https://lore.kernel.org/r/20220207140533.565411-1-ammarfaizi2@gnuweeb.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-02-07 08:38:07 -07:00
Usama Arif	f6133fbd37	io_uring: remove unused argument from io_rsrc_node_alloc io_ring_ctx is not used in the function. Signed-off-by: Usama Arif <usama.arif@bytedance.com> Link: https://lore.kernel.org/r/20220127140444.4016585-1-usama.arif@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-01-27 10:18:53 -07:00
Dylan Yudaken	b36a205004	io_uring: fix bug in slow unregistering of nodes In some cases io_rsrc_ref_quiesce will call io_rsrc_node_switch_start, and then immediately flush the delayed work queue &ctx->rsrc_put_work. However the percpu_ref_put does not immediately destroy the node, it will be called asynchronously via RCU. That ends up with io_rsrc_node_ref_zero only being called after rsrc_put_work has been flushed, and so the process ends up sleeping for 1 second unnecessarily. This patch executes the put code immediately if we are busy quiescing. Fixes: 4a38aed2a0a7 ("io_uring: batch reap of dead file registrations") Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220121123856.3557884-1-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-01-23 09:12:53 -07:00
Jens Axboe	ccbf726171	io_uring: perform poll removal even if async work removal is successful An active work can have poll armed, hence it's not enough to just do the async work removal and return the value if it's different from "not found". Rather than make poll removal special, just fall through to do the remaining type lookups and removals. Reported-by: Florian Fischer <florian.fl.fischer@fau.de> Link: https://lore.kernel.org/io-uring/20220118151337.fac6cthvbnu7icoc@pasture/ Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-01-18 19:28:43 -07:00
Pavel Begunkov	791f3465c4	io_uring: fix UAF due to missing POLLFREE handling Fixes a problem described in 50252e4b5e989 ("aio: fix use-after-free due to missing POLLFREE handling") and copies the approach used there. In short, we have to forcibly eject a poll entry when we meet POLLFREE. We can't rely on io_poll_get_ownership() as can't wait for potentially running tw handlers, so we use the fact that wqs are RCU freed. See Eric's patch and comments for more details. Reported-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20211209010455.42744-6-ebiggers@kernel.org Reported-and-tested-by: syzbot+5426c7ed6868c705ca14@syzkaller.appspotmail.com Fixes: 221c5eb233823 ("io_uring: add support for IORING_OP_POLL") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4ed56b6f548f7ea337603a82315750449412748a.1642161259.git.asml.silence@gmail.com [axboe: drop non-functional change from patch] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-01-14 06:48:35 -07:00
Jiapeng Chong	c84b8a3fef	io_uring: Remove unused function req_ref_put Fix the following clang warnings: fs/io_uring.c:1195:20: warning: unused function 'req_ref_put' [-Wunused-function]. Fixes: aa43477b0402 ("io_uring: poll rework") Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Link: https://lore.kernel.org/r/20220113162005.3011-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-01-13 12:43:05 -07:00
Linus Torvalds	d3c8108035	for-5.17/block-2022-01-11 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmHd8DAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnhRD/wMAjsNO65PCA+o/bPpVi4ulx9EejAzrJnB 5vHFvREAoOOGKvRpYGe4w3TcKyW+zPb+GtlXFjPfK+wuVzWhrQtW/+vkjKlBt8wK o7rzeMwTKJ9ZGvYaaQpp1yC0WURBB3qnCRQhb8dOQzhJgEXinhIOznZsut4mniLv fTqcDmKAb/+G6K6CQCCqnH0I/+OJZyUeSFo1kk2i4ZqCBepQpBkOL6H2rBOtGxUg bt1jiGHbbhCRYEE3u2kV0HP10qAChNaMQC705jV4Qpf4+3EntSxs+6nSb74dvMkX 3+Wmp8Ctq6lpPnDL1nrAFGz3jZnB0Y+GdgOclQn3ViQd1FCXZzuYWQ3fTaBfURCZ /RE5nc047SqpwCFLOynM++OkaeQZ1zSxeyoFTtzDaPF4tLuaX3JHswvTzNGPw8SN BnexseNnNBCjJliZSEE7fOkjJDcev2dvRxPtI8/wkF4lHUgETc5IW563C53xo/Tx 32yFjZwCVIpNWk21su/0H3iEq80wZ7PnriiN/E3JA6XbnevlRPu0NPMb0D258GCm yCcdPVDNZsQCB8hluqZcu0g6LSgZRo90Yg1oqKqEpAllJJMBaEAPPPuUIJh998mo iKGxZzgr7d9jrbGJTInp0F8b3B3/oV/hxgzy0Hu/mHP3AsnaAk9o/oEQZ7rX4Khr 6biloqkIMA== =RWnJ -----END PGP SIGNATURE----- Merge tag 'for-5.17/block-2022-01-11' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: - Unify where the struct request handling code is located in the blk-mq code (Christoph) - Header cleanups (Christoph) - Clean up the io_context handling code (Christoph, me) - Get rid of ->rq_disk in struct request (Christoph) - Error handling fix for add_disk() (Christoph) - request allocation cleanusp (Christoph) - Documentation updates (Eric, Matthew) - Remove trivial crypto unregister helper (Eric) - Reduce shared tag overhead (John) - Reduce poll_stats memory overhead (me) - Known indirect function call for dio (me) - Use atomic references for struct request (me) - Support request list issue for block and NVMe (me) - Improve queue dispatch pinning (Ming) - Improve the direct list issue code (Keith) - BFQ improvements (Jan) - Direct completion helper and use it in mmc block (Sebastian) - Use raw spinlock for the blktrace code (Wander) - fsync error handling fix (Ye) - Various fixes and cleanups (Lukas, Randy, Yang, Tetsuo, Ming, me) * tag 'for-5.17/block-2022-01-11' of git://git.kernel.dk/linux-block: (132 commits) MAINTAINERS: add entries for block layer documentation docs: block: remove queue-sysfs.rst docs: sysfs-block: document virt_boundary_mask docs: sysfs-block: document stable_writes docs: sysfs-block: fill in missing documentation from queue-sysfs.rst docs: sysfs-block: add contact for nomerges docs: sysfs-block: sort alphabetically docs: sysfs-block: move to stable directory block: don't protect submit_bio_checks by q_usage_counter block: fix old-style declaration nvme-pci: fix queue_rqs list splitting block: introduce rq_list_move block: introduce rq_list_for_each_safe macro block: move rq_list macros to blk-mq.h block: drop needless assignment in set_task_ioprio() block: remove unnecessary trailing '\' bio.h: fix kernel-doc warnings block: check minor range in device_add_disk() block: use "unsigned long" for blk_validate_block_size(). block: fix error unwinding in device_add_disk ...	2022-01-12 10:26:52 -08:00
Linus Torvalds	42a7b4ed45	for-5.17/io_uring-2022-01-11 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmHd8BkQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpkgED/4vyoNlKfuTqRzuHAZzq+PBYlMpVbn54ebm z2AJ2StLrZ8rflz/9ERNVYxbSS6ELf/8H2Mdjnb5uCoRza/QkMf4JnGtgT0pBavV uoruKTiYgQAQJmGuhxAo0TQwyW7F6qadp1eZH8CqxB2Cwj5wgSb3SB/yIlNkj6+2 fSzVGCh0CljhrkgFG10z507VGJYmgBqE9kl2tnfYkAA4blqyowZN4boTMx3l9fm5 2GWt0xg80lgMCzHb5m/sbqdOrMJV77TkuiAuVc+FjIfdGiVWo7XCn3q3e0qqJhRM A3x9mmUyF/xNrRLMgTyYxHzKLrytBibr+VxvDLr/LmLtV8viVCsgJiP5EIiAWIAt BCBks31QUxHaB9fyiQ43/nsRZLHmhXErLEK4D0loAXa50p4Xl4ZdaegD6txSH3FI mGprv4Si1kMNqSxmzrOUfFk8Xdv/vC08RWL4UbRa8xkgwWUAIhpRaYmAtmw913kb MgwFQuGpE9b7ae7HNdgWzyI424srCwY5UawgpzI25ZwGAVXTXDQ2qw1Lmyyo6swT bsYVyH/vJLvCS1tjmBtrKfJQ0Mokm1sJpaMeTs/SfSyHmAUXEsEFdXVu6bRnVkYF 9vjHZKOl5jA5nx6/JDpW993GnHV3FGkCDfENXs3wUYY7Hu0DDYZt0CGiqGLjC7Oq Ow4q3aEJQA== =syIB -----END PGP SIGNATURE----- Merge tag 'for-5.17/io_uring-2022-01-11' of git://git.kernel.dk/linux-block Pull io_uring updates from Jens Axboe: - Support for prioritized work completions (Hao) - Simplification of reissue (Pavel) - Add support for CQE skip (Pavel) - Memory leak fix going to 5.15-stable (Pavel) - Re-write of internal poll. This both cleans up that code, and gets us ready to fix the POLLFREE issue (Pavel) - Various cleanups (GuoYong, Pavel, Hao) * tag 'for-5.17/io_uring-2022-01-11' of git://git.kernel.dk/linux-block: (31 commits) io_uring: fix not released cached task refs io_uring: remove redundant tab space io_uring: remove unused function parameter io_uring: use completion batching for poll rem/upd io_uring: single shot poll removal optimisation io_uring: poll rework io_uring: kill poll linking optimisation io_uring: move common poll bits io_uring: refactor poll update io_uring: remove double poll on poll update io_uring: code clean for some ctx usage io_uring: batch completion in prior_task_list io_uring: split io_req_complete_post() and add a helper io_uring: add helper for task work execution code io_uring: add a priority tw list for irq completion work io-wq: add helper to merge two wq_lists io_uring: reuse io_req_task_complete for timeouts io_uring: tweak iopoll CQE_SKIP event counting io_uring: simplify selected buf handling io_uring: move up io_put_kbuf() and io_put_rw_kbuf() ...	2022-01-12 10:20:35 -08:00
Pavel Begunkov	3cc7fdb9f9	io_uring: fix not released cached task refs tctx_task_work() may get run after io_uring cancellation and so there will be no one to put cached in tctx task refs that may have been added back by tw handlers using inline completion infra, Call io_uring_drop_tctx_refs() at the end of the main tw handler to release them. Cc: stable@vger.kernel.org # 5.15+ Reported-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Fixes: e98e49b2bbf7 ("io_uring: extend task put optimisations") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/69f226b35fbdb996ab799a8bbc1c06bf634ccec1.1641688805.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-01-09 09:22:49 -07:00
GuoYong Zheng	c0235652ee	io_uring: remove redundant tab space When show fdinfo, SqMask follow two tab space, which is inconsistent with other parameters. Remove one, so it lines up nicely. Signed-off-by: GuoYong Zheng <zhenggy@chinatelecom.cn> Link: https://lore.kernel.org/r/1641377585-1891-1-git-send-email-zhenggy@chinatelecom.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-01-05 12:31:37 -07:00
GuoYong Zheng	00f6e68b8d	io_uring: remove unused function parameter Parameter res2 is not used in __io_complete_rw, remove it. Fixes: 6b19b766e8f0 ("fs: get rid of the res2 iocb->ki_complete argument") Signed-off-by: GuoYong Zheng <zhenggy@chinatelecom.cn> Link: https://lore.kernel.org/r/1641377522-1851-1-git-send-email-zhenggy@chinatelecom.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-01-05 12:29:26 -07:00
Keith Busch	edce22e19b	block: move rq_list macros to blk-mq.h Move the request list macros to the header file that defines that struct they operate on. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220105170518.3181469-2-kbusch@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-01-05 12:25:42 -07:00
Pavel Begunkov	cc8e9ba71a	io_uring: use completion batching for poll rem/upd Use __io_req_complete() in io_poll_update(), so we can utilise completion batching for both update/remove request and the poll we're killing (if any). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e2bdc6c5abd9e9b80f09b86d8823eb1c780362cd.1639605189.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-28 09:51:15 -08:00
Pavel Begunkov	eb0089d629	io_uring: single shot poll removal optimisation We don't need to poll oneshot request if we've got a desired mask in io_poll_wake(), task_work will clean it up correctly, but as we already hold a wq spinlock, we can remove ourselves and save on additional spinlocking in io_poll_remove_entries(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ee170a344a18c9ef36b554d806c64caadfd61c31.1639605189.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-28 09:51:15 -08:00
Pavel Begunkov	aa43477b04	io_uring: poll rework It's not possible to go forward with the current state of io_uring polling, we need a more straightforward and easier synchronisation. There are a lot of problems with how it is at the moment, including missing events on rewait. The main idea here is to introduce a notion of request ownership while polling, no one but the owner can modify any part but ->poll_refs of struct io_kiocb, that grants us protection against all sorts of races. Main users of such exclusivity are poll task_work handler, so before queueing a tw one should have/acquire ownership, which will be handed off to the tw handler. The other user is __io_arm_poll_handler() do initial poll arming. It starts taking the ownership, so tw handlers won't be run until it's released later in the function after vfs_poll. note: also prevents races in __io_queue_proc(). Poll wake/etc. may not be able to get ownership, then they need to increase the poll refcount and the task_work should notice it and retry if necessary, see io_poll_check_events(). There is also IO_POLL_CANCEL_FLAG flag to notify that we want to kill request. It makes cancellations more reliable, enables double multishot polling, fixes double poll rewait, fixes missing poll events and fixes another bunch of races. Even though it adds some overhead for new refcounting, and there are a couple of nice performance wins: - no req->refs refcounting for poll requests anymore - if the data is already there (once measured for some test to be 1-2% of all apoll requests), it removes it doesn't add atomics and removes spin_lock/unlock pair. - works well with multishots, we don't do remove from queue / add to queue for each new poll event. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6b652927c77ed9580ea4330ac5612f0e0848c946.1639605189.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-28 09:51:15 -08:00
Pavel Begunkov	ab1dab960b	io_uring: kill poll linking optimisation With IORING_FEAT_FAST_POLL in place, io_put_req_find_next() for poll requests doesn't make much sense, and in any case re-adding it shouldn't be a problem considering batching in tctx_task_work(). We can remove it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/15699682bf81610ec901d4e79d6da64baa9f70be.1639605189.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-28 09:51:15 -08:00
Pavel Begunkov	5641897a5e	io_uring: move common poll bits Move some poll helpers/etc up, we'll need them there shortly Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6c5c3dba24c86aad5cd389a54a8c7412e6a0621d.1639605189.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-28 09:51:14 -08:00
Pavel Begunkov	2bbb146d96	io_uring: refactor poll update Clean up io_poll_update() and unify cancellation paths for remove and update. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5937138b6265a1285220e2fab1b28132c1d73ce3.1639605189.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-28 09:51:14 -08:00
Pavel Begunkov	e840b4baf3	io_uring: remove double poll on poll update Before updating a poll request we should remove it from poll queues, including the double poll entry. Fixes: b69de288e913 ("io_uring: allow events and user_data update of running poll requests") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ac39e7f80152613603b8a6cc29a2b6063ac2434f.1639605189.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-28 09:51:14 -08:00
Jens Axboe	7b9762a5e8	io_uring: zero iocb->ki_pos for stream file types io_uring supports using offset == -1 for using the current file position, and we read that in as part of read/write command setup. For the non-iter read/write types we pass in NULL for the position pointer, but for the iter types we should not be passing any anything but 0 for the position for a stream. Clear kiocb->ki_pos if the file is a stream, don't leave it as -1. If we do, then the request will error with -ESPIPE. Fixes: ba04291eb66e ("io_uring: allow use of offset == -1 to mean file position") Link: https://github.com/axboe/liburing/discussions/501 Reported-by: Samuel Williams <samuel.williams@oriontransfer.co.nz> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-22 20:34:32 -07:00
Hao Xu	33ce2aff7d	io_uring: code clean for some ctx usage There are some functions doing ctx = req->ctx while still using req->ctx, update those places. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20211214055904.61772-1-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-14 06:49:06 -07:00
Jens Axboe	78a7806020	io_uring: ensure task_work gets run as part of cancelations If we successfully cancel a work item but that work item needs to be processed through task_work, then we can be sleeping uninterruptibly in io_uring_cancel_generic() and never process it. Hence we don't make forward progress and we end up with an uninterruptible sleep warning. While in there, correct a comment that should be IFF, not IIF. Reported-and-tested-by: syzbot+21e6887c0be14181206d@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-10 13:56:28 -07:00
Hao Xu	f28c240e71	io_uring: batch completion in prior_task_list In previous patches, we have already gathered some tw with io_req_task_complete() as callback in prior_task_list, let's complete them in batch while we cannot grab uring lock. In this way, we batch the req_complete_post path. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20211208052125.351587-1-haoxu@linux.alibaba.com Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-08 11:34:48 -07:00
Hao Xu	a37fae8aaa	io_uring: split io_req_complete_post() and add a helper Split io_req_complete_post(), this is a prep for the next patch. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Link: https://lore.kernel.org/r/20211207093951.247840-5-haoxu@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-12-07 15:02:42 -07:00

... 2 3 4 5 6 ...

1939 Commits