linux

iv/linux

Author	SHA1	Message	Date
Helge Deller	56675f8b9f	io_uring/parisc: Adjust pgoff in io_uring mmap() for parisc The changes from commit `32832a407a` ("io_uring: Fix io_uring mmap() by using architecture-provided get_unmapped_area()") to the parisc implementation of get_unmapped_area() broke glibc's locale-gen executable when running on parisc. This patch reverts those architecture-specific changes, and instead adjusts in io_uring_mmu_get_unmapped_area() the pgoff offset which is then given to parisc's get_unmapped_area() function. This is much cleaner than the previous approach, and we still will get a coherent addresss. This patch has no effect on other architectures (SHM_COLOUR is only defined on parisc), and the liburing testcase stil passes on parisc. Cc: stable@vger.kernel.org # 6.4 Signed-off-by: Helge Deller <deller@gmx.de> Reported-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Fixes: `32832a407a` ("io_uring: Fix io_uring mmap() by using architecture-provided get_unmapped_area()") Fixes: `d808459b2e` ("io_uring: Adjust mapping wrt architecture aliasing requirements") Link: https://lore.kernel.org/r/ZNEyGV0jyI8kOOfz@p100 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-08-08 12:37:01 -06:00
Jens Axboe	7b72d661f1	io_uring: gate iowait schedule on having pending requests A previous commit made all cqring waits marked as iowait, as a way to improve performance for short schedules with pending IO. However, for use cases that have a special reaper thread that does nothing but wait on events on the ring, this causes a cosmetic issue where we know have one core marked as being "busy" with 100% iowait. While this isn't a grave issue, it is confusing to users. Rather than always mark us as being in iowait, gate setting of current->in_iowait to 1 by whether or not the waiting task has pending requests. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/io-uring/CAMEGJJ2RxopfNQ7GNLhr7X9=bHXKo+G5OOe0LUq=+UgLXsv1Xg@mail.gmail.com/ Link: https://bugzilla.kernel.org/show_bug.cgi?id=217699 Link: https://bugzilla.kernel.org/show_bug.cgi?id=217700 Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name> Reported-by: Phil Elwell <phil@raspberrypi.com> Tested-by: Andres Freund <andres@anarazel.de> Fixes: `8a796565ce` ("io_uring: Use io_schedule* in cqring wait") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-24 11:44:35 -06:00
Helge Deller	32832a407a	io_uring: Fix io_uring mmap() by using architecture-provided get_unmapped_area() The io_uring testcase is broken on IA-64 since commit `d808459b2e` ("io_uring: Adjust mapping wrt architecture aliasing requirements"). The reason is, that this commit introduced an own architecture independend get_unmapped_area() search algorithm which finds on IA-64 a memory region which is outside of the regular memory region used for shared userspace mappings and which can't be used on that platform due to aliasing. To avoid similar problems on IA-64 and other platforms in the future, it's better to switch back to the architecture-provided get_unmapped_area() function and adjust the needed input parameters before the call. Beside fixing the issue, the function now becomes easier to understand and maintain. This patch has been successfully tested with the io_uring testcase on physical x86-64, ppc64le, IA-64 and PA-RISC machines. On PA-RISC the LTP mmmap testcases did not report any regressions. Cc: stable@vger.kernel.org # 6.4 Signed-off-by: Helge Deller <deller@gmx.de> Reported-by: matoro <matoro_mailinglist_kernel@matoro.tk> Fixes: `d808459b2e` ("io_uring: Adjust mapping wrt architecture aliasing requirements") Link: https://lore.kernel.org/r/20230721152432.196382-2-deller@gmx.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-21 09:41:29 -06:00
Jens Axboe	a9be202269	io_uring: treat -EAGAIN for REQ_F_NOWAIT as final for io-wq io-wq assumes that an issue is blocking, but it may not be if the request type has asked for a non-blocking attempt. If we get -EAGAIN for that case, then we need to treat it as a final result and not retry or arm poll for it. Cc: stable@vger.kernel.org # 5.10+ Link: https://github.com/axboe/liburing/issues/897 Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-20 13:16:53 -06:00
Ondrej Mosnacek	6adc2272aa	io_uring: don't audit the capability check in io_uring_create() The check being unconditional may lead to unwanted denials reported by LSMs when a process has the capability granted by DAC, but denied by an LSM. In the case of SELinux such denials are a problem, since they can't be effectively filtered out via the policy and when not silenced, they produce noise that may hide a true problem or an attack. Since not having the capability merely means that the created io_uring context will be accounted against the current user's RLIMIT_MEMLOCK limit, we can disable auditing of denials for this check by using ns_capable_noaudit() instead of capable(). Fixes: `2b188cc1bb` ("Add io_uring IO interface") Link: https://bugzilla.redhat.com/show_bug.cgi?id=2193317 Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Link: https://lore.kernel.org/r/20230718115607.65652-1-omosnace@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-18 14:16:25 -06:00
Andres Freund	8a796565ce	io_uring: Use io_schedule* in cqring wait I observed poor performance of io_uring compared to synchronous IO. That turns out to be caused by deeper CPU idle states entered with io_uring, due to io_uring using plain schedule(), whereas synchronous IO uses io_schedule(). The losses due to this are substantial. On my cascade lake workstation, t/io_uring from the fio repository e.g. yields regressions between 20% and 40% with the following command: ./t/io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S$use_sync -R 0 /mnt/t2/fio/write.0.0 This is repeatable with different filesystems, using raw block devices and using different block devices. Use io_schedule_prepare() / io_schedule_finish() in io_cqring_wait_schedule() to address the difference. After that using io_uring is on par or surpassing synchronous IO (using registered files etc makes it reliably win, but arguably is a less fair comparison). There are other calls to schedule() in io_uring/, but none immediately jump out to be similarly situated, so I did not touch them. Similarly, it's possible that mutex_lock_io() should be used, but it's not clear if there are cases where that matters. Cc: stable@vger.kernel.org # 5.10+ Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: io-uring@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Andres Freund <andres@anarazel.de> Link: https://lore.kernel.org/r/20230707162007.194068-1-andres@anarazel.de [axboe: minor style fixup] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-07-07 11:24:29 -06:00
Jens Axboe	dfbe5561ae	io_uring: flush offloaded and delayed task_work on exit io_uring offloads task_work for cancelation purposes when the task is exiting. This is conceptually fine, but we should be nicer and actually wait for that work to complete before returning. Add an argument to io_fallback_tw() telling it to flush the deferred work when it's all queued up, and have it flush a ctx behind whenever the ctx changes. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-28 11:06:05 -06:00
Jens Axboe	10e1c0d590	io_uring: remove io_fallback_tw() forward declaration It's used just one function higher up, get rid of the declaration and just move it up a bit. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-27 16:07:24 -06:00
Pavel Begunkov	c98c81a4ac	io_uring: merge conditional unlock flush helpers There is no reason not to use __io_cq_unlock_post_flush for intermediate aux CQE flushing, all ->task_complete should apply there, i.e. if set it should be the submitter task. Combine them, get rid of of __io_cq_unlock_post() and rename the left function. This place was also taking a couple percents of CPU according to profiles for max throughput net benchmarks due to multishot recv flooding it with completions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bbed60734cbec2e833d9c7bdcf9741aada5d8aab.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-23 08:19:40 -06:00
Pavel Begunkov	0fdb9a196c	io_uring: make io_cq_unlock_post static io_cq_unlock_post() is exclusively used in io_uring/io_uring.c, mark it static and don't expose to other files. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3dc8127dda4514e1dd24bb32035faac887c5fa37.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-23 08:19:40 -06:00
Pavel Begunkov	ff12617728	io_uring: inline __io_cq_unlock __io_cq_unlock is not very helpful, and users should be calling flush variants anyway. Open code the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d875c4cfb69f38ccecb58a57111446c77a614caa.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-23 08:19:40 -06:00
Pavel Begunkov	55b6a69fed	io_uring: fix acquire/release annotations We do conditional locking, so __io_cq_lock() and friends not always actually grab/release the lock, so kill misleading annotations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2a098f9144c24cab622f8bf90b39f44da5d0401e.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-23 08:19:40 -06:00
Pavel Begunkov	f432b76bcc	io_uring: kill io_cq_unlock() We're abusing ->completion_lock helpers. io_cq_unlock() neither locking conditionally nor doing CQE flushing, which means that callers must have some side reason of taking the lock and should do it directly. Open code io_cq_unlock() into io_cqring_overflow_kill() and clean it up. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7dabb36856db2b562e78780480396c52c29b2bf4.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-23 08:19:39 -06:00
Pavel Begunkov	91c7884ac9	io_uring: remove IOU_F_TWQ_FORCE_NORMAL Extract a function for non-local task_work_add, and use it directly from io_move_task_work_from_local(). Now we don't use IOU_F_TWQ_FORCE_NORMAL and it can be killed. As a small positive side effect we don't grab task->io_uring in io_req_normal_work_add anymore, which is not needed for io_req_local_work_add(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2e55571e8ff2927ae3cc12da606d204e2485525b.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-23 08:19:39 -06:00
Pavel Begunkov	2fdd6fb5ff	io_uring: don't batch task put on reqs free We're trying to batch io_put_task() in io_free_batch_list(), but considering that the hot path is a simple inc, it's most cerainly and probably faster to just do io_put_task() instead of task tracking. We don't care about io_put_task_remote() as it's only for IOPOLL where polling/waiting is done by not the submitter task. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4a7ef7dce845fe2bd35507bf389d6bd2d5c1edf0.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-23 08:19:39 -06:00
Pavel Begunkov	5a754dea27	io_uring: move io_clean_op() Move io_clean_op() up in the source file and remove the forward declaration, as the function doesn't have tricky dependencies anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1b7163b2ba7c3a8322d972c79c1b0a9301b3057e.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-23 08:19:39 -06:00
Pavel Begunkov	3b7a612fd0	io_uring: inline io_dismantle_req() io_dismantle_req() is only used in __io_req_complete_post(), open code it there. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ba8f20cb2c914eefa2e7d120a104a198552050db.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-23 08:19:39 -06:00
Pavel Begunkov	6ec9afc7f4	io_uring: remove io_free_req_tw Request completion is a very hot path in general, but there are 3 places that can be doing it: io_free_batch_list(), io_req_complete_post() and io_free_req_tw(). io_free_req_tw() is used rather marginally and we don't care about it. Killing it can help to clean up and optimise the left two, do that by replacing it with io_req_task_complete(). There are two things to consider: 1) io_free_req() is called when all refs are put, so we need to reinit references. The easiest way to do that is to clear REQ_F_REFCOUNT. 2) We also don't need a cqe from it, so silence it with REQ_F_CQE_SKIP. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/434a2be8f33d474ad888ce1c17fe5ea7bbcb2a55.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-23 08:19:39 -06:00
Pavel Begunkov	247f97a5f1	io_uring: open code io_put_req_find_next There is only one user of io_put_req_find_next() and it doesn't make much sense to have it. Open code the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/38b5c5e48e4adc8e6a0cd16fdd5c1531d7ff81a9.1687518903.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-23 08:19:39 -06:00
Christoph Hellwig	4bfb0c9af8	io_uring: add helpers to decode the fixed file file_ptr Remove all the open coded magic on slot->file_ptr by introducing two helpers that return the file pointer and the flags instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-20 09:36:22 -06:00
Christoph Hellwig	8487f083c6	io_uring: return REQ_F_ flags from io_file_get_flags Two of the three callers want them, so return the more usual format, and shift into the FFS_ form only for the fixed file table. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-20 09:36:22 -06:00
Christoph Hellwig	3beed235d1	io_uring: remove io_req_ffs_set Just checking the flag directly makes it a lot more obvious what is going on here. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-20 09:36:22 -06:00
Christoph Hellwig	b57c7cd1c1	io_uring: remove a confusing comment above io_file_get_flags The SCM inflight mechanism has nothing to do with the fact that a file might be a regular file or not and if it supports non-blocking operations. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-20 09:36:22 -06:00
Christoph Hellwig	53cfd5cea7	io_uring: remove the mode variable in io_file_get_flags The variable is only once now, so don't bother with it. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-20 09:36:22 -06:00
Christoph Hellwig	b9a6c9459a	io_uring: remove __io_file_supports_nowait Now that this only checks O_NONBLOCK and FMODE_NOWAIT, the helper is complete overkilļ, and the comments are confusing bordering to wrong. Just inline the check into the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230620113235.920399-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-20 09:36:21 -06:00
Jens Axboe	4826c59453	io_uring: wait interruptibly for request completions on exit WHen the ring exits, cleanup is done and the final cancelation and waiting on completions is done by io_ring_exit_work. That function is invoked by kworker, which doesn't take any signals. Because of that, it doesn't really matter if we wait for completions in TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE state. However, it does matter to the hung task detection checker! Normally we expect cancelations and completions to happen rather quickly. Some test cases, however, will exit the ring and park the owning task stopped (eg via SIGSTOP). If the owning task needs to run task_work to complete requests, then io_ring_exit_work won't make any progress until the task is runnable again. Hence io_ring_exit_work can trigger the hung task detection, which is particularly problematic if panic-on-hung-task is enabled. As the ring exit doesn't take signals to begin with, have it wait interruptibly rather than uninterruptibly. io_uring has a separate stuck-exit warning that triggers independently anyway, so we're not really missing anything by making this switch. Cc: stable@vger.kernel.org # 5.10+ Link: https://lore.kernel.org/r/b0e4aaef-7088-56ce-244c-976edeac0e66@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-12 09:43:57 -06:00
Jens Axboe	003f242b0d	io_uring: get rid of unnecessary 'length' variable Just use the ARRAY_SIZE directly, we don't use length for anything else in this function. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-07 15:00:07 -06:00
Jens Axboe	d86eaed185	io_uring: cleanup io_aux_cqe() API Everybody is passing in the request, so get rid of the io_ring_ctx and explicit user_data pass-in. Both the ctx and user_data can be deduced from the request at hand. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-07 14:59:22 -06:00
Jens Axboe	c92fcfc2ba	io_uring: avoid indirect function calls for the hottest task_work We use task_work for a variety of reasons, but doing completions or triggering rety after poll are by far the hottest two. Use the indirect funtion call wrappers to avoid the indirect function call if CONFIG_RETPOLINE is set. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-06-02 08:55:37 -06:00
Jens Axboe	3af0356c16	io_uring: maintain ordering for DEFER_TASKRUN tw list We use lockless lists for the local and deferred task_work, which means that when we queue up events for processing, we ultimately process them in reverse order to how they were received. This usually doesn't matter, but for some cases, it does seem to make a big difference. Do the right thing and reverse the list before processing it, so that we know it's processed in the same order in which it was received. This makes a rather big difference for some medium load network tests, where consistency of performance was a bit all over the place. Here's a case that has 4 connections each doing two sends and receives: io_uring port=10002: rps:161.13k Bps: 1.45M idle=256ms io_uring port=10002: rps:107.27k Bps: 0.97M idle=413ms io_uring port=10002: rps:136.98k Bps: 1.23M idle=321ms io_uring port=10002: rps:155.58k Bps: 1.40M idle=268ms and after the change: io_uring port=10002: rps:205.48k Bps: 1.85M idle=140ms user=40ms io_uring port=10002: rps:203.57k Bps: 1.83M idle=139ms user=20ms io_uring port=10002: rps:218.79k Bps: 1.97M idle=106ms user=30ms io_uring port=10002: rps:217.88k Bps: 1.96M idle=110ms user=20ms io_uring port=10002: rps:222.31k Bps: 2.00M idle=101ms user=0ms io_uring port=10002: rps:218.74k Bps: 1.97M idle=102ms user=20ms io_uring port=10002: rps:208.43k Bps: 1.88M idle=125ms user=40ms using more of the time to actually process work rather than sitting idle. No effects have been observed at the peak end of the spectrum, where performance is still the same even with deep batch depths (and hence more items to sort). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-19 13:49:51 -06:00
Josh Triplett	6e76ac5958	io_uring: Add io_uring_setup flag to pre-register ring fd and never install it With IORING_REGISTER_USE_REGISTERED_RING, an application can register the ring fd and use it via registered index rather than installed fd. This allows using a registered ring for everything except the initial mmap. With IORING_SETUP_NO_MMAP, io_uring_setup uses buffers allocated by the user, rather than requiring a subsequent mmap. The combination of the two allows a user to operate entirely via a registered ring fd, making it unnecessary to ever install the fd in the first place. So, add a flag IORING_SETUP_REGISTERED_FD_ONLY to make io_uring_setup register the fd and return a registered index, without installing the fd. This allows an application to avoid touching the fd table at all, and allows a library to never even momentarily install a file descriptor. This splits out an io_ring_add_registered_file helper from io_ring_add_registered_fd, for use by io_uring_setup. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Link: https://lore.kernel.org/r/bc8f431bada371c183b95a83399628b605e978a3.1682699803.git.josh@joshtriplett.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-16 08:06:00 -06:00
Jens Axboe	03d89a2de2	io_uring: support for user allocated memory for rings/sqes Currently io_uring applications must call mmap(2) twice to map the rings themselves, and the sqes array. This works fine, but it does not support using huge pages to back the rings/sqes. Provide a way for the application to pass in pre-allocated memory for the rings/sqes, which can then suitably be allocated from shmfs or via mmap to get huge page support. Particularly for larger rings, this reduces the TLBs needed. If an application wishes to take advantage of that, it must pre-allocate the memory needed for the sq/cq ring, and the sqes. The former must be passed in via the io_uring_params->cq_off.user_data field, while the latter is passed in via the io_uring_params->sq_off.user_data field. Then it must set IORING_SETUP_NO_MMAP in the io_uring_params->flags field, and io_uring will then map the existing memory into the kernel for shared use. The application must not call mmap(2) to map rings as it otherwise would have, that will now fail with -EINVAL if this setup flag was used. The pages used for the rings and sqes must be contigious. The intent here is clearly that huge pages should be used, otherwise the normal setup procedure works fine as-is. The application may use one huge page for both the rings and sqes. Outside of those initialization changes, everything works like it did before. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-16 08:04:55 -06:00
Jens Axboe	9c189eee73	io_uring: add ring freeing helper We do rings and sqes separately, move them into a helper that does both the freeing and clearing of the memory. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-16 08:04:49 -06:00
Jens Axboe	e27cef86a0	io_uring: return error pointer from io_mem_alloc() In preparation for having more than one time of ring allocator, make the existing one return valid/error-pointer rather than just NULL. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-16 08:04:42 -06:00
Jens Axboe	9b1b58cacc	io_uring: remove sq/cq_off memset We only have two reserved members we're not clearing, do so manually instead. This is in preparation for using one of these members for a new feature. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-16 08:04:37 -06:00
Jens Axboe	caec5ebe77	io_uring: rely solely on FMODE_NOWAIT Now that we have both sockets and block devices setting FMODE_NOWAIT appropriately, we can get rid of all the odd special casing in __io_file_supports_nowait() and rely soley on FMODE_NOWAIT and O_NONBLOCK rather than special case sockets and (in particular) bdevs. Link: https://lore.kernel.org/r/20230509151910.183637-4-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-05-15 10:12:27 -06:00
Linus Torvalds	5b9a7bb72f	for-6.4/io_uring-2023-04-21 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRCvawQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpiKTEACvp0jm3Lyhxb8RMsx5T6Ko0pFH3DIymiL4 xpoZmAUflOjD0c+99FwHRQqKKXuo3OelhW+YOm0N6OOAt6JMSGmKpZh0UNJx+Fgj wMwiQ0X3Y5SaAsr5ZpXM+G1BV7ajihsMpu8/a718ERB3U3cLDz2qJfnzJh+E5Ip5 pYB4vS3+/FAER2MYQ7IPeovch2wWYtxDPztOxNX6SORu3OvpWiz1GR/+8u0tqj50 ROq97Jwjh5Tl4zP356EUSj/Vkfdr2yb+NlLbun8My5x8tYftZjnrNQ/+qeJNLwB8 tWTrg166ox/VX3aYZruAgUPv0IyGPZg7qZV5R72ChBK3VhIbOOLOCm/V6dhvl/XH vu2FG7J8WylWHmc+OU8u7TeSJdrwxTLs4e2IFUBK9ymAYFp0Q9S924fgvSYsFvVB iNn58SPRIbuA4SPtRfCd7pENtZW/QKfBC5CYK+pjsZVX40c9dbe40foVu4t2/EAo gi9+gSWEUVRRW2osxjaHXh78cW63g0j9bNfS6n1Vy32Oo5Mwm7n+bVWqCU5bCBXI MpPOk6AgME3UPwFzGzSmx+PVw8VacPxYP1NF8RFTCwj7OowFnrolJtruDmKJgXWY BN41EDo41k/C5mEu16Jr9rAkHeVhHaNZ+JhyDrzv8llJ/rv+4zEJw9SrhnpufmOX +YERd/ndAw== =Erfk -----END PGP SIGNATURE----- Merge tag 'for-6.4/io_uring-2023-04-21' of git://git.kernel.dk/linux Pull io_uring updates from Jens Axboe: - Cleanup of the io-wq per-node mapping, notably getting rid of it so we just have a single io_wq entry per ring (Breno) - Followup to the above, move accounting to io_wq as well and completely drop struct io_wqe (Gabriel) - Enable KASAN for the internal io_uring caches (Breno) - Add support for multishot timeouts. Some applications use timeouts to wake someone waiting on completion entries, and this makes it a bit easier to just have a recurring timer rather than needing to rearm it every time (David) - Support archs that have shared cache coloring between userspace and the kernel, and hence have strict address requirements for mmap'ing the ring into userspace. This should only be parisc/hppa. (Helge, me) - XFS has supported O_DIRECT writes without needing to lock the inode exclusively for a long time, and ext4 now supports it as well. This is true for the common cases of not extending the file size. Flag the fs as having that feature, and utilize that to avoid serializing those writes in io_uring (me) - Enable completion batching for uring commands (me) - Revert patch adding io_uring restriction to what can be GUP mapped or not. This does not belong in io_uring, as io_uring isn't really special in this regard. Since this is also getting in the way of cleanups and improvements to the GUP code, get rid of if (me) - A few series greatly reducing the complexity of registered resources, like buffers or files. Not only does this clean up the code a lot, the simplified code is also a LOT more efficient (Pavel) - Series optimizing how we wait for events and run task_work related to it (Pavel) - Fixes for file/buffer unregistration with DEFER_TASKRUN (Pavel) - Misc cleanups and improvements (Pavel, me) * tag 'for-6.4/io_uring-2023-04-21' of git://git.kernel.dk/linux: (71 commits) Revert "io_uring/rsrc: disallow multi-source reg buffers" io_uring: add support for multishot timeouts io_uring/rsrc: disassociate nodes and rsrc_data io_uring/rsrc: devirtualise rsrc put callbacks io_uring/rsrc: pass node to io_rsrc_put_work() io_uring/rsrc: inline io_rsrc_put_work() io_uring/rsrc: add empty flag in rsrc_node io_uring/rsrc: merge nodes and io_rsrc_put io_uring/rsrc: infer node from ctx on io_queue_rsrc_removal io_uring/rsrc: remove unused io_rsrc_node::llist io_uring/rsrc: refactor io_queue_rsrc_removal io_uring/rsrc: simplify single file node switching io_uring/rsrc: clean up __io_sqe_buffers_update() io_uring/rsrc: inline switch_start fast path io_uring/rsrc: remove rsrc_data refs io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce io_uring/rsrc: use wq for quiescing io_uring/rsrc: refactor io_rsrc_ref_quiesce io_uring/rsrc: remove io_rsrc_node::done io_uring/rsrc: use nospec'ed indexes ...	2023-04-26 12:40:31 -07:00
Pavel Begunkov	0b222eeb65	io_uring/rsrc: remove rsrc_data refs Instead of waiting for rsrc_data->refs to be downed to zero, check whether there are rsrc nodes queued for completion, that's easier then maintaining references. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8e33fd143d83e11af3e386aea28eb6d6c6a1be10.1681395792.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-15 14:44:57 -06:00
Pavel Begunkov	4ea15b56f0	io_uring/rsrc: use wq for quiescing Replace completions with waitqueues for rsrc data quiesce, the main wakeup condition is when data refs hit zero. Note that data refs are only changes under ->uring_lock, so we prepare before mutex_unlock() reacquire it after taking the lock back. This change will be needed in the next patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1d0dbc74b3b4fd67c8f01819e680c5e0da252956.1681395792.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-15 14:44:57 -06:00
Ming Lei	860e1c7f8b	io_uring: complete request via task work in case of DEFER_TASKRUN So far io_req_complete_post() only covers DEFER_TASKRUN by completing request via task work when the request is completed from IOWQ. However, uring command could be completed from any context, and if io uring is setup with DEFER_TASKRUN, the command is required to be completed from current context, otherwise wait on IORING_ENTER_GETEVENTS can't be wakeup, and may hang forever. The issue can be observed on removing ublk device, but turns out it is one generic issue for uring command & DEFER_TASKRUN, so solve it in io_uring core code. Fixes: `e6aeb2721d` ("io_uring: complete all requests in task context") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-block/b3fc9991-4c53-9218-a8cc-5b4dd3952108@kernel.dk/ Reported-by: Jens Axboe <axboe@kernel.dk> Cc: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-14 06:38:23 -06:00
Pavel Begunkov	2933ae6eaa	io_uring/rsrc: refactor io_rsrc_node_switch We use io_rsrc_node_switch() coupled with io_rsrc_node_switch_start() for a bunch of cases including initialising ctx->rsrc_node, i.e. by passing NULL instead of rsrc_data. Leave it to only deal with actual node changing. For that, first remove it from io_uring_create() and add a function allocating the first node. Then also remove all calls to io_rsrc_node_switch() from files/buffers register as we already have a node installed and it does essentially nothing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d146fe306ff98b1a5a60c997c252534f03d423d7.1681210788.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-12 12:09:41 -06:00
Pavel Begunkov	528407b1e0	io_uring/rsrc: consolidate node caching We store one pre-allocated rsrc node in ->rsrc_backup_node, merge it with ->rsrc_node_cache. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6d5410e51ccd29be7a716be045b51d6b371baef6.1681210788.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-12 12:09:41 -06:00
Pavel Begunkov	8b1df11f97	io_uring: shut io_prep_async_work warning io_uring/io_uring.c:432 io_prep_async_work() error: we previously assumed 'req->file' could be null (see line 425). Even though it's a false positive as there will not be REQ_F_ISREG set without a file, let's add a simple check to make the kernel test robot happy. We don't care about performance here, but assumingly it'll be optimised out by the compiler. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a6cfbe92c74b789c0b4f046f7f98d19b1ca2e5b7.1681210788.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-12 12:09:41 -06:00
Pavel Begunkov	360cd42c4e	io_uring: optimise io_req_local_work_add Chains of memory accesses are never good for performance. The req->task->io_uring->in_cancel in io_req_local_work_add() is there so that when a task is exiting via io_uring_try_cancel_requests() and starts waiting for completions, it gets woken up by every new task_work item queued. Do a little trick by announcing waiting in io_uring_try_cancel_requests(), making io_req_local_work_add() wake us up. We also need to check for deferred tw items after prepare_to_wait(TASK_INTERRUPTIBLE); Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/fb11597e9bbcb365901824f8c5c2cf0d6ee100d0.1680782017.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-06 16:24:36 -06:00
Pavel Begunkov	c66ae3ec38	io_uring: refactor __io_cq_unlock_post_flush() Separate ->task_complete path in __io_cq_unlock_post_flush(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/baa9b8d822f024e4ee01c40209dbbe38d9c8c11d.1680782017.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-06 16:23:28 -06:00
Pavel Begunkov	8751d15426	io_uring: reduce scheduling due to tw Every task_work will try to wake the task to be executed, which causes excessive scheduling and additional overhead. For some tw it's justified, but others won't do much but post a single CQE. When a task waits for multiple cqes, every such task_work will wake it up. Instead, the task may give a hint about how many cqes it waits for, io_req_local_work_add() will compare against it and skip wake ups if #cqes + #tw is not enough to satisfy the waiting condition. Task_work that uses the optimisation should be simple enough and never post more than one CQE. It's also ignored for non DEFER_TASKRUN rings. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d2b77e99d1e86624d8a69f7037d764b739dcd225.1680782017.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-06 16:23:28 -06:00
Pavel Begunkov	5150940079	io_uring: inline llist_add() We'll need to grab some information from the previous request in the tw list, inline llist_add(), it'll be used in the following patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f0165493af7b379943c792114b972f331e7d7d10.1680782017.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-06 16:23:28 -06:00
Pavel Begunkov	8501fe70ae	io_uring: add tw add flags We pass 'allow_local' into io_req_task_work_add() but will need more flags. Replace it with a flags bit field and name this allow_local flag. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4c0f01e7ef4e6feebfb199093cc995af7a19befa.1680782017.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-06 16:23:28 -06:00
Pavel Begunkov	6e7248adf8	io_uring: refactor io_cqring_wake() Instead of smp_mb() + __io_cqring_wake() in __io_cq_unlock_post_flush() use equivalent io_cqring_wake(). With that we can clean it up further and remove __io_cqring_wake(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/662ee5d898168ac206be06038525e97b64072a46.1680782017.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-06 16:23:28 -06:00
Pavel Begunkov	d73a572df2	io_uring: optimize local tw add ctx pinning We currently pin the ctx for io_req_local_work_add() with percpu_ref_get/put, which implies two rcu_read_lock/unlock pairs and some extra overhead on top in the fast path. Replace it with a pure rcu read and let io_ring_exit_work() synchronise against it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cbdfcb6b232627f30e9e50ef91f13c4f05910247.1680782017.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-04-06 16:23:10 -06:00

1 2 3 4 5 ...

313 Commits