IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
__io_cq_unlock_post() is identical to io_cq_unlock_post(), and
io_cqring_ev_posted() has a single caller so migth as well just inline
it there.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit 7fdbc5f014c3f71bc44673a2d6c5bb2d12d45f25.
This patch dealt with a subset of the real problem, which is a potential
circular dependency on the wakup path for io_uring itself. Outside of
io_uring, eventfd can also trigger this (see details in 03e02acda8e2)
and so can epoll (see details in caf1aeaffc3b). Now that we have a
generic solution to this problem, get rid of the io_uring specific
work-around.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass in EPOLL_URING_WAKE when signaling eventfd or doing poll related
wakups, so that we can check for a circular event dependency between
eventfd and epoll. If this flag is set when our wakeup handlers are
called, then we know we have a dependency that needs to terminate
multishot requests.
eventfd and epoll are the only such possible dependencies.
Cc: stable@vger.kernel.org # 6.0
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the target process is dying and so task_work_add() is not allowed
we push all task_work item to the fallback workqueue. Move the part
responsible for moving tw items out of __io_req_task_work_add() into
a separate function. Makes it a bit cleaner and gives the compiler a bit
of extra info.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e503dab9d7af95470ca6b214c6de17715ae4e748.1668162751.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__io_req_task_work_add() is huge but marked inline, that makes compilers
to generate lots of garbage. Inline the wrapper caller
io_req_task_work_add() instead.
before and after:
text data bss dec hex filename
47347 16248 8 63603 f873 io_uring/io_uring.o
text data bss dec hex filename
45303 16248 8 61559 f077 io_uring/io_uring.o
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/26dc8c28ca0160e3269ef3e55c5a8b917c4d4450.1668162751.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Previous commit 13a99017ff19 ("io_uring: remove events caching
atavisms") entirely removes the events caching optimization introduced
by commit 81459350d581 ("io_uring: cache req->apoll->events in
req->cflags"). Hence the related comment should also be removed to avoid
misunderstanding.
Fixes: 13a99017ff19 ("io_uring: remove events caching atavisms")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Link: https://lore.kernel.org/r/20221110060313.16303-1-linma@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With commit aa1df3a360a0 ("io_uring: fix CQE reordering"), there are
stronger guarantees for overflow ordering. Specifically ensuring that
userspace will not receive out of order receive CQEs. Therefore this is
not needed any more for recv/recvmsg.
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221107125236.260132-4-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is no longer needed after commit aa1df3a360a0 ("io_uring: fix CQE
reordering"), since all reordering is now taken care of.
This reverts commit cbd25748545c ("io_uring: fix multishot accept
ordering").
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221107125236.260132-2-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Running task work when not needed can unnecessarily delay
operations. Specifically IORING_SETUP_DEFER_TASKRUN tries to avoid running
task work until the user requests it. Therefore do not run it in
io_uring_register any more.
The one catch is that io_rsrc_ref_quiesce expects it to have run in order
to process all outstanding references, and so reorder it's loop to do this.
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221107123349.4106213-1-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fixes two errors:
"ERROR: do not use assignment in if condition
130: FILE: io_uring/net.c:130:
+ if (!(issue_flags & IO_URING_F_UNLOCKED) &&
ERROR: do not use assignment in if condition
599: FILE: io_uring/poll.c:599:
+ } else if (!(issue_flags & IO_URING_F_UNLOCKED) &&"
reported by checkpatch.pl in net.c and poll.c .
Signed-off-by: Xinghui Li <korantli@tencent.com>
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/r/20221102082503.32236-1-korantwork@gmail.com
[axboe: style tweaks]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can also move mm accounting to the extended callbacks. It removes a
few cycles from the hot path including skipping one function call and
setting io_req_task_complete as a callback directly. For user backed I/O
it shouldn't make any difference taking into considering atomic mm
accounting and page pinning.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1062f270273ad11c1b7b45ec59a6a317533d5e64.1667557923.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When we post a CQE we wake all ring pollers as it normally should be.
However, if a CQE was generated by a multishot poll request targeting
its own ring, it'll wake that request up, which will make it to post
a new CQE, which will wake the request and so on until it exhausts all
CQ entries.
Don't allow multishot polling io_uring files but downgrade them to
oneshots, which was always stated as a correct behaviour that the
userspace should check for.
Cc: stable@vger.kernel.org
Fixes: aa43477b04025 ("io_uring: poll rework")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3124038c0e7474d427538c2d915335ec28c92d21.1668785722.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Having REQ_F_POLLED set doesn't guarantee that the request is
executed as a multishot from the polling path. Fortunately for us, if
the code thinks it's multishot issue when it's not, it can only ask to
skip completion so leaking the request. Use issue_flags to mark
multipoll issues.
Cc: stable@vger.kernel.org
Fixes: 1300ebb20286b ("io_uring: multishot recv")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/37762040ba9c52b81b92a2f5ebfd4ee484088951.1668710222.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Having REQ_F_POLLED set doesn't guarantee that the request is
executed as a multishot from the polling path. Fortunately for us, if
the code thinks it's multishot issue when it's not, it can only ask to
skip completion so leaking the request. Use issue_flags to mark
multipoll issues.
Cc: stable@vger.kernel.org
Fixes: 390ed29b5e425 ("io_uring: add IORING_ACCEPT_MULTISHOT for accept")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7700ac57653f2823e30b34dc74da68678c0c5f13.1668710222.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When io_poll_check_events() collides with someone attempting to queue a
task work, it'll spin for one more time. However, it'll continue to use
the mask from the first iteration instead of updating it. For example,
if the first wake up was a EPOLLIN and the second EPOLLOUT, the
userspace will not get EPOLLOUT in time.
Clear the mask for all subsequent iterations to force vfs_poll().
Cc: stable@vger.kernel.org
Fixes: aa43477b04025 ("io_uring: poll rework")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2dac97e8f691231049cb259c4ae57e79e40b537c.1668710222.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_poll_double_prepare() | io_poll_wake()
| poll->head = NULL
smp_load(&poll->head); /* NULL */ |
flags = req->flags; |
| req->flags &= ~SINGLE_POLL;
req->flags = flags | DOUBLE_POLL |
The idea behind io_poll_double_prepare() is to serialise with the
first poll entry by taking the wq lock. However, it's not safe to assume
that io_poll_wake() is not running when we can't grab the lock and so we
may race modifying req->flags.
Skip double poll setup if that happens. It's ok because the first poll
entry will only be removed when it's definitely completing, e.g.
pollfree or oneshot with a valid mask.
Fixes: 49f1c68e048f1 ("io_uring: optimise submission side poll_refs")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b7fab2d502f6121a7d7b199fe4d914a43ca9cdfd.1668184658.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We already check if the chosen starting offset for the buffer IDs fit
within an unsigned short, as 65535 is the maximum value for a provided
buffer. But if the caller asks to add N buffers at offset M, and M + N
would exceed the size of the unsigned short, we simply add buffers with
wrapping around the ID.
This is not necessarily a bug and could in fact be a valid use case, but
it seems confusing and inconsistent with the initial check for starting
offset. Let's check for wrap consistently, and error the addition if we
do need to wrap.
Reported-by: Olivier Langlois <olivier@trillion01.com>
Link: https://github.com/axboe/liburing/issues/726
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_cqring_wait (and it's wake function io_has_work) used cached_cq_tail in
order to calculate the number of CQEs. cached_cq_tail is set strictly
before the user visible rings->cq.tail
However as far as userspace is concerned, if io_uring_enter(2) is called
with a minimum number of events, they will verify by checking
rings->cq.tail.
It is therefore possible for io_uring_enter(2) to return early with fewer
events visible to the user.
Instead make the wait functions read from the user visible value, so there
will be no discrepency.
This is triggered eventually by the following reproducer:
struct io_uring_sqe *sqe;
struct io_uring_cqe *cqe;
unsigned int cqe_ready;
struct io_uring ring;
int ret, i;
ret = io_uring_queue_init(N, &ring, 0);
assert(!ret);
while(true) {
for (i = 0; i < N; i++) {
sqe = io_uring_get_sqe(&ring);
io_uring_prep_nop(sqe);
sqe->flags |= IOSQE_ASYNC;
}
ret = io_uring_submit(&ring);
assert(ret == N);
do {
ret = io_uring_wait_cqes(&ring, &cqe, N, NULL, NULL);
} while(ret == -EINTR);
cqe_ready = io_uring_cq_ready(&ring);
assert(!ret);
assert(cqe_ready == N);
io_uring_cq_advance(&ring, N);
}
Fixes: ad3eb2c89fb2 ("io_uring: split overflow state into SQ and CQ side")
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221108153016.1854297-1-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert an initial portion to rely on struct mnt_idmap by converting the
high level xattr helpers.
Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
It is possible for tw to lock the ring, and this was not propogated out to
io_run_local_work. This can cause an unlock to be missed.
Instead pass a pointer to locked into __io_run_local_work.
Fixes: 8ac5d85a89b4 ("io_uring: add local task_work run helper that is entered locked")
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221027144429.3971400-3-dylany@meta.com
[axboe: WARN_ON() -> WARN_ON_ONCE() and add a minor comment]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a protocol doesn't support zerocopy it will silently fall back to
copying. This type of behaviour has always been a source of troubles
so it's better to fail such requests instead.
Cc: <stable@vger.kernel.org> # 6.0
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2db3c7f16bb6efab4b04569cd16e6242b40c5cb3.1666346426.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the CPU mask allocation for a node fails, then the memory allocated for
the 'io_wqe' struct of the current node doesn't get freed on the error
handling path, since it has not yet been added to the 'wqes' array.
This was spotted when fuzzing v6.1-rc1 with Syzkaller:
BUG: memory leak
unreferenced object 0xffff8880093d5000 (size 1024):
comm "syz-executor.2", pid 7701, jiffies 4295048595 (age 13.900s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<00000000cb463369>] __kmem_cache_alloc_node+0x18e/0x720
[<00000000147a3f9c>] kmalloc_node_trace+0x2a/0x130
[<000000004e107011>] io_wq_create+0x7b9/0xdc0
[<00000000c38b2018>] io_uring_alloc_task_context+0x31e/0x59d
[<00000000867399da>] __io_uring_add_tctx_node.cold+0x19/0x1ba
[<000000007e0e7a79>] io_uring_setup.cold+0x1b80/0x1dce
[<00000000b545e9f6>] __x64_sys_io_uring_setup+0x5d/0x80
[<000000008a8a7508>] do_syscall_64+0x5d/0x90
[<000000004ac08bec>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes: 0e03496d1967 ("io-wq: use private CPU mask")
Cc: stable@vger.kernel.org
Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Link: https://lore.kernel.org/r/20221020014710.902201-1-rafaelmendsr@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This debug statement was never meant to go into the upstream release,
kill it off before it ends up in a release. It was just part of the
testing for the initial version of the patch.
Fixes: 2ec33a6c3cca ("io_uring/rw: ensure kiocb_end_write() is always called")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We should not be completing requests from a task context that has already
undergone io_uring cancellations, i.e. __io_uring_cancel(), as there are
some assumptions, e.g. around cached task refs draining. Remove
iopolling from io_ring_ctx_wait_and_kill() as it can be called later
after PF_EXITING is set with the last task_work run.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7c03cc91455c4a1af49c6b9cbda4e57ea467aa11.1665891182.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We test file_table.bitmap in io_file_get_fixed() to check invariants,
don't do it, it's expensive and was showing up in profiles. No reports of
this triggering has come in. Move the check to the file clear instead,
which will still catch any wrong usage.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cf77f2ded68d2e5b2bc7355784d969837d48e023.1665891182.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
THe lifetime of SCM'ed files is bound to ring_sock, which is destroyed
strictly after we're done with registered file tables. This means there
is no need for the FFS_SCM hack, which was not available on 32-bit builds
anyway.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/984226a1045adf42dc35d8bd7fb5a8bbfa472ce1.1665891182.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A previous commit moved the notifications and end-write handling, but
it is now missing a few spots where we also want to call both of those.
Without that, we can potentially be missing file notifications, and
more importantly, have an imbalance in the super_block writers sem
accounting.
Fixes: b000145e9907 ("io_uring/rw: defer fsnotify calls to task context")
Reported-by: Dave Chinner <david@fromorbit.com>
Link: https://lore.kernel.org/all/20221010050319.GC2703033@dread.disaster.area/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This fixes the shadowing of the outer variable rw in the function
io_write(). No issue is caused by this, but let's silence the shadowing
warning anyway.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Link: https://lore.kernel.org/r/20221010234330.244244-1-shr@devkernel.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>