From bd6fc5da4c51107e1e0cec4a3a07963d1dae2c84 Mon Sep 17 00:00:00 2001 From: Gabriel Krisman Bertazi Date: Mon, 28 Aug 2023 19:42:49 -0400 Subject: [PATCH 1/7] io_uring: Don't set affinity on a dying sqpoll thread Syzbot reported a null-ptr-deref of sqd->thread inside io_sqpoll_wq_cpu_affinity. It turns out the sqd->thread can go away from under us during io_uring_register, in case the process gets a fatal signal during io_uring_register. It is not particularly hard to hit the race, and while I am not sure this is the exact case hit by syzbot, it solves it. Finally, checking ->thread is enough to close the race because we locked sqd while "parking" the thread, thus preventing it from going away. I reproduced it fairly consistently with a program that does: int main(void) { ... io_uring_queue_init(RING_LEN, &ring1, IORING_SETUP_SQPOLL); while (1) { io_uring_register_iowq_aff(ring, 1, &mask); } } Executed in a loop with timeout to trigger SIGTERM: while true; do timeout 1 /a.out ; done This will hit the following BUG() in very few attempts. BUG: kernel NULL pointer dereference, address: 00000000000007a8 PGD 800000010e949067 P4D 800000010e949067 PUD 10e46e067 PMD 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 15715 Comm: dead-sqpoll Not tainted 6.5.0-rc7-next-20230825-g193296236fa0-dirty #23 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:io_sqpoll_wq_cpu_affinity+0x27/0x70 Code: 90 90 90 0f 1f 44 00 00 55 53 48 8b 9f 98 03 00 00 48 85 db 74 4f 48 89 df 48 89 f5 e8 e2 f8 ff ff 48 8b 43 38 48 85 c0 74 22 <48> 8b b8 a8 07 00 00 48 89 ee e8 ba b1 00 00 48 89 df 89 c5 e8 70 RSP: 0018:ffffb04040ea7e70 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff93c010749e40 RCX: 0000000000000001 RDX: 0000000000000000 RSI: ffffffffa7653331 RDI: 00000000ffffffff RBP: ffffb04040ea7eb8 R08: 0000000000000000 R09: c0000000ffffdfff R10: ffff93c01141b600 R11: ffffb04040ea7d18 R12: ffff93c00ea74840 R13: 0000000000000011 R14: 0000000000000000 R15: ffff93c00ea74800 FS: 00007fb7c276ab80(0000) GS:ffff93c36f200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000007a8 CR3: 0000000111634003 CR4: 0000000000370ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? __die_body+0x1a/0x60 ? page_fault_oops+0x154/0x440 ? do_user_addr_fault+0x174/0x7b0 ? exc_page_fault+0x63/0x140 ? asm_exc_page_fault+0x22/0x30 ? io_sqpoll_wq_cpu_affinity+0x27/0x70 __io_register_iowq_aff+0x2b/0x60 __io_uring_register+0x614/0xa70 __x64_sys_io_uring_register+0xaa/0x1a0 do_syscall_64+0x3a/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0033:0x7fb7c226fec9 Code: 2e 00 b8 ca 00 00 00 0f 05 eb a5 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 7f 2d 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe2c0674f8 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb7c226fec9 RDX: 00007ffe2c067530 RSI: 0000000000000011 RDI: 0000000000000003 RBP: 00007ffe2c0675d0 R08: 00007ffe2c067550 R09: 00007ffe2c067550 R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffe2c067750 R14: 0000000000000000 R15: 0000000000000000 Modules linked in: CR2: 00000000000007a8 ---[ end trace 0000000000000000 ]--- Reported-by: syzbot+c74fea926a78b8a91042@syzkaller.appspotmail.com Fixes: ebdfefc09c6d ("io_uring/sqpoll: fix io-wq affinity when IORING_SETUP_SQPOLL is used") Signed-off-by: Gabriel Krisman Bertazi Link: https://lore.kernel.org/r/87v8cybuo6.fsf@suse.de Signed-off-by: Jens Axboe --- io_uring/sqpoll.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c index ee2d2c687fda..bd6c2c7959a5 100644 --- a/io_uring/sqpoll.c +++ b/io_uring/sqpoll.c @@ -430,7 +430,9 @@ __cold int io_sqpoll_wq_cpu_affinity(struct io_ring_ctx *ctx, if (sqd) { io_sq_thread_park(sqd); - ret = io_wq_cpu_affinity(sqd->thread->io_uring, mask); + /* Don't set affinity for a dying thread */ + if (sqd->thread) + ret = io_wq_cpu_affinity(sqd->thread->io_uring, mask); io_sq_thread_unpark(sqd); } From b484a40dc1f16edb58e5430105a021e1916e6f27 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Fri, 1 Sep 2023 21:49:16 +0800 Subject: [PATCH 2/7] io_uring: fix IO hang in io_wq_put_and_exit from do_exit() io_wq_put_and_exit() is called from do_exit(), but all FIXED_FILE requests in io_wq aren't canceled in io_uring_cancel_generic() called from do_exit(). Meantime io_wq IO code path may share resource with normal iopoll code path. So if any HIPRI request is submittd via io_wq, this request may not get resouce for moving on, given iopoll isn't possible in io_wq_put_and_exit(). The issue can be triggered when terminating 't/io_uring -n4 /dev/nullb0' with default null_blk parameters. Fix it by always cancelling all requests in io_wq by adding helper of io_uring_cancel_wq(), and this way is reasonable because io_wq destroying follows canceling requests immediately. Closes: https://lore.kernel.org/linux-block/3893581.1691785261@warthog.procyon.org.uk/ Reported-by: David Howells Cc: Chengming Zhou Signed-off-by: Ming Lei Link: https://lore.kernel.org/r/20230901134916.2415386-1-ming.lei@redhat.com Signed-off-by: Jens Axboe --- io_uring/io_uring.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index e7675355048d..c6d9e4677073 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3290,6 +3290,37 @@ static s64 tctx_inflight(struct io_uring_task *tctx, bool tracked) return percpu_counter_sum(&tctx->inflight); } +static void io_uring_cancel_wq(struct io_uring_task *tctx) +{ + int ret; + + if (!tctx->io_wq) + return; + + /* + * FIXED_FILE request isn't tracked in do_exit(), and these + * requests may be submitted to our io_wq as iopoll, so have to + * cancel them before destroying io_wq for avoiding IO hang + */ + do { + struct io_tctx_node *node; + unsigned long index; + + ret = 0; + xa_for_each(&tctx->xa, index, node) { + struct io_ring_ctx *ctx = node->ctx; + struct io_task_cancel cancel = { .task = current, .all = true, }; + enum io_wq_cancel cret; + + io_iopoll_try_reap_events(ctx); + cret = io_wq_cancel_cb(tctx->io_wq, io_cancel_task_cb, + &cancel, true); + ret |= (cret != IO_WQ_CANCEL_NOTFOUND); + cond_resched(); + } + } while (ret); +} + /* * Find any io_uring ctx that this task has registered or done IO on, and cancel * requests. @sqd should be not-null IFF it's an SQPOLL thread cancellation. @@ -3361,6 +3392,7 @@ end_wait: finish_wait(&tctx->wait, &wait); } while (1); + io_uring_cancel_wq(tctx); io_uring_clean_tctx(tctx); if (cancel_all) { /* From 32f5dea040ee6e3cc30ac52d23f1674fd5110d03 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Fri, 1 Sep 2023 13:59:19 -0600 Subject: [PATCH 3/7] io_uring/fdinfo: only print ->sq_array[] if it's there If a ring is setup with IORING_SETUP_NO_SQARRAY, then we don't have the SQ array. Don't try to dump info from it through fdinfo if that is the case. Reported-by: syzbot+216e2ea6e0bf4a0acdd7@syzkaller.appspotmail.com Fixes: 2af89abda7d9 ("io_uring: add option to remove SQ indirection") Reviewed-by: Gabriel Krisman Bertazi Signed-off-by: Jens Axboe --- io_uring/fdinfo.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c index 300455b4bc12..c53678875416 100644 --- a/io_uring/fdinfo.c +++ b/io_uring/fdinfo.c @@ -93,6 +93,8 @@ __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *f) struct io_uring_sqe *sqe; unsigned int sq_idx; + if (ctx->flags & IORING_SETUP_NO_SQARRAY) + break; sq_idx = READ_ONCE(ctx->sq_array[entry & sq_mask]); if (sq_idx > sq_mask) continue; From 76d3ccecfa186af3120e206d62f03db1a94a535f Mon Sep 17 00:00:00 2001 From: Matteo Rizzo Date: Mon, 21 Aug 2023 17:15:52 -0400 Subject: [PATCH 4/7] io_uring: add a sysctl to disable io_uring system-wide Introduce a new sysctl (io_uring_disabled) which can be either 0, 1, or 2. When 0 (the default), all processes are allowed to create io_uring instances, which is the current behavior. When 1, io_uring creation is disabled (io_uring_setup() will fail with -EPERM) for unprivileged processes not in the kernel.io_uring_group group. When 2, calls to io_uring_setup() fail with -EPERM regardless of privilege. Signed-off-by: Matteo Rizzo [JEM: modified to add io_uring_group] Signed-off-by: Jeff Moyer Link: https://lore.kernel.org/r/x49y1i42j1z.fsf@segfault.boston.devel.redhat.com Signed-off-by: Jens Axboe --- Documentation/admin-guide/sysctl/kernel.rst | 29 ++++++++++++ io_uring/io_uring.c | 50 +++++++++++++++++++++ 2 files changed, 79 insertions(+) diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 3800fab1619b..0795d790cc56 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -450,6 +450,35 @@ this allows system administrators to override the ``IA64_THREAD_UAC_NOPRINT`` ``prctl`` and avoid logs being flooded. +io_uring_disabled +================= + +Prevents all processes from creating new io_uring instances. Enabling this +shrinks the kernel's attack surface. + += ====================================================================== +0 All processes can create io_uring instances as normal. This is the + default setting. +1 io_uring creation is disabled (io_uring_setup() will fail with + -EPERM) for unprivileged processes not in the io_uring_group group. + Existing io_uring instances can still be used. See the + documentation for io_uring_group for more information. +2 io_uring creation is disabled for all processes. io_uring_setup() + always fails with -EPERM. Existing io_uring instances can still be + used. += ====================================================================== + + +io_uring_group +============== + +When io_uring_disabled is set to 1, a process must either be +privileged (CAP_SYS_ADMIN) or be in the io_uring_group group in order +to create an io_uring instance. If io_uring_group is set to -1 (the +default), only processes with the CAP_SYS_ADMIN capability may create +io_uring instances. + + kexec_load_disabled =================== diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index c6d9e4677073..0f0ba31c3850 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -150,6 +150,31 @@ static void io_queue_sqe(struct io_kiocb *req); struct kmem_cache *req_cachep; +static int __read_mostly sysctl_io_uring_disabled; +static int __read_mostly sysctl_io_uring_group = -1; + +#ifdef CONFIG_SYSCTL +static struct ctl_table kernel_io_uring_disabled_table[] = { + { + .procname = "io_uring_disabled", + .data = &sysctl_io_uring_disabled, + .maxlen = sizeof(sysctl_io_uring_disabled), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_TWO, + }, + { + .procname = "io_uring_group", + .data = &sysctl_io_uring_group, + .maxlen = sizeof(gid_t), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + {}, +}; +#endif + struct sock *io_uring_get_socket(struct file *file) { #if defined(CONFIG_UNIX) @@ -4070,9 +4095,30 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params) return io_uring_create(entries, &p, params); } +static inline bool io_uring_allowed(void) +{ + int disabled = READ_ONCE(sysctl_io_uring_disabled); + kgid_t io_uring_group; + + if (disabled == 2) + return false; + + if (disabled == 0 || capable(CAP_SYS_ADMIN)) + return true; + + io_uring_group = make_kgid(&init_user_ns, sysctl_io_uring_group); + if (!gid_valid(io_uring_group)) + return false; + + return in_group_p(io_uring_group); +} + SYSCALL_DEFINE2(io_uring_setup, u32, entries, struct io_uring_params __user *, params) { + if (!io_uring_allowed()) + return -EPERM; + return io_uring_setup(entries, params); } @@ -4666,6 +4712,10 @@ static int __init io_uring_init(void) offsetof(struct io_kiocb, cmd.data), sizeof_field(struct io_kiocb, cmd.data), NULL); +#ifdef CONFIG_SYSCTL + register_sysctl_init("kernel", kernel_io_uring_disabled_table); +#endif + return 0; }; __initcall(io_uring_init); From 45500dc4e01c167ee063f3dcc22f51ced5b2b1e9 Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Thu, 7 Sep 2023 13:50:07 +0100 Subject: [PATCH 5/7] io_uring: break out of iowq iopoll on teardown io-wq will retry iopoll even when it failed with -EAGAIN. If that races with task exit, which sets TIF_NOTIFY_SIGNAL for all its workers, such workers might potentially infinitely spin retrying iopoll again and again and each time failing on some allocation / waiting / etc. Don't keep spinning if io-wq is dying. Fixes: 561fb04a6a225 ("io_uring: replace workqueue usage with io-wq") Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov Signed-off-by: Jens Axboe --- io_uring/io-wq.c | 10 ++++++++++ io_uring/io-wq.h | 1 + io_uring/io_uring.c | 2 ++ 3 files changed, 13 insertions(+) diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c index 62f345587df5..1ecc8c748768 100644 --- a/io_uring/io-wq.c +++ b/io_uring/io-wq.c @@ -174,6 +174,16 @@ static void io_worker_ref_put(struct io_wq *wq) complete(&wq->worker_done); } +bool io_wq_worker_stopped(void) +{ + struct io_worker *worker = current->worker_private; + + if (WARN_ON_ONCE(!io_wq_current_is_worker())) + return true; + + return test_bit(IO_WQ_BIT_EXIT, &worker->wq->state); +} + static void io_worker_cancel_cb(struct io_worker *worker) { struct io_wq_acct *acct = io_wq_get_acct(worker); diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h index 06d9ca90c577..2b2a6406dd8e 100644 --- a/io_uring/io-wq.h +++ b/io_uring/io-wq.h @@ -52,6 +52,7 @@ void io_wq_hash_work(struct io_wq_work *work, void *val); int io_wq_cpu_affinity(struct io_uring_task *tctx, cpumask_var_t mask); int io_wq_max_workers(struct io_wq *wq, int *new_count); +bool io_wq_worker_stopped(void); static inline bool io_wq_is_hashed(struct io_wq_work *work) { diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 0f0ba31c3850..58d8dd34a45f 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -1975,6 +1975,8 @@ fail: if (!needs_poll) { if (!(req->ctx->flags & IORING_SETUP_IOPOLL)) break; + if (io_wq_worker_stopped()) + break; cond_resched(); continue; } From 27122c079f5b4b4ecf1323b65700edc57e07bf6e Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Thu, 7 Sep 2023 13:50:08 +0100 Subject: [PATCH 6/7] io_uring: fix unprotected iopoll overflow [ 71.490669] WARNING: CPU: 3 PID: 17070 at io_uring/io_uring.c:769 io_cqring_event_overflow+0x47b/0x6b0 [ 71.498381] Call Trace: [ 71.498590] [ 71.501858] io_req_cqe_overflow+0x105/0x1e0 [ 71.502194] __io_submit_flush_completions+0x9f9/0x1090 [ 71.503537] io_submit_sqes+0xebd/0x1f00 [ 71.503879] __do_sys_io_uring_enter+0x8c5/0x2380 [ 71.507360] do_syscall_64+0x39/0x80 We decoupled CQ locking from ->task_complete but haven't fixed up places forcing locking for CQ overflows. Fixes: ec26c225f06f5 ("io_uring: merge iopoll and normal completion paths") Signed-off-by: Pavel Begunkov Signed-off-by: Jens Axboe --- io_uring/io_uring.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 58d8dd34a45f..090913acf1db 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -908,7 +908,7 @@ static void __io_flush_post_cqes(struct io_ring_ctx *ctx) struct io_uring_cqe *cqe = &ctx->completion_cqes[i]; if (!io_fill_cqe_aux(ctx, cqe->user_data, cqe->res, cqe->flags)) { - if (ctx->task_complete) { + if (ctx->lockless_cq) { spin_lock(&ctx->completion_lock); io_cqring_event_overflow(ctx, cqe->user_data, cqe->res, cqe->flags, 0, 0); @@ -1566,7 +1566,7 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx) if (!(req->flags & REQ_F_CQE_SKIP) && unlikely(!io_fill_cqe_req(ctx, req))) { - if (ctx->task_complete) { + if (ctx->lockless_cq) { spin_lock(&ctx->completion_lock); io_req_cqe_overflow(req); spin_unlock(&ctx->completion_lock); From 023464fe33a53d7e3fa0a1967a2adcb17e5e40e3 Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Thu, 7 Sep 2023 09:41:49 -0600 Subject: [PATCH 7/7] Revert "io_uring: fix IO hang in io_wq_put_and_exit from do_exit()" This reverts commit b484a40dc1f16edb58e5430105a021e1916e6f27. This commit cancels all requests with io-wq, not just the ones from the originating task. This breaks use cases that have thread pools, or just multiple tasks issuing requests on the same ring. The liburing regression test for this also shows that problem: $ test/thread-exit.t cqe->res=-125, Expected 512 where an IO thread gets its request canceled rather than complete successfully. Signed-off-by: Jens Axboe --- io_uring/io_uring.c | 32 -------------------------------- 1 file changed, 32 deletions(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 090913acf1db..783ed0fff71b 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3317,37 +3317,6 @@ static s64 tctx_inflight(struct io_uring_task *tctx, bool tracked) return percpu_counter_sum(&tctx->inflight); } -static void io_uring_cancel_wq(struct io_uring_task *tctx) -{ - int ret; - - if (!tctx->io_wq) - return; - - /* - * FIXED_FILE request isn't tracked in do_exit(), and these - * requests may be submitted to our io_wq as iopoll, so have to - * cancel them before destroying io_wq for avoiding IO hang - */ - do { - struct io_tctx_node *node; - unsigned long index; - - ret = 0; - xa_for_each(&tctx->xa, index, node) { - struct io_ring_ctx *ctx = node->ctx; - struct io_task_cancel cancel = { .task = current, .all = true, }; - enum io_wq_cancel cret; - - io_iopoll_try_reap_events(ctx); - cret = io_wq_cancel_cb(tctx->io_wq, io_cancel_task_cb, - &cancel, true); - ret |= (cret != IO_WQ_CANCEL_NOTFOUND); - cond_resched(); - } - } while (ret); -} - /* * Find any io_uring ctx that this task has registered or done IO on, and cancel * requests. @sqd should be not-null IFF it's an SQPOLL thread cancellation. @@ -3419,7 +3388,6 @@ end_wait: finish_wait(&tctx->wait, &wait); } while (1); - io_uring_cancel_wq(tctx); io_uring_clean_tctx(tctx); if (cancel_all) { /*