linux

iv/linux

Author	SHA1	Message	Date
Paul E. McKenney	ad4f8c117b	rcu: Prevent expedited GP from enabling tick on offline CPU commit `147f04b14a` upstream. If an RCU expedited grace period starts just when a CPU is in the process of going offline, so that the outgoing CPU has completed its pass through stop-machine but has not yet completed its final dive into the idle loop, RCU will attempt to enable that CPU's scheduling-clock tick via a call to tick_dep_set_cpu(). For this to happen, that CPU has to have been online when the expedited grace period completed its CPU-selection phase. This is pointless: The outgoing CPU has interrupts disabled, so it cannot take a scheduling-clock tick anyway. In addition, the tick_dep_set_cpu() function's eventual call to irq_work_queue_on() will splat as follows: smpboot: CPU 1 is now offline WARNING: CPU: 6 PID: 124 at kernel/irq_work.c:95 +irq_work_queue_on+0x57/0x60 Modules linked in: CPU: 6 PID: 124 Comm: kworker/6:2 Not tainted 5.15.0-rc1+ #3 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS +rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014 Workqueue: rcu_gp wait_rcu_exp_gp RIP: 0010:irq_work_queue_on+0x57/0x60 Code: 8b 05 1d c7 ea 62 a9 00 00 f0 00 75 21 4c 89 ce 44 89 c7 e8 +9b 37 fa ff ba 01 00 00 00 89 d0 c3 4c 89 cf e8 3b ff ff ff eb ee <0f> 0b eb b7 +0f 0b eb db 90 48 c7 c0 98 2a 02 00 65 48 03 05 91 6f RSP: 0000:ffffb12cc038fe48 EFLAGS: 00010282 RAX: 0000000000000001 RBX: 0000000000005208 RCX: 0000000000000020 RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff9ad01f45a680 RBP: 000000000004c990 R08: 0000000000000001 R09: ffff9ad01f45a680 R10: ffffb12cc0317db0 R11: 0000000000000001 R12: 00000000fffecee8 R13: 0000000000000001 R14: 0000000000026980 R15: ffffffff9e53ae00 FS: 0000000000000000(0000) GS:ffff9ad01f580000(0000) +knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000000de0c000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: tick_nohz_dep_set_cpu+0x59/0x70 rcu_exp_wait_wake+0x54e/0x870 ? sync_rcu_exp_select_cpus+0x1fc/0x390 process_one_work+0x1ef/0x3c0 ? process_one_work+0x3c0/0x3c0 worker_thread+0x28/0x3c0 ? process_one_work+0x3c0/0x3c0 kthread+0x115/0x140 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 ---[ end trace c5bf75eb6aa80bc6 ]--- This commit therefore avoids invoking tick_dep_set_cpu() on offlined CPUs to limit both futility and false-positive splats. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-09-02 09:18:13 +02:00
Zqiang	a7d21b8585	rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access [ Upstream commit `3c1566bca3` ] For kernels built with CONFIG_PREEMPT_RCU=y, the following scenario can result in a NULL-pointer dereference: CPU1 CPU2 rcu_preempt_deferred_qs_irqrestore rcu_print_task_exp_stall if (special.b.blocked) READ_ONCE(rnp->exp_tasks) != NULL raw_spin_lock_rcu_node np = rcu_next_node_entry(t, rnp) if (&t->rcu_node_entry == rnp->exp_tasks) WRITE_ONCE(rnp->exp_tasks, np) .... raw_spin_unlock_irqrestore_rcu_node raw_spin_lock_irqsave_rcu_node t = list_entry(rnp->exp_tasks->prev, struct task_struct, rcu_node_entry) (if rnp->exp_tasks is NULL, this will dereference a NULL pointer) The problem is that CPU2 accesses the rcu_node structure's->exp_tasks field without holding the rcu_node structure's ->lock and CPU2 did not observe CPU1's change to rcu_node structure's ->exp_tasks in time. Therefore, if CPU1 sets rcu_node structure's->exp_tasks pointer to NULL, then CPU2 might dereference that NULL pointer. This commit therefore holds the rcu_node structure's ->lock while accessing that structure's->exp_tasks field. [ paulmck: Apply Frederic Weisbecker feedback. ] Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-30 12:57:47 +01:00
Paul E. McKenney	2bf501f1bc	rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait() [ Upstream commit `2d7f00b2f0` ] The normal grace period's RCU CPU stall warnings are invoked from the scheduling-clock interrupt handler, and can thus invoke smp_processor_id() with impunity, which allows them to directly invoke dump_cpu_task(). In contrast, the expedited grace period's RCU CPU stall warnings are invoked from process context, which causes the dump_cpu_task() function's calls to smp_processor_id() to complain bitterly in debug kernels. This commit therefore causes synchronize_rcu_expedited_wait() to disable preemption around its call to dump_cpu_task(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-03-11 16:39:48 +01:00
Frederic Weisbecker	0836f94040	rcu/exp: Mark current CPU as exp-QS in IPI loop second pass [ Upstream commit `81f6d49cce` ] Expedited RCU grace periods invoke sync_rcu_exp_select_node_cpus(), which takes two passes over the leaf rcu_node structure's CPUs. The first pass gathers up the current CPU and CPUs that are in dynticks idle mode. The workqueue will report a quiescent state on their behalf later. The second pass sends IPIs to the rest of the CPUs, but excludes the current CPU, incorrectly assuming it has been included in the first pass's list of CPUs. Unfortunately the current CPU may have changed between the first and second pass, due to the fact that the various rcu_node structures' ->lock fields have been dropped, thus momentarily enabling preemption. This means that if the second pass's CPU was not on the first pass's list, it will be ignored completely. There will be no IPI sent to it, and there will be no reporting of quiescent states on its behalf. Unfortunately, the expedited grace period will nevertheless be waiting for that CPU to report a quiescent state, but with that CPU having no reason to believe that such a report is needed. The result will be an expedited grace period stall. Fix this by no longer excluding the current CPU from consideration during the second pass. Fixes: `b9ad4d6ed1` ("rcu: Avoid self-IPI in sync_rcu_exp_select_node_cpus()") Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-01-27 10:53:55 +01:00
Neeraj Upadhyay	226d68fb6c	rcu: Fix existing exp request check in sync_sched_exp_online_cleanup() [ Upstream commit `f0b2b2df54` ] The sync_sched_exp_online_cleanup() checks to see if RCU needs an expedited quiescent state from the incoming CPU, sending it an IPI if so. Before sending IPI, it checks whether expedited qs need has been already requested for the incoming CPU, by checking rcu_data.cpu_no_qs.b.exp for the current cpu, on which sync_sched_exp_online_cleanup() is running. This works for the case where incoming CPU is same as self. However, for the case where incoming CPU is different from self, expedited request won't get marked, which can potentially delay reporting of expedited quiescent state for the incoming CPU. Fixes: `e015a34112` ("rcu: Avoid self-IPI in sync_sched_exp_online_cleanup()") Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-11-18 14:04:02 +01:00
Paul E. McKenney	7487ea07df	rcu: Initialize at declaration time in rcu_exp_handler() This commit moves the initialization of the CONFIG_PREEMPT=n version of the rcu_exp_handler() function's rdp and rnp local variables into their respective declarations to save a couple lines of code. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-08-24 18:36:03 -07:00
Paul E. McKenney	68c2f27e01	rcu: Expedited grace-period sleeps to idle priority This commit converts the schedule_timeout_uninterruptible() call used by RCU's expedited grace-period processing to schedule_timeout_idle(). This conversion avoids polluting the load-average with RCU-related sleeping. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-06-29 11:58:50 -07:00
Paul E. McKenney	f736e0f1a5	Merge branches 'fixes.2020.04.27a', 'kfree_rcu.2020.04.27a', 'rcu-tasks.2020.04.27a', 'stall.2020.04.27a' and 'torture.2020.05.07a' into HEAD fixes.2020.04.27a: Miscellaneous fixes. kfree_rcu.2020.04.27a: Changes related to kfree_rcu(). rcu-tasks.2020.04.27a: Addition of new RCU-tasks flavors. stall.2020.04.27a: RCU CPU stall-warning updates. torture.2020.05.07a: Torture-test updates.	2020-05-07 10:18:32 -07:00
Paul E. McKenney	654db05cee	rcu: Use data_race() for RCU expedited CPU stall-warning prints Although the accesses used to determine whether or not an expedited stall should be printed are an integral part of the concurrency algorithm governing use of the corresponding variables, the values that are simply printed are ancillary. As such, it is best to use data_race() for these accesses in order to provide the greatest latitude in the use of KCSAN for the other accesses that are an integral part of the algorithm. This commit therefore changes the relevant uses of READ_ONCE() to data_race(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-04-27 11:04:37 -07:00
Lai Jiangshan	5f5fa7ea89	rcu: Don't use negative nesting depth in __rcu_read_unlock() Now that RCU flavors have been consolidated, an RCU-preempt rcu_read_unlock() in an interrupt or softirq handler cannot possibly end the RCU read-side critical section. Consider the old vulnerability involving rcu_read_unlock() being invoked within such a handler that interrupted an __rcu_read_unlock_special(), in which a wakeup might be invoked with a scheduler lock held. Because rcu_read_unlock_special() no longer does wakeups in such situations, it is no longer necessary for __rcu_read_unlock() to set the nesting level negative. This commit therefore removes this recursion-protection code from __rcu_read_unlock(). [ paulmck: Let rcu_exp_handler() continue to call rcu_report_exp_rdp(). ] [ paulmck: Adjust other checks given no more negative nesting. ] Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-04-27 11:03:50 -07:00
Paul E. McKenney	314eeb43e5	rcu: Add *_ONCE() and data_race() to rcu_node ->exp_tasks plus locking There are lockless loads from the rcu_node structure's ->exp_tasks field, so this commit causes all stores to use WRITE_ONCE() and all lockless loads to use READ_ONCE() or data_race(), with the latter for debug prints. This code also did a unprotected traversal of the linked list pointed into by ->exp_tasks, so this commit also acquires the rcu_node structure's ->lock to properly protect this traversal. This list was traversed unprotected only when printing an RCU CPU stall warning for an expedited grace period, so the odds of seeing this in production are not all that high. This data race was reported by KCSAN. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-04-27 11:01:15 -07:00
Paul E. McKenney	aa93ec620b	Merge branches 'doc.2020.02.27a', 'fixes.2020.03.21a', 'kfree_rcu.2020.02.20a', 'locktorture.2020.02.20a', 'ovld.2020.02.20a', 'rcu-tasks.2020.02.20a', 'srcu.2020.02.20a' and 'torture.2020.02.20a' into HEAD doc.2020.02.27a: Documentation updates. fixes.2020.03.21a: Miscellaneous fixes. kfree_rcu.2020.02.20a: Updates to kfree_rcu(). locktorture.2020.02.20a: Lock torture-test updates. ovld.2020.02.20a: Updates to callback-overload handling. rcu-tasks.2020.02.20a: RCU-tasks updates. srcu.2020.02.20a: SRCU updates. torture.2020.02.20a: Torture-test updates.	2020-03-21 17:15:11 -07:00
Paul E. McKenney	58c53360b3	rcutorture: Allow boottime stall warnings to be suppressed In normal production, an RCU CPU stall warning at boottime is often just as bad as at any other time. In fact, given the desire for fast boot, any sort of long-term stall at boot is a bad idea. However, heavy rcutorture testing on large hyperthreaded systems can generate boottime RCU CPU stalls as a matter of course. This commit therefore provides a kernel boot parameter that suppresses reporting of boottime RCU CPU stall warnings and similarly of rcutorture writer stalls. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 16:03:30 -08:00
Paul E. McKenney	59ee0326cc	rcutorture: Suppress forward-progress complaints during early boot Some larger systems can take in excess of 50 seconds to complete their early boot initcalls prior to spawing init. This does not in any way help the forward-progress judgments of built-in rcutorture (when rcutorture is built as a module, the insmod or modprobe command normally cannot happen until some time after boot completes). This commit therefore suppresses such complaints until about the time that init is spawned. This also includes a fix to a stupid error located by kbuild test robot. [ paulmck: Apply kbuild test robot feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> [ paulmck: Fix to nohz_full slow-expediting recovery logic, per bpetkov. ] [ paulmck: Restrict splat to CONFIG_PREEMPT_RT=y kernels and simplify. ] Tested-by: Borislav Petkov <bp@alien8.de>	2020-02-20 16:03:30 -08:00
Paul E. McKenney	b0c18c8773	rcu: Add WRITE_ONCE to rcu_node ->exp_seq_rq store The rcu_node structure's ->exp_seq_rq field is read locklessly, so this commit adds the WRITE_ONCE() to a load in order to provide proper documentation and READ_ONCE()/WRITE_ONCE() pairing. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:22 -08:00
Paul E. McKenney	24bb9eccf7	rcu: Fix exp_funnel_lock()/rcu_exp_wait_wake() datarace The rcu_node structure's ->exp_seq_rq field is accessed locklessly, so updates must use WRITE_ONCE(). This commit therefore adds the needed WRITE_ONCE() invocation where it was missed. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-02-20 15:58:21 -08:00
Paul E. McKenney	59d8cc6b2e	rcu: Forgive slow expedited grace periods at boot time Boot-time processing often loops in the kernel longer than one might prefer, which can prevent expedited grace periods from completing in a timely manner. This in turn triggers a splat In nohz_full CPUs One could argue that long-looping code should be fixed, but on the other hand, boot time is a bit special. This commit therefore removes the splat. Later commits will add the splat back in, but in a way that removes false positives. Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-01-25 12:00:40 -08:00
Paul E. McKenney	0e247386d9	Merge branches 'doc.2019.12.10a', 'exp.2019.12.09a', 'fixes.2020.01.24a', 'kfree_rcu.2020.01.24a', 'list.2020.01.10a', 'preempt.2020.01.24a' and 'torture.2019.12.09a' into HEAD doc.2019.12.10a: Documentations updates exp.2019.12.09a: Expedited grace-period updates fixes.2020.01.24a: Miscellaneous fixes kfree_rcu.2020.01.24a: Batch kfree_rcu() work list.2020.01.10a: RCU-protected-list updates preempt.2020.01.24a: Preemptible RCU updates torture.2019.12.09a: Torture-test updates	2020-01-24 10:37:27 -08:00
Lai Jiangshan	77339e61aa	rcu: Provide wrappers for uses of ->rcu_read_lock_nesting This commit provides wrapper functions for uses of ->rcu_read_lock_nesting to improve readability and to ease future changes to support inlining of __rcu_read_lock() and __rcu_read_unlock(). Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2020-01-24 10:27:33 -08:00
Sebastian Andrzej Siewior	90326f0521	rcu: Use CONFIG_PREEMPTION where appropriate The config option `CONFIG_PREEMPT' is used for the preemption model "Low-Latency Desktop". The config option `CONFIG_PREEMPTION' is enabled when kernel preemption is enabled which is true for the preemption model `CONFIG_PREEMPT' and `CONFIG_PREEMPT_RT'. Use `CONFIG_PREEMPTION' if it applies to both preemption models and not just to `CONFIG_PREEMPT'. Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: rcu@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2019-12-09 12:37:51 -08:00
Paul E. McKenney	df1e849ae4	rcu: Enable tick for nohz_full CPUs slow to provide expedited QS An expedited grace period can be stalled by a nohz_full CPU looping in kernel context. This possibility is currently handled by some carefully crafted checks in rcu_read_unlock_special() that enlist help from ksoftirqd when permitted by the scheduler. However, it is exactly these checks that require the scheduler avoid holding any of its rq or pi locks across rcu_read_unlock() without also having held them across the entire RCU read-side critical section. It would therefore be very nice if expedited grace periods could handle nohz_full CPUs looping in kernel context without such checks. This commit therefore adds code to the expedited grace period's wait and cleanup code that forces the scheduler-clock interrupt on for CPUs that fail to quickly supply a quiescent state. "Quickly" is currently a hard-coded single-jiffy delay. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2019-12-09 12:32:59 -08:00
Paul E. McKenney	28f0361fdf	rcu: Replace synchronize_sched_expedited_wait() "_sched" with "_rcu" After RCU flavor consolidation, synchronize_sched_expedited_wait() does both RCU-preempt and RCU-sched, whichever happens to have been built into the running kernel. This commit therefore changes this function's name to synchronize_rcu_expedited_wait() to reflect its new generic nature. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2019-12-09 12:24:59 -08:00
Paul E. McKenney	de8cd0a533	rcu: Update tree_exp.h function-header comments The function-header comments in kernel/rcu/tree_exp.h have gotten a bit out of date, so this commit updates a number of them. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2019-12-09 12:24:58 -08:00
Paul E. McKenney	6c7d7dbf5b	rcu: Rename sync_rcu_preempt_exp_done() to sync_rcu_exp_done() Now that the RCU flavors have been consolidated, there is one common function for checking to see if an expedited RCU grace period has completed, namely sync_rcu_preempt_exp_done(). Because this function is no longer specific to RCU-preempt, this commit removes the "_preempt" from its name. This commit also changes sync_rcu_preempt_exp_done_unlocked() to sync_rcu_exp_done_unlocked() for the same reason. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2019-12-09 12:24:58 -08:00
Neeraj Upadhyay	4bc6b745e5	rcu: Allow only one expedited GP to run concurrently with wakeups The current expedited RCU grace-period code expects that a task requesting an expedited grace period cannot awaken until that grace period has reached the wakeup phase. However, it is possible for a long preemption to result in the waiting task never sleeping. For example, consider the following sequence of events: 1. Task A starts an expedited grace period by invoking synchronize_rcu_expedited(). It proceeds normally up to the wait_event() near the end of that function, and is then preempted (or interrupted or whatever). 2. The expedited grace period completes, and a kworker task starts the awaken phase, having incremented the counter and acquired the rcu_state structure's .exp_wake_mutex. This kworker task is then preempted or interrupted or whatever. 3. Task A resumes and enters wait_event(), which notes that the expedited grace period has completed, and thus doesn't sleep. 4. Task B starts an expedited grace period exactly as did Task A, complete with the preemption (or whatever delay) just before the call to wait_event(). 5. The expedited grace period completes, and another kworker task starts the awaken phase, having incremented the counter. However, it blocks when attempting to acquire the rcu_state structure's .exp_wake_mutex because step 2's kworker task has not yet released it. 6. Steps 4 and 5 repeat, resulting in overflow of the rcu_node structure's ->exp_wq[] array. In theory, this is harmless. Tasks waiting on the various ->exp_wq[] array will just be spuriously awakened, but they will just sleep again on noting that the rcu_state structure's ->expedited_sequence value has not advanced far enough. In practice, this wastes CPU time and is an accident waiting to happen. This commit therefore moves the rcu_exp_gp_seq_end() call that officially ends the expedited grace period (along with associate tracing) until after the ->exp_wake_mutex has been acquired. This prevents Task A from awakening prematurely, thus preventing more than one expedited grace period from being in flight during a previous expedited grace period's wakeup phase. Fixes: `3b5f668e71` ("rcu: Overlap wakeups with next expedited grace period") Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> [ paulmck: Added updated comment. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2019-12-09 12:24:57 -08:00
Neeraj Upadhyay	fd6bc19d76	rcu: Fix missed wakeup of exp_wq waiters Tasks waiting within exp_funnel_lock() for an expedited grace period to elapse can be starved due to the following sequence of events: 1. Tasks A and B both attempt to start an expedited grace period at about the same time. This grace period will have completed when the lower four bits of the rcu_state structure's ->expedited_sequence field are 0b'0100', for example, when the initial value of this counter is zero. Task A wins, and thus does the actual work of starting the grace period, including acquiring the rcu_state structure's .exp_mutex and sets the counter to 0b'0001'. 2. Because task B lost the race to start the grace period, it waits on ->expedited_sequence to reach 0b'0100' inside of exp_funnel_lock(). This task therefore blocks on the rcu_node structure's ->exp_wq[1] field, keeping in mind that the end-of-grace-period value of ->expedited_sequence (0b'0100') is shifted down two bits before indexing the ->exp_wq[] field. 3. Task C attempts to start another expedited grace period, but blocks on ->exp_mutex, which is still held by Task A. 4. The aforementioned expedited grace period completes, so that ->expedited_sequence now has the value 0b'0100'. A kworker task therefore acquires the rcu_state structure's ->exp_wake_mutex and starts awakening any tasks waiting for this grace period. 5. One of the first tasks awakened happens to be Task A. Task A therefore releases the rcu_state structure's ->exp_mutex, which allows Task C to start the next expedited grace period, which causes the lower four bits of the rcu_state structure's ->expedited_sequence field to become 0b'0101'. 6. Task C's expedited grace period completes, so that the lower four bits of the rcu_state structure's ->expedited_sequence field now become 0b'1000'. 7. The kworker task from step 4 above continues its wakeups. Unfortunately, the wake_up_all() refetches the rcu_state structure's .expedited_sequence field: wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]); This results in the wakeup being applied to the rcu_node structure's ->exp_wq[2] field, which is unfortunate given that Task B is instead waiting on ->exp_wq[1]. On a busy system, no harm is done (or at least no permanent harm is done). Some later expedited grace period will redo the wakeup. But on a quiet system, such as many embedded systems, it might be a good long time before there was another expedited grace period. On such embedded systems, this situation could therefore result in a system hang. This issue manifested as DPM device timeout during suspend (which usually qualifies as a quiet time) due to a SCSI device being stuck in _synchronize_rcu_expedited(), with the following stack trace: schedule() synchronize_rcu_expedited() synchronize_rcu() scsi_device_quiesce() scsi_bus_suspend() dpm_run_callback() __device_suspend() This commit therefore prevents such delays, timeouts, and hangs by making rcu_exp_wait_wake() use its "s" argument consistently instead of refetching from rcu_state.expedited_sequence. Fixes: `3b5f668e71` ("rcu: Overlap wakeups with next expedited grace period") Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2019-12-09 12:24:57 -08:00
Paul E. McKenney	aca2991a25	rcu: Substitute lookup for bit-twiddling in sync_rcu_exp_select_node_cpus() The code in sync_rcu_exp_select_node_cpus() calculates the current CPU's mask within its rcu_node structure's bitmasks, but this has already been computed in the ->grpmask field of that CPU's rcu_data structure. This commit therefore just uses this ->grpmask field. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2019-12-09 12:24:57 -08:00
Boqun Feng	9f08cf0886	rcu: Avoid modifying mask_ofl_ipi in sync_rcu_exp_select_node_cpus() The "mask_ofl_ipi" is used to track which CPUs get IPIed, however in the IPI sending loop, "mask_ofl_ipi" along with another variable "mask_ofl_test" might also get modified to record which CPUs' quiesent states must be reported by the sync_rcu_exp_select_node_cpus() at the end of sync_rcu_exp_select_node_cpus(). This overlap of roles can be confusing, so this patch cleans things a little by using "mask_ofl_ipi" solely for determining which CPUs must be IPIed and "mask_ofl_test" for solely determining on behalf of which CPUs sync_rcu_exp_select_node_cpus() must report a quiscent state. Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: Marco Elver <elver@google.com>	2019-12-09 12:24:56 -08:00
Paul E. McKenney	15c7c972cd	rcu: Use *_ONCE() to protect lockless ->expmask accesses The rcu_node structure's ->expmask field is accessed locklessly when starting a new expedited grace period and when reporting an expedited RCU CPU stall warning. This commit therefore handles the former by taking a snapshot of ->expmask while the lock is held and the latter by applying READ_ONCE() to lockless reads and WRITE_ONCE() to the corresponding updates. Link: https://lore.kernel.org/lkml/CANpmjNNmSOagbTpffHr4=Yedckx9Rm2NuGqC9UqE+AOz5f1-ZQ@mail.gmail.com Reported-by: syzbot+134336b86f728d6e55a0@syzkaller.appspotmail.com Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Marco Elver <elver@google.com>	2019-12-09 12:24:56 -08:00
Mukesh Ojha	511b44f759	rcu: Fix spelling mistake "greate"->"great" This commit fixes a spelling mistake in file tree_exp.h. Signed-off-by: Mukesh Ojha <mojha@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>	2019-08-12 11:25:06 -07:00
Paul E. McKenney	fbad01af8c	rcu: Add destroy_work_on_stack() to match INIT_WORK_ONSTACK() The synchronize_rcu_expedited() function has an INIT_WORK_ONSTACK(), but lacks the corresponding destroy_work_on_stack(). This commit therefore adds destroy_work_on_stack(). Reported-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com>	2019-08-01 14:05:51 -07:00
Paul E. McKenney	11ca7a9d54	Merge branches 'consolidate.2019.05.28a', 'doc.2019.05.28a', 'fixes.2019.06.13a', 'srcu.2019.05.28a', 'sync.2019.05.28a' and 'torture.2019.05.28a' into HEAD consolidate.2019.05.28a: RCU flavor consolidation cleanups and optmizations. doc.2019.05.28a: Documentation updates. fixes.2019.06.13a: Miscellaneous fixes. srcu.2019.05.28a: SRCU updates. sync.2019.05.28a: RCU-sync flavor consolidation. torture.2019.05.28a: Torture-test updates.	2019-06-19 09:21:46 -07:00
Paul E. McKenney	96050c68be	rcu: Upgrade sync_exp_work_done() to smp_mb() The sync_exp_work_done() function uses smp_mb__before_atomic(), but there is no obvious atomic in the ensuing code. The ordering is absolutely required for grace periods to work correctly, so this commit upgrades the smp_mb__before_atomic() to smp_mb(). Fixes: `6fba2b3767` ("rcu: Remove deprecated RCU debugfs tracing code") Reported-by: Andrea Parri <andrea.parri@amarulasolutions.com> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>	2019-06-13 15:33:19 -07:00
Jiang Biao	f0b6356273	rcu: Remove unused rdp local from synchronize_rcu_expedited() Because rdp is initialized but never used in synchronize_rcu_expedited(), this commit removes it. Signed-off-by: Jiang Biao <benbjiang@tencent.com> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>	2019-05-28 08:48:19 -07:00
Paul E. McKenney	1bb336443c	rcu: Rename rcu_data's ->deferred_qs to ->exp_deferred_qs The rcu_data structure's ->deferred_qs field is used to indicate that the current CPU is blocking an expedited grace period (perhaps a future one). Given that it is used only for expedited grace periods, its current name is misleading, so this commit renames it to ->exp_deferred_qs. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>	2019-05-28 08:48:19 -07:00
Paul E. McKenney	e015a34112	rcu: Avoid self-IPI in sync_sched_exp_online_cleanup() The sync_sched_exp_online_cleanup() is invoked at online time to handle the case where the start of an expedited grace period ran concurrently with a CPU being taken offline and then immediately being placed online. It checks to see if RCU needs an expedited quiescent state from the incoming CPU, sending it an IPI if so. However, it is quite possible that sync_sched_exp_online_cleanup() is running on that CPU, in which case it is considerably less overhead to simply request the quiescent state locally instead of simulating a self-IPI. This commit therefore places the last few lines of rcu_exp_handler() into a new rcu_exp_need_qs() function, which is invoked both by rcu_exp_handler() and by sync_sched_exp_online_cleanup() in the self-IPI case. This also reduces the rcu_exp_handler() function's state space by removing the direct call that this smp_call_function_single() uses to emulate the requested self-IPI. This in turn will allow tighter error checking in rcu_is_cpu_rrupt_from_idle(). Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>	2019-05-25 14:50:50 -07:00
Paul E. McKenney	b9ad4d6ed1	rcu: Avoid self-IPI in sync_rcu_exp_select_node_cpus() Although sync_rcu_exp_select_node_cpus() treats the current CPU as being in a quiescent state, it might well migrate to some other CPU before reaching the smp_call_function_single(), which could then result in an unnecessary simulated self-IPI. This commit therefore instead simply refuses to invoke smp_call_function_single() on the current CPU, which causes the later rcu_report_exp_cpu_mult() to report this CPU's quiescent state with less overhead. This also reduces the rcu_exp_handler() function's state space by removing the direct call that this smp_call_function_single() uses to emulate the requested self-IPI. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> [ paulmck: Use get_cpu() instead of preempt_disable() per Joel Fernandes. ]	2019-05-25 14:50:50 -07:00
Paul E. McKenney	6cdbc07a5a	Merge branches 'consolidate.2019.04.09a', 'doc.2019.03.26b', 'fixes.2019.03.26b', 'srcu.2019.03.26b', 'stall.2019.03.26b' and 'torture.2019.03.26b' into HEAD consolidate.2019.04.09a: Lingering RCU flavor consolidation cleanups. doc.2019.03.26b: Documentation updates. fixes.2019.03.26b: Miscellaneous fixes. srcu.2019.03.26b: SRCU updates. stall.2019.03.26b: RCU CPU stall warning updates. torture.2019.03.26b: Torture-test updates.	2019-04-09 08:08:13 -07:00
Paul E. McKenney	d87cda5094	rcu: Move rcu_print_task_exp_stall() to tree_exp.h Because expedited CPU stall warnings are contained within the kernel/rcu/tree_exp.h file, rcu_print_task_exp_stall() should live there too. This commit carries out the required code motion. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>	2019-03-26 14:40:13 -07:00
Paul E. McKenney	add0d37b4f	rcu: Correct READ_ONCE()/WRITE_ONCE() for ->rcu_read_unlock_special The task_struct structure's ->rcu_read_unlock_special field is only ever read or written by the owning task, but it is accessed both at process and interrupt levels. It may therefore be accessed using plain reads and writes while interrupts are disabled, but must be accessed using READ_ONCE() and WRITE_ONCE() or better otherwise. This commit makes a few adjustments to align with this discipline. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>	2019-03-26 14:38:38 -07:00
Paul E. McKenney	f1a98045ab	rcu: Fix typo in tree_exp.h comment This commit changes a rcu_exp_handler() comment from rcu_preempt_defer_qs() to rcu_preempt_deferred_qs() in order to better match reality. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>	2019-03-26 14:38:38 -07:00
Paul E. McKenney	e7ffb4eb9a	Merge branches 'doc.2019.01.26a', 'fixes.2019.01.26a', 'sil.2019.01.26a', 'spdx.2019.02.09a', 'srcu.2019.01.26a' and 'torture.2019.01.26a' into HEAD doc.2019.01.26a: Documentation updates. fixes.2019.01.26a: Miscellaneous fixes. sil.2019.01.26a: Removal of a few more spin_is_locked() instances. spdx.2019.02.09a: Add SPDX identifiers to RCU files srcu.2019.01.26a: SRCU updates. torture.2019.01.26a: Torture-test updates.	2019-02-09 08:47:52 -08:00
Paul E. McKenney	22e4092531	rcu/tree: Convert to SPDX license identifier Replace the license boiler plate with a SPDX license identifier. While in the area, update an email address. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> [ paulmck: Update .h file SPDX comment format per Joe Perches. ] Reviewed-by: Thomas Gleixner <tglx@linutronix.de>	2019-02-09 08:44:10 -08:00
Paul E. McKenney	5a0874c1d1	rcu: Remove preemption disabling from expedited CPU selection It turns out that it is queue_delayed_work_on() rather than queue_work_on() that has difficulties when used concurrently with CPU-hotplug removal operations. It is therefore unnecessary to protect CPU identification and queue_work_on() with preempt_disable(). This commit therefore removes the preempt_disable() and preempt_enable() from sync_rcu_exp_select_cpus(), which has the further benefit of reducing the number of changes that must be maintained in the -rt patchset. Reported-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: Sebastian Siewior <bigeasy@linutronix.de> Suggested-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>	2019-01-25 15:35:23 -08:00
Paul E. McKenney	8923072664	rcu: Inline _synchronize_rcu_expedited() into synchronize_rcu_expedited() Now that _synchronize_rcu_expedited() has only one caller, and given that this is a tail call, this commit inlines _synchronize_rcu_expedited() into synchronize_rcu_expedited(). Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>	2019-01-25 15:28:29 -08:00
Paul E. McKenney	e5bc3af773	rcu: Consolidate PREEMPT and !PREEMPT synchronize_rcu() Now that rcu_blocking_is_gp() makes the correct immediate-return decision for both PREEMPT and !PREEMPT, a single implementation of synchronize_rcu() will work correctly under both configurations. This commit therefore eliminates a few lines of code by consolidating the two implementations of synchronize_rcu(). Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>	2019-01-25 15:28:28 -08:00
Paul E. McKenney	3cd4ca47aa	rcu: Consolidate PREEMPT and !PREEMPT synchronize_rcu_expedited() The CONFIG_PREEMPT=n and CONFIG_PREEMPT=y implementations of synchronize_rcu_expedited() are quite similar, and with small modifications to rcu_blocking_is_gp() can be made identical. This commit therefore makes this change in order to save a few lines of code and to reduce the amount of duplicate code. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>	2019-01-25 15:28:27 -08:00
Paul E. McKenney	142d106d5e	rcu: Determine expedited-GP IPI handler at build time Back when there could be multiple RCU flavors running in the same kernel at the same time, it was necessary to specify the expedited grace-period IPI handler at runtime. Now that there is only one RCU flavor, the IPI handler can be determined at build time. There is therefore no longer any reason for the RCU-preempt and RCU-sched IPI handlers to have different names, nor is there any reason to pass these handlers in function arguments and in the data structures enclosing workqueues. This commit therefore makes all these changes, pushing the specification of the expedited grace-period IPI handler down to the point of use. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>	2019-01-25 15:28:27 -08:00
Paul E. McKenney	1de462ed85	rcu: Make expedited IPI handler return after handling critical section During expedited RCU grace-period initialization, IPIs are sent to all non-idle online CPUs. The IPI handler checks to see if the CPU is in quiescent state, reporting one if so. This handler looks at three different cases: (1) The CPU is not in an rcu_read_lock()-based critical section, (2) The CPU is in the process of exiting an rcu_read_lock()-based critical section, and (3) The CPU is in an rcu_read_lock()-based critical section. In case (2), execution falls through into case (3). This is harmless from a functionality viewpoint, but can result in needless overhead during an improbable corner case. This commit therefore adds the "return" statement needed to prevent fall-through. Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>	2019-01-25 15:28:24 -08:00
Paul E. McKenney	05f415715c	rcu: Speed up expedited GPs when interrupting RCU reader In PREEMPT kernels, an expedited grace period might send an IPI to a CPU that is executing an RCU read-side critical section. In that case, it would be nice if the rcu_read_unlock() directly interacted with the RCU core code to immediately report the quiescent state. And this does happen in the case where the reader has been preempted. But it would also be a nice performance optimization if immediate reporting also happened in the preemption-free case. This commit therefore adds an ->exp_hint field to the task_struct structure's ->rcu_read_unlock_special field. The IPI handler sets this hint when it has interrupted an RCU read-side critical section, and this causes the outermost rcu_read_unlock() call to invoke rcu_read_unlock_special(), which, if preemption is enabled, reports the quiescent state immediately. If preemption is disabled, then the report is required to be deferred until preemption (or bottom halves or interrupts or whatever) is re-enabled. Because this is a hint, it does nothing for more complicated cases. For example, if the IPI interrupts an RCU reader, but interrupts are disabled across the rcu_read_unlock(), but another rcu_read_lock() is executed before interrupts are re-enabled, the hint will already have been cleared. If you do crazy things like this, reporting will be deferred until some later RCU_SOFTIRQ handler, context switch, cond_resched(), or similar. Reported-by: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>	2018-11-12 09:03:59 -08:00

1 2

100 Commits