diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html new file mode 100644 index 000000000000..e5b42a798ff3 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html @@ -0,0 +1,9 @@ + + +
+ + diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html new file mode 100644 index 000000000000..8651b0b4fd79 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html @@ -0,0 +1,707 @@ + + +
August 8, 2017
+This article was contributed by Paul E. McKenney
+ +This document gives a rough visual overview of how Tree RCU's +grace-period memory ordering guarantee is provided. + +
RCU grace periods provide extremely strong memory-ordering guarantees +for non-idle non-offline code. +Any code that happens after the end of a given RCU grace period is guaranteed +to see the effects of all accesses prior to the beginning of that grace +period that are within RCU read-side critical sections. +Similarly, any code that happens before the beginning of a given RCU grace +period is guaranteed to see the effects of all accesses following the end +of that grace period that are within RCU read-side critical sections. + +
This guarantee is particularly pervasive for synchronize_sched(), +for which RCU-sched read-side critical sections include any region +of code for which preemption is disabled. +Given that each individual machine instruction can be thought of as +an extremely small region of preemption-disabled code, one can think of +synchronize_sched() as smp_mb() on steroids. + +
RCU updaters use this guarantee by splitting their updates into +two phases, one of which is executed before the grace period and +the other of which is executed after the grace period. +In the most common use case, phase one removes an element from +a linked RCU-protected data structure, and phase two frees that element. +For this to work, any readers that have witnessed state prior to the +phase-one update (in the common case, removal) must not witness state +following the phase-two update (in the common case, freeing). + +
The RCU implementation provides this guarantee using a network +of lock-based critical sections, memory barriers, and per-CPU +processing, as is described in the following sections. + +
The workhorse for RCU's grace-period memory ordering is the +critical section for the rcu_node structure's +->lock. +These critical sections use helper functions for lock acquisition, including +raw_spin_lock_rcu_node(), +raw_spin_lock_irq_rcu_node(), and +raw_spin_lock_irqsave_rcu_node(). +Their lock-release counterparts are +raw_spin_unlock_rcu_node(), +raw_spin_unlock_irq_rcu_node(), and +raw_spin_unlock_irqrestore_rcu_node(), +respectively. +For completeness, a +raw_spin_trylock_rcu_node() +is also provided. +The key point is that the lock-acquisition functions, including +raw_spin_trylock_rcu_node(), all invoke +smp_mb__after_unlock_lock() immediately after successful +acquisition of the lock. + +
Therefore, for any given rcu_node struction, any access +happening before one of the above lock-release functions will be seen +by all CPUs as happening before any access happening after a later +one of the above lock-acquisition functions. +Furthermore, any access happening before one of the +above lock-release function on any given CPU will be seen by all +CPUs as happening before any access happening after a later one +of the above lock-acquisition functions executing on that same CPU, +even if the lock-release and lock-acquisition functions are operating +on different rcu_node structures. +Tree RCU uses these two ordering guarantees to form an ordering +network among all CPUs that were in any way involved in the grace +period, including any CPUs that came online or went offline during +the grace period in question. + +
The following litmus test exhibits the ordering effects of these +lock-acquisition and lock-release functions: + +
+ 1 int x, y, z; + 2 + 3 void task0(void) + 4 { + 5 raw_spin_lock_rcu_node(rnp); + 6 WRITE_ONCE(x, 1); + 7 r1 = READ_ONCE(y); + 8 raw_spin_unlock_rcu_node(rnp); + 9 } +10 +11 void task1(void) +12 { +13 raw_spin_lock_rcu_node(rnp); +14 WRITE_ONCE(y, 1); +15 r2 = READ_ONCE(z); +16 raw_spin_unlock_rcu_node(rnp); +17 } +18 +19 void task2(void) +20 { +21 WRITE_ONCE(z, 1); +22 smp_mb(); +23 r3 = READ_ONCE(x); +24 } +25 +26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0); ++ +
The WARN_ON() is evaluated at “the end of time”, +after all changes have propagated throughout the system. +Without the smp_mb__after_unlock_lock() provided by the +acquisition functions, this WARN_ON() could trigger, for example +on PowerPC. +The smp_mb__after_unlock_lock() invocations prevent this +WARN_ON() from triggering. + +
This approach must be extended to include idle CPUs, which need +RCU's grace-period memory ordering guarantee to extend to any +RCU read-side critical sections preceding and following the current +idle sojourn. +This case is handled by calls to the strongly ordered +atomic_add_return() read-modify-write atomic operation that +is invoked within rcu_dynticks_eqs_enter() at idle-entry +time and within rcu_dynticks_eqs_exit() at idle-exit time. +The grace-period kthread invokes rcu_dynticks_snap() and +rcu_dynticks_in_eqs_since() (both of which invoke +an atomic_add_return() of zero) to detect idle CPUs. + +
Quick Quiz: |
---|
+ But what about CPUs that remain offline for the entire + grace period? + |
Answer: |
+ Such CPUs will be offline at the beginning of the grace period, + so the grace period won't expect quiescent states from them. + Races between grace-period start and CPU-hotplug operations + are mediated by the CPU's leaf rcu_node structure's + ->lock as described above. + |
The approach must be extended to handle one final case, that +of waking a task blocked in synchronize_rcu(). +This task might be affinitied to a CPU that is not yet aware that +the grace period has ended, and thus might not yet be subject to +the grace period's memory ordering. +Therefore, there is an smp_mb() after the return from +wait_for_completion() in the synchronize_rcu() +code path. + +
Quick Quiz: |
---|
+ What? Where??? + I don't see any smp_mb() after the return from + wait_for_completion()!!! + |
Answer: |
+ That would be because I spotted the need for that + smp_mb() during the creation of this documentation, + and it is therefore unlikely to hit mainline before v4.14. + Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and + Jonathan Cameron for asking questions that sensitized me + to the rather elaborate sequence of events that demonstrate + the need for this memory barrier. + |
Tree RCU's grace--period memory-ordering guarantees rely most +heavily on the rcu_node structure's ->lock +field, so much so that it is necessary to abbreviate this pattern +in the diagrams in the next section. +For example, consider the rcu_prepare_for_idle() function +shown below, which is one of several functions that enforce ordering +of newly arrived RCU callbacks against future grace periods: + +
+ 1 static void rcu_prepare_for_idle(void) + 2 { + 3 bool needwake; + 4 struct rcu_data *rdp; + 5 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks); + 6 struct rcu_node *rnp; + 7 struct rcu_state *rsp; + 8 int tne; + 9 +10 if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) || +11 rcu_is_nocb_cpu(smp_processor_id())) +12 return; +13 tne = READ_ONCE(tick_nohz_active); +14 if (tne != rdtp->tick_nohz_enabled_snap) { +15 if (rcu_cpu_has_callbacks(NULL)) +16 invoke_rcu_core(); +17 rdtp->tick_nohz_enabled_snap = tne; +18 return; +19 } +20 if (!tne) +21 return; +22 if (rdtp->all_lazy && +23 rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) { +24 rdtp->all_lazy = false; +25 rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted; +26 invoke_rcu_core(); +27 return; +28 } +29 if (rdtp->last_accelerate == jiffies) +30 return; +31 rdtp->last_accelerate = jiffies; +32 for_each_rcu_flavor(rsp) { +33 rdp = this_cpu_ptr(rsp->rda); +34 if (rcu_segcblist_pend_cbs(&rdp->cblist)) +35 continue; +36 rnp = rdp->mynode; +37 raw_spin_lock_rcu_node(rnp); +38 needwake = rcu_accelerate_cbs(rsp, rnp, rdp); +39 raw_spin_unlock_rcu_node(rnp); +40 if (needwake) +41 rcu_gp_kthread_wake(rsp); +42 } +43 } ++ +
But the only part of rcu_prepare_for_idle() that really +matters for this discussion are lines 37–39. +We will therefore abbreviate this function as follows: + +
+ +
The box represents the rcu_node structure's ->lock +critical section, with the double line on top representing the additional +smp_mb__after_unlock_lock(). + +
Tree RCU's grace-period memory-ordering guarantee is provided by +a number of RCU components: + +
Each of the following section looks at the corresponding component +in detail. + +
If RCU's grace-period guarantee is to mean anything at all, any +access that happens before a given invocation of call_rcu() +must also happen before the corresponding grace period. +The implementation of this portion of RCU's grace period guarantee +is shown in the following figure: + +
+ +
Because call_rcu() normally acts only on CPU-local state, +it provides no ordering guarantees, either for itself or for +phase one of the update (which again will usually be removal of +an element from an RCU-protected data structure). +It simply enqueues the rcu_head structure on a per-CPU list, +which cannot become associated with a grace period until a later +call to rcu_accelerate_cbs(), as shown in the diagram above. + +
One set of code paths shown on the left invokes +rcu_accelerate_cbs() via +note_gp_changes(), either directly from call_rcu() (if +the current CPU is inundated with queued rcu_head structures) +or more likely from an RCU_SOFTIRQ handler. +Another code path in the middle is taken only in kernels built with +CONFIG_RCU_FAST_NO_HZ=y, which invokes +rcu_accelerate_cbs() via rcu_prepare_for_idle(). +The final code path on the right is taken only in kernels built with +CONFIG_HOTPLUG_CPU=y, which invokes +rcu_accelerate_cbs() via +rcu_advance_cbs(), rcu_migrate_callbacks, +rcutree_migrate_callbacks(), and takedown_cpu(), +which in turn is invoked on a surviving CPU after the outgoing +CPU has been completely offlined. + +
There are a few other code paths within grace-period processing +that opportunistically invoke rcu_accelerate_cbs(). +However, either way, all of the CPU's recently queued rcu_head +structures are associated with a future grace-period number under +the protection of the CPU's lead rcu_node structure's +->lock. +In all cases, there is full ordering against any prior critical section +for that same rcu_node structure's ->lock, and +also full ordering against any of the current task's or CPU's prior critical +sections for any rcu_node structure's ->lock. + +
The next section will show how this ordering ensures that any +accesses prior to the call_rcu() (particularly including phase +one of the update) +happen before the start of the corresponding grace period. + +
Quick Quiz: |
---|
+ But what about synchronize_rcu()? + |
Answer: |
+ The synchronize_rcu() passes call_rcu() + to wait_rcu_gp(), which invokes it. + So either way, it eventually comes down to call_rcu(). + |
Grace-period initialization is carried out by +the grace-period kernel thread, which makes several passes over the +rcu_node tree within the rcu_gp_init() function. +This means that showing the full flow of ordering through the +grace-period computation will require duplicating this tree. +If you find this confusing, please note that the state of the +rcu_node changes over time, just like Heraclitus's river. +However, to keep the rcu_node river tractable, the +grace-period kernel thread's traversals are presented in multiple +parts, starting in this section with the various phases of +grace-period initialization. + +
The first ordering-related grace-period initialization action is to +increment the rcu_state structure's ->gpnum +grace-period-number counter, as shown below: + +
+ +
The actual increment is carried out using smp_store_release(), +which helps reject false-positive RCU CPU stall detection. +Note that only the root rcu_node structure is touched. + +
The first pass through the rcu_node tree updates bitmasks +based on CPUs having come online or gone offline since the start of +the previous grace period. +In the common case where the number of online CPUs for this rcu_node +structure has not transitioned to or from zero, +this pass will scan only the leaf rcu_node structures. +However, if the number of online CPUs for a given leaf rcu_node +structure has transitioned from zero, +rcu_init_new_rnp() will be invoked for the first incoming CPU. +Similarly, if the number of online CPUs for a given leaf rcu_node +structure has transitioned to zero, +rcu_cleanup_dead_rnp() will be invoked for the last outgoing CPU. +The diagram below shows the path of ordering if the leftmost +rcu_node structure onlines its first CPU and if the next +rcu_node structure has no online CPUs +(or, alternatively if the leftmost rcu_node structure offlines +its last CPU and if the next rcu_node structure has no online CPUs). + +
+ +
The final rcu_gp_init() pass through the rcu_node +tree traverses breadth-first, setting each rcu_node structure's +->gpnum field to the newly incremented value from the +rcu_state structure, as shown in the following diagram. + +
+ +
This change will also cause each CPU's next call to +__note_gp_changes() +to notice that a new grace period has started, as described in the next +section. +But because the grace-period kthread started the grace period at the +root (with the increment of the rcu_state structure's +->gpnum field) before setting each leaf rcu_node +structure's ->gpnum field, each CPU's observation of +the start of the grace period will happen after the actual start +of the grace period. + +
Quick Quiz: |
---|
+ But what about the CPU that started the grace period? + Why wouldn't it see the start of the grace period right when + it started that grace period? + |
Answer: |
+ In some deep philosophical and overly anthromorphized + sense, yes, the CPU starting the grace period is immediately + aware of having done so. + However, if we instead assume that RCU is not self-aware, + then even the CPU starting the grace period does not really + become aware of the start of this grace period until its + first call to __note_gp_changes(). + On the other hand, this CPU potentially gets early notification + because it invokes __note_gp_changes() during its + last rcu_gp_init() pass through its leaf + rcu_node structure. + |
When all entities that might block the grace period have reported +quiescent states (or as described in a later section, had quiescent +states reported on their behalf), the grace period can end. +Online non-idle CPUs report their own quiescent states, as shown +in the following diagram: + +
+ +
This is for the last CPU to report a quiescent state, which signals +the end of the grace period. +Earlier quiescent states would push up the rcu_node tree +only until they encountered an rcu_node structure that +is waiting for additional quiescent states. +However, ordering is nevertheless preserved because some later quiescent +state will acquire that rcu_node structure's ->lock. + +
Any number of events can lead up to a CPU invoking +note_gp_changes (or alternatively, directly invoking +__note_gp_changes()), at which point that CPU will notice +the start of a new grace period while holding its leaf +rcu_node lock. +Therefore, all execution shown in this diagram happens after the +start of the grace period. +In addition, this CPU will consider any RCU read-side critical +section that started before the invocation of __note_gp_changes() +to have started before the grace period, and thus a critical +section that the grace period must wait on. + +
Quick Quiz: |
---|
+ But a RCU read-side critical section might have started + after the beginning of the grace period + (the ->gpnum++ from earlier), so why should + the grace period wait on such a critical section? + |
Answer: |
+ It is indeed not necessary for the grace period to wait on such + a critical section. + However, it is permissible to wait on it. + And it is furthermore important to wait on it, as this + lazy approach is far more scalable than a “big bang” + all-at-once grace-period start could possibly be. + |
If the CPU does a context switch, a quiescent state will be +noted by rcu_node_context_switch() on the left. +On the other hand, if the CPU takes a scheduler-clock interrupt +while executing in usermode, a quiescent state will be noted by +rcu_check_callbacks() on the right. +Either way, the passage through a quiescent state will be noted +in a per-CPU variable. + +
The next time an RCU_SOFTIRQ handler executes on +this CPU (for example, after the next scheduler-clock +interrupt), __rcu_process_callbacks() will invoke +rcu_check_quiescent_state(), which will notice the +recorded quiescent state, and invoke +rcu_report_qs_rdp(). +If rcu_report_qs_rdp() verifies that the quiescent state +really does apply to the current grace period, it invokes +rcu_report_rnp() which traverses up the rcu_node +tree as shown at the bottom of the diagram, clearing bits from +each rcu_node structure's ->qsmask field, +and propagating up the tree when the result is zero. + +
Note that traversal passes upwards out of a given rcu_node +structure only if the current CPU is reporting the last quiescent +state for the subtree headed by that rcu_node structure. +A key point is that if a CPU's traversal stops at a given rcu_node +structure, then there will be a later traversal by another CPU +(or perhaps the same one) that proceeds upwards +from that point, and the rcu_node ->lock +guarantees that the first CPU's quiescent state happens before the +remainder of the second CPU's traversal. +Applying this line of thought repeatedly shows that all CPUs' +quiescent states happen before the last CPU traverses through +the root rcu_node structure, the “last CPU” +being the one that clears the last bit in the root rcu_node +structure's ->qsmask field. + +
Due to energy-efficiency considerations, RCU is forbidden from +disturbing idle CPUs. +CPUs are therefore required to notify RCU when entering or leaving idle +state, which they do via fully ordered value-returning atomic operations +on a per-CPU variable. +The ordering effects are as shown below: + +
+ +
The RCU grace-period kernel thread samples the per-CPU idleness +variable while holding the corresponding CPU's leaf rcu_node +structure's ->lock. +This means that any RCU read-side critical sections that precede the +idle period (the oval near the top of the diagram above) will happen +before the end of the current grace period. +Similarly, the beginning of the current grace period will happen before +any RCU read-side critical sections that follow the +idle period (the oval near the bottom of the diagram above). + +
Plumbing this into the full grace-period execution is described +below. + +
RCU is also forbidden from disturbing offline CPUs, which might well +be powered off and removed from the system completely. +CPUs are therefore required to notify RCU of their comings and goings +as part of the corresponding CPU hotplug operations. +The ordering effects are shown below: + +
+ +
Because CPU hotplug operations are much less frequent than idle transitions, +they are heavier weight, and thus acquire the CPU's leaf rcu_node +structure's ->lock and update this structure's +->qsmaskinitnext. +The RCU grace-period kernel thread samples this mask to detect CPUs +having gone offline since the beginning of this grace period. + +
Plumbing this into the full grace-period execution is described +below. + +
As noted above, idle and offline CPUs cannot report their own +quiescent states, and therefore the grace-period kernel thread +must do the reporting on their behalf. +This process is called “forcing quiescent states”, it is +repeated every few jiffies, and its ordering effects are shown below: + +
+ +
Each pass of quiescent state forcing is guaranteed to traverse the +leaf rcu_node structures, and if there are no new quiescent +states due to recently idled and/or offlined CPUs, then only the +leaves are traversed. +However, if there is a newly offlined CPU as illustrated on the left +or a newly idled CPU as illustrated on the right, the corresponding +quiescent state will be driven up towards the root. +As with self-reported quiescent states, the upwards driving stops +once it reaches an rcu_node structure that has quiescent +states outstanding from other CPUs. + +
Quick Quiz: |
---|
+ The leftmost drive to root stopped before it reached + the root rcu_node structure, which means that + there are still CPUs subordinate to that structure on + which the current grace period is waiting. + Given that, how is it possible that the rightmost drive + to root ended the grace period? + |
Answer: |
+ Good analysis! + It is in fact impossible in the absence of bugs in RCU. + But this diagram is complex enough as it is, so simplicity + overrode accuracy. + You can think of it as poetic license, or you can think of + it as misdirection that is resolved in the + stitched-together diagram. + |
Grace-period cleanup first scans the rcu_node tree +breadth-first setting all the ->completed fields equal +to the number of the newly completed grace period, then it sets +the rcu_state structure's ->completed field, +again to the number of the newly completed grace period. +The ordering effects are shown below: + +
+ +
As indicated by the oval at the bottom of the diagram, once +grace-period cleanup is complete, the next grace period can begin. + +
Quick Quiz: |
---|
+ But when precisely does the grace period end? + |
Answer: |
+ There is no useful single point at which the grace period + can be said to end. + The earliest reasonable candidate is as soon as the last + CPU has reported its quiescent state, but it may be some + milliseconds before RCU becomes aware of this. + The latest reasonable candidate is once the rcu_state + structure's ->completed field has been updated, + but it is quite possible that some CPUs have already completed + phase two of their updates by that time. + In short, if you are going to work with RCU, you need to + learn to embrace uncertainty. + |
Once a given CPU's leaf rcu_node structure's +->completed field has been updated, that CPU can begin +invoking its RCU callbacks that were waiting for this grace period +to end. +These callbacks are identified by rcu_advance_cbs(), +which is usually invoked by __note_gp_changes(). +As shown in the diagram below, this invocation can be triggered by +the scheduling-clock interrupt (rcu_check_callbacks() on +the left) or by idle entry (rcu_cleanup_after_idle() on +the right, but only for kernels build with +CONFIG_RCU_FAST_NO_HZ=y). +Either way, RCU_SOFTIRQ is raised, which results in +rcu_do_batch() invoking the callbacks, which in turn +allows those callbacks to carry out (either directly or indirectly +via wakeup) the needed phase-two processing for each update. + +
+ +
Please note that callback invocation can also be prompted by any +number of corner-case code paths, for example, when a CPU notes that +it has excessive numbers of callbacks queued. +In all cases, the CPU acquires its leaf rcu_node structure's +->lock before invoking callbacks, which preserves the +required ordering against the newly completed grace period. + +
However, if the callback function communicates to other CPUs, +for example, doing a wakeup, then it is that function's responsibility +to maintain ordering. +For example, if the callback function wakes up a task that runs on +some other CPU, proper ordering must in place in both the callback +function and the task being awakened. +To see why this is important, consider the top half of the +grace-period cleanup diagram. +The callback might be running on a CPU corresponding to the leftmost +leaf rcu_node structure, and awaken a task that is to run on +a CPU corresponding to the rightmost leaf rcu_node structure, +and the grace-period kernel thread might not yet have reached the +rightmost leaf. +In this case, the grace period's memory ordering might not yet have +reached that CPU, so again the callback function and the awakened +task must supply proper ordering. + +
A stitched-together diagram is +here. + +
This work represents the view of the author and does not necessarily +represent the view of IBM. + +
Linux is a registered trademark of Linus Torvalds. + +
Other company, product, and service names may be trademarks or
+service marks of others.
+
+
diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg
new file mode 100644
index 000000000000..832408313d93
--- /dev/null
+++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-invocation.svg
@@ -0,0 +1,486 @@
+
+
+
+
+
+
+
+
diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-registry.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-registry.svg
new file mode 100644
index 000000000000..7ac6f9269806
--- /dev/null
+++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-callback-registry.svg
@@ -0,0 +1,655 @@
+
+
+
+
+
+
+
+
diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg
new file mode 100644
index 000000000000..423df00c4df9
--- /dev/null
+++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-dyntick.svg
@@ -0,0 +1,700 @@
+
+
+
+
+
+
+
+
diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-cleanup.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-cleanup.svg
new file mode 100644
index 000000000000..754f426b297a
--- /dev/null
+++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-cleanup.svg
@@ -0,0 +1,1126 @@
+
+
+
+
+
+
+
+
diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg
new file mode 100644
index 000000000000..7ddc094d7f28
--- /dev/null
+++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-fqs.svg
@@ -0,0 +1,1309 @@
+
+
+
+
+
+
+
+
diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-1.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-1.svg
new file mode 100644
index 000000000000..0161262904ec
--- /dev/null
+++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-1.svg
@@ -0,0 +1,656 @@
+
+
+
+
+
+
+
+
diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-2.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-2.svg
new file mode 100644
index 000000000000..4d956a732685
--- /dev/null
+++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-2.svg
@@ -0,0 +1,656 @@
+
+
+
+
+
+
+
+
diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-3.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-3.svg
new file mode 100644
index 000000000000..de6ecc51b00e
--- /dev/null
+++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp-init-3.svg
@@ -0,0 +1,632 @@
+
+
+
+
+
+
+
+
diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg
new file mode 100644
index 000000000000..b13b7b01bb3a
--- /dev/null
+++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-gp.svg
@@ -0,0 +1,5135 @@
+
+
+
+
+
+
+
+
diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg
new file mode 100644
index 000000000000..2c9310ba29ba
--- /dev/null
+++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-hotplug.svg
@@ -0,0 +1,775 @@
+
+
+
+
+
+
+
+
diff --git a/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg
new file mode 100644
index 000000000000..de3992f4cbe1
--- /dev/null
+++ b/Documentation/RCU/Design/Memory-Ordering/TreeRCU-qs.svg
@@ -0,0 +1,1095 @@
+
+
+
+
+
+
+
+
diff --git a/Documentation/RCU/Design/Memory-Ordering/rcu_node-lock.svg b/Documentation/RCU/Design/Memory-Ordering/rcu_node-lock.svg
new file mode 100644
index 000000000000..94c96c595aed
--- /dev/null
+++ b/Documentation/RCU/Design/Memory-Ordering/rcu_node-lock.svg
@@ -0,0 +1,229 @@
+
+
+
+
+
+
+
+
diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt
index 96a3d81837e1..a08f928c8557 100644
--- a/Documentation/RCU/stallwarn.txt
+++ b/Documentation/RCU/stallwarn.txt
@@ -40,7 +40,9 @@ o Booting Linux using a console connection that is too slow to
o Anything that prevents RCU's grace-period kthreads from running.
This can result in the "All QSes seen" console-log message.
This message will include information on when the kthread last
- ran and how often it should be expected to run.
+ ran and how often it should be expected to run. It can also
+ result in the "rcu_.*kthread starved for" console-log message,
+ which will include additional debugging information.
o A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might
happen to preempt a low-priority task in the middle of an RCU
@@ -60,6 +62,20 @@ o A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that
CONFIG_PREEMPT_RCU case, you might see stall-warning
messages.
+o A periodic interrupt whose handler takes longer than the time
+ interval between successive pairs of interrupts. This can
+ prevent RCU's kthreads and softirq handlers from running.
+ Note that certain high-overhead debugging options, for example
+ the function_graph tracer, can result in interrupt handler taking
+ considerably longer than normal, which can in turn result in
+ RCU CPU stall warnings.
+
+o Testing a workload on a fast system, tuning the stall-warning
+ timeout down to just barely avoid RCU CPU stall warnings, and then
+ running the same workload with the same stall-warning timeout on a
+ slow system. Note that thermal throttling and on-demand governors
+ can cause a single system to be sometimes fast and sometimes slow!
+
o A hardware or software issue shuts off the scheduler-clock
interrupt on a CPU that is not in dyntick-idle mode. This
problem really has happened, and seems to be most likely to
@@ -155,67 +171,32 @@ Interpreting RCU's CPU Stall-Detector "Splats"
For non-RCU-tasks flavors of RCU, when a CPU detects that it is stalling,
it will print a message similar to the following:
-INFO: rcu_sched_state detected stall on CPU 5 (t=2500 jiffies)
+ INFO: rcu_sched detected stalls on CPUs/tasks:
+ 2-...: (3 GPs behind) idle=06c/0/0 softirq=1453/1455 fqs=0
+ 16-...: (0 ticks this GP) idle=81c/0/0 softirq=764/764 fqs=0
+ (detected by 32, t=2603 jiffies, g=7073, c=7072, q=625)
-This message indicates that CPU 5 detected that it was causing a stall,
-and that the stall was affecting RCU-sched. This message will normally be
-followed by a stack dump of the offending CPU. On TREE_RCU kernel builds,
-RCU and RCU-sched are implemented by the same underlying mechanism,
-while on PREEMPT_RCU kernel builds, RCU is instead implemented
-by rcu_preempt_state.
-
-On the other hand, if the offending CPU fails to print out a stall-warning
-message quickly enough, some other CPU will print a message similar to
-the following:
-
-INFO: rcu_bh_state detected stalls on CPUs/tasks: { 3 5 } (detected by 2, 2502 jiffies)
-
-This message indicates that CPU 2 detected that CPUs 3 and 5 were both
-causing stalls, and that the stall was affecting RCU-bh. This message
+This message indicates that CPU 32 detected that CPUs 2 and 16 were both
+causing stalls, and that the stall was affecting RCU-sched. This message
will normally be followed by stack dumps for each CPU. Please note that
-PREEMPT_RCU builds can be stalled by tasks as well as by CPUs,
-and that the tasks will be indicated by PID, for example, "P3421".
-It is even possible for a rcu_preempt_state stall to be caused by both
-CPUs -and- tasks, in which case the offending CPUs and tasks will all
-be called out in the list.
+PREEMPT_RCU builds can be stalled by tasks as well as by CPUs, and that
+the tasks will be indicated by PID, for example, "P3421". It is even
+possible for a rcu_preempt_state stall to be caused by both CPUs -and-
+tasks, in which case the offending CPUs and tasks will all be called
+out in the list.
-Finally, if the grace period ends just as the stall warning starts
-printing, there will be a spurious stall-warning message:
-
-INFO: rcu_bh_state detected stalls on CPUs/tasks: { } (detected by 4, 2502 jiffies)
-
-This is rare, but does happen from time to time in real life. It is also
-possible for a zero-jiffy stall to be flagged in this case, depending
-on how the stall warning and the grace-period initialization happen to
-interact. Please note that it is not possible to entirely eliminate this
-sort of false positive without resorting to things like stop_machine(),
-which is overkill for this sort of problem.
-
-Recent kernels will print a long form of the stall-warning message:
-
- INFO: rcu_preempt detected stall on CPU
- 0: (63959 ticks this GP) idle=241/3fffffffffffffff/0 softirq=82/543
- (t=65000 jiffies)
-
-In kernels with CONFIG_RCU_FAST_NO_HZ, more information is printed:
-
- INFO: rcu_preempt detected stall on CPU
- 0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 softirq=82/543 last_accelerate: a345/d342 nonlazy_posted: 25 .D
- (t=65000 jiffies)
-
-The "(64628 ticks this GP)" indicates that this CPU has taken more
-than 64,000 scheduling-clock interrupts during the current stalled
-grace period. If the CPU was not yet aware of the current grace
-period (for example, if it was offline), then this part of the message
-indicates how many grace periods behind the CPU is.
+CPU 2's "(3 GPs behind)" indicates that this CPU has not interacted with
+the RCU core for the past three grace periods. In contrast, CPU 16's "(0
+ticks this GP)" indicates that this CPU has not taken any scheduling-clock
+interrupts during the current stalled grace period.
The "idle=" portion of the message prints the dyntick-idle state.
The hex number before the first "/" is the low-order 12 bits of the
-dynticks counter, which will have an even-numbered value if the CPU is
-in dyntick-idle mode and an odd-numbered value otherwise. The hex
-number between the two "/"s is the value of the nesting, which will
-be a small positive number if in the idle loop and a very large positive
-number (as shown above) otherwise.
+dynticks counter, which will have an even-numbered value if the CPU
+is in dyntick-idle mode and an odd-numbered value otherwise. The hex
+number between the two "/"s is the value of the nesting, which will be
+a small non-negative number if in the idle loop (as shown above) and a
+very large positive number otherwise.
The "softirq=" portion of the message tracks the number of RCU softirq
handlers that the stalled CPU has executed. The number before the "/"
@@ -230,24 +211,72 @@ handlers are no longer able to execute on this CPU. This can happen if
the stalled CPU is spinning with interrupts are disabled, or, in -rt
kernels, if a high-priority process is starving RCU's softirq handler.
-For CONFIG_RCU_FAST_NO_HZ kernels, the "last_accelerate:" prints the
-low-order 16 bits (in hex) of the jiffies counter when this CPU last
-invoked rcu_try_advance_all_cbs() from rcu_needs_cpu() or last invoked
-rcu_accelerate_cbs() from rcu_prepare_for_idle(). The "nonlazy_posted:"
-prints the number of non-lazy callbacks posted since the last call to
-rcu_needs_cpu(). Finally, an "L" indicates that there are currently
-no non-lazy callbacks ("." is printed otherwise, as shown above) and
-"D" indicates that dyntick-idle processing is enabled ("." is printed
-otherwise, for example, if disabled via the "nohz=" kernel boot parameter).
+The "fps=" shows the number of force-quiescent-state idle/offline
+detection passes that the grace-period kthread has made across this
+CPU since the last time that this CPU noted the beginning of a grace
+period.
+
+The "detected by" line indicates which CPU detected the stall (in this
+case, CPU 32), how many jiffies have elapsed since the start of the
+grace period (in this case 2603), the number of the last grace period
+to start and to complete (7073 and 7072, respectively), and an estimate
+of the total number of RCU callbacks queued across all CPUs (625 in
+this case).
+
+In kernels with CONFIG_RCU_FAST_NO_HZ, more information is printed
+for each CPU:
+
+ 0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 softirq=82/543 last_accelerate: a345/d342 nonlazy_posted: 25 .D
+
+The "last_accelerate:" prints the low-order 16 bits (in hex) of the
+jiffies counter when this CPU last invoked rcu_try_advance_all_cbs()
+from rcu_needs_cpu() or last invoked rcu_accelerate_cbs() from
+rcu_prepare_for_idle(). The "nonlazy_posted:" prints the number
+of non-lazy callbacks posted since the last call to rcu_needs_cpu().
+Finally, an "L" indicates that there are currently no non-lazy callbacks
+("." is printed otherwise, as shown above) and "D" indicates that
+dyntick-idle processing is enabled ("." is printed otherwise, for example,
+if disabled via the "nohz=" kernel boot parameter).
+
+If the grace period ends just as the stall warning starts printing,
+there will be a spurious stall-warning message, which will include
+the following:
+
+ INFO: Stall ended before state dump start
+
+This is rare, but does happen from time to time in real life. It is also
+possible for a zero-jiffy stall to be flagged in this case, depending
+on how the stall warning and the grace-period initialization happen to
+interact. Please note that it is not possible to entirely eliminate this
+sort of false positive without resorting to things like stop_machine(),
+which is overkill for this sort of problem.
+
+If all CPUs and tasks have passed through quiescent states, but the
+grace period has nevertheless failed to end, the stall-warning splat
+will include something like the following:
+
+ All QSes seen, last rcu_preempt kthread activity 23807 (4297905177-4297881370), jiffies_till_next_fqs=3, root ->qsmask 0x0
+
+The "23807" indicates that it has been more than 23 thousand jiffies
+since the grace-period kthread ran. The "jiffies_till_next_fqs"
+indicates how frequently that kthread should run, giving the number
+of jiffies between force-quiescent-state scans, in this case three,
+which is way less than 23807. Finally, the root rcu_node structure's
+->qsmask field is printed, which will normally be zero.
If the relevant grace-period kthread has been unable to run prior to
-the stall warning, the following additional line is printed:
+the stall warning, as was the case in the "All QSes seen" line above,
+the following additional line is printed:
- rcu_preempt kthread starved for 2023 jiffies!
+ kthread starved for 23807 jiffies! g7073 c7072 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
-Starving the grace-period kthreads of CPU time can of course result in
-RCU CPU stall warnings even when all CPUs and tasks have passed through
-the required quiescent states.
+Starving the grace-period kthreads of CPU time can of course result
+in RCU CPU stall warnings even when all CPUs and tasks have passed
+through the required quiescent states. The "g" and "c" numbers flag the
+number of the last grace period started and completed, respectively,
+the "f" precedes the ->gp_flags command to the grace-period kthread,
+the "RCU_GP_WAIT_FQS" indicates that the kthread is waiting for a short
+timeout, and the "state" precedes value of the task_struct ->state field.
Multiple Warnings From One Stall
@@ -264,13 +293,28 @@ Stall Warnings for Expedited Grace Periods
If an expedited grace period detects a stall, it will place a message
like the following in dmesg:
- INFO: rcu_sched detected expedited stalls on CPUs: { 1 2 6 } 26009 jiffies s: 1043
+ INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 7-... } 21119 jiffies s: 73 root: 0x2/.
-This indicates that CPUs 1, 2, and 6 have failed to respond to a
-reschedule IPI, that the expedited grace period has been going on for
-26,009 jiffies, and that the expedited grace-period sequence counter is
-1043. The fact that this last value is odd indicates that an expedited
-grace period is in flight.
+This indicates that CPU 7 has failed to respond to a reschedule IPI.
+The three periods (".") following the CPU number indicate that the CPU
+is online (otherwise the first period would instead have been "O"),
+that the CPU was online at the beginning of the expedited grace period
+(otherwise the second period would have instead been "o"), and that
+the CPU has been online at least once since boot (otherwise, the third
+period would instead have been "N"). The number before the "jiffies"
+indicates that the expedited grace period has been going on for 21,119
+jiffies. The number following the "s:" indicates that the expedited
+grace-period sequence counter is 73. The fact that this last value is
+odd indicates that an expedited grace period is in flight. The number
+following "root:" is a bitmask that indicates which children of the root
+rcu_node structure correspond to CPUs and/or tasks that are blocking the
+current expedited grace period. If the tree had more than one level,
+additional hex numbers would be printed for the states of the other
+rcu_node structures in the tree.
+
+As with normal grace periods, PREEMPT_RCU builds can be stalled by
+tasks as well as by CPUs, and that the tasks will be indicated by PID,
+for example, "P3421".
It is entirely possible to see stall warnings from normal and from
-expedited grace periods at about the same time from the same run.
+expedited grace periods at about the same time during the same run.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 05496622b4ef..bc94c47085a5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3539,6 +3539,9 @@
rcutorture.stall_cpu_holdoff= [KNL]
Time to wait (s) after boot before inducing stall.
+ rcutorture.stall_cpu_irqsoff= [KNL]
+ Disable interrupts while stalling if set.
+
rcutorture.stat_interval= [KNL]
Time (s) between statistics printk()s.
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index b759a60624fd..519940ec767f 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -53,7 +53,7 @@ CONTENTS
- SMP barrier pairing.
- Examples of memory barrier sequences.
- Read memory barriers vs load speculation.
- - Transitivity
+ - Multicopy atomicity.
(*) Explicit kernel barriers.
@@ -383,8 +383,8 @@ Memory barriers come in four basic varieties:
to have any effect on loads.
A CPU can be viewed as committing a sequence of store operations to the
- memory system as time progresses. All stores before a write barrier will
- occur in the sequence _before_ all the stores after the write barrier.
+ memory system as time progresses. All stores _before_ a write barrier
+ will occur _before_ all the stores after the write barrier.
[!] Note that write barriers should normally be paired with read or data
dependency barriers; see the "SMP barrier pairing" subsection.
@@ -635,6 +635,11 @@ can be used to record rare error conditions and the like, and the CPUs'
naturally occurring ordering prevents such records from being lost.
+Note well that the ordering provided by a data dependency is local to
+the CPU containing it. See the section on "Multicopy atomicity" for
+more information.
+
+
The data dependency barrier is very important to the RCU system,
for example. See rcu_assign_pointer() and rcu_dereference() in
include/linux/rcupdate.h. This permits the current target of an RCU'd
@@ -851,38 +856,11 @@ In short, control dependencies apply only to the stores in the then-clause
and else-clause of the if-statement in question (including functions
invoked by those two clauses), not to code following that if-statement.
-Finally, control dependencies do -not- provide transitivity. This is
-demonstrated by two related examples, with the initial values of
-'x' and 'y' both being zero:
- CPU 0 CPU 1
- ======================= =======================
- r1 = READ_ONCE(x); r2 = READ_ONCE(y);
- if (r1 > 0) if (r2 > 0)
- WRITE_ONCE(y, 1); WRITE_ONCE(x, 1);
+Note well that the ordering provided by a control dependency is local
+to the CPU containing it. See the section on "Multicopy atomicity"
+for more information.
- assert(!(r1 == 1 && r2 == 1));
-
-The above two-CPU example will never trigger the assert(). However,
-if control dependencies guaranteed transitivity (which they do not),
-then adding the following CPU would guarantee a related assertion:
-
- CPU 2
- =====================
- WRITE_ONCE(x, 2);
-
- assert(!(r1 == 2 && r2 == 1 && x == 2)); /* FAILS!!! */
-
-But because control dependencies do -not- provide transitivity, the above
-assertion can fail after the combined three-CPU example completes. If you
-need the three-CPU example to provide ordering, you will need smp_mb()
-between the loads and stores in the CPU 0 and CPU 1 code fragments,
-that is, just before or just after the "if" statements. Furthermore,
-the original two-CPU example is very fragile and should be avoided.
-
-These two examples are the LB and WWC litmus tests from this paper:
-http://www.cl.cam.ac.uk/users/pes20/ppc-supplemental/test6.pdf and this
-site: https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html.
In summary:
@@ -922,8 +900,8 @@ In summary:
(*) Control dependencies pair normally with other types of barriers.
- (*) Control dependencies do -not- provide transitivity. If you
- need transitivity, use smp_mb().
+ (*) Control dependencies do -not- provide multicopy atomicity. If you
+ need all the CPUs to see a given store at the same time, use smp_mb().
(*) Compilers do not understand control dependencies. It is therefore
your job to ensure that they do not break your code.
@@ -936,13 +914,14 @@ When dealing with CPU-CPU interactions, certain types of memory barrier should
always be paired. A lack of appropriate pairing is almost certainly an error.
General barriers pair with each other, though they also pair with most
-other types of barriers, albeit without transitivity. An acquire barrier
-pairs with a release barrier, but both may also pair with other barriers,
-including of course general barriers. A write barrier pairs with a data
-dependency barrier, a control dependency, an acquire barrier, a release
-barrier, a read barrier, or a general barrier. Similarly a read barrier,
-control dependency, or a data dependency barrier pairs with a write
-barrier, an acquire barrier, a release barrier, or a general barrier:
+other types of barriers, albeit without multicopy atomicity. An acquire
+barrier pairs with a release barrier, but both may also pair with other
+barriers, including of course general barriers. A write barrier pairs
+with a data dependency barrier, a control dependency, an acquire barrier,
+a release barrier, a read barrier, or a general barrier. Similarly a
+read barrier, control dependency, or a data dependency barrier pairs
+with a write barrier, an acquire barrier, a release barrier, or a
+general barrier:
CPU 1 CPU 2
=============== ===============
@@ -968,7 +947,7 @@ Or even:
=============== ===============================
r1 = READ_ONCE(y);