189 Commits

Author SHA1 Message Date
Paul E. McKenney
3f95aa81d2 rcu: Make TASKS_RCU handle tasks that are almost done exiting
Once a task has passed exit_notify() in the do_exit() code path, it
is no longer on the task lists, and is therefore no longer visible
to rcu_tasks_kthread().  This means that an almost-exited task might
be preempted while within a trampoline, and this task won't be waited
on by rcu_tasks_kthread().  This commit fixes this bug by adding an
srcu_struct.  An exiting task does srcu_read_lock() just before calling
exit_notify(), and does the corresponding srcu_read_unlock() after
doing the final preempt_disable().  This means that rcu_tasks_kthread()
can do synchronize_srcu() to wait for all mostly-exited tasks to reach
their final preempt_disable() region, and then use synchronize_sched()
to wait for those tasks to finish exiting.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:27:22 -07:00
Paul E. McKenney
53c6d4edf8 rcu: Add synchronous grace-period waiting for RCU-tasks
It turns out to be easier to add the synchronous grace-period waiting
functions to RCU-tasks than to work around their absense in rcutorture,
so this commit adds them.  The key point is that the existence of
call_rcu_tasks() means that rcutorture needs an rcu_barrier_tasks().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:27:21 -07:00
Paul E. McKenney
bde6c3aa99 rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops
RCU-tasks requires the occasional voluntary context switch
from CPU-bound in-kernel tasks.  In some cases, this requires
instrumenting cond_resched().  However, there is some reluctance
to countenance unconditionally instrumenting cond_resched() (see
http://lwn.net/Articles/603252/), so this commit creates a separate
cond_resched_rcu_qs() that may be used in place of cond_resched() in
locations prone to long-duration in-kernel looping.

This commit currently instruments only RCU-tasks.  Future possibilities
include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce
IPI usage.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:27:20 -07:00
Paul E. McKenney
8315f42295 rcu: Add call_rcu_tasks()
This commit adds a new RCU-tasks flavor of RCU, which provides
call_rcu_tasks().  This RCU flavor's quiescent states are voluntary
context switch (not preemption!) and userspace execution (not the idle
loop -- use some sort of schedule_on_each_cpu() if you need to handle the
idle tasks.  Note that unlike other RCU flavors, these quiescent states
occur in tasks, not necessarily CPUs.  Includes fixes from Steven Rostedt.

This RCU flavor is assumed to have very infrequent latency-tolerant
updaters.  This assumption permits significant simplifications, including
a single global callback list protected by a single global lock, along
with a single task-private linked list containing all tasks that have not
yet passed through a quiescent state.  If experience shows this assumption
to be incorrect, the required additional complexity will be added.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:27:19 -07:00
Paul E. McKenney
38706bc5a2 rcutorture: Add callback-flood test
Although RCU is designed to handle arbitrary floods of callbacks, this
capability is not routinely tested.   This commit therefore adds a
cbflood capability in which kthreads repeatedly registers large numbers
of callbacks.  One such kthread is created for each four CPUs (rounding
up), and the test may be controlled by several cbflood_* kernel boot
parameters, which control the number of bursts per flood, the number
of callbacks per burst, the time between bursts, and the time between
floods.  The default values are large enough to exercise RCU's emergency
responses to callback flooding.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Miller <davem@davemloft.net>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2014-09-07 16:24:48 -07:00
Joe Perches
eea203fea3 rcu: Use pr_alert/pr_cont for printing logs
User pr_alert/pr_cont for printing the logs from rcutorture module directly
instead of writing it to a buffer and then printing it. This allows us from not
having to allocate such buffers. Also remove a resulting empty function.

I tested this using the parse-torture.sh script as follows:

$ dmesg | grep torture > log.txt
$ bash parse-torture.sh log.txt test
$

There were no warnings which means that parsing went fine.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:24:47 -07:00
Pranith Kumar
58ade2dbe9 rcutorture: Fix a sparse warning by marking boost_mutex static
This commit fixes the following sparse warning by marking boost_mutex
static:

kernel/rcu/rcutorture.c:185:1: warning: symbol 'boost_mutex' was not declared. Should it be static?

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-09-07 16:24:17 -07:00
Paul E. McKenney
73a860cd58 rcu: Replace flush_signals() with WARN_ON(signal_pending())
Currently, when RCU awakens from a wait_event_interruptible() that
might have awakened prematurely, it does a flush_signals(). This is
done on the off-chance that someone figured out how to deliver a signal
to a kthread, which is supposed to be impossible.  Given that this
is supposed to be impossible, this commit changes the flush_signals()
calls into WARN_ON(signal_pending()).

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:20 -07:00
Pranith Kumar
2aa792e6fa rcu: Use rcu_gp_kthread_wake() to wake up grace period kthreads
The rcu_gp_kthread_wake() function checks for three conditions before
waking up grace period kthreads:

*  Is the thread we are trying to wake up the current thread?
*  Are the gp_flags zero? (all threads wait on non-zero gp_flags condition)
*  Is there no thread created for this flavour, hence nothing to wake up?

If any one of these condition is true, we do not call wake_up().
It was found that there are quite a few avoidable wake ups both during
idle time and under stress induced by rcutorture.

Idle:

Total:66000, unnecessary:66000, case1:61827, case2:66000, case3:0
Total:68000, unnecessary:68000, case1:63696, case2:68000, case3:0

rcutorture:

Total:254000, unnecessary:254000, case1:199913, case2:254000, case3:0
Total:256000, unnecessary:256000, case1:201784, case2:256000, case3:0

Here case{1-3} are the cases listed above. We can avoid these wake
ups by using rcu_gp_kthread_wake() to conditionally wake up the grace
period kthreads.

There is a comment about an implied barrier supplied by the wake_up()
logic.  This barrier is necessary for the awakened thread to see the
updated ->gp_flags.  This flag is always being updated with the root node
lock held. Also, the awakened thread tries to acquire the root node lock
before reading ->gp_flags because of which there is proper ordering.

Hence this commit tries to avoid calling wake_up() whenever we can by
using rcu_gp_kthread_wake() function.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:19 -07:00
Paul E. McKenney
ade9862470 rcu: Make TINY_RCU tinier by putting error checks under #ifdef
The rcu_idle_enter_common() and rcu_idle_exit_common() functions contain
error checks that have to the best of my knowledge have never triggered
over the past several years.  These are nevertheless valuable when
creating new architectures or doing other low-level changes, so the
checks should not be deleted.  This commit instead places these checks
under #ifdef CONFIG_RCU_TRACE so that they are executed only when
specifically requested.

The savings are significant:

	Before:

	   text    data     bss     dec     hex filename
	   1749      39       0    1788     6fc /tmp/b/kernel/rcu/tiny.o
	    632     152       0     784     310 /tmp/b/kernel/rcu/update.o
				   ----
				   2572

	After:

	   text    data     bss     dec     hex filename
	   1281      37       0    1318     526 /tmp/b/kernel/rcu/tiny.o
	    632     152       0     784     310 /tmp/b/kernel/rcu/update.o
				   ----
				   2102

This amounts to 470 bytes, or 18% of the original.

Switched from #ifdef to IS_ENABLED() on Josh Triplett's advice.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-09-07 16:18:18 -07:00
Paul E. McKenney
9fdd3bc900 rcu: Break more call_rcu() deadlock involving scheduler and perf
Commit 96d3fd0d315a9 (rcu: Break call_rcu() deadlock involving scheduler
and perf) covered the case where __call_rcu_nocb_enqueue() needs to wake
the rcuo kthread due to the queue being initially empty, but did not
do anything for the case where the queue was overflowing.  This commit
therefore also defers wakeup for the overflow case.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:17 -07:00
Pranith Kumar
66d701ea7e rcu: Remove stale comment in tree.c
This commit removes a stale comment in rcu/tree.c which was left
out when some code was moved around previously in commit 2036d94a7b61
("rcu:  Rework detection of use of RCU by offline CPUs") For reference,
the following updated comment exists a few lines below this which means
the same:

/* Remove the outgoing CPU from the masks in the rcu_node hierarchy. */

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:16 -07:00
Pranith Kumar
fafb6e843f rcu: Update tiny.c references to tree.c
This commit updates the references to rcutree.c which is now rcu/tree.c

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:15 -07:00
Ard Biesheuvel
a8a29b3b7b rcu: Define tracepoint strings only if CONFIG_TRACING is set
Commit f7f7bac9cb1c ("rcu: Have the RCU tracepoints use the tracepoint_string
infrastructure") unconditionally populates the __tracepoint_str input section,
but this section is not assigned an output section if CONFIG_TRACING is not set.
This results in the __tracepoint_str turning up in unexpected places, i.e.,
after _edata.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:14 -07:00
Oleg Nesterov
85b39d305b rcu: Uninline rcu_read_lock_held()
This commit uninlines rcu_read_lock_held(). According to "size vmlinux"
this saves 28549 in .text:

	- 5541731 3014560 14757888 23314179
	+ 5513182 3026848 14757888 23297918

Note: it looks as if the data grows by 12288 bytes but this is not true,
it does not actually grow. But .data starts with ALIGN(THREAD_SIZE) and
since .text shrinks the padding grows, and thus .data grows too as it
seen by /bin/size. diff System.map:

	- ffffffff81510000 D _sdata
	- ffffffff81510000 D init_thread_union
	+ ffffffff81509000 D _sdata
	+ ffffffff8150c000 D init_thread_union

Perhaps we can change vmlinux.lds.S to .data itself, so that /bin/size
can't "wrongly" report that .data grows if .text shinks.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:13 -07:00
Pranith Kumar
e02b2edfa1 rcu: Use true/false instead of 1/0 for a bool type
This commit uses true/false instead of 1/0 for bool types in rcu_gp_fqs()
and force_qs_rnp().

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:12 -07:00
Pranith Kumar
d0bc90fd37 rcu: Return bool type for rcu_try_advance_all_cbs()
Return a bool type instead of 0 in rcu_try_advance_all_cbs().

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:10 -07:00
Pranith Kumar
f534ed1fd7 rcu: Use bool type for return value in rcu_is_watching()
Use a bool type for return in rcu_is_watching().

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:09 -07:00
Pranith Kumar
bf33eb1aef rcu: Fix sparse warning about rcu_batches_completed_preempt() being non-static
fix sparse warning about rcu_batches_completed_preempt() being non-static by
marking it as static

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:08 -07:00
Pranith Kumar
4de376a1b1 rcu: Remove remaining read-modify-write ACCESS_ONCE() calls
Change the remaining uses of ACCESS_ONCE() so that each ACCESS_ONCE() either does a load or a store, but not both.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:18:07 -07:00
Pranith Kumar
11ed7f934c rcu: Make nocb leader kthreads process pending callbacks after spawning
The nocb callbacks generated before the nocb kthreads are spawned are
enqueued in the nocb queue for later processing. Commit fbce7497ee5af ("rcu:
Parallelize and economize NOCB kthread wakeups") introduced nocb leader kthreads
which checked the nocb_leader_wake flag to see if there were any such pending
callbacks. A case was reported in which newly spawned leader kthreads were not
processing the pending callbacks as this flag was not set, which led to a boot
hang.

The following commit ensures that the newly spawned nocb kthreads process the
pending callbacks by allowing the kthreads to run immediately after spawning
instead of waiting. This is done by inverting the logic of nocb_leader_wake
tests to nocb_leader_sleep which allows us to use the default initialization of
this flag to 0 to let the kthreads run.

Reported-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Link: http://www.spinics.net/lists/kernel/msg1802899.html
[ paulmck: Backported to v3.17-rc2. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Amit Shah <amit.shah@redhat.com>
2014-08-28 05:59:59 -07:00
Linus Torvalds
5bda4f638f Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU changes from Ingo Molar:
 "The main changes:

   - torture-test updates
   - callback-offloading changes
   - maintainership changes
   - update RCU documentation
   - miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
  rcu: Allow for NULL tick_nohz_full_mask when nohz_full= missing
  rcu: Fix a sparse warning in rcu_report_unblock_qs_rnp()
  rcu: Fix a sparse warning in rcu_initiate_boost()
  rcu: Fix __rcu_reclaim() to use true/false for bool
  rcu: Remove CONFIG_PROVE_RCU_DELAY
  rcu: Use __this_cpu_read() instead of per_cpu_ptr()
  rcu: Don't use NMIs to dump other CPUs' stacks
  rcu: Bind grace-period kthreads to non-NO_HZ_FULL CPUs
  rcu: Simplify priority boosting by putting rt_mutex in rcu_node
  rcu: Check both root and current rcu_node when setting up future grace period
  rcu: Allow post-unlock reference for rt_mutex
  rcu: Loosen __call_rcu()'s rcu_head alignment constraint
  rcu: Eliminate read-modify-write ACCESS_ONCE() calls
  rcu: Remove redundant ACCESS_ONCE() from tick_do_timer_cpu
  rcu: Make rcu node arrays static const char * const
  signal: Explain local_irq_save() call
  rcu: Handle obsolete references to TINY_PREEMPT_RCU
  rcu: Document deadlock-avoidance information for rcu_read_unlock()
  scripts: Teach get_maintainer.pl about the new "R:" tag
  rcu: Update rcu torture maintainership filename patterns
  ...
2014-08-04 15:55:08 -07:00
Josh Triplett
e0198b290d Josh has moved
My IBM email addresses haven't worked for years; also map some
old-but-functional forwarding addresses to my canonical address.

Update my GPG key fingerprint; I moved to 4096R a long time ago.

Update description.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-30 17:16:13 -07:00
Paul E. McKenney
187497fa5e rcu: Allow for NULL tick_nohz_full_mask when nohz_full= missing
If there isn't a nohz_full= kernel parameter specified, then
tick_nohz_full_mask can legitimately be NULL.  This can cause
problems when RCU's boot code tries to cpumask_or() this value into
rcu_nocb_mask.  In addition, if NO_HZ_FULL_ALL=y, there is no point
in doing the cpumask_or() in the first place because this will cause
RCU_NOCB_CPU_ALL=y, which in turn will have all bits already set in
rcu_nocb_mask.

This commit therefore avoids the cpumask_or() if NO_HZ_FULL_ALL=y
and checks for !tick_nohz_full_running otherwise, this latter check
catching cases when there was no nohz_full= kernel parameter specified.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-07-16 10:44:46 -07:00
Paul E. McKenney
1823172ab5 Merge branches 'doc.2014.07.08a', 'fixes.2014.07.09a', 'maintainers.2014.07.08b', 'nocbs.2014.07.07a' and 'torture.2014.07.07a' into HEAD
doc.2014.07.08a: Documentation updates.
fixes.2014.07.09a: Miscellaneous fixes.
maintainers.2014.07.08b: Maintainership updates.
nocbs.2014.07.07a: Callback-offloading fixes.
torture.2014.07.07a: Torture-test updates.
2014-07-09 09:16:54 -07:00
Pranith Kumar
b41d1b924d rcu: Fix a sparse warning in rcu_report_unblock_qs_rnp()
This commit annotates rcu_report_unblock_qs_rnp() in order to fix the
following sparse warning:

kernel/rcu/tree_plugin.h:990:13: warning: context imbalance in 'rcu_report_unblock_qs_rnp' - unexpected unlock

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2014-07-09 09:15:51 -07:00
Pranith Kumar
615e41c605 rcu: Fix a sparse warning in rcu_initiate_boost()
This commit annotates rcu_initiate_boost() fixes the following sparse
warning:

	kernel/rcu/tree_plugin.h:1494:13: warning: context imbalance in 'rcu_initiate_boost' - unexpected unlock

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2014-07-09 09:15:45 -07:00
Paul E. McKenney
406e3e5365 rcu: Fix __rcu_reclaim() to use true/false for bool
The __rcu_reclaim() function returned 0/1, which is not proper for a
function of type bool.  This commit therefore converts to false/true.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2014-07-09 09:15:32 -07:00
Paul E. McKenney
11992c703a rcu: Remove CONFIG_PROVE_RCU_DELAY
The CONFIG_PROVE_RCU_DELAY Kconfig parameter doesn't appear to be very
effective at finding race conditions, so this commit removes it.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
[ paulmck: Remove definition and uses as noted by Paul Bolle. ]
2014-07-09 09:15:31 -07:00
Shan Wei
d860d40327 rcu: Use __this_cpu_read() instead of per_cpu_ptr()
The __this_cpu_read() function produces better code than does
per_cpu_ptr() on both ARM and x86.  For example, gcc (Ubuntu/Linaro
4.7.3-12ubuntu1) 4.7.3 produces the following:

ARMv7 per_cpu_ptr():

force_quiescent_state:
    mov    r3, sp    @,
    bic    r1, r3, #8128    @ tmp171,,
    ldr    r2, .L98    @ tmp169,
    bic    r1, r1, #63    @ tmp170, tmp171,
    ldr    r3, [r0, #220]    @ __ptr, rsp_6(D)->rda
    ldr    r1, [r1, #20]    @ D.35903_68->cpu, D.35903_68->cpu
    mov    r6, r0    @ rsp, rsp
    ldr    r2, [r2, r1, asl #2]    @ tmp173, __per_cpu_offset
    add    r3, r3, r2    @ tmp175, __ptr, tmp173
    ldr    r5, [r3, #12]    @ rnp_old, D.29162_13->mynode

ARMv7 __this_cpu_read():

force_quiescent_state:
    ldr    r3, [r0, #220]    @ rsp_7(D)->rda, rsp_7(D)->rda
    mov    r6, r0    @ rsp, rsp
    add    r3, r3, #12    @ __ptr, rsp_7(D)->rda,
    ldr    r5, [r2, r3]    @ rnp_old, *D.29176_13

Using gcc 4.8.2:

x86_64 per_cpu_ptr():

    movl %gs:cpu_number,%edx    # cpu_number, pscr_ret__
    movslq    %edx, %rdx    # pscr_ret__, pscr_ret__
    movq    __per_cpu_offset(,%rdx,8), %rdx    # __per_cpu_offset, tmp93
    movq    %rdi, %r13    # rsp, rsp
    movq    1000(%rdi), %rax    # rsp_9(D)->rda, __ptr
    movq    24(%rdx,%rax), %r12    # _15->mynode, rnp_old

x86_64 __this_cpu_read():

    movq    %rdi, %r13    # rsp, rsp
    movq    1000(%rdi), %rax    # rsp_9(D)->rda, rsp_9(D)->rda
    movq %gs:24(%rax),%r12    # _10->mynode, rnp_old

Because this change produces significant benefits for these two very
diverse architectures, this commit makes this change.

Signed-off-by: Shan Wei <davidshan@tencent.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2014-07-09 09:15:21 -07:00
Paul E. McKenney
bc1dce514e rcu: Don't use NMIs to dump other CPUs' stacks
Although NMI-based stack dumps are in principle more accurate, they are
also more likely to trigger deadlocks.  This commit therefore replaces
all uses of trigger_all_cpu_backtrace() with rcu_dump_cpu_stacks(), so
that the CPU detecting an RCU CPU stall does the stack dumping.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2014-07-09 09:15:04 -07:00
Paul E. McKenney
c0f489d2c6 rcu: Bind grace-period kthreads to non-NO_HZ_FULL CPUs
Binding the grace-period kthreads to the timekeeping CPU resulted in
significant performance decreases for some workloads.  For more detail,
see:

https://lkml.org/lkml/2014/6/3/395 for benchmark numbers

https://lkml.org/lkml/2014/6/4/218 for CPU statistics

It turns out that it is necessary to bind the grace-period kthreads
to the timekeeping CPU only when all but CPU 0 is a nohz_full CPU
on the one hand or if CONFIG_NO_HZ_FULL_SYSIDLE=y on the other.
In other cases, it suffices to bind the grace-period kthreads to the
set of non-nohz_full CPUs.

This commit therefore creates a tick_nohz_not_full_mask that is the
complement of tick_nohz_full_mask, and then binds the grace-period
kthread to the set of CPUs indicated by this new mask, which covers
the CONFIG_NO_HZ_FULL_SYSIDLE=n case.  The CONFIG_NO_HZ_FULL_SYSIDLE=y
case still binds the grace-period kthreads to the timekeeping CPU.
This commit also includes the tick_nohz_full_enabled() check suggested
by Frederic Weisbecker.

Reported-by: Jet Chen <jet.chen@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Created housekeeping_affine() and housekeeping_mask per
  fweisbec feedback. ]
2014-07-09 09:15:02 -07:00
Paul E. McKenney
abaa93d9e1 rcu: Simplify priority boosting by putting rt_mutex in rcu_node
RCU priority boosting currently checks for boosting via a pointer in
task_struct.  However, this is not needed: As Oleg noted, if the
rt_mutex is placed in the rcu_node instead of on the booster's stack,
the boostee can simply check it see if it owns the lock.  This commit
makes this change, shrinking task_struct by one pointer and the kernel
by thirteen lines.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-07-09 09:15:01 -07:00
Pranith Kumar
48bd8e9b82 rcu: Check both root and current rcu_node when setting up future grace period
The rcu_start_future_gp() function checks the current rcu_node's ->gpnum
and ->completed twice, once without ACCESS_ONCE() and once with it.
Which is pointless because we hold that rcu_node's ->lock at that point.
The intent was to check the current rcu_node structure and the root
rcu_node structure, the latter locklessly with ACCESS_ONCE().  This
commit therefore makes that change.

The reason that it is safe to locklessly check the root rcu_nodes's
->gpnum and ->completed fields is that we hold the current rcu_node's
->lock, which constrains the root rcu_node's ability to change its
->gpnum and ->completed fields.  Of course, if there is a single rcu_node
structure, then rnp_root==rnp, and holding the lock prevents all changes.
If there is more than one rcu_node structure, then the code updates the
fields in the following order:

1.	Increment rnp_root->gpnum to start new grace period.
2.	Increment rnp->gpnum to initialize the current rcu_node,
	continuing initialization for the new grace period.
3.	Increment rnp_root->completed to end the current grace period.
4.	Increment rnp->completed to continue cleaning up after the
	old grace period.

So there are four possible combinations of relative values of these
four fields:

N   N   N   N:  RCU idle, new grace period must be initiated.
		Although rnp_root->gpnum might be incremented immediately
		after we check, that will just result in unnecessary work.
		The grace period already started, and we try to start it.

N+1 N   N   N:  RCU grace period just started.  No further change is
		possible because we hold rnp->lock, so the checks of
		rnp_root->gpnum and rnp_root->completed are stable.
		We know that our request for a future grace period will
		be seen during grace-period cleanup.

N+1 N   N+1 N:  RCU grace period is ongoing.  Because rnp->gpnum is
		different than rnp->completed, we won't even look at
		rnp_root->gpnum and rnp_root->completed, so the possible
		concurrent change to rnp_root->completed does not matter.
		We know that our request for a future grace period will
		be seen during grace-period cleanup, which cannot pass
		this rcu_node because we hold its ->lock.

N+1 N+1 N+1 N:  RCU grace period has ended, but not yet been cleaned up.
		Because rnp->gpnum is different than rnp->completed, we
		won't look at rnp_root->gpnum and rnp_root->completed, so
		the possible concurrent change to rnp_root->completed does
		not matter.  We know that our request for a future grace
		period will be seen during grace-period cleanup, which
		cannot pass this rcu_node because we hold its ->lock.

Therefore, despite initial appearances, the lockless check is safe.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
[ paulmck: Update comment to say why the lockless check is safe. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-07-09 09:15:01 -07:00
Paul E. McKenney
dfeb9765ce rcu: Allow post-unlock reference for rt_mutex
The current approach to RCU priority boosting uses an rt_mutex strictly
for its priority-boosting side effects.  The rt_mutex_init_proxy_locked()
function is used by the booster to initialize the lock as held by the
boostee.  The booster then uses rt_mutex_lock() to acquire this rt_mutex,
which priority-boosts the boostee.  When the boostee reaches the end
of its outermost RCU read-side critical section, it checks a field in
its task structure to see whether it has been boosted, and, if so, uses
rt_mutex_unlock() to release the rt_mutex.  The booster can then go on
to boost the next task that is blocking the current RCU grace period.

But reasonable implementations of rt_mutex_unlock() might result in the
boostee referencing the rt_mutex's data after releasing it.  But the
booster might have re-initialized the rt_mutex between the time that the
boostee released it and the time that it later referenced it.  This is
clearly asking for trouble, so this commit introduces a completion that
forces the booster to wait until the boostee has completely finished with
the rt_mutex, thus avoiding the case where the booster is re-initializing
the rt_mutex before the last boostee's last reference to that rt_mutex.

This of course does introduce some overhead, but the priority-boosting
code paths are miles from any possible fastpath, and the overhead of
executing the completion will normally be quite small compared to the
overhead of priority boosting and deboosting, so this should be OK.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-07-09 09:15:00 -07:00
Paul E. McKenney
1146edcbef rcu: Loosen __call_rcu()'s rcu_head alignment constraint
The m68k architecture aligns only to 16-bit boundaries, which can cause
the align-to-32-bits check in __call_rcu() to trigger.  Because there is
currently no known potential need for more than one low-order bit, this
commit loosens the check to 16-bit boundaries.

Reported-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2014-07-09 09:14:50 -07:00
Paul E. McKenney
a792563bd4 rcu: Eliminate read-modify-write ACCESS_ONCE() calls
RCU contains code of the following forms:

	ACCESS_ONCE(x)++;
	ACCESS_ONCE(x) += y;
	ACCESS_ONCE(x) -= y;

Now these constructs do operate correctly, but they really result in a
pair of volatile accesses, one to do the load and another to do the store.
This can be confusing, as the casual reader might well assume that (for
example) gcc might generate a memory-to-memory add instruction for each
of these three cases.  In fact, gcc will do no such thing.  Also, there
is a good chance that the kernel will move to separate load and store
variants of ACCESS_ONCE(), and constructs like the above could easily
confuse both people and scripts attempting to make that sort of change.
Finally, most of RCU's read-modify-write uses of ACCESS_ONCE() really
only need the store to be volatile, so that the read-modify-write form
might be misleading.

This commit therefore changes the above forms in RCU so that each instance
of ACCESS_ONCE() either does a load or a store, but not both.  In a few
cases, ACCESS_ONCE() was not critical, for example, for maintaining
statisitics.  In these cases, ACCESS_ONCE() has been dispensed with
entirely.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-07-09 09:14:49 -07:00
Paul E. McKenney
4da117cfa7 rcu: Remove redundant ACCESS_ONCE() from tick_do_timer_cpu
In kernels built with CONFIG_NO_HZ_FULL, tick_do_timer_cpu is constant
once boot completes.  Thus, there is no need to wrap it in ACCESS_ONCE()
in code that is built only when CONFIG_NO_HZ_FULL.  This commit therefore
removes the redundant ACCESS_ONCE().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2014-07-09 09:14:35 -07:00
Fabian Frederick
b4426b49c6 rcu: Make rcu node arrays static const char * const
Those two arrays are being passed to lockdep_init_map(), which expects
const char *, and are stored in lockdep_map the same way.

Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-07-09 09:14:34 -07:00
Paul E. McKenney
b58cc46c5f rcu: Don't offload callbacks unless specifically requested
Enabling NO_HZ_FULL currently has the side effect of enabling callback
offloading on all CPUs.  This results in lots of additional rcuo kthreads,
and can also increase context switching and wakeups, even in cases where
callback offloading is neither needed nor particularly desirable.  This
commit therefore enables callback offloading on a given CPU only if
specifically requested at build time or boot time, or if that CPU has
been specifically designated (again, either at build time or boot time)
as a nohz_full CPU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-07-07 15:13:44 -07:00
Paul E. McKenney
fbce7497ee rcu: Parallelize and economize NOCB kthread wakeups
An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things.  This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.

To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers.  By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders.  In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.

For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period.  This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.

Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-07-07 15:13:44 -07:00
Paul E. McKenney
4a81e8328d rcu: Reduce overhead of cond_resched() checks for RCU
Commit ac1bea85781e (Make cond_resched() report RCU quiescent states)
fixed a problem where a CPU looping in the kernel with but one runnable
task would give RCU CPU stall warnings, even if the in-kernel loop
contained cond_resched() calls.  Unfortunately, in so doing, it introduced
performance regressions in Anton Blanchard's will-it-scale "open1" test.
The problem appears to be not so much the increased cond_resched() path
length as an increase in the rate at which grace periods complete, which
increased per-update grace-period overhead.

This commit takes a different approach to fixing this bug, mainly by
moving the RCU-visible quiescent state from cond_resched() to
rcu_note_context_switch(), and by further reducing the check to a
simple non-zero test of a single per-CPU variable.  However, this
approach requires that the force-quiescent-state processing send
resched IPIs to the offending CPUs.  These will be sent only once
the grace period has reached an age specified by the boot/sysfs
parameter rcutree.jiffies_till_sched_qs, or once the grace period
reaches an age halfway to the point at which RCU CPU stall warnings
will be emitted, whichever comes first.

Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Lameter <cl@gentwo.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
[ paulmck: Made rcu_momentary_dyntick_idle() as suggested by the
  ktest build robot.  Also fixed smp_mb() comment as noted by
  Oleg Nesterov. ]

Merge with e552592e (Reduce overhead of cond_resched() checks for RCU)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-06-23 11:19:32 -07:00
Paul E. McKenney
546a9d8519 rcu: Export debug_init_rcu_head() and and debug_init_rcu_head()
Currently, call_rcu() relies on implicit allocation and initialization
for the debug-objects handling of RCU callbacks.  If you hammer the
kernel hard enough with Sasha's modified version of trinity, you can end
up with the sl*b allocators recursing into themselves via this implicit
call_rcu() allocation.

This commit therefore exports the debug_init_rcu_head() and
debug_rcu_head_free() functions, which permits the allocators to allocated
and pre-initialize the debug-objects information, so that there no longer
any need for call_rcu() to do that initialization, which in turn prevents
the recursion into the memory allocators.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Looks-good-to: Christoph Lameter <cl@linux.com>
2014-06-23 11:19:29 -07:00
Linus Torvalds
776edb5931 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
Pull core locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - reduced/streamlined smp_mb__*() interface that allows more usecases
     and makes the existing ones less buggy, especially in rarer
     architectures

   - add rwsem implementation comments

   - bump up lockdep limits"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  rwsem: Add comments to explain the meaning of the rwsem's count field
  lockdep: Increase static allocations
  arch: Mass conversion of smp_mb__*()
  arch,doc: Convert smp_mb__*()
  arch,xtensa: Convert smp_mb__*()
  arch,x86: Convert smp_mb__*()
  arch,tile: Convert smp_mb__*()
  arch,sparc: Convert smp_mb__*()
  arch,sh: Convert smp_mb__*()
  arch,score: Convert smp_mb__*()
  arch,s390: Convert smp_mb__*()
  arch,powerpc: Convert smp_mb__*()
  arch,parisc: Convert smp_mb__*()
  arch,openrisc: Convert smp_mb__*()
  arch,mn10300: Convert smp_mb__*()
  arch,mips: Convert smp_mb__*()
  arch,metag: Convert smp_mb__*()
  arch,m68k: Convert smp_mb__*()
  arch,m32r: Convert smp_mb__*()
  arch,ia64: Convert smp_mb__*()
  ...
2014-06-03 12:57:53 -07:00
Rik van Riel
61f38db3e3 rcu: Provide API to suppress stall warnings while sysrc runs
Some sysrq handlers can run for a long time, because they dump a lot
of data onto a serial console. Having RCU stall warnings pop up in
the middle of them only makes the problem worse.

This commit provides rcu_sysrq_start() and rcu_sysrq_end() APIs to
temporarily suppress RCU CPU stall warnings while a sysrq request is
handled.

Signed-off-by: Rik van Riel <riel@redhat.com>
[ paulmck: Fix TINY_RCU build error. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-05-19 10:52:04 -07:00
Uma Sharma
e534165bbf rcu: Variable name changed in tree_plugin.h and used in tree.c
The variable and struct both having the name "rcu_state" confuses
sparse in some situations, so this commit changes the variable to
"rcu_state_p" in order to avoid this confusion.  This also makes
things easier for human readers.

Signed-off-by: Uma Sharma <uma.sharma523@gmail.com>
[ paulmck: Changed the declaration and several additional uses. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-05-14 11:41:04 -07:00
Paul E. McKenney
f5d2a0450d Merge branches 'doc.2014.04.29a', 'fixes.2014.04.29a' and 'torture.2014.05.14a' into HEAD
doc.2014.04.29a:  Documentation updates.
fixes.2014.04.29a:  Miscellaneous fixes.
torture.2014.05.14a:  RCU/Lock torture tests.
2014-05-14 10:57:31 -07:00
Paul E. McKenney
5228084eed torture: Check for multiple concurrent torture tests
The torture tests are designed to run in isolation, but do not enforce
this isolation.  This commit therefore checks for concurrent torture
tests, and refuses to start new tests while old tests are running.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-05-14 09:46:29 -07:00
Paul E. McKenney
48d684fdad rcutorture: Run rcu_torture_writer at normal priority
There are usually lots of readers and only one writer, so if there has
to be a choice, we would want rcu_torture_writer to win.  This commit
therefore removes the set_user_nice() from rcu_torture_writer().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-05-14 09:46:26 -07:00
Thomas Gleixner
424c1b6820 rcutorture: Add missing destroy_timer_on_stack()
The rcu_torture_reader() function uses an on-stack timer_list structure
which it initializes with setup_timer_on_stack().  However, it fails to
use destroy_timer_on_stack() before exiting, which results in leaking a
tracking object if DEBUG_OBJECTS is enabled.  This commit therefore
invokes destroy_timer_on_stack() to avoid this leakage.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-05-14 09:46:24 -07:00