15077 Commits

Author SHA1 Message Date
Mike Galbraith
e0a79f529d sched: Fix select_idle_sibling() bouncing cow syndrome
If the previous CPU is cache affine and idle, select it.

The current implementation simply traverses the sd_llc domain,
taking the first idle CPU encountered, which walks buddy pairs
hand in hand over the package, inflicting excruciating pain.

1 tbench pair (worst case) in a 10 core + SMT package:

  pre   15.22 MB/sec 1 procs
  post 252.01 MB/sec 1 procs

Signed-off-by: Mike Galbraith <bitbucket@online.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1359371965.5783.127.camel@marge.simpson.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-04 20:07:24 +01:00
Ingo Molnar
9228b5f243 Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

1.	Changes to rcutorture and to RCU documentation. Posted to LKML at
        https://lkml.org/lkml/2013/1/26/188.

2.	Enhancements to uniprocessor handling in tiny RCU. Posted to LKML
        at https://lkml.org/lkml/2013/1/27/2.

3.	Tag RCU callbacks with grace-period number to simplify callback
        advancement. Posted to LKML at https://lkml.org/lkml/2013/1/26/203.

4.	Miscellaneous fixes. Posted to LKML at https://lkml.org/lkml/2013/1/26/204.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-04 19:06:34 +01:00
anish kumar
c02cf5f8ed irq_work: Remove return value from the irq_work_queue() function
As no one is using the return value of irq_work_queue(),
so it is better to just make it void.

Signed-off-by: anish kumar <anish198519851985@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
[ Fix stale comments, remove now unnecessary __irq_work_queue() intermediate function ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1359925703-24304-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-04 11:50:59 +01:00
Thomas Gleixner
90889a635a Merge branch 'fortglx/3.9/time' of git://git.linaro.org/people/jstultz/linux into timers/core
Trivial conflict in arch/x86/Kconfig

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-02-04 11:03:03 +01:00
Al Viro
bde208d2e1 switch mips to generic rt_sigsuspend(), make it unconditional
mips was the last architecture not using the generic variant.
Both native and compat variants switched to generic, which is
made unconditional now.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-03 18:32:49 -05:00
Al Viro
2ce5da1757 new helper: signal_setup_done()
usual "call force_sigsegv or signal_delivered" logics.  Takes
ksignal instead of separate siginfo/k_sigaction.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-03 15:09:26 -05:00
Al Viro
90caf58dad convert futex compat syscalls to COMPAT_SYSCALL_DEFINE
ppc is stepping into a nasal daemon territory a bit...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-03 15:09:24 -05:00
Al Viro
495dfbf767 generic sys_sigaction() and compat_sys_sigaction()
conditional on OLD_SIGACTION/COMPAT_OLD_SIGACTION

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-03 15:09:23 -05:00
Al Viro
08d32fe504 generic sys_compat_rt_sigaction()
Again, protected by a temporary config symbol (GENERIC_COMPAT_RT_SIGACTION);
will be gone by the end of series.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-03 15:09:23 -05:00
Al Viro
6883da8c6c switch compat_sys_sched_rr_get_interval to COMPAT_SYSCALL_DEFINE
... and make it unconditional - we want the sucker on all biarch
platforms, really.  All kinds of wrappers and private implementations
can go now; fortunately, they don't cause name conflicts, so we can
do that one first without any bisect hazards.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-03 15:09:22 -05:00
Al Viro
9aae8fc05d switch rt_tgsigqueueinfo to COMPAT_SYSCALL_DEFINE
C ABI violations on sparc, ppc and mips

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-03 15:09:21 -05:00
Al Viro
5cf2210022 switch compat_sys_sigprocmask to COMPAT_SYSCALL_DEFINE
In principle, C ABI violation on ppc and mips...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-03 15:09:21 -05:00
Al Viro
28d27f2d25 switch compat_sys_rt_sigtimedwait to COMPAT_SYSCALL_DEFINE
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-03 15:09:20 -05:00
Al Viro
0a0e8cdf73 old sigsuspend variants in kernel/signal.c
conditional on OLD_SIGSUSPEND/OLD_SIGSUSPEND3, depending on which
variety of that fossil is needed.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-03 15:09:20 -05:00
Al Viro
75907d4d7b generic compat_sys_rt_sigqueueinfo()
conditional on GENERIC_COMPAT_RT_SIGQUEUEINFO; by the end of that series
it will become the same thing as COMPAT and conditional will die out.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-03 15:09:19 -05:00
Al Viro
fe9c1db2cf generic compat_sys_rt_sigpending()
conditional on GENERIC_COMPAT_RT_SIGPENDING; by the end of that series
it will become the same thing as COMPAT and conditional will die out.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-03 15:09:19 -05:00
Al Viro
322a56cb1f generic compat_sys_rt_sigprocmask()
conditional on GENERIC_COMPAT_RT_SIGPROCMASK; by the end of that series
it will become the same thing as COMPAT and conditional will die out.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-03 15:09:19 -05:00
Al Viro
ad4b65a434 consolidate rt_sigsuspend()
* pull compat version alongside with the native one
* make little-endian compat variant just call the native
* don't bother with separate conditional for compat (both native and
compat are going to become unconditional very soon).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-03 15:09:18 -05:00
Al Viro
eaca6eae3e sanitize rt_sigaction() situation a bit
Switch from __ARCH_WANT_SYS_RT_SIGACTION to opposite
(!CONFIG_ODD_RT_SIGACTION); the only two architectures that
need it are alpha and sparc.  The reason for use of CONFIG_...
instead of __ARCH_... is that it's needed only kernel-side
and doing it that way avoids a mess with include order on many
architectures.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-03 15:09:18 -05:00
Al Viro
377840744b switch compat_sys_[gs]etitimer(2) to COMPAT_SYSCALL_DEFINE
again, strictly speaking we are in nasal daemon territory on ppc
and mips - we need to sign-extend int arguments.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-03 15:09:17 -05:00
Kirill Tkhai
60334caf37 sched/rt: Further simplify pick_rt_task()
Function next_prio() has been removed and pull_rt_task() is the
only user of pick_next_highest_task_rt() at the moment.

pull_rt_task is not interested in p->nr_cpus_allowed, its only
interest is the fact that cpu is allowed to execute p. If
nr_cpus_allowed == 1, cpu != task_cpu(p) and cpu is allowed then
it means that task p is in the middle of the migration
techniques; the task waits until it is moved by migration
thread. So, lets pull it earlier.

Signed-off-by: Kirill V Tkhai <tkhai@yandex.ru>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
CC: linux-rt-users <linux-rt-users@vger.kernel.org>
Link: http://lkml.kernel.org/r/70871359644177@web16d.yandex.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-03 19:54:58 +01:00
Jiri Olsa
0231bb5336 perf: Fix event group context move
When we have group with mixed events (hw/sw) we want to end up
with group leader being in hw context. So if group leader is
initialy sw event, we move all the events under hw context.

The move is done for each event by removing it from its context
and adding it back into proper one. As a part of the removal the
event is automatically disabled, which is not what we want at
this stage of creating groups.

The fix is to initialize event state after removal from sw
context.

This fix resulted from the following discussion:

  http://thread.gmane.org/gmane.linux.kernel.perf.user/1144

Reported-by: Andreas Hollmann <hollmann@in.tum.de>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vince@deater.net>
Link: http://lkml.kernel.org/r/1359714225-4231-1-git-send-email-jolsa@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-03 12:01:29 +01:00
Ingo Molnar
f7355a5e7c Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core
Pull tracing updated from Steve Rostedt.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-03 11:14:06 +01:00
Steven Rostedt (Red Hat)
d840f718d2 tracing: Init current_trace to nop_trace and remove NULL checks
On early boot up, when the ftrace ring buffer is initialized, the
static variable current_trace is initialized to &nop_trace.
Before this initialization, current_trace is NULL and will never
become NULL again. It is always reassigned to a ftrace tracer.

Several places check if current_trace is NULL before it uses
it, and this check is frivolous, because at the point in time
when the checks are made the only way current_trace could be
NULL is if ftrace failed its allocations at boot up, and the
paths to these locations would probably not be possible.

By initializing current_trace to &nop_trace where it is declared,
current_trace will never be NULL, and we can remove all these
checks of current_trace being NULL which never needed to be
checked in the first place.

Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-02-01 18:38:47 -05:00
Mark Rutland
12ad100046 clockevents: Add generic timer broadcast function
Currently, the timer broadcast mechanism is defined by a function
pointer on struct clock_event_device. As the fundamental mechanism for
broadcast is architecture-specific, this means that clock_event_device
drivers cannot be shared across multiple architectures.

This patch adds an (optional) architecture-specific function for timer
tick broadcast, allowing drivers which may require broadcast
functionality to be shared across multiple architectures.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: nico@linaro.org
Cc: Will.Deacon@arm.com
Cc: Marc.Zyngier@arm.com
Cc: john.stultz@linaro.org
Link: http://lkml.kernel.org/r/1358183124-28461-3-git-send-email-mark.rutland@arm.com
Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-01-31 22:15:36 +01:00
Mark Rutland
12572dbb53 clockevents: Add generic timer broadcast receiver
Currently the broadcast mechanism used for timers is abstracted by a
function pointer on struct clock_event_device. As the fundamental
mechanism for broadcast is architecture-specific, this ties each
clock_event_device driver to a single architecture, even where the
driver is otherwise generic.

This patch adds a standard path for the receipt of timer broadcasts, so
drivers and/or architecture backends need not manage redundant lists of
timers for the purpose of routing broadcast timer ticks.

[tglx: Made the implementation depend on the config switch as well ]

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: nico@linaro.org
Cc: Will.Deacon@arm.com
Cc: Marc.Zyngier@arm.com
Cc: john.stultz@linaro.org
Link: http://lkml.kernel.org/r/1358183124-28461-2-git-send-email-mark.rutland@arm.com
Tested-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-01-31 22:15:35 +01:00
Kirill Tkhai
fc79e240be sched/rt: Do not account zero delta_exec in update_curr_rt()
There are several places of consecutive calls of
dequeue_task_rt() and put_prev_task_rt() in the scheduler.
For example, function rt_mutex_setprio() does it.

The both calls lead to update_curr_rt(), the second of it
receives zeroed delta_exec. The only effective action in this
case is call of sched_rt_avg_update(), which can change
rq->age_stamp and rq->rt_avg. But it is possible in case of
""floating"" rq->clock. This fact is not reasonable to be
accounted. Another actions do nothing.

Signed-off-by: Kirill V Tkhai <tkhai@yandex.ru>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
CC: linux-rt-users <linux-rt-users@vger.kernel.org>
Link: http://lkml.kernel.org/r/931541359550236@web1g.yandex.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-01-31 10:31:13 +01:00
Linus Torvalds
bdb0ae6a76 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Peter Anvin:
 "This is a collection of miscellaneous fixes, the most important one is
  the fix for the Samsung laptop bricking issue (auto-blacklisting the
  samsung-laptop driver); the efi_enabled() changes you see below are
  prerequisites for that fix.

  The other issues fixed are booting on OLPC XO-1.5, an UV fix, NMI
  debugging, and requiring CAP_SYS_RAWIO for MSR references, just as
  with I/O port references."

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  samsung-laptop: Disable on EFI hardware
  efi: Make 'efi_enabled' a function to query EFI facilities
  smp: Fix SMP function call empty cpu mask race
  x86/msr: Add capabilities check
  x86/dma-debug: Bump PREALLOC_DMA_DEBUG_ENTRIES
  x86/olpc: Fix olpc-xo1-sci.c build errors
  arch/x86/platform/uv: Fix incorrect tlb flush all issue
  x86-64: Fix unwind annotations in recent NMI changes
  x86-32: Start out cr0 clean, disable paging before modifying cr3/4
2013-01-31 17:08:43 +11:00
Dave Airlie
ff0d05bf73 Revert "console: implement lockdep support for console_lock"
This reverts commit daee779718a319ff9f83e1ba3339334ac650bb22.

I'll requeue this after the console locking fixes, so lockdep
is useful again for people until fbcon is fixed.

Signed-off-by: Dave Airlie <airlied@redhat.com>
2013-01-31 15:46:56 +11:00
Hiraku Toyooka
debdd57f51 tracing: Make a snapshot feature available from userspace
Ftrace has a snapshot feature available from kernel space and
latency tracers (e.g. irqsoff) are using it. This patch enables
user applictions to take a snapshot via debugfs.

Add "snapshot" debugfs file in "tracing" directory.

  snapshot:
    This is used to take a snapshot and to read the output of the
    snapshot.

     # echo 1 > snapshot

    This will allocate the spare buffer for snapshot (if it is
    not allocated), and take a snapshot.

     # cat snapshot

    This will show contents of the snapshot.

     # echo 0 > snapshot

    This will free the snapshot if it is allocated.

    Any other positive values will clear the snapshot contents if
    the snapshot is allocated, or return EINVAL if it is not allocated.

Link: http://lkml.kernel.org/r/20121226025300.3252.86850.stgit@liselsia

Cc: Jiri Olsa <jolsa@redhat.com>
Cc: David Sharp <dhsharp@google.com>
Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
[
   Fixed irqsoff selftest and also a conflict with a change
   that fixes the update_max_tr.
]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-30 11:02:06 -05:00
Hiraku Toyooka
2fd196ec1e tracing: Replace static old_tracer check of tracer name
Currently the trace buffer read functions use a static variable
"old_tracer" for detecting if the current tracer changes. This
was suitable for a single trace file ("trace"), but to add a
snapshot feature that will use the same function for its file,
a check against a static variable is not sufficient.

To use the output functions for two different files, instead of
storing the current tracer in a static variable, as the trace
iterator descriptor contains a pointer to the original current
tracer's name, that pointer can now be used to check if the
current tracer has changed between different reads of the trace
file.

Link: http://lkml.kernel.org/r/20121226025252.3252.9276.stgit@liselsia

Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-30 11:02:05 -05:00
Namhyung Kim
5e67b51e3f tracing: Use sched_clock_cpu for trace_clock_global
For systems with an unstable sched_clock, all cpu_clock() does is enable/
disable local irq during the call to sched_clock_cpu().  And for stable
systems they are same.

trace_clock_global() already disables interrupts, so it can call
sched_clock_cpu() directly.

Link: http://lkml.kernel.org/r/1356576585-28782-2-git-send-email-namhyung@kernel.org

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-30 11:02:05 -05:00
Steven Rostedt (Red Hat)
ad964704ba ring-buffer: Add stats field for amount read from trace ring buffer
Add a stat about the number of events read from the ring buffer:

 #  cat /debug/tracing/per_cpu/cpu0/stats
entries: 39869
overrun: 870512
commit overrun: 0
bytes: 1449912
oldest event ts:  6561.368690
now ts:  6565.246426
dropped events: 0
read events: 112    <-- Added

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-30 11:01:53 -05:00
Yinghai Lu
0212f91596 x86: Add Crash kernel low reservation
During kdump kernel's booting stage, it need to find low ram for
swiotlb buffer when system does not support intel iommu/dmar remapping.

kexed-tools is appending memmap=exactmap and range from /proc/iomem
with "Crash kernel", and that range is above 4G for 64bit after boot
protocol 2.12.

We need to add another range in /proc/iomem like "Crash kernel low",
so kexec-tools could find that info and append to kdump kernel
command line.

Try to reserve some under 4G if the normal "Crash kernel" is above 4G.

User could specify the size with crashkernel_low=XX[KMG].

-v2: fix warning that is found by Fengguang's test robot.
-v3: move out get_mem_size change to another patch, to solve compiling
     warning that is found by Borislav Petkov <bp@alien8.de>
-v4: user must specify crashkernel_low if system does not support
     intel or amd iommu.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1359058816-7615-31-git-send-email-yinghai@kernel.org
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Rob Landley <rob@landley.net>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29 19:32:58 -08:00
John Stultz
6f16eebe1f timekeeping: Switch HAS_PERSISTENT_CLOCK to ALWAYS_USE_PERSISTENT_CLOCK
Jason pointed out the HAS_PERSISTENT_CLOCK name isn't
quite accurate for the config, as some systems may have
the persistent_clock in some cases, but not always.

So change the config name to the more clear
ALWAYS_USE_PERSISTENT_CLOCK.

Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-01-29 14:40:12 -08:00
Steven Rostedt (Red Hat)
03274a3ffb tracing/fgraph: Adjust fgraph depth before calling trace return callback
While debugging the virtual cputime with the function graph tracer
with a max_depth of 1 (most common use of the max_depth so far),
I found that I was missing kernel execution because of a race condition.

The code for the return side of the function has a slight race:

	ftrace_pop_return_trace(&trace, &ret, frame_pointer);
	trace.rettime = trace_clock_local();
	ftrace_graph_return(&trace);
	barrier();
	current->curr_ret_stack--;

The ftrace_pop_return_trace() initializes the trace structure for
the callback. The ftrace_graph_return() uses the trace structure
for its own use as that structure is on the stack and is local
to this function. Then the curr_ret_stack is decremented which
is what the trace.depth is set to.

If an interrupt comes in after the ftrace_graph_return() but
before the curr_ret_stack, then the called function will get
a depth of 2. If max_depth is set to 1 this function will be
ignored.

The problem is that the trace has already been called, and the
timestamp for that trace will not reflect the time the function
was about to re-enter userspace. Calls to the interrupt will not
be traced because the max_depth has prevented this.

To solve this issue, the ftrace_graph_return() can safely be
moved after the current->curr_ret_stack has been updated.
This way the timestamp for the return callback will reflect
the actual time.

If an interrupt comes in after the curr_ret_stack update and
ftrace_graph_return(), it will be traced. It may look a little
confusing to see it within the other function, but at least
it will not be lost.

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-29 17:30:31 -05:00
David S. Miller
f1e7b73acc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Bring in the 'net' tree so that we can get some ipv4/ipv6 bug
fixes that some net-next work will build upon.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-29 15:32:13 -05:00
Jovi Zhang
38dbe0b137 tracing: Remove second iterator initializer
The trace iterator is already initialized by trace_init_global_iter(),
so there is no need to initialize it again.

Link: http://lkml.kernel.org/r/CACV3sb+G1YnO6168JhY3dEadmJi58pA5-2cSZT8E0WVHJNFt9Q@mail.gmail.com

Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-01-29 09:57:49 -05:00
Peter Zijlstra
7b270f6099 sched: Bail out of yield_to when source and target runqueue has one task
In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly come
out of PLE handler.

(History: Raghavendra initially worked on break out of kvm ple handler upon
 seeing source runqueue length = 1, but it had to export rq length).
 Peter came up with the elegant idea of return -ESRCH in scheduler core.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Raghavendra, Checking the rq length of target vcpu condition added.(thanks Avi)
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: Andrew Jones <drjones@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-29 15:38:37 +02:00
Paul E. McKenney
40393f525f Merge branches 'doctorture.2013.01.29a', 'fixes.2013.01.26a', 'tagcb.2013.01.24a' and 'tiny.2013.01.29b' into HEAD
doctorture.2013.01.11a: Changes to rcutorture and to RCU documentation.

fixes.2013.01.26a: Miscellaneous fixes.

tagcb.2013.01.24a: Tag RCU callbacks with grace-period number to
	simplify callback advancement.

tiny.2013.01.29b: Enhancements to uniprocessor handling in tiny RCU.
2013-01-28 22:25:21 -08:00
Paul E. McKenney
0e11c8e8a6 rcu: Make rcutorture's shuffler task shuffle recently added tasks
A number of kthreads have been added to rcutorture, but the shuffler
task was not informed of them, and thus did not shuffle them.  This
commit therefore adds the requisite shuffling, and, while in the area
fixes up some whitespace issues.

However, the shuffling is intended to keep randomly selected CPUs
idle, which means that the RCU priority boosting kthreads need to
avoid waking up every jiffy.  This commit also makes that fix.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2013-01-28 22:19:54 -08:00
Paul E. McKenney
6bfc09e232 rcu: Provide RCU CPU stall warnings for tiny RCU
Tiny RCU has historically omitted RCU CPU stall warnings in order to
reduce memory requirements, however, lack of these warnings caused
Thomas Gleixner some debugging pain recently.  Therefore, this commit
adds RCU CPU stall warnings to tiny RCU if RCU_TRACE=y.  This keeps
the memory footprint small, while still enabling CPU stall warnings
in kernels built to enable them.

Updated to include Josh Triplett's suggested use of RCU_STALL_COMMON
config variable to simplify #if expressions.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2013-01-28 22:06:21 -08:00
Wang YanQing
f44310b98d smp: Fix SMP function call empty cpu mask race
I get the following warning every day with v3.7, once or
twice a day:

  [ 2235.186027] WARNING: at /mnt/sda7/kernel/linux/arch/x86/kernel/apic/ipi.c:109 default_send_IPI_mask_logical+0x2f/0xb8()

As explained by Linus as well:

 |
 | Once we've done the "list_add_rcu()" to add it to the
 | queue, we can have (another) IPI to the target CPU that can
 | now see it and clear the mask.
 |
 | So by the time we get to actually send the IPI, the mask might
 | have been cleared by another IPI.
 |

This patch also fixes a system hang problem, if the data->cpumask
gets cleared after passing this point:

        if (WARN_ONCE(!mask, "empty IPI mask"))
                return;

then the problem in commit 83d349f35e1a ("x86: don't send an IPI to
the empty set of CPU's") will happen again.

Signed-off-by: Wang YanQing <udknight@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: peterz@infradead.org
Cc: mina86@mina86.org
Cc: srivatsa.bhat@linux.vnet.ibm.com
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/20130126075357.GA3205@udknight
[ Tidied up the changelog and the comment in the code. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-01-28 11:21:57 +01:00
Olof Johansson
6b914c9987 Linux 3.8-rc5
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQEcBAABAgAGBQJRAuO3AAoJEHm+PkMAQRiGbfAH/1C3QQKB11aBpYLAw7qijAze
 yOui26UCnwRryxsO8zBCQjGoByy5DvY/Q0zyUCWUE6nf/JFSoKGUHzfJ1ATyzGll
 3vENP6Fnmq0Hgc4t8/gXtXrZ1k/c43cYA2XEhDnEsJlFNmNj2wCQQj9njTNn2cl1
 k6XhZ9U1V2hGYpLL5bmsZiLVI6dIpkCVw8d4GZ8BKxSLUacVKMS7ml2kZqxBTMgt
 AF6T2SPagBBxxNq8q87x4b7vyHYchZmk+9tAV8UMs1ecimasLK8vrRAJvkXXaH1t
 xgtR0sfIp5raEjoFYswCK+cf5NEusLZDKOEvoABFfEgL4/RKFZ8w7MMsmG8m0rk=
 =m68Y
 -----END PGP SIGNATURE-----

Merge tag 'v3.8-rc5' into next/cleanup

Linux 3.8-rc5

Signed-off-by: Olof Johansson <olof@lixom.net>
2013-01-27 22:07:20 -08:00
Frederic Weisbecker
6a61671bb2 cputime: Safely read cputime of full dynticks CPUs
While remotely reading the cputime of a task running in a
full dynticks CPU, the values stored in utime/stime fields
of struct task_struct may be stale. Its values may be those
of the last kernel <-> user transition time snapshot and
we need to add the tickless time spent since this snapshot.

To fix this, flush the cputime of the dynticks CPUs on
kernel <-> user transition and record the time / context
where we did this. Then on top of this snapshot and the current
time, perform the fixup on the reader side from task_times()
accessors.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
[fixed kvm module related build errors]
Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com>
2013-01-27 20:35:47 +01:00
Frederic Weisbecker
c11f11fcbd kvm: Prepare to add generic guest entry/exit callbacks
Do some ground preparatory work before adding guest_enter()
and guest_exit() context tracking callbacks. Those will
be later used to read the guest cputime safely when we
run in full dynticks mode.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-01-27 20:35:40 +01:00
Frederic Weisbecker
6fac4829ce cputime: Use accessors to read task cputime stats
This is in preparation for the full dynticks feature. While
remotely reading the cputime of a task running in a full
dynticks CPU, we'll need to do some extra-computation. This
way we can account the time it spent tickless in userspace
since its last cputime snapshot.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-01-27 19:23:31 +01:00
Frederic Weisbecker
3f4724ea85 cputime: Allow dynamic switch between tick/virtual based cputime accounting
Allow to dynamically switch between tick and virtual based
cputime accounting. This way we can provide a kind of "on-demand"
virtual based cputime accounting. In this mode, the kernel relies
on the context tracking subsystem to dynamically probe on kernel
boundaries.

This is in preparation for being able to stop the timer tick in
more places than just the idle state. Doing so will depend on
CONFIG_VIRT_CPU_ACCOUNTING_GEN which makes it possible to account
the cputime without the tick by hooking on kernel/user boundaries.

Depending whether the tick is stopped or not, we can switch between
tick and vtime based accounting anytime in order to minimize the
overhead associated to user hooks.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-01-27 19:23:29 +01:00
Frederic Weisbecker
abf917cd91 cputime: Generic on-demand virtual cputime accounting
If we want to stop the tick further idle, we need to be
able to account the cputime without using the tick.

Virtual based cputime accounting solves that problem by
hooking into kernel/user boundaries.

However implementing CONFIG_VIRT_CPU_ACCOUNTING require
low level hooks and involves more overhead. But we already
have a generic context tracking subsystem that is required
for RCU needs by archs which plan to shut down the tick
outside idle.

This patch implements a generic virtual based cputime
accounting that relies on these generic kernel/user hooks.

There are some upsides of doing this:

- This requires no arch code to implement CONFIG_VIRT_CPU_ACCOUNTING
if context tracking is already built (already necessary for RCU in full
tickless mode).

- We can rely on the generic context tracking subsystem to dynamically
(de)activate the hooks, so that we can switch anytime between virtual
and tick based accounting. This way we don't have the overhead
of the virtual accounting when the tick is running periodically.

And one downside:

- There is probably more overhead than a native virtual based cputime
accounting. But this relies on hooks that are already set anyway.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-01-27 19:23:27 +01:00
Frederic Weisbecker
ae8dda5c47 cputime: Move default nsecs_to_cputime() to jiffies based cputime file
If the architecture doesn't provide an implementation of
nsecs_to_cputime(), the cputime accounting core uses a
default one that converts the nanoseconds to jiffies. However
this only makes sense if we use the jiffies based cputime.

For now it doesn't matter much because this API is only
called on code that uses jiffies based cputime accounting.

But the code may evolve and this API may be used more
broadly in the future. Keeping this default implementation
around is very error prone as it may introduce a bug and
hide it on architectures that don't override this API.

Fix this by moving this definition to the jiffies based
cputime headers as it is the only place where it belongs to.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-01-27 19:23:25 +01:00