IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
The following patch fixes a bug introduced by the following
commit:
e050e3f0a71b ("perf: Fix broken interrupt rate throttling")
The patch caused the following warning to pop up depending on
the sampling frequency adjustments:
------------[ cut here ]------------
WARNING: at arch/x86/kernel/cpu/perf_event.c:995 x86_pmu_start+0x79/0xd4()
It was caused by the following call sequence:
perf_adjust_freq_unthr_context.part() {
stop()
if (delta > 0) {
perf_adjust_period() {
if (period > 8*...) {
stop()
...
start()
}
}
}
start()
}
Which caused a double start and a double stop, thus triggering
the assert in x86_pmu_start().
The patch fixes the problem by avoiding the double calls. We
pass a new argument to perf_adjust_period() to indicate whether
or not the event is already stopped. We can't just remove the
start/stop from that function because it's called from
__perf_event_overflow where the event needs to be reloaded via a
stop/start back-toback call.
The patch reintroduces the assertion in x86_pmu_start() which
was removed by commit:
84f2b9b ("perf: Remove deprecated WARN_ON_ONCE()")
In this second version, we've added calls to disable/enable PMU
during unthrottling or frequency adjustment based on bug report
of spurious NMI interrupts from Eric Dumazet.
Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: markus@trippelsdorf.de
Cc: paulus@samba.org
Link: http://lkml.kernel.org/r/20120207133956.GA4932@quad
[ Minor edits to the changelog and to the code ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
put_io_context() performed a complex trylock dancing to avoid
deferring ioc release to workqueue. It was also broken on UP because
trylock was always assumed to succeed which resulted in unbalanced
preemption count.
While there are ways to fix the UP breakage, even the most
pathological microbench (forced ioc allocation and tight fork/exit
loop) fails to show any appreciable performance benefit of the
optimization. Strip it out. If there turns out to be workloads which
are affected by this change, simpler optimization from the discussion
thread can be applied later.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <1328514611.21268.66.camel@sli10-conroe>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
So that we can get the perf bench exec stack fixes and then apply the
remaining fix for the files added after what is in perf/urgent.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Three power management regression fixes, one for a recent regression introcuded
by the freezer changes during the 3.3 merge window and two for regressions
in cpuidle (resulting from PM QoS changes) and in the hibernate user space
interface, both introduced during the 3.2 development cycle.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIcBAABAgAGBQJPLbiPAAoJEKhOf7ml8uNscvwQAJYYhSBL+ouK8ERS0OLkeEoB
k4O1Ap0hb5Kv54Sr85WKEm5zGRDJXUxlWeMklo9K/fvs04CU1gsBb8jhDdbZ2ovE
rnyybPjfieExQbLxX6nYIP4qKMLtnZvHhHpafuDSUz0RWq/7sCTiFI2htNj97gGu
DzXYpeePFgvzG6AaznywWkvNdXoQfmsTC0adDrXWcuKXnNrH6h8o/OIB+pO70Szw
gmU8SjVGGQjrlnuQ+Ku4WqbSyXs1bXlUkyTHJilg6CNJySrA/LUHhKPrRnP1i3Hu
LxX/rsrTqohhD1tz1qQOpnMiu86FSez+UVA65b2cF3EqZbNROY2+O1/V+OlczKYy
V9Q3rk+J4uRJtnL8DEgcniMGrRsjyle5USN5KDX50BkrC56h3mZirnEu1yaiMIJn
K8NWI/4JdK7JbA6f2hXuPuesmudSP4uo8vuUzKthEUi88QReYXYSMcz/Fy/G9z8n
JW7PimC5OmeTwYIqBcjZf+8j/1u6cHaEkvjPAJhIUgCR/ZVi6VFySnUByDD6JKTJ
bQcUSqZZ8TvEc4A6JjG18/QfmWIZMErfuG0WAKb8sqtXoPkHKR/XXjbaXof9Oppn
nRS5iJUaZGY4YivSHZZOFAk24ThqKx5ZK3qXq/dBbj9JwtJdc+++b9f0RwXUHjd9
ECoM3bFtO8ewINmZ7wRQ
=EKGs
-----END PGP SIGNATURE-----
Merge tag 'pm-fixes-for-3.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Power management fixes for 3.3-rc3
Three power management regression fixes, one for a recent regression introcuded
by the freezer changes during the 3.3 merge window and two for regressions
in cpuidle (resulting from PM QoS changes) and in the hibernate user space
interface, both introduced during the 3.2 development cycle.
They include:
* Two hibernate (s2disk) regression fixes from Srivatsa S. Bhat (for
regressions introduced during the 3.3 merge window and during the 3.2
development cycle).
* A cpuidle fix from Venki Pallipadi for a regression resulting from PM QoS
changes during the 3.2 development cycle causing cpuidle to work incorrectly
for CONFIG_PM unset.
* tag 'pm-fixes-for-3.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / QoS: CPU C-state breakage with PM Qos change
PM / Freezer: Thaw only kernel threads if freezing of kernel threads fails
PM / Hibernate: Thaw kernel threads in SNAPSHOT_CREATE_IMAGE ioctl path
If freezing of kernel threads fails, we are expected to automatically
thaw tasks in the error recovery path. However, at times, we encounter
situations in which we would like the automatic error recovery path
to thaw only the kernel threads, because we want to be able to do
some more cleanup before we thaw userspace. Something like:
error = freeze_kernel_threads();
if (error) {
/* Do some cleanup */
/* Only then thaw userspace tasks*/
thaw_processes();
}
An example of such a situation is where we freeze/thaw filesystems
during suspend/hibernation. There, if freezing of kernel threads
fails, we would like to thaw the frozen filesystems before thawing
the userspace tasks.
So, modify freeze_kernel_threads() to thaw only kernel threads in
case of freezing failure. And change suspend_freeze_processes()
accordingly. (At the same time, let us also get rid of the rather
cryptic usage of the conditional operator (:?) in that function.)
[rjw: In fact, this patch fixes a regression introduced during the
3.3 merge window, because without it thaw_processes() may be called
before swsusp_free() in some situations and that may lead to massive
memory allocation failures.]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Nigel Cunningham <nigel@tuxonice.net>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
In function pre_handler_kretprobe(), the allocated kretprobe_instance
object will get leaked if the entry_handler callback returns non-zero.
This may cause all the preallocated kretprobe_instance objects exhausted.
This issue can be reproduced by changing
samples/kprobes/kretprobe_example.c to probe "mutex_unlock". And the fix
is straightforward: just put the allocated kretprobe_instance object back
onto the free_instances list.
[akpm@linux-foundation.org: use raw_spin_lock/unlock]
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
EXPORT_SYMBOL(adjust_resource) should be right after adjust_resource().
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The __raise_softirq_irqoff() contains a tracepoint. As tracepoints in headers
can cause issues, and not to mention, bloats the kernel when they are
in a static inline, it is best to move the function that contains the
tracepoint out of the header and into softirq.c.
Link: http://lkml.kernel.org/r/20120118120711.GB14863@elte.hu
Suggested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the ftrace_set_filter and ftrace_set_notrace functions
do not return any return code. So there's no way for ftrace_ops
user to tell wether the filter was correctly applied.
The set_ftrace_filter interface returns error in case the filter
did not match:
# echo krava > set_ftrace_filter
bash: echo: write error: Invalid argument
Changing both ftrace_set_filter and ftrace_set_notrace functions
to return zero if the filter was applied correctly or -E* values
in case of error.
Link: http://lkml.kernel.org/r/1325495060-6402-2-git-send-email-jolsa@redhat.com
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As noted by Arve and others, since wall time can jump backwards, it is
difficult to use for input because one cannot determine if one event
occurred before another or for how long a key was pressed.
However, the timestamp field is part of the kernel ABI, and cannot be
changed without possibly breaking existing users.
This patch adds a new IOCTL that allows a clockid to be set in the
evdev_client struct that will specify which time base to use for event
timestamps (ie: CLOCK_MONOTONIC instead of CLOCK_REALTIME).
For now we only support CLOCK_MONOTONIC and CLOCK_REALTIME, but
in the future we could support other clockids if appropriate.
The default remains CLOCK_REALTIME, so we don't change the ABI.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Daniel Kurtz <djkurtz@google.com>
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
This fixes the race in process_vm_core found by Oleg (see
http://article.gmane.org/gmane.linux.kernel/1235667/
for details).
This has been updated since I last sent it as the creation of the new
mm_access() function did almost exactly the same thing as parts of the
previous version of this patch did.
In order to use mm_access() even when /proc isn't enabled, we move it to
kernel/fork.c where other related process mm access functions already
are.
Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
bugs, x86: Fix printk levels for panic, softlockups and stack dumps
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf top: Fix number of samples displayed
perf tools: Fix strlen() bug in perf_event__synthesize_event_type()
perf tools: Fix broken build by defining _GNU_SOURCE in Makefile
x86/dumpstack: Remove unneeded check in dump_trace()
perf: Fix broken interrupt rate throttling
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/rt: Fix task stack corruption under __ARCH_WANT_INTERRUPTS_ON_CTXSW
sched: Fix ancient race in do_exit()
sched/nohz: Fix nohz cpu idle load balancing state with cpu hotplug
sched/s390: Fix compile error in sched/core.c
sched: Fix rq->nr_uninterruptible update race
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/reboot: Remove VersaLogic Menlow reboot quirk
x86/reboot: Skip DMI checks if reboot set by user
x86: Properly parenthesize cmpxchg() macro arguments
The argument is not used at all, and it's not necessary, because
a specific callback handler of course knows which subsys it
belongs to.
Now only ->pupulate() takes this argument, because the handlers of
this callback always call cgroup_add_file()/cgroup_add_files().
So we reduce a few lines of code, though the shrinking of object size
is minimal.
16 files changed, 113 insertions(+), 162 deletions(-)
text data bss dec hex filename
5486240 656987 7039960 13183187 c928d3 vmlinux.o.orig
5486170 656987 7039960 13183117 c9288d vmlinux.o
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In the SNAPSHOT_CREATE_IMAGE ioctl, if the call to hibernation_snapshot()
fails, the frozen tasks are not thawed.
And in the case of success, if we happen to exit due to a successful freezer
test, all tasks (including those of userspace) are thawed, whereas actually
we should have thawed only the kernel threads at that point. Fix both these
issues.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: stable@vger.kernel.org
This also fixes the following compilation warning on !SMP:
CC kernel/sched/fair.o
kernel/sched/fair.c:218:36: warning: 'max_load_balance_interval' defined but not used [-Wunused-variable]
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4F2754A0.9090306@ct.jp.nec.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In cgroup_attach_proc, we indirectly call find_existing_css_set 3
times. It is an expensive call so we want to call it a minimum
of times. This patch only calls it once and stores the result so
that it can be used later on when we call cgroup_task_migrate.
This required modifying cgroup_task_migrate to take the new css_set
(which we obtained from find_css_set) as a parameter. The nice side
effect of this is that cgroup_task_migrate is now identical for
cgroup_attach_task and cgroup_attach_proc. It also now returns a
void since it can never fail.
Changes in V5:
* https://lkml.org/lkml/2012/1/20/344 (Tejun Heo)
* Remove css_set_refs
Changes in V4:
* https://lkml.org/lkml/2011/12/22/421 (Li Zefan)
* Avoid GFP_KERNEL (sleep) in rcu_read_lock by getting css_set in
a separate loop not under an rcu_read_lock
Changes in V3:
* https://lkml.org/lkml/2011/12/22/13 (Li Zefan)
* Fixed earlier bug by creating a seperate patch to remove tasklist_lock
Changes in V2:
* https://lkml.org/lkml/2011/12/20/372 (Tejun Heo)
* Move find_css_set call into loop which creates the flex array
* Author
* Kill css_set_refs and use group_size instead
* Fix an off-by-one error in counting css_set refs
* Add a retval check in out_list_teardown
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: containers@lists.linux-foundation.org
Cc: cgroups@vger.kernel.org
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
- Replace class ID #define with enumeration
- Loop through PM QoS objects during initialization (rather than
initializing them one-by-one)
Signed-off-by: Alex Frid <afrid@nvidia.com>
Reviewed-by: Antti Miettinen <amiettinen@nvidia.com>
Reviewed-by: Diwakar Tundlam <dtundlam@nvidia.com>
Reviewed-by: Scott Williams <scwilliams@nvidia.com>
Reviewed-by: Yu-Huan Hsu <yhsu@nvidia.com>
Acked-by: markgross <markgross@thegnar.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
The current device suspend/resume phases during system-wide power
transitions appear to be insufficient for some platforms that want
to use the same callback routines for saving device states and
related operations during runtime suspend/resume as well as during
system suspend/resume. In principle, they could point their
.suspend_noirq() and .resume_noirq() to the same callback routines
as their .runtime_suspend() and .runtime_resume(), respectively,
but at least some of them require device interrupts to be enabled
while the code in those routines is running.
It also makes sense to have device suspend-resume callbacks that will
be executed with runtime PM disabled and with device interrupts
enabled in case someone needs to run some special code in that
context during system-wide power transitions.
Apart from this, .suspend_noirq() and .resume_noirq() were introduced
as a workaround for drivers using shared interrupts and failing to
prevent their interrupt handlers from accessing suspended hardware.
It appears to be better not to use them for other porposes, or we may
have to deal with some serious confusion (which seems to be happening
already).
For the above reasons, introduce new device suspend/resume phases,
"late suspend" and "early resume" (and analogously for hibernation)
whose callback will be executed with runtime PM disabled and with
device interrupts enabled and whose callback pointers generally may
point to runtime suspend/resume routines.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reviewed-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Reviewed-by: Kevin Hilman <khilman@ti.com>
Commit 2aede851ddf08666f68ffc17be446420e9d2a056
PM / Hibernate: Freeze kernel threads after preallocating memory
introduced a mechanism by which kernel threads were frozen after
the preallocation of hibernate image memory to avoid problems with
frozen kernel threads not responding to memory freeing requests.
However, it overlooked the s2disk code path in which the
SNAPSHOT_CREATE_IMAGE ioctl was run directly after SNAPSHOT_FREE,
which caused freeze_workqueues_begin() to BUG(), because it saw
that worqueues had been already frozen.
Although in principle this issue might be addressed by removing
the relevant BUG_ON() from freeze_workqueues_begin(), that would
reintroduce the very problem that commit 2aede851ddf08666f68ffc17be4
attempted to avoid into that particular code path. For this reason,
to fix the issue at hand, introduce thaw_kernel_threads() and make
the SNAPSHOT_FREE ioctl execute it.
Special thanks to Srivatsa S. Bhat for detailed analysis of the
problem.
Reported-and-tested-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: stable@kernel.org
Currently we don't utilize the sched_switch field anymore.
But, simply removing sched_switch field from the middle of the
sched_stat output will break tools.
So, to stay compatible we hardcode it to zero and remove the
field from the scheduler data structures.
Update the schedstat documentation accordingly.
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1327422836.27181.5.camel@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With a lot of small tasks, the softirq sched is nearly never called
when no_hz is enabled. In this case load_balance() is mainly called
with the newly_idle mode which doesn't update the cpu_power.
Add a next_update field which ensure a maximum update period when
there is short activity.
Having stale cpu_power information can skew the load-balancing
decisions, this is cured by the guaranteed update.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1323717668-2143-1-git-send-email-vincent.guittot@linaro.org
The block layer has some code trying to determine if two CPUs share a
cache, the scheduler has a similar function. Expose the function used
by the scheduler and make the block layer use it, thereby removing the
block layers usage of CONFIG_SCHED* and topology bits.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jens Axboe <axboe@kernel.dk>
Link: http://lkml.kernel.org/r/1327579450.2446.95.camel@twins
This issue happens under the following conditions:
1. preemption is off
2. __ARCH_WANT_INTERRUPTS_ON_CTXSW is defined
3. RT scheduling class
4. SMP system
Sequence is as follows:
1.suppose current task is A. start schedule()
2.task A is enqueued pushable task at the entry of schedule()
__schedule
prev = rq->curr;
...
put_prev_task
put_prev_task_rt
enqueue_pushable_task
4.pick the task B as next task.
next = pick_next_task(rq);
3.rq->curr set to task B and context_switch is started.
rq->curr = next;
4.At the entry of context_swtich, release this cpu's rq->lock.
context_switch
prepare_task_switch
prepare_lock_switch
raw_spin_unlock_irq(&rq->lock);
5.Shortly after rq->lock is released, interrupt is occurred and start IRQ context
6.try_to_wake_up() which called by ISR acquires rq->lock
try_to_wake_up
ttwu_remote
rq = __task_rq_lock(p)
ttwu_do_wakeup(rq, p, wake_flags);
task_woken_rt
7.push_rt_task picks the task A which is enqueued before.
task_woken_rt
push_rt_tasks(rq)
next_task = pick_next_pushable_task(rq)
8.At find_lock_lowest_rq(), If double_lock_balance() returns 0,
lowest_rq can be the remote rq.
(But,If preemption is on, double_lock_balance always return 1 and it
does't happen.)
push_rt_task
find_lock_lowest_rq
if (double_lock_balance(rq, lowest_rq))..
9.find_lock_lowest_rq return the available rq. task A is migrated to
the remote cpu/rq.
push_rt_task
...
deactivate_task(rq, next_task, 0);
set_task_cpu(next_task, lowest_rq->cpu);
activate_task(lowest_rq, next_task, 0);
10. But, task A is on irq context at this cpu.
So, task A is scheduled by two cpus at the same time until restore from IRQ.
Task A's stack is corrupted.
To fix it, don't migrate an RT task if it's still running.
Signed-off-by: Chanho Min <chanho.min@lge.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/CAOAMb1BHA=5fm7KTewYyke6u-8DP0iUuJMpgQw54vNeXFsGpoQ@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch fixes the sampling interrupt throttling mechanism.
It was broken in v3.2. Events were not being unthrottled. The
unthrottling mechanism required that events be checked at each
timer tick.
This patch solves this problem and also separates:
- unthrottling
- multiplexing
- frequency-mode period adjustments
Not all of them need to be executed at each timer tick.
This third version of the patch is based on my original patch +
PeterZ proposal (https://lkml.org/lkml/2012/1/7/87).
At each timer tick, for each context:
- if the current CPU has throttled events, we unthrottle events
- if context has frequency-based events, we adjust sampling periods
- if we have reached the jiffies interval, we multiplex (rotate)
We decoupled rotation (multiplexing) from frequency-mode sampling
period adjustments. They should not necessarily happen at the same
rate. Multiplexing is subject to jiffies_interval (currently at 1
but could be higher once the tunable is exposed via sysfs).
We have grouped frequency-mode adjustment and unthrottling into the
same routine to minimize code duplication. When throttled while in
frequency mode, we scan the events only once.
We have fixed the threshold enforcement code in __perf_event_overflow().
There was a bug whereby it would allow more than the authorized rate
because an increment of hwc->interrupts was not executed at the right
place.
The patch was tested with low sampling limit (2000) and fixed periods,
frequency mode, overcommitted PMU.
On a 2.1GHz AMD CPU:
$ cat /proc/sys/kernel/perf_event_max_sample_rate
2000
We set a rate of 3000 samples/sec (2.1GHz/3000 = 700000):
$ perf record -e cycles,cycles -c 700000 noploop 10
$ perf report -D | tail -21
Aggregated stats:
TOTAL events: 80086
MMAP events: 88
COMM events: 2
EXIT events: 4
THROTTLE events: 19996
UNTHROTTLE events: 19996
SAMPLE events: 40000
cycles stats:
TOTAL events: 40006
MMAP events: 5
COMM events: 1
EXIT events: 4
THROTTLE events: 9998
UNTHROTTLE events: 9998
SAMPLE events: 20000
cycles stats:
TOTAL events: 39996
THROTTLE events: 9998
UNTHROTTLE events: 9998
SAMPLE events: 20000
For 10s, the cap is 2x2000x10 = 40000 samples.
We get exactly that: 20000 samples/event.
Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: <stable@kernel.org> # v3.2+
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120126160319.GA5655@quad
Signed-off-by: Ingo Molnar <mingo@elte.hu>
try_to_wake_up() has a problem which may change status from TASK_DEAD to
TASK_RUNNING in race condition with SMI or guest environment of virtual
machine. As a result, exited task is scheduled() again and panic occurs.
Here is the sequence how it occurs:
----------------------------------+-----------------------------
|
CPU A | CPU B
----------------------------------+-----------------------------
TASK A calls exit()....
do_exit()
exit_mm()
down_read(mm->mmap_sem);
rwsem_down_failed_common()
set TASK_UNINTERRUPTIBLE
set waiter.task <= task A
list_add to sem->wait_list
:
raw_spin_unlock_irq()
(I/O interruption occured)
__rwsem_do_wake(mmap_sem)
list_del(&waiter->list);
waiter->task = NULL
wake_up_process(task A)
try_to_wake_up()
(task is still
TASK_UNINTERRUPTIBLE)
p->on_rq is still 1.)
ttwu_do_wakeup()
(*A)
:
(I/O interruption handler finished)
if (!waiter.task)
schedule() is not called
due to waiter.task is NULL.
tsk->state = TASK_RUNNING
:
check_preempt_curr();
:
task->state = TASK_DEAD
(*B)
<--- set TASK_RUNNING (*C)
schedule()
(exit task is running again)
BUG_ON() is called!
--------------------------------------------------------
The execution time between (*A) and (*B) is usually very short,
because the interruption is disabled, and setting TASK_RUNNING at (*C)
must be executed before setting TASK_DEAD.
HOWEVER, if SMI is interrupted between (*A) and (*B),
(*C) is able to execute AFTER setting TASK_DEAD!
Then, exited task is scheduled again, and BUG_ON() is called....
If the system works on guest system of virtual machine, the time
between (*A) and (*B) may be also long due to scheduling of hypervisor,
and same phenomenon can occur.
By this patch, do_exit() waits for releasing task->pi_lock which is used
in try_to_wake_up(). It guarantees the task becomes TASK_DEAD after
waking up.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120117174031.3118.E1E9C6FF@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Keep all the interesting data in a single cache line.
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Now that ntp.c's locking is reworked, we can remove most
of the xtime_lock usage in timekeeping.c
The remaining xtime_lock presence is really for jiffies access
and the global load calculation.
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Use a ntp_lock spin lock to replace xtime_lock locking in ntp.c
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Currently the NTP managed tick_length value is accessed globally,
in preparations for locking cleanups, make sure it is accessed via
a function and mark it as static.
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Move ntp_sycned to ntp.c and mark time_status as static.
Also yank function declaration for non-existant function.
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Now that all the timekeeping variables are stored in
the timekeeper structure, add a new lock to protect the
structure.
For now, this lock nests under the xtime_lock for writes.
For readers, we don't need to take xtime_lock anymore.
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Move global xtime_lock and timekeeping_suspended values up
to the top of timekeeping.c
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
In preparation for locking cleanups, move raw_time into
timekeeper structure.
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
In preparation for locking cleanups, move xtime into
timekeeper structure.
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
In preparation for locking cleanups, move wall_to_monotonic
into the timekeeper structure.
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Move total_sleep_time into the timekeeper structure in preparation
for locking cleanups
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
rsyslog will display KERN_EMERG messages on a connected
terminal. However, these messages are useless/undecipherable
for a general user.
For example, after a softlockup we get:
Message from syslogd@intel-s3e37-04 at Jan 25 14:18:06 ...
kernel:Stack:
Message from syslogd@intel-s3e37-04 at Jan 25 14:18:06 ...
kernel:Call Trace:
Message from syslogd@intel-s3e37-04 at Jan 25 14:18:06 ...
kernel:Code: ff ff a8 08 75 25 31 d2 48 8d 86 38 e0 ff ff 48 89
d1 0f 01 c8 0f ae f0 48 8b 86 38 e0 ff ff a8 08 75 08 b1 01 4c 89 e0 0f 01 c9 <e8> ea 69 dd ff 4c 29 e8 48 89 c7 e8 0f bc da ff 49 89 c4 49 89
This happens because the printk levels for these messages are
incorrect. Only an informational message should be displayed on
a terminal.
I modified the printk levels for various messages in the kernel
and tested the output by using the drivers/misc/lkdtm.c kernel
modules (ie, softlockups, panics, hard lockups, etc.) and
confirmed that the console output was still the same and that
the output to the terminals was correct.
For example, in the case of a softlockup we now see the much
more informative:
Message from syslogd@intel-s3e37-04 at Jan 25 10:18:06 ...
BUG: soft lockup - CPU4 stuck for 60s!
instead of the above confusing messages.
AFAICT, the messages no longer have to be KERN_EMERG. In the
most important case of a panic we set console_verbose(). As for
the other less severe cases the correct data is output to the
console and /var/log/messages.
Successfully tested by me using the drivers/misc/lkdtm.c module.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: dzickus@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1327586134-11926-1-git-send-email-prarit@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With the recent nohz scheduler changes, rq's nohz flag
'NOHZ_TICK_STOPPED' and its associated state doesn't get cleared
immediately after the cpu exits idle. This gets cleared as part
of the next tick seen on that cpu.
For the cpu offline support, we need to clear this state
manually. Fix it by registering a cpu notifier, which clears the
nohz idle load balance state for this rq explicitly during the
CPU_DYING notification.
There won't be any nohz updates for that cpu, after the
CPU_DYING notification. But lets be extra paranoid and skip
updating the nohz state in the select_nohz_load_balancer() if
the cpu is not in active state anymore.
Reported-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Reviewed-and-tested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1327026538.16150.40.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit 029632fbb7b7c9d85063cc9eb470de6c54873df3 ("sched: Make
separate sched*.c translation units") removed the include of
asm/mutex.h from sched.c.
This breaks the combination of:
CONFIG_MUTEX_SPIN_ON_OWNER=yes
CONFIG_HAVE_ARCH_MUTEX_CPU_RELAX=yes
like s390 without mutex debugging:
CC kernel/sched/core.o
kernel/sched/core.c: In function ‘mutex_spin_on_owner’:
kernel/sched/core.c:3287: error: implicit declaration of function ‘arch_mutex_cpu_relax’
Lets re-add the include to kernel/sched/core.c
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1326268696-30904-1-git-send-email-borntraeger@de.ibm.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
KOSAKI Motohiro noticed the following race:
> CPU0 CPU1
> --------------------------------------------------------
> deactivate_task()
> task->state = TASK_UNINTERRUPTIBLE;
> activate_task()
> rq->nr_uninterruptible--;
>
> schedule()
> deactivate_task()
> rq->nr_uninterruptible++;
>
Kosaki-San's scenario is possible when CPU0 runs
__sched_setscheduler() against CPU1's current @task.
__sched_setscheduler() does a dequeue/enqueue in order to move
the task to its new queue (position) to reflect the newly provided
scheduling parameters. However it should be completely invariant to
nr_uninterruptible accounting, sched_setscheduler() doesn't affect
readyness to run, merely policy on when to run.
So convert the inappropriate activate/deactivate_task usage to
enqueue/dequeue_task, which avoids the nr_uninterruptible accounting.
Also convert the two other sites: __migrate_task() and
normalize_task() that still use activate/deactivate_task. These sites
aren't really a problem since __migrate_task() will only be called on
non-running task (and therefore are immume to the described problem)
and normalize_task() isn't ever used on regular systems.
Also remove the comments from activate/deactivate_task since they're
misleading at best.
Reported-by: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1327486224.2614.45.camel@laptop
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On ARM, we don't want SPARSE_IRQ to be a user visible option. Make
SPARSE_IRQ visible based on MAY_HAVE_SPARSE_IRQ instead of depending
on HAVE_SPARSE_IRQ.
With this, SPARSE_IRQ is not visible on C6X and ARM.
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Mark Salter <msalter@redhat.com>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-c6x-dev@linux-c6x.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-sh@vger.kernel.org
Move the core sysctl code from kernel/sysctl.c and kernel/sysctl_check.c
into fs/proc/proc_sysctl.c.
Currently sysctl maintenance is hampered by the sysctl implementation
being split across 3 files with artificial layering between them.
Consolidate the entire sysctl implementation into 1 file so that
it is easier to see what is going on and hopefully allowing for
simpler maintenance.
For functions that are now only used in fs/proc/proc_sysctl.c remove
their declarations from sysctl.h and make them static in fs/proc/proc_sysctl.c
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Simplify the code by treating the base sysctl table like any other
sysctl table and register it with register_sysctl_table.
To ensure this table is registered early enough to avoid problems
call sysctl_init from proc_sys_init.
Rename sysctl_net.c:sysctl_init() to net_sysctl_init() to avoid
name conflicts now that kernel/sysctl.c:sysctl_init() is no longer
static.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>