23055 Commits

Author SHA1 Message Date
Thomas Gleixner
7ee681b252 workqueue: Convert to state machine callbacks
Get rid of the prio ordering of the separate notifiers and use a proper state
callback pair.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160713153335.197083890@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 09:34:43 +02:00
Thomas Gleixner
00e16c3d68 perf/core: Convert to hotplug state machine
Actually a nice symmetric startup/teardown pair which fits properly into
the state machine concept. In the long run we should be able to invoke
the startup callback for the boot CPU via the state machine and get
rid of the init function which invokes it on the boot CPU.

Note: This comes actually before the perf hardware callbacks. In the notifier
model the hardware callbacks have a higher priority than the core
callback. But that's solely for CPU offline so that hardware migration of
events happens before the core is notified about the outgoing CPU.

With the symetric state array model we have the following ordering:

 UP:     core -> hardware
 DOWN:   hardware -> core

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Reviewed-by: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160713153333.587514098@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 09:34:31 +02:00
Thomas Gleixner
6a4e24518c cpu/hotplug: Handle early registration gracefully
We switched the hotplug machinery to smpboot threads. Early registration of
hotplug callbacks, i.e. from do_pre_smp_initcalls(), happens before the
threads are initialized. Instead of moving the thread init, we simply handle
it in the hotplug code itself and invoke the function directly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160713153332.896450738@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 09:34:25 +02:00
Greg Kroah-Hartman
6c71ee3b61 Third set of IIO new device support, features and cleanups for the 4.8 cycle.
New core features
 - Selection of the clock source for IIO timestamps.  This is done per device
   as it makes little sense to have events in one timebase and data timestamped
   on another.  Biggest reason for this is that we currently use a clock
   source which is non monotonic which can result in 'interesting' data sets.
   (Includes export for get_monotonic_corse64 which Thomas Gleixner didn't mind
    in an earlier version.)
 - MAINTAINERS add the git tree to the list for IIO.
 
 New device support + a kind of indirect staging graduation.
 * Broadcom iproc-static-adc
   - new driver
 * mcp4531
   - support for MCP454x, MCP456x, MCP464x and MCP466x potentiometers
 * mpu6050
   - support the IC20608 6 axis motion tracking device
 * st-sensors
   - support the lis3l02dq + drop the lis3l02dq driver from staging.
   The general purpose driver is missing event support, but good to get
   rid of this driver which was rather long in the tooth.
 
 New driver features
 * ak8975
   - Add vid regulator support and refactor handling in general.
   - Allow a delay after enabling regulators.
   - Runtime and system PM.
 * bmg160
   - filter frequency control support.
 * bmp280
   - SPI device support.
   - EOC interrupt support for the BMP085
   - power management support.
   - supply regulator support.
   - reset gpio support
   - dt bindings for reset gpio and regulators.
   - of table to support device tree registration
 * max1363
   - Device tree bindings.
 * mcp4531
   - Device tree bindings.
 * st-pressure
   - temperature channels as part of triggered buffer (previously not due
   probably to alignment issues - see below).
   - lps22hb open drain interrupt support.
   - lps22hb temperature channel support
 
 Cleanups and reworkings.
 * numerous ADC drivers
   - ensure the iio_dev->dev.of_node is set to the parent dev.of_node so
   as to allow client bindings to find the device.
 * ak8975
   - Fix incorrect handling of missing regulator
   - make sure power is down and remove.
 * bmp280
   - read the calibration data only once as it doesn't change.
 * isl29125
   - Use a few macros to make code a touch more readable.
 * mma8452
   - fix a memory leak on error.
   - drop an unecessary bit of return value handling.
 * potentiometer kconfig
   - typo fix.
 * st-pressure
   - drop some uninformative default assignments of elements of the channel
   array structure (aids readability).
 * st-sensors
   - Harden interrupt handling considerably.  These are actually all using
   level interrupts, but at least two known boards have them wired to
   edge only interrupt chips.  Hence a slightly interesting bit of handling
   is needed in which we first allow for the easy option (level triggered) and
   secondly check the status registers before reenabling edge interrupts and
   fall back to a tight loop in the thread until we successfully clear the
   interrupt.  No harm is done if we never succeed in doing so.  It's an odd
   patch that has been through a lot of revisions to reach a consensus on how
   to handle what is basically broken hardware (which the previous defaults
   allowed to kind of work).
   - Fix alignment to defined storagebytes boundaries.
   - Ensure alignment of power of 2 byte boundaries.  This has always in theory
   been part of the ABI of IIO, but we missed a few that snuck in that need
   fixing.  The effect was minor as they were only followed by timestamp
   channels which were correctly aligned,
   - Add some docs to explain the gain calculations.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIuBAABCAAYBQJXfBqnERxqaWMyM0BrZXJuZWwub3JnAAoJEFSFNJnE9BaIqjwP
 /0OJbr8kIa1i6+iCqCRCPCixdymd6k9wvjDaKSQoDeamen+8iKOLZNhXJJjOX8hd
 eCRMrCJbvY96Bl2Ll51TCEBb8R1xppCwwYIYylKhF9CL6N2ndapzWY0G4XZb6pc0
 e1JIa6uxynAAEsfplBskk4Ytf5PPHDOWER5WsTmxlZcTTAL9gLxIlii2Du0AmeN/
 tANVzwuvK07i5HHuZfYV2h2+OWDSlm4Y5rvE7t8keWpp6wnZ0XtiIw1WjkpR1OY7
 KiKGKRJMomFlp51hP9IKqc20Dweiaf3lHS7BDggvkB11VxyajQTcjvogxQ0BSPUv
 7PTHHlk8txgEUMqrDWP8x0TL97iNt3hiOZ0/rI3IZdFLC8pnibewnB+uHEGCH3tv
 bqToPtpJHjsIiGlCGVxvt8BRgqT5Qq7JT65hYS6774uFcQiPEvPDI44BDqUxaDUf
 /1WFM23VB4KJpx8JnL+nC8iu6DBnVPDWDKAsjGgc+ljnz3VRcSxWz5P0yMFZRMA2
 mbLiG2yiD4oD/LcI8FeZh9X50Irg09ElAWu07VRymrYMRfCYLXO07o5nZJ0bOqOB
 R+1MToYaHz2g6jJ+KGVC0Ul5EuULzymqH0CMbdjWnaD9AaoPuOKkNfUVBkzRK0t/
 TO/wLHm/qNbk+zGZHQFU15mH1Nn9leEJ/uCdnGqkRo7i
 =FxNN
 -----END PGP SIGNATURE-----

Merge tag 'iio-for-4.8c' of git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-next

Jonathan writes:

Third set of IIO new device support, features and cleanups for the 4.8 cycle.

New core features
- Selection of the clock source for IIO timestamps.  This is done per device
  as it makes little sense to have events in one timebase and data timestamped
  on another.  Biggest reason for this is that we currently use a clock
  source which is non monotonic which can result in 'interesting' data sets.
  (Includes export for get_monotonic_corse64 which Thomas Gleixner didn't mind
   in an earlier version.)
- MAINTAINERS add the git tree to the list for IIO.

New device support + a kind of indirect staging graduation.
* Broadcom iproc-static-adc
  - new driver
* mcp4531
  - support for MCP454x, MCP456x, MCP464x and MCP466x potentiometers
* mpu6050
  - support the IC20608 6 axis motion tracking device
* st-sensors
  - support the lis3l02dq + drop the lis3l02dq driver from staging.
  The general purpose driver is missing event support, but good to get
  rid of this driver which was rather long in the tooth.

New driver features
* ak8975
  - Add vid regulator support and refactor handling in general.
  - Allow a delay after enabling regulators.
  - Runtime and system PM.
* bmg160
  - filter frequency control support.
* bmp280
  - SPI device support.
  - EOC interrupt support for the BMP085
  - power management support.
  - supply regulator support.
  - reset gpio support
  - dt bindings for reset gpio and regulators.
  - of table to support device tree registration
* max1363
  - Device tree bindings.
* mcp4531
  - Device tree bindings.
* st-pressure
  - temperature channels as part of triggered buffer (previously not due
  probably to alignment issues - see below).
  - lps22hb open drain interrupt support.
  - lps22hb temperature channel support

Cleanups and reworkings.
* numerous ADC drivers
  - ensure the iio_dev->dev.of_node is set to the parent dev.of_node so
  as to allow client bindings to find the device.
* ak8975
  - Fix incorrect handling of missing regulator
  - make sure power is down and remove.
* bmp280
  - read the calibration data only once as it doesn't change.
* isl29125
  - Use a few macros to make code a touch more readable.
* mma8452
  - fix a memory leak on error.
  - drop an unecessary bit of return value handling.
* potentiometer kconfig
  - typo fix.
* st-pressure
  - drop some uninformative default assignments of elements of the channel
  array structure (aids readability).
* st-sensors
  - Harden interrupt handling considerably.  These are actually all using
  level interrupts, but at least two known boards have them wired to
  edge only interrupt chips.  Hence a slightly interesting bit of handling
  is needed in which we first allow for the easy option (level triggered) and
  secondly check the status registers before reenabling edge interrupts and
  fall back to a tight loop in the thread until we successfully clear the
  interrupt.  No harm is done if we never succeed in doing so.  It's an odd
  patch that has been through a lot of revisions to reach a consensus on how
  to handle what is basically broken hardware (which the previous defaults
  allowed to kind of work).
  - Fix alignment to defined storagebytes boundaries.
  - Ensure alignment of power of 2 byte boundaries.  This has always in theory
  been part of the ABI of IIO, but we missed a few that snuck in that need
  fixing.  The effect was minor as they were only followed by timestamp
  channels which were correctly aligned,
  - Add some docs to explain the gain calculations.
2016-07-14 12:05:29 +09:00
Linus Torvalds
f97d10454e Merge branches 'perf-urgent-for-linus' and 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf and timer fixes from Ingo Molnar:
 "A fix for a posix CPU timers bug, and a perf printk message fix"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Fix bogus kernel printk, again

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  posix_cpu_timer: Exit early when process has been reaped
2016-07-14 05:44:47 +09:00
Thomas Gleixner
e877bde234 Merge branch 'core/urgent' into smp/hotplug to pick up dependencies 2016-07-13 17:03:30 +02:00
Thomas Gleixner
e1c4cde62b Merge branch 'core/rcu' into smp/hotplug to pick up dependencies 2016-07-13 17:03:17 +02:00
Thomas Gleixner
54f5449677 Merge branch 'timers/core' into smp/hotplug to pick up dependencies 2016-07-13 17:01:51 +02:00
Thomas Gleixner
d60585c576 sched/core: Correct off by one bug in load migration calculation
The move of calc_load_migrate() from CPU_DEAD to CPU_DYING did not take into
account that the function is now called from a thread running on the outgoing
CPU. As a result a cpu unplug leakes a load of 1 into the global load
accounting mechanism.

Fix it by adjusting for the currently running thread which calls
calc_load_migrate().

Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Cc: rt@linutronix.de
Cc: shreyas@linux.vnet.ibm.com
Fixes: e9cd8fa4fcfd: ("sched/migration: Move calc_load_migrate() into CPU_DYING")
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1607121744350.4083@nanos
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-13 14:58:20 +02:00
Thomas Gleixner
a7c734140a cpu/hotplug: Keep enough storage space if SMP=n to avoid array out of bounds scribble
Xiaolong Ye reported lock debug warnings triggered by the following commit:

  8de4a0066106 ("perf/x86: Convert the core to the hotplug state machine")

The bug is the following: the cpuhp_bp_states[] array is cut short when
CONFIG_SMP=n, but the dynamically registered callbacks are stored nevertheless
and happily scribble outside of the array bounds...

We need to store them in case that the state is unregistered so we can invoke
the teardown function. That's independent of CONFIG_SMP. Make sure the array
is large enough.

Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Adam Borowski <kilobyte@angband.pl>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: lkp@01.org
Cc: stable@vger.kernel.org
Cc: tipbuild@zytor.com
Fixes: cff7d378d3fd "cpu/hotplug: Convert to a state machine for the control processor"
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1607122144560.4083@nanos
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-13 09:29:39 +02:00
Paul Gortmaker
a536a6e13e bpf: make inode code explicitly non-modular
The Kconfig currently controlling compilation of this code is:

init/Kconfig:config BPF_SYSCALL
init/Kconfig:   bool "Enable bpf() system call"

...meaning that it currently is not being built as a module by anyone.

Lets remove the couple traces of modular infrastructure use, so that
when reading the driver there is no doubt it is builtin-only.

Note that MODULE_ALIAS is a no-op for non-modular code.

We replace module.h with init.h since the file does use __init.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 13:52:43 -07:00
Alexander Popov
a1b7b1a57b irqdomain: Fix irq_domain_alloc_irqs_recursive() error handling
If an irq_domain is auto-recursive and irq_domain_alloc_irqs_recursive()
for its parent has returned an error, then do return and avoid calling
irq_domain_free_irqs_recursive() uselessly, because:
- if domain->ops->alloc() had failed for an auto-recursive irq_domain,
   then irq_domain_free_irqs_recursive() had already been called;
- if domain->ops->alloc() had failed for a not auto-recursive irq_domain,
   then there is nothing to free at all.

Signed-off-by: Alexander Popov <alex.popov@linux.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/1467505448-2850-1-git-send-email-alex.popov@linux.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-11 17:23:48 +02:00
Alexey Dobriyan
2c13ce8f6b posix_cpu_timer: Exit early when process has been reaped
Variable "now" seems to be genuinely used unintialized
if branch

	if (CPUCLOCK_PERTHREAD(timer->it_clock)) {

is not taken and branch

	if (unlikely(sighand == NULL)) {

is taken. In this case the process has been reaped and the timer is marked as
disarmed anyway. So none of the postprocessing of the sample is
required. Return right away.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20160707223911.GA26483@p183.telecom.by
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-11 17:20:12 +02:00
Ingo Molnar
44530d588e Revert "perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86"
This reverts commit 2c95afc1e83d93fac3be6923465e1753c2c53b0a.

Stephane reported the following regression:

 > Since Andi added:
 >
 > commit 2c95afc1e83d93fac3be6923465e1753c2c53b0a
 > Author: Andi Kleen <ak@linux.intel.com>
 > Date:   Thu Jun 9 06:14:38 2016 -0700
 >
 >    perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86
 >
 > $ perf stat -e ref-cycles ls
 >   <not counted> ....
 >
 > fails systematically because the ref-cycles is now used by the
 > watchdog and given this is a system-wide pinned event, it monopolizes
 > the fixed counter 2 which is the only counter able to measure this event.

Since the next merge window is near, fix the regression for now
by reverting the commit.

Reported-by: Stephane Eranian <eranian@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-10 20:58:36 +02:00
Daniel Bristot de Oliveira
748c7201e6 sched/core: Panic on scheduling while atomic bugs if kernel.panic_on_warn is set
Currently, a schedule while atomic error prints the stack trace to the
kernel log and the system continue running.

Although it is possible to collect the kernel log messages and analyze
it, often more information are needed. Furthermore, keep the system
running is not always the best choice. For example, when the preempt
count underflows the system will not stop to complain about scheduling
while atomic, so the kernel log can wrap around overwriting the first
stack trace, tuning the analysis even more challenging.

This patch uses the kernel.panic_on_warn sysctl to help out on these
more complex situations.

When kernel.panic_on_warn is set to 1, the kernel will panic() in the
schedule while atomic detection.

The default value of the sysctl is 0, maintaining the current behavior.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Reviewed-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e8f7b80f353aa22c63bd8557208163989af8493d.1464983675.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-10 20:17:27 +02:00
Rafael J. Wysocki
4c0b6c10fb PM / hibernate: Image data protection during restoration
Make it possible to protect all pages holding image data during
hibernate image restoration by setting them read-only (so as to
catch attempts to write to those pages after image data have been
stored in them).

This adds overhead to image restoration code (it may cause large
page mappings to be split as a result of page flags changes) and
the errors it protects against should never happen in theory, so
the feature is only active after passing hibernate=protect_image
to the command line of the restore kernel.

Also it only is built if CONFIG_DEBUG_RODATA is set.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-10 02:12:10 +02:00
Rafael J. Wysocki
d5f32af310 PM / hibernate: Add missing braces in __register_nosave_region()
One branch of an if/else statement in __register_nosave_region() is
formatted against the kernel coding style which causes the code to
look slightly odd.  To fix that, add missing braces to it.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-10 01:37:35 +02:00
Rafael J. Wysocki
ef96f639ea PM / hibernate: Clean up comments in snapshot.c
Many comments in kernel/power/snapshot.c do not follow the general
comment formatting rules.  They look odd, some of them are outdated
too, some are hard to parse and generally difficult to understand.

Clean them up to make them easier to comprehend.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-10 01:37:26 +02:00
Rafael J. Wysocki
efd5a85242 PM / hibernate: Clean up function headers in snapshot.c
The formatting of some function headers in kernel/power/snapshot.c
is not consistent with the general kernel coding style and with the
formatting of some other function headers in the same file.

Make all of them follow the same formatting convention.

No functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-10 01:37:20 +02:00
Rafael J. Wysocki
2f88e41a22 PM / hibernate: Add missing braces in hibernate_setup()
Make hibernate_setup() follow the coding style more closely by adding
some missing braces to the if () statement in it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-07-10 01:37:13 +02:00
Zhao Lei
277a13e4f0 sched/cpuacct: Introduce cpuacct.usage_all to show all CPU stats together
In current code, we can get cpuacct data from several files,
but each file has various limitations.

For example:

 - We can get CPU usage in user and kernel mode via cpuacct.stat,
   but we can't get detailed data about each CPU.

 - We can get each CPU's kernel mode usage in cpuacct.usage_percpu_sys,
   but we can't get user mode usage data at the same time.

This patch introduces cpuacct.usage_all, to show all detailed CPU
accounting data together:

 # cat cpuacct.usage_all
 cpu user system
 0 3809760299 5807968992
 1 3250329855 454612211
 ..

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/7744460969edd7caaf0e903592ee52353ed9bdd6.1466415271.git.zhaolei@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-09 13:56:15 +02:00
Zhao Lei
8e546bfafb sched/cpuacct: Use loop to consolidate code in cpuacct_stats_show()
In cpuacct_stats_show() we currently we have copies of similar code,
for each cpustat(system/user) variant.

Use a loop instead to consolidate the code. This will also work better
if we extend the CPUACCT_STAT_NSTATS type.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/b0597d4224655e9f333f1a6224ed9654c7d7d36a.1466415271.git.zhaolei@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-09 13:56:15 +02:00
Zhao Lei
9acacc2ac5 sched/cpuacct: Merge cpuacct_usage_index and cpuacct_stat_index enums
These two types have similar function, no need to separate them.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/436748885270d64363c7dc67167507d486c2057a.1466415271.git.zhaolei@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-09 13:56:15 +02:00
Alexei Starovoitov
606274c5ab bpf: introduce bpf_get_current_task() helper
over time there were multiple requests to access different data
structures and fields of task_struct current, so finally add
the helper to access 'current' as-is. Tracing bpf programs will do
the rest of walking the pointers via bpf_probe_read().
Note that current can be null and bpf program has to deal it with,
but even dumb passing null into bpf_probe_read() is still safe.

Suggested-by: Brendan Gregg <brendan.d.gregg@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09 00:00:16 -04:00
Rafael J. Wysocki
63f9ccb895 Merge back earlier suspend/hibernation changes for v4.8. 2016-07-08 23:14:17 +02:00
Linus Torvalds
369da7fc6d Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Two load-balancing fixes for cgroups-intense workloads"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix calc_cfs_shares() fixed point arithmetics width confusion
  sched/fair: Fix effective_load() to consistently use smoothed load
2016-07-08 09:04:34 -07:00
Ingo Molnar
9e7f7f5425 Merge branch 'x86/mm' into x86/boot, to pick up dependencies
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-08 17:27:47 +02:00
Ingo Molnar
4b4b20852d Merge branch 'timers/fast-wheel' into timers/core 2016-07-07 10:35:28 +02:00
Anna-Maria Gleixner
f00c0afdfa timers: Implement optimization for same expiry time in mod_timer()
The existing optimization for same expiry time in mod_timer() checks whether
the timer expiry time is the same as the new requested expiry time. In the old
timer wheel implementation this does not take the slack batching into account,
neither does the new implementation evaluate whether the new expiry time will
requeue the timer to the same bucket.

To optimize that, we can calculate the resulting bucket and check if the new
expiry time is different from the current expiry time. This calculation
happens outside the base lock held region. If the resulting bucket is the same
we can avoid taking the base lock and requeueing the timer.

If the timer needs to be requeued then we have to check under the base lock
whether the base time has changed between the lockless calculation and taking
the lock. If it has changed we need to recalculate under the lock.

This optimization takes effect for timers which are enqueued into the less
granular wheel levels (1 and above). With a simple test case the functionality
has been verified:

            Before        After
 Match:       5.5%        86.6%
 Requeue:    94.5%        13.4%
 Recalc:                  <0.01%

In the non optimized case the timer is requeued in 94.5% of the cases. With
the index optimization in place the requeue rate drops to 13.4%. The case
where the lockless index calculation has to be redone is less than 0.01%.

With a real world test case (networking) we observed the following changes:

            Before        After
 Match:      97.8%        99.7%
 Requeue:     2.2%         0.3%
 Recalc:                  <0.001%

That means two percent fewer lock/requeue/unlock operations done in one of
the hot path use cases of timers.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.778527749@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:12 +02:00
Anna-Maria Gleixner
ffdf047728 timers: Split out index calculation
For further optimizations we need to seperate index calculation
from queueing. No functional change.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.691159619@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:12 +02:00
Thomas Gleixner
4e85876a9d timers: Only wake softirq if necessary
With the wheel forwading in place and with the HZ=1000 4ms folding we can
avoid running the softirq at all.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.607650550@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:11 +02:00
Thomas Gleixner
a683f390b9 timers: Forward the wheel clock whenever possible
The wheel clock is stale when a CPU goes into a long idle sleep. This has the
side effect that timers which are queued end up in the outer wheel levels.
That results in coarser granularity.

To solve this, we keep track of the idle state and forward the wheel clock
whenever possible.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.512039360@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:11 +02:00
Thomas Gleixner
ff00673292 timers/nohz: Remove pointless tick_nohz_kick_tick() function
This was a failed attempt to optimize the timer expiry in idle, which was
disabled and never revisited. Remove the cruft.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.431073782@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:10 +02:00
Anna-Maria Gleixner
236968383c timers: Optimize collect_expired_timers() for NOHZ
After a NOHZ idle sleep the timer wheel must be forwarded to current jiffies.
There might be expired timers so the current code loops and checks the expired
buckets for timers. This can take quite some time for long NOHZ idle periods.

The pending bitmask in the timer base allows us to do a quick search for the
next expiring timer and therefore a fast forward of the base time which
prevents pointless long lasting loops.

For a 3 seconds idle sleep this reduces the catchup time from ~1ms to 5us.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.351296290@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:10 +02:00
Anna-Maria Gleixner
73420fea80 timers: Move __run_timers() function
Move __run_timers() below __next_timer_interrupt() and next_pending_bucket()
in preparation for __run_timers() NOHZ optimization.

No functional change.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.271872665@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:09 +02:00
Thomas Gleixner
53bf837b78 timers: Remove set_timer_slack() leftovers
We now have implicit batching in the timer wheel. The slack API is no longer
used, so remove it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Andrew F. Davis <afd@ti.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jaehoon Chung <jh80.chung@samsung.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathias Nyman <mathias.nyman@intel.com>
Cc: Pali Rohár <pali.rohar@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mmc@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.189813118@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:09 +02:00
Thomas Gleixner
500462a9de timers: Switch to a non-cascading wheel
The current timer wheel has some drawbacks:

1) Cascading:

   Cascading can be an unbound operation and is completely pointless in most
   cases because the vast majority of the timer wheel timers are canceled or
   rearmed before expiration. (They are used as timeout safeguards, not as
   real timers to measure time.)

2) No fast lookup of the next expiring timer:

   In NOHZ scenarios the first timer soft interrupt after a long NOHZ period
   must fast forward the base time to the current value of jiffies. As we
   have no way to find the next expiring timer fast, the code loops linearly
   and increments the base time one by one and checks for expired timers
   in each step. This causes unbound overhead spikes exactly in the moment
   when we should wake up as fast as possible.

After a thorough analysis of real world data gathered on laptops,
workstations, webservers and other machines (thanks Chris!) I came to the
conclusion that the current 'classic' timer wheel implementation can be
modified to address the above issues.

The vast majority of timer wheel timers is canceled or rearmed before
expiry. Most of them are timeouts for networking and other I/O tasks. The
nature of timeouts is to catch the exception from normal operation (TCP ack
timed out, disk does not respond, etc.). For these kinds of timeouts the
accuracy of the timeout is not really a concern. Timeouts are very often
approximate worst-case values and in case the timeout fires, we already
waited for a long time and performance is down the drain already.

The few timers which actually expire can be split into two categories:

 1) Short expiry times which expect halfways accurate expiry

 2) Long term expiry times are inaccurate today already due to the
    batching which is done for NOHZ automatically and also via the
    set_timer_slack() API.

So for long term expiry timers we can avoid the cascading property and just
leave them in the less granular outer wheels until expiry or
cancelation. Timers which are armed with a timeout larger than the wheel
capacity are no longer cascaded. We expire them with the longest possible
timeout (6+ days). We have not observed such timeouts in our data collection,
but at least we handle them, applying the rule of the least surprise.

To avoid extending the wheel levels for HZ=1000 so we can accomodate the
longest observed timeouts (5 days in the network conntrack code) we reduce the
first level granularity on HZ=1000 to 4ms, which effectively is the same as
the HZ=250 behaviour. From our data analysis there is nothing which relies on
that 1ms granularity and as a side effect we get better batching and timer
locality for the networking code as well.

Contrary to the classic wheel the granularity of the next wheel is not the
capacity of the first wheel. The granularities of the wheels are in the
currently chosen setting 8 times the granularity of the previous wheel.

So for HZ=250 we end up with the following granularity levels:

 Level Offset   Granularity                  Range
     0      0          4 ms                 0 ms -        252 ms
     1     64         32 ms               256 ms -       2044 ms (256ms - ~2s)
     2    128        256 ms              2048 ms -      16380 ms (~2s   - ~16s)
     3    192       2048 ms (~2s)       16384 ms -     131068 ms (~16s  - ~2m)
     4    256      16384 ms (~16s)     131072 ms -    1048572 ms (~2m   - ~17m)
     5    320     131072 ms (~2m)     1048576 ms -    8388604 ms (~17m  - ~2h)
     6    384    1048576 ms (~17m)    8388608 ms -   67108863 ms (~2h   - ~18h)
     7    448    8388608 ms (~2h)    67108864 ms -  536870911 ms (~18h  - ~6d)

That's a worst case inaccuracy of 12.5% for the timers which are queued at the
beginning of a level.

So the new wheel concept addresses the old issues:

1) Cascading is avoided completely

2) By keeping the timers in the bucket until expiry/cancelation we can track
   the buckets which have timers enqueued in a bucket bitmap and therefore can
   look up the next expiring timer very fast and O(1).

A further benefit of the concept is that the slack calculation which is done
on every timer start is no longer necessary because the granularity levels
provide natural batching already.

Our extensive testing with various loads did not show any performance
degradation vs. the current wheel implementation.

This patch does not address the 'fast lookup' issue as we wanted to make sure
that there is no regression introduced by the wheel redesign. The
optimizations are in follow up patches.

This patch contains fixes from Anna-Maria Gleixner and Richard Cochran.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.108621834@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:09 +02:00
Thomas Gleixner
494af3ed78 timers: Give a few structs and members proper names
Some of the names in the internal implementation of the timer code
are not longer correct and others are simply too long to type.

Clean it up before we switch the wheel implementation over to
the new scheme.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.948752516@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:08 +02:00
Thomas Gleixner
2b1ecc3d1a signals: Use hrtimer for sigtimedwait()
We've converted most timeout related syscalls to hrtimers, but
sigtimedwait() did not get this treatment.

Convert it so we get a reasonable accuracy and remove the
user space exposure to the timer wheel properties.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Cyril Hrubis <chrubis@suse.cz>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.787164909@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:07 +02:00
Thomas Gleixner
177ec0a0a5 timers: Remove the deprecated mod_timer_pinned() API
We switched all users to initialize the timers as pinned and call
mod_timer(). Remove the now unused timer API function.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.706205231@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:35:06 +02:00
Thomas Gleixner
e675447bda timers: Make 'pinned' a timer property
We want to move the timer migration logic from a 'push' to a 'pull' model.

Under the current 'push' model pinned timers are handled via
a runtime API variant: mod_timer_pinned().

The 'pull' model requires us to store the pinned attribute of a timer
in the timer_list structure itself, as a new TIMER_PINNED bit in
timer->flags.

This flag must be set at initialization time and the timer APIs
recognize the flag.

This patch:

 - Implements the new flag and associated new-style initialization
   methods

 - makes mod_timer() recognize new-style pinned timers,

 - and adds some migration helper facility to allow
   step by step conversion of old-style to new-style
   pinned timers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.049338558@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 10:25:13 +02:00
Ingo Molnar
36e91aa262 Merge branch 'locking/arch-atomic' into locking/core, because the topic is ready
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 09:12:02 +02:00
Wei Yongjun
885885f6b8 locking/static_keys: Fix non static symbol Sparse warning
Fix the following sparse warnings:

  kernel/jump_label.c:473:23: warning:
   symbol 'jump_label_module_nb' was not declared. Should it be static?

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1466183980-8903-1-git-send-email-weiyj_lk@163.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 09:06:46 +02:00
Ingo Molnar
3ebe3bd8fb Merge branch 'perf/urgent' into perf/core, to pick up fixes before merging new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 08:58:23 +02:00
Mark Rutland
2c81a64770 perf/core: Fix pmu::filter_match for SW-led groups
The following commit:

  66eb579e66ec ("perf: allow for PMU-specific event filtering")

added the pmu::filter_match() callback. This was intended to
avoid HW constraints on events from resulting in extremely
pessimistic scheduling.

However, pmu::filter_match() is only called for the leader of each event
group. When the leader is a SW event, we do not filter the groups, and
may fail at pmu::add() time, and when this happens we'll give up on
scheduling any event groups later in the list until they are rotated
ahead of the failing group.

This can result in extremely sub-optimal event scheduling behaviour,
e.g. if running the following on a big.LITTLE platform:

$ taskset -c 0 ./perf stat \
 -e 'a57{context-switches,armv8_cortex_a57/config=0x11/}' \
 -e 'a53{context-switches,armv8_cortex_a53/config=0x11/}' \
 ls

     <not counted>      context-switches                                              (0.00%)
     <not counted>      armv8_cortex_a57/config=0x11/                                 (0.00%)
                24      context-switches                                              (37.36%)
          57589154      armv8_cortex_a53/config=0x11/                                 (37.36%)

Here the 'a53' event group was always eligible to be scheduled, but
the 'a57' group never eligible to be scheduled, as the task was always
affine to a Cortex-A53 CPU. The SW (group leader) event in the 'a57'
group was eligible, but the HW event failed at pmu::add() time,
resulting in ctx_flexible_sched_in giving up on scheduling further
groups with HW events.

One way of avoiding this is to check pmu::filter_match() on siblings
as well as the group leader. If any of these fail their
pmu::filter_match() call, we must skip the entire group before
attempting to add any events.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Fixes: 66eb579e66ec ("perf: allow for PMU-specific event filtering")
Link: http://lkml.kernel.org/r/1465917041-15339-1-git-send-email-mark.rutland@arm.com
[ Small readability edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-07 08:57:57 +02:00
Juergen Gross
ecb23dc6f2 xen: add steal_clock support on x86
The pv_time_ops structure contains a function pointer for the
"steal_clock" functionality used only by KVM and Xen on ARM. Xen on x86
uses its own mechanism to account for the "stolen" time a thread wasn't
able to run due to hypervisor scheduling.

Add support in Xen arch independent time handling for this feature by
moving it out of the arm arch into drivers/xen and remove the x86 Xen
hack.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2016-07-06 10:34:48 +01:00
Namhyung Kim
a4a551b8f1 ftrace: Reduce size of function graph entries
Currently ftrace_graph_ent{,_entry} and ftrace_graph_ret{,_entry} struct
can have padding bytes at the end due to alignment in 64-bit data type.
As these data are recorded so frequently, those paddings waste
non-negligible space.  As the ring buffer maintains alignment properly
for each architecture, just to remove the extra padding using 'packed'
attribute.

  ftrace_graph_ent_entry:  24 -> 20
  ftrace_graph_ret_entry:  48 -> 44

Also I moved the 'overrun' field in struct ftrace_graph_ret to minimize
the padding in the middle.

Tested on x86_64 only.

Link: http://lkml.kernel.org/r/1467197808-13578-1-git-send-email-namhyung@kernel.org

Cc: Ingo Molnar <mingo@kernel.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-07-05 17:28:30 -04:00
Tom Zanussi
7ad8fb61c4 tracing: Have HIST_TRIGGERS select TRACING
The kbuild test robot reported a compile error if HIST_TRIGGERS was
enabled but nothing else that selected TRACING was configured in.

HIST_TRIGGERS should directly select it and not rely on anything else
to do it.

Link: http://lkml.kernel.org/r/57791866.8080505@linux.intel.com

Reported-by: kbuild test robot <fennguang.wu@intel.com>
Fixes: 7ef224d1d0e3a ("tracing: Add 'hist' event trigger command")
Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-07-05 15:49:01 -04:00
Wei Yongjun
67f20b0845 tracing: Using for_each_set_bit() to simplify trace_pid_write()
Using for_each_set_bit() to simplify the code.

Link: http://lkml.kernel.org/r/1467645004-11169-1-git-send-email-weiyj_lk@163.com

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-07-05 11:22:40 -04:00
Jisheng Zhang
5130213721 tick/broadcast-hrtimer: Set name of the ce_broadcast_hrtimer
This is to avoid the "null" name when we either

~ # cat /sys/devices/system/clockevents/broadcast/current_device
(null)

or

~ # cat /proc/timer_list
...
Tick Device: mode:     1
Broadcast device
Clock Event Device: (null)
...

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/1467709071-3667-1-git-send-email-jszhang@marvell.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-05 17:02:19 +02:00