23437 Commits

Author SHA1 Message Date
Ingo Molnar
950d8381d9 Merge branch 'linus' into timers/core, to refresh the branch
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08 14:05:16 +02:00
Arnd Bergmann
f1e4ba5b6a perf, bpf: fix conditional call to bpf_overflow_handler
The newly added bpf_overflow_handler function is only built of both
CONFIG_EVENT_TRACING and CONFIG_BPF_SYSCALL are enabled, but the caller
only checks the latter:

kernel/events/core.c: In function 'perf_event_alloc':
kernel/events/core.c:9106:27: error: 'bpf_overflow_handler' undeclared (first use in this function)

This changes the caller so we also skip this call if CONFIG_EVENT_TRACING
is disabled entirely.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: aa6a5f3cb2b2 ("perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT programs")
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-06 16:34:14 -07:00
Sebastian Andrzej Siewior
c4544dbc7a kernel/softirq: Convert to hotplug state machine
Install the callbacks via the state machine.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160818125731.27256-7-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-06 18:30:22 +02:00
Sebastian Andrzej Siewior
6731d4f123 slab: Convert to hotplug state machine
Install the callbacks via the state machine.

Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Link: http://lkml.kernel.org/r/20160823125319.abeapfjapf2kfezp@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-06 18:30:20 +02:00
Richard Weinberger
e6d4989a9a relayfs: Convert to hotplug state machine
Install the callbacks via the state machine. They are installed at run time but
relay_prepare_cpu() does not need to be invoked by the boot CPU because
relay_open() was not yet invoked and there are no pools that need to be created.

Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20160818125731.27256-3-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-06 18:30:20 +02:00
Akash Goel
017c59c042 relay: Use per CPU constructs for the relay channel buffer pointers
relay essentially needs to maintain a per CPU array of channel buffer
pointers but it manually creates that array.  Instead its better to use
the per CPU constructs, provided by the kernel, to allocate & access the
array of pointer to channel buffers.

Signed-off-by: Akash Goel <akash.goel@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: http://lkml.kernel.org/r/1470909140-25919-1-git-send-email-akash.goel@intel.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-06 18:30:19 +02:00
Thomas Gleixner
ee1e714b94 cpu/hotplug: Remove CPU_STARTING and CPU_DYING notifier
All users are converted to state machine, remove CPU_STARTING and the
corresponding CPU_DYING.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160818125731.27256-2-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-06 18:30:19 +02:00
Thomas Gleixner
677f664653 cpu/hotplug: Make state names consistent
We should have all names in the scheme "[subsys/]facility:state]". Fix the
core to comply.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-06 16:20:28 +02:00
Alexander Kuleshov
00b992deaa genirq: No need to mask non trigger mode flags before __irq_set_trigger()
Some callers of __irq_set_trigger() masks all flags except trigger mode
flags. This is unnecessary, ase __irq_set_trigger() already does this
before usage of flags.

[ tglx: Moved the flag mask and adjusted comment. Removed the hunk in
  	enable_percpu_irq() as it is required there ]

Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Link: http://lkml.kernel.org/r/20160719095408.13778-1-kuleshovmail@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-06 12:14:12 +02:00
Thomas Gleixner
e8b61b3f2c futex: Add some more function commentry
Add some more comments and reformat existing ones to kernel doc style.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Link: http://lkml.kernel.org/r/1464770609-30168-1-git-send-email-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-05 17:20:18 +02:00
Punit Agrawal
545d5d657b genirq: Update stale comment for __irq_domain_add
Commit 1bf4ddc46c5d ("irqdomain: Introduce irq_domain_create_{linear,
tree}") introduced the use of fwnode_handle to identify the interrupt
controller when calling __irq_domain_add but missed updating the kernel
doc parameters for the function.

Update this comment. While we are touching this code, also consolidate
the declaration and assignment of of_node.

Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Marc Zygnier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/1464699409-23113-1-git-send-email-punit.agrawal@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-05 17:11:10 +02:00
Thomas Gleixner
3c1627e999 cpu/hotplug: Replace anon union
Some compilers are unhappy with the anon union in the state array. Replace
it with a named union.

While at it align the state array initializers proper and add the missing
name tags.

Fixes: cf392d10b69e "cpu/hotplug: Add multi instance support"
Reported-by: Ingo Molnar <mingo@kernel.org>
Reported-by: Fenguang Wu <fengguang.wu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: rt@linutronix.de
2016-09-05 15:32:58 +02:00
Tejun Heo
c86d06ba28 PM / QoS: avoid calling cancel_delayed_work_sync() during early boot
of_clk_init() ends up calling into pm_qos_update_request() very early
during boot where irq is expected to stay disabled.
pm_qos_update_request() uses cancel_delayed_work_sync() which
correctly assumes that irq is enabled on invocation and
unconditionally disables and re-enables it.

Gate cancel_delayed_work_sync() invocation with kevented_up() to avoid
enabling irq unexpectedly during early boot.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Qiao Zhou <qiaozhou@asrmicro.com>
Link: http://lkml.kernel.org/r/d2501c4c-8e7b-bea3-1b01-000b36b5dfe9@asrmicro.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-09-05 15:07:53 +02:00
Juergen Gross
df8ce9d78a smp: Add function to execute a function synchronously on a CPU
On some hardware models (e.g. Dell Studio 1555 laptop) some hardware
related functions (e.g. SMIs) are to be executed on physical CPU 0
only. Instead of open coding such a functionality multiple times in
the kernel add a service function for this purpose. This will enable
the possibility to take special measures in virtualized environments
like Xen, too.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Douglas_Warzecha@dell.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akataria@vmware.com
Cc: boris.ostrovsky@oracle.com
Cc: chrisw@sous-sol.org
Cc: david.vrabel@citrix.com
Cc: hpa@zytor.com
Cc: jdelvare@suse.com
Cc: jeremy@goop.org
Cc: linux@roeck-us.net
Cc: pali.rohar@gmail.com
Cc: rusty@rustcorp.com.au
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1472453327-19050-4-git-send-email-jgross@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:52:39 +02:00
Juergen Gross
47ae4b05d0 virt, sched: Add generic vCPU pinning support
Add generic virtualization support for pinning the current vCPU to a
specified physical CPU. As this operation isn't performance critical
(a very limited set of operations like BIOS calls and SMIs is expected
to need this) just add a hypervisor specific indirection.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Douglas_Warzecha@dell.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akataria@vmware.com
Cc: boris.ostrovsky@oracle.com
Cc: chrisw@sous-sol.org
Cc: david.vrabel@citrix.com
Cc: hpa@zytor.com
Cc: jdelvare@suse.com
Cc: jeremy@goop.org
Cc: linux@roeck-us.net
Cc: pali.rohar@gmail.com
Cc: rusty@rustcorp.com.au
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1472453327-19050-3-git-send-email-jgross@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:52:38 +02:00
Josh Poimboeuf
4fa8d299b4 sched/debug: Remove several CONFIG_SCHEDSTATS guards
Clean up the sched code by removing several of the CONFIG_SCHEDSTATS
guards, using schedstat_*() macros where needed.

Code size:

  !CONFIG_SCHEDSTATS defconfig:

      text	   data	    bss	     dec	    hex	filename
  10209818	4368184	1105920	15683922	 ef5152	vmlinux.before.nostats
  10209818	4368184	1105920	15683922	 ef5152	vmlinux.after.nostats

  CONFIG_SCHEDSTATS defconfig:

      text	   data	    bss	    dec	    hex	filename
  10214210	4370040	1105920	15690170	 ef69ba	vmlinux.before.stats
  10214210	4370680	1105920	15690810	 ef6c3a	vmlinux.after.stats

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e51e0ebe5af95ac295de720dd252e7c0d2142e4a.1466184592.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:47 +02:00
Josh Poimboeuf
20e1d4863b sched/debug: Rename 'schedstat_val()' -> 'schedstat_val_or_zero()'
The schedstat_val() macro's behavior is kind of surprising: when
schedstat is runtime disabled, it returns zero.  Rename it to
schedstat_val_or_zero().

There's also a need for a similar macro which doesn't have the 'if
(schedstat_enable())' check, to avoid doing the check twice.  Create a
new 'schedstat_val()' macro for that.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/3bb1d2367d041fee333b0dde17171e709395b675.1466184592.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:46 +02:00
Josh Poimboeuf
ae92882e56 sched/debug: Clean up schedstat macros
The schedstat_*() macros are inconsistent: most of them take a pointer
and a field which the macro combines, whereas schedstat_set() takes the
already combined ptr->field.

The already combined ptr->field argument is actually more intuitive and
easier to use, and there's no reason to require the user to split the
variable up, so convert the macros to use the combined argument.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/54953ca25bb579f3a5946432dee409b0e05222c6.1466184592.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:46 +02:00
Josh Poimboeuf
1a3d027c5a sched/debug: Rename and move enqueue_sleeper()
enqueue_sleeper() doesn't actually enqueue, it just handles some
statistics and tracepoints.  Rename it to update_stats_enqueue_sleeper()
and call it from update_stats_enqueue().

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/fb20b7159dc4d028c406c0e8d5f8c439b741615b.1466184592.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:45 +02:00
Wanpeng Li
61c7aca695 sched/deadline: Fix the intention to re-evalute tick dependency for offline CPU
The dl task will be replenished after dl task timer fire and start a
new period. It will be enqueued and to re-evaluate its dependency on
the tick in order to restart it. However, if the CPU is hot-unplugged,
irq_work_queue will splash since the target CPU is offline.

As a result we get:

    WARNING: CPU: 2 PID: 0 at kernel/irq_work.c:69 irq_work_queue_on+0xad/0xe0
    Call Trace:
     dump_stack+0x99/0xd0
     __warn+0xd1/0xf0
     warn_slowpath_null+0x1d/0x20
     irq_work_queue_on+0xad/0xe0
     tick_nohz_full_kick_cpu+0x44/0x50
     tick_nohz_dep_set_cpu+0x74/0xb0
     enqueue_task_dl+0x226/0x480
     activate_task+0x5c/0xa0
     dl_task_timer+0x19b/0x2c0
     ? push_dl_task.part.31+0x190/0x190

This can be triggered by hot-unplugging the full dynticks CPU which dl
task is running on.

We enqueue the dl task on the offline CPU, because we need to do
replenish for start_dl_timer(). So, as Juri pointed out, we would
need to do is calling replenish_dl_entity() directly, instead of
enqueue_task_dl(). pi_se shouldn't be a problem as the task shouldn't
be boosted if it was throttled.

This patch fixes it by avoiding the whole enqueue+dequeue+enqueue story, by
first migrating (set_task_cpu()) and then doing 1 enqueue.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1472639264-3932-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:45 +02:00
seokhoon.yoon
efca03ecbe schedcore: Remove duplicated init_task's preempt_notifiers init
init_task's preempt_notifiers is initialized twice:

1) sched_init()
   -> INIT_HLIST_HEAD(&init_task.preempt_notifiers)

2) sched_init()
   -> init_idle(current,) <--- current task is init_task at this time
    -> __sched_fork(,current)
     -> INIT_HLIST_HEAD(&p->preempt_notifiers)

I think the first one is unnecessary, so remove it.

Signed-off-by: seokhoon.yoon <iamyooon@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1471339568-5790-1-git-send-email-iamyooon@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:44 +02:00
Dietmar Eggemann
2665621506 sched/fair: Fix load_above_capacity fixed point arithmetic width
Since commit:

  2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels")

we now have two different fixed point units for load.

load_above_capacity has to have 10 bits fixed point unit like PELT,
whereas NICE_0_LOAD has 20 bit fixed point unit on 64-bit kernels.

Fix this by scaling down NICE_0_LOAD when multiplying
load_above_capacity with it.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1470824847-5316-1-git-send-email-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:44 +02:00
Tommaso Cucinotta
d8206bb3ff sched/deadline: Split cpudl_set() into cpudl_set() and cpudl_clear()
These 2 exercise independent code paths and need different arguments.

After this change, you call:

  cpudl_clear(cp, cpu);
  cpudl_set(cp, cpu, dl);

instead of:

  cpudl_set(cp, cpu, 0 /* dl */, 0 /* is_valid */);
  cpudl_set(cp, cpu, dl, 1 /* is_valid */);

Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Luca Abeni <luca.abeni@unitn.it>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-dl@retis.sssup.it
Link: http://lkml.kernel.org/r/1471184828-12644-4-git-send-email-tommaso.cucinotta@sssup.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:43 +02:00
Tommaso Cucinotta
8e1bc301aa sched/deadline: Make CPU heap faster avoiding real swaps on heapify
This change goes from heapify() ops done by swapping with parent/child
so that the item to fix moves along, to heapify() ops done by just
pulling the parent/child chain by 1 pos, then storing the item to fix
just at the end. On a non-trivial heapify(), this performs roughly half
stores wrt swaps.

This has been measured to achieve up to 10% of speed-up for cpudl_set()
calls, with a randomly generated workload of 1K,10K,100K random heap
insertions and deletions (75% cpudl_set() calls with is_valid=1 and
25% with is_valid=0), and randomly generated cpu IDs, with up to 256
CPUs, as measured on an Intel Core2 Duo.

Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Luca Abeni <luca.abeni@unitn.it>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-dl@retis.sssup.it
Link: http://lkml.kernel.org/r/1471184828-12644-3-git-send-email-tommaso.cucinotta@sssup.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:43 +02:00
Tommaso Cucinotta
126b3b6842 sched/deadline: Refactor CPU heap code
1. heapify up factored out in new dedicated function heapify_up()
    (avoids repetition of same code)

 2. call to cpudl_change_key() replaced with heapify_up() when
    cpudl_set actually inserts a new node in the heap

 3. cpudl_change_key() replaced with heapify() that heapifies up
    or down as needed.

Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Luca Abeni <luca.abeni@unitn.it>
Reviewed-by: Juri Lelli <juri.lelli@arm.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-dl@retis.sssup.it
Link: http://lkml.kernel.org/r/1471184828-12644-2-git-send-email-tommaso.cucinotta@sssup.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:42 +02:00
Byungchul Park
97a7142f15 sched/fair: Make update_min_vruntime() more readable
The update_min_vruntime() control flow can be simplified.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: minchan.kim@lge.com
Link: http://lkml.kernel.org/r/1436088829-25768-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:29:41 +02:00
Ingo Molnar
62cc20bcf2 Merge branch 'sched/urgent' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:24:11 +02:00
Will Deacon
c9bbdd4830 perf/core: Don't pass PERF_EF_START to the PMU ->start callback
PERF_EF_START is a flag to indicate to the PMU ->add() callback that, as
well as claiming the PMU resources required by the event being added,
it should also start the PMU.

Passing this flag to the ->start() callback doesn't make sense, because
->start() always tries to start the PMU. Remove it.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: mark.rutland@arm.com
Link: http://lkml.kernel.org/r/1471257765-29662-1-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 13:19:18 +02:00
Ingo Molnar
2cc538412a Merge branch 'perf/urgent' into perf/core, to pick up fixed and resolve conflicts
Conflicts:
	kernel/events/core.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 12:09:59 +02:00
Balbir Singh
135e8c9250 sched/core: Fix a race between try_to_wake_up() and a woken up task
The origin of the issue I've seen is related to
a missing memory barrier between check for task->state and
the check for task->on_rq.

The task being woken up is already awake from a schedule()
and is doing the following:

	do {
		schedule()
		set_current_state(TASK_(UN)INTERRUPTIBLE);
	} while (!cond);

The waker, actually gets stuck doing the following in
try_to_wake_up():

	while (p->on_cpu)
		cpu_relax();

Analysis:

The instance I've seen involves the following race:

 CPU1					CPU2

 while () {
   if (cond)
     break;
   do {
     schedule();
     set_current_state(TASK_UN..)
   } while (!cond);
					wakeup_routine()
					  spin_lock_irqsave(wait_lock)
   raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
 }					  try_to_wake_up()
 set_current_state(TASK_RUNNING);	  ..
 list_del(&waiter.list);

CPU2 wakes up CPU1, but before it can get the wait_lock and set
current state to TASK_RUNNING the following occurs:

 CPU3
 wakeup_routine()
 raw_spin_lock_irqsave(wait_lock)
 if (!list_empty)
   wake_up_process()
   try_to_wake_up()
   raw_spin_lock_irqsave(p->pi_lock)
   ..
   if (p->on_rq && ttwu_wakeup())
   ..
   while (p->on_cpu)
     cpu_relax()
   ..

CPU3 tries to wake up the task on CPU1 again since it finds
it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
after CPU2, CPU3 got it.

CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
the task is spinning on the wait_lock. Interestingly since p->on_rq
is checked under pi_lock, I've noticed that try_to_wake_up() finds
p->on_rq to be 0. This was the most confusing bit of the analysis,
but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
check is not reliable without this fix IMHO. The race is visible
(based on the analysis) only when ttwu_queue() does a remote wakeup
via ttwu_queue_remote. In which case the p->on_rq change is not
done uder the pi_lock.

The result is that after a while the entire system locks up on
the raw_spin_irqlock_save(wait_lock) and the holder spins infintely

Reproduction of the issue:

The issue can be reproduced after a long run on my system with 80
threads and having to tweak available memory to very low and running
memory stress-ng mmapfork test. It usually takes a long time to
reproduce. I am trying to work on a test case that can reproduce
the issue faster, but thats work in progress. I am still testing the
changes on my still in a loop and the tests seem OK thus far.

Big thanks to Benjamin and Nick for helping debug this as well.
Ben helped catch the missing barrier, Nick caught every missing
bit in my theory.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[ Updated comment to clarify matching barriers. Many
  architectures do not have a full barrier in switch_to()
  so that cannot be relied upon. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nicholas Piggin <nicholas.piggin@gmail.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 11:57:53 +02:00
Peter Zijlstra
5876314875 perf/core: Remove WARN from perf_event_read()
This effectively reverts commit:

  71e7bc2bab77 ("perf/core: Check return value of the perf_event_read() IPI")

... and puts in a comment explaining why we ignore the return value.

Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 71e7bc2bab77 ("perf/core: Check return value of the perf_event_read() IPI")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-05 11:55:00 +02:00
Linus Torvalds
1c3333600b Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "Two fixlet from the timers departement:

   - A fix for scheduler stalls in the tick idle code affecting
     NOHZ_FULL kernels

   - A trivial compile fix"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick/nohz: Fix softlockup on scheduler stalls in kvm guest
  clocksource/drivers/atmel-pit: Fix compilation error
2016-09-04 08:43:45 -07:00
Lianwei Wang
01b4115906 cpu/hotplug: Handle unbalanced hotplug enable/disable
When cpu_hotplug_enable() is called unbalanced w/o a preceeding
cpu_hotplug_disable() the code emits a warning, but happily decrements the
disabled counter. This causes the next operations to malfunction.

Prevent the decrement and just emit a warning.

Signed-off-by: Lianwei Wang <lianwei.wang@gmail.com>
Cc: peterz@infradead.org
Cc: linux-pm@vger.kernel.org
Cc: oleg@redhat.com
Link: http://lkml.kernel.org/r/1465541008-12476-1-git-send-email-lianwei.wang@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-02 20:37:17 +02:00
Sebastian Frias
f88eecfe2f genirq/generic_chip: Verify irqs_per_chip <= 32
Most (if not all) code here implicitly assumes that the maximum number of
IRQs per chip will be 32, and thus uses 'u32' or 'unsigned long' for many
tasks (for example "struct irq_data" declares its 'mask' field as 'u32',
and "struct irq_chip_generic" declares its 'installed' field as 'unsigned
long')

However, there is no check to verify that irqs_per_chip is <= 32.  Hence,
calling irq_alloc_domain_generic_chips() with a bigger value will result in
unexpected results.

Provide a wrapper with a MAYBE_BUILD_BUG_ON(nrirqs >= 32) to catch such
cases.

[ tglx: Reduced changelog to the essential information ]

Signed-off-by: Sebastian Frias <sf84@laposte.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Mason <slash.tmp@free.fr>
Cc: Jason Cooper <jason@lakedaemon.net>
Link: http://lkml.kernel.org/r/57B31D94.5040701@laposte.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-02 20:20:59 +02:00
Thomas Gleixner
cf392d10b6 cpu/hotplug: Add multi instance support
This patch adds the ability for a given state to have multiple
instances. Until now all states have a single instance and the startup /
teardown callback use global variables.
A few drivers need to perform a the same callbacks on multiple
"instances". Currently we have three drivers in tree which all have a
global list which they iterate over. With multi instance they support
don't need their private list and the functionality has been moved into
core code. Plus we hold the hotplug lock in core so no cpus comes/goes
while instances are registered and we do rollback in error case :)

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/1471024183-12666-3-git-send-email-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-02 20:05:05 +02:00
Thomas Gleixner
a724632ca0 cpu/hotplug: Rework callback invocation logic
This is preparation for the following patch.
This rework here changes the arguments of cpuhp_invoke_callback(). It
passes now `state' and whether `startup' or `teardown' callback should
be invoked. The callback then is looked up by the function.

The following is a clanup of callers:
- cpuhp_issue_call() has one argument less
- struct cpuhp_cpu_state (which is used by the hotplug thread) gets also
  its callback removed. The decision if it is a single callback
  invocation moved to the `single' variable. Also a `bringup' variable
  has been added to distinguish between startup and teardown callback.
- take_cpu_down() needs to start one step earlier. We always get here
  via CPUHP_TEARDOWN_CPU callback. Before that change cpuhp_ap_states +
  CPUHP_TEARDOWN_CPU pointed to an empty entry because TEARDOWN is saved
  in bp_states for this reason. Now that we use cpuhp_get_step() to
  lookup the state we must explicitly skip it in order not to invoke it
  twice.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/1471024183-12666-2-git-send-email-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-02 20:05:05 +02:00
Alexei Starovoitov
aa6a5f3cb2 perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT programs
Allow attaching BPF_PROG_TYPE_PERF_EVENT programs to sw and hw perf events
via overflow_handler mechanism.
When program is attached the overflow_handlers become stacked.
The program acts as a filter.
Returning zero from the program means that the normal perf_event_output handler
will not be called and sampling event won't be stored in the ring buffer.

The overflow_handler_context==NULL is an additional safety check
to make sure programs are not attached to hw breakpoints and watchdog
in case other checks (that prevent that now anyway) get accidentally
relaxed in the future.

The program refcnt is incremented in case perf_events are inhereted
when target task is forked.
Similar to kprobe and tracepoint programs there is no ioctl to
detach the program or swap already attached program. The user space
expected to close(perf_event_fd) like it does right now for kprobe+bpf.
That restriction simplifies the code quite a bit.

The invocation of overflow_handler in __perf_event_overflow() is now
done via READ_ONCE, since that pointer can be replaced when the program
is attached while perf_event itself could have been active already.
There is no need to do similar treatment for event->prog, since it's
assigned only once before it's accessed.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 10:46:44 -07:00
Alexei Starovoitov
fdc15d388d bpf: perf_event progs should only use preallocated maps
Make sure that BPF_PROG_TYPE_PERF_EVENT programs only use
preallocated hash maps, since doing memory allocation
in overflow_handler can crash depending on where nmi got triggered.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 10:46:44 -07:00
Alexei Starovoitov
0515e5999a bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type
Introduce BPF_PROG_TYPE_PERF_EVENT programs that can be attached to
HW and SW perf events (PERF_TYPE_HARDWARE and PERF_TYPE_SOFTWARE
correspondingly in uapi/linux/perf_event.h)

The program visible context meta structure is
struct bpf_perf_event_data {
    struct pt_regs regs;
     __u64 sample_period;
};
which is accessible directly from the program:
int bpf_prog(struct bpf_perf_event_data *ctx)
{
  ... ctx->sample_period ...
  ... ctx->regs.ip ...
}

The bpf verifier rewrites the accesses into kernel internal
struct bpf_perf_event_data_kern which allows changing
struct perf_sample_data without affecting bpf programs.
New fields can be added to the end of struct bpf_perf_event_data
in the future.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 10:46:44 -07:00
Alexei Starovoitov
ea2e7ce5d0 bpf: support 8-byte metafield access
The verifier supported only 4-byte metafields in
struct __sk_buff and struct xdp_md. The metafields in upcoming
struct bpf_perf_event are 8-byte to match register width in struct pt_regs.
Teach verifier to recognize 8-byte metafield access.
The patch doesn't affect safety of sockets and xdp programs.
They check for 4-byte only ctx access before these conditions are hit.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 10:46:44 -07:00
Steven Rostedt (Red Hat)
7b2c862501 tracing: Add NMI tracing in hwlat detector
As NMIs can also cause latency when interrupts are disabled, the hwlat
detectory has no way to know if the latency it detects is from an NMI or an
SMI or some other hardware glitch.

As ftrace_nmi_enter/exit() funtions are no longer used (except for sh, which
isn't supported anymore), I converted those to "arch_ftrace_nmi_enter/exit"
and use ftrace_nmi_enter/exit() to check if hwlat detector is tracing or
not, and if so, it calls into the hwlat utility.

Since the hwlat detector only has a single kthread that is spinning with
interrupts disabled, it marks what CPU it is on, and if the NMI callback
happens on that CPU, it records the time spent in that NMI. This is added to
the output that is generated by the hwlat detector as:

 #3     inner/outer(us):    9/9     ts:1470836488.206734548
 #4     inner/outer(us):    0/8     ts:1470836497.140808588
 #5     inner/outer(us):    0/6     ts:1470836499.140825168 nmi-total:5 nmi-count:1
 #6     inner/outer(us):    9/9     ts:1470836501.140841748

All time is still tracked in microseconds.

The NMI information is only shown when an NMI occurred during the sample.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-09-02 12:47:55 -04:00
Steven Rostedt (Red Hat)
0330f7aa8e tracing: Have hwlat trace migrate across tracing_cpumask CPUs
Instead of having the hwlat detector thread stay on one CPU, have it migrate
across all the CPUs specified by tracing_cpumask. If the user modifies the
thread's CPU affinity, the migration will stop until the next instance that
the tracer is instantiated. The migration happens at the end of each window
(period).

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-09-02 12:47:54 -04:00
Steven Rostedt (Red Hat)
e7c15cd8a1 tracing: Added hardware latency tracer
The hardware latency tracer has been in the PREEMPT_RT patch for some time.
It is used to detect possible SMIs or any other hardware interruptions that
the kernel is unaware of. Note, NMIs may also be detected, but that may be
good to note as well.

The logic is pretty simple. It simply creates a thread that spins on a
single CPU for a specified amount of time (width) within a periodic window
(window). These numbers may be adjusted by their cooresponding names in

   /sys/kernel/tracing/hwlat_detector/

The defaults are window = 1000000 us (1 second)
                 width  =  500000 us (1/2 second)

The loop consists of:

	t1 = trace_clock_local();
	t2 = trace_clock_local();

Where trace_clock_local() is a variant of sched_clock().

The difference of t2 - t1 is recorded as the "inner" timestamp and also the
timestamp  t1 - prev_t2 is recorded as the "outer" timestamp. If either of
these differences are greater than the time denoted in
/sys/kernel/tracing/tracing_thresh then it records the event.

When this tracer is started, and tracing_thresh is zero, it changes to the
default threshold of 10 us.

The hwlat tracer in the PREEMPT_RT patch was originally written by
Jon Masters. I have modified it quite a bit and turned it into a
tracer.

Based-on-code-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-09-02 12:47:51 -04:00
Sebastian Frias
0c228919e0 irqdomain: Mask irq type in irq_domain_xlate_onetwocell()
According to the xlate() callback definition, the 'out_type' parameter
needs to be the "linux irq type".

A mask for such bits exists, IRQ_TYPE_SENSE_MASK, which is correctly
applied in irq_domain_xlate_twocell()

So use it for irq_domain_xlate_onetwocell() as well.

Signed-off-by: Sebastian Frias <sf84@laposte.net>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Mason <slash.tmp@free.fr>
Cc: Jason Cooper <jason@lakedaemon.net>
Link: http://lkml.kernel.org/r/57A05F5D.103@laposte.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-02 18:06:50 +02:00
Sebastian Frias
ee26c013cd genirq/generic_chip: Add irq_unmap callback
Without this patch irq_domain_disassociate() cannot properly release the
interrupt. In fact, irq_map_generic_chip() checks a bit on 'gc->installed'
but said bit is never cleared, only set.

Commit 088f40b7b027 ("genirq: Generic chip: Add linear irq domain support")
added irq_map_generic_chip() function and also stated "This lacks a removal
function for now".

This commit provides an implementation of an unmap function that can be
called by irq_domain_disassociate().

[ tglx: Made the function static and removed the export as we have neither
  	a prototype nor a modular user. ]

Fixes: 088f40b7b027 ("genirq: Generic chip: Add linear irq domain support")
Signed-off-by: Sebastian Frias <sf84@laposte.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Mason <slash.tmp@free.fr>
Cc: Jason Cooper <jason@lakedaemon.net>
Link: http://lkml.kernel.org/r/579F5C5A.2070507@laposte.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-02 18:06:49 +02:00
Sebastian Frias
f0c450eaa3 genirq/generic_chip: Get rid of code duplication
irq_map_generic_chip() contains about the same code as
irq_get_domain_generic_chip() except for the return values.

Split out the irq_get_domain_generic_chip() implementation so it can be
reused.

[ tglx: Removed the extra churn in irq_get_domain_generic_chip() callers
  	and massaged changelog ]

Signed-off-by: Sebastian Frias <sf84@laposte.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Mason <slash.tmp@free.fr>
Cc: Jason Cooper <jason@lakedaemon.net>
Link: http://lkml.kernel.org/r/579F5C69.8070006@laposte.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-02 18:06:49 +02:00
Thomas Gleixner
48e0fba842 genirq: Remove export of irq_map_generic_chip()
No module users.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-02 18:06:49 +02:00
Thomas Gleixner
fc590c22f9 genirq: Robustify handle_percpu_devid_irq()
The percpu_devid handler is not robust against spurious interrupts. If a
spurious interrupt happens and no action is installed then the handler
crashes with a NULL pointer dereference.

Add a sanity check for this and log the wreckage once in dmesg.

Reported-by: Majun <majun258@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: guohanjun@huawei.com
Cc: dingtianhong@huawei.com
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1609021436160.5647@nanos
2016-09-02 18:06:49 +02:00
Wanpeng Li
08d0725992 tick/nohz: Fix softlockup on scheduler stalls in kvm guest
tick_nohz_start_idle() is prevented to be called if the idle tick can't 
be stopped since commit 1f3b0f8243cb934 ("tick/nohz: Optimize nohz idle 
enter"). As a result, after suspend/resume the host machine, full dynticks 
kvm guest will softlockup:

 NMI watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:0]
 Call Trace:
  default_idle+0x31/0x1a0
  arch_cpu_idle+0xf/0x20
  default_idle_call+0x2a/0x50
  cpu_startup_entry+0x39b/0x4d0
  rest_init+0x138/0x140
  ? rest_init+0x5/0x140
  start_kernel+0x4c1/0x4ce
  ? set_init_arg+0x55/0x55
  ? early_idt_handler_array+0x120/0x120
  x86_64_start_reservations+0x24/0x26
  x86_64_start_kernel+0x142/0x14f

In addition, cat /proc/stat | grep cpu in guest or host:

cpu  398 16 5049 15754 5490 0 1 46 0 0
cpu0 206 5 450 0 0 0 1 14 0 0
cpu1 81 0 3937 3149 1514 0 0 9 0 0
cpu2 45 6 332 6052 2243 0 0 11 0 0
cpu3 65 2 328 6552 1732 0 0 11 0 0

The idle and iowait states are weird 0 for cpu0(housekeeping). 

The bug is present in both guest and host kernels, and they both have 
cpu0's idle and iowait states issue, however, host kernel's suspend/resume 
path etc will touch watchdog to avoid the softlockup.

- The watchdog will not be touched in tick_nohz_stop_idle path (need be 
  touched since the scheduler stall is expected) if idle_active flags are 
  not detected.
- The idle and iowait states will not be accounted when exit idle loop 
  (resched or interrupt) if idle start time and idle_active flags are 
  not set. 

This patch fixes it by reverting commit 1f3b0f8243cb934 since can't stop 
idle tick doesn't mean can't be idle.

Fixes: 1f3b0f8243cb934 ("tick/nohz: Optimize nohz idle enter")
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: Sanjeev Yadav<sanjeev.yadav@spreadtrum.com>
Cc: Gaurav Jindal<gaurav.jindal@spreadtrum.com>
Cc: stable@vger.kernel.org
Cc: kvm@vger.kernel.org
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: http://lkml.kernel.org/r/1472798303-4154-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-02 10:25:40 +02:00
Linus Torvalds
b9677faf45 Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "14 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  rapidio/tsi721: fix incorrect detection of address translation condition
  rapidio/documentation/mport_cdev: add missing parameter description
  kernel/fork: fix CLONE_CHILD_CLEARTID regression in nscd
  MAINTAINERS: Vladimir has moved
  mm, mempolicy: task->mempolicy must be NULL before dropping final reference
  printk/nmi: avoid direct printk()-s from __printk_nmi_flush()
  treewide: remove references to the now unnecessary DEFINE_PCI_DEVICE_TABLE
  drivers/scsi/wd719x.c: remove last declaration using DEFINE_PCI_DEVICE_TABLE
  mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator
  lib/test_hash.c: fix warning in preprocessor symbol evaluation
  lib/test_hash.c: fix warning in two-dimensional array init
  kconfig: tinyconfig: provide whole choice blocks to avoid warnings
  kexec: fix double-free when failing to relocate the purgatory
  mm, oom: prevent premature OOM killer invocation for high order request
2016-09-01 18:23:22 -07:00