22211 Commits

Author SHA1 Message Date
Rafael J. Wysocki
02fd4e76a6 Merge branches 'acpi-pci', 'acpi-soc' and 'pnp'
* acpi-pci:
  x86/ACPI/PCI: Recognize that Interrupt Line 255 means "not connected"

* acpi-soc:
  i2c: designware: Add device HID for future AMD I2C controller

* pnp:
  PNP / ACPI: add ACPI_RESOURCE_TYPE_SERIAL_BUS as a valid type
2016-03-14 14:20:57 +01:00
Thomas Gleixner
d10ef6f938 cpu/hotplug: Document states better
Requested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-12 20:57:38 +01:00
Anna-Maria Gleixner
22aceb3176 workqueue: Fix comment for work_on_cpu()
Function is processed in thread context, not in user context.

Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-11 12:39:01 -05:00
Thomas Gleixner
2a58c527bb cpu/hotplug: Fix smpboot thread ordering
Commit 931ef163309e moved the smpboot thread park/unpark invocation to the
state machine. The move of the unpark invocation was premature as it depends
on work in progress patches.

As a result cpu down can fail, because rcu synchronization in takedown_cpu()
eventually requires a functional softirq thread. I never encountered the
problem in testing, but 0day testing managed to provide a reliable reproducer.

Remove the smpboot_threads_park() call from the state machine for now and put
it back into the original place after the rcu synchronization.

I'm embarrassed as I knew about the dependency and still managed to get it
wrong. Hotplug induced brain melt seems to be the only sensible explanation
for that.

Fixes: 931ef163309e "cpu/hotplug: Unpark smpboot threads from the state machine"
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
2016-03-10 21:21:59 +01:00
Rafael J. Wysocki
a5acbfbd70 Merge branch 'pm-cpufreq-governor' into pm-cpufreq 2016-03-10 20:46:03 +01:00
Rafael J. Wysocki
adaf9fcd13 cpufreq: Move scheduler-related code to the sched directory
Create cpufreq.c under kernel/sched/ and move the cpufreq code
related to the scheduler to that file and to sched.h.

Redefine cpufreq_update_util() as a static inline function to avoid
function calls at its call sites in the scheduler code (as suggested
by Peter Zijlstra).

Also move the definition of struct update_util_data and declaration
of cpufreq_set_update_util_data() from include/linux/cpufreq.h to
include/linux/sched.h.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2016-03-10 20:44:47 +01:00
Quan Nguyen
52b2a05fa7 genirq: Export IRQ functions for module use
Export irq_chip_*_parent(), irq_domain_create_hierarchy(),
irq_domain_set_hwirq_and_chip(), irq_domain_reset_irq_data(),
irq_domain_alloc/free_irqs_parent()

So gpio drivers can be built as modules. First user: gpio-xgene-sb

Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Cc: Phong Vo <pvo@apm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: patches@apm.com
Cc: Loc Ho <lho@apm.com>
Cc: Keyur Chudgar <kchudgar@apm.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Link: https://lists.01.org/pipermail/kbuild-all/2016-February/017914.html
Link: http://lkml.kernel.org/r/1457017012-10628-1-git-send-email-qnguyen@apm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-10 16:00:35 +01:00
Davidlohr Bueso
38460a2178 locking/csd_lock: Use smp_cond_acquire() in csd_lock_wait()
We can micro-optimize this call and mildly relax the
barrier requirements by relying on ctrl + rmb, keeping
the acquire semantics. In addition, this is pretty much
the now standard for busy-waiting under such restraints.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Link: http://lkml.kernel.org/r/1457574936-19065-3-git-send-email-dbueso@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-10 10:28:35 +01:00
Davidlohr Bueso
90d1098478 locking/csd_lock: Explicitly inline csd_lock*() helpers
While the compiler tends to already to it for us (except for
csd_unlock), make it explicit. These helpers mainly deal with
the ->flags, are short-lived  and can be called, for example,
from smp_call_function_many().

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Link: http://lkml.kernel.org/r/1457574936-19065-2-git-send-email-dbueso@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-10 10:28:35 +01:00
Ingo Molnar
6cbe9e4a22 Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-10 10:28:27 +01:00
Alexei Starovoitov
cdc4e47da8 bpf: avoid copying junk bytes in bpf_get_current_comm()
Lots of places in the kernel use memcpy(buf, comm, TASK_COMM_LEN); but
the result is typically passed to print("%s", buf) and extra bytes
after zero don't cause any harm.
In bpf the result of bpf_get_current_comm() is used as the part of
map key and was causing spurious hash map mismatches.
Use strlcpy() to guarantee zero-terminated string.
bpf verifier checks that output buffer is zero-initialized,
so even for short task names the output buffer don't have junk bytes.
Note it's not a security concern, since kprobe+bpf is root only.

Fixes: ffeedafbf023 ("bpf: introduce current->pid, tgid, uid, gid, comm accessors")
Reported-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09 23:27:30 -05:00
Alexei Starovoitov
b8cdc05173 bpf: bpf_stackmap_copy depends on CONFIG_PERF_EVENTS
0-day bot reported build error:
kernel/built-in.o: In function `map_lookup_elem':
>> kernel/bpf/.tmp_syscall.o:(.text+0x329b3c): undefined reference to `bpf_stackmap_copy'
when CONFIG_BPF_SYSCALL is set and CONFIG_PERF_EVENTS is not.
Add weak definition to resolve it.
This code path in map_lookup_elem() is never taken
when CONFIG_PERF_EVENTS is not set.

Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09 23:26:51 -05:00
Ard Biesheuvel
ac343e882a memremap: check pfn validity before passing to pfn_to_page()
In memremap's helper function try_ram_remap(), we dereference a struct
page pointer that was derived from a PFN that is known to be covered by
a 'System RAM' iomem region, and is thus assumed to be a 'valid' PFN,
i.e., a PFN that has a struct page associated with it and is covered by
the kernel direct mapping.

However, the assumption that there is a 1:1 relation between the System
RAM iomem region and the kernel direct mapping is not universally valid
on all architectures, and on ARM and arm64, 'System RAM' may include
regions for which pfn_valid() returns false.

Generally speaking, both __va() and pfn_to_page() should only ever be
called on PFNs/physical addresses for which pfn_valid() returns true, so
add that check to try_ram_remap().

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-09 15:43:42 -08:00
Mark Rutland
e1b77c9298 sched/kasan: remove stale KASAN poison after hotplug
Functions which the compiler has instrumented for KASAN place poison on
the stack shadow upon entry and remove this poision prior to returning.

In the case of CPU hotplug, CPUs exit the kernel a number of levels deep
in C code.  Any instrumented functions on this critical path will leave
portions of the stack shadow poisoned.

When a CPU is subsequently brought back into the kernel via a different
path, depending on stackframe, layout calls to instrumented functions
may hit this stale poison, resulting in (spurious) KASAN splats to the
console.

To avoid this, clear any stale poison from the idle thread for a CPU
prior to bringing a CPU online.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-09 15:43:42 -08:00
Dan Williams
5f29a77cd9 mm: fix mixed zone detection in devm_memremap_pages
The check for whether we overlap "System RAM" needs to be done at
section granularity.  For example a system with the following mapping:

    100000000-37bffffff : System RAM
    37c000000-837ffffff : Persistent Memory

...is unable to use devm_memremap_pages() as it would result in two
zones colliding within a given section.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-09 15:43:42 -08:00
Dan Williams
d77a117e68 list: kill list_force_poison()
Given we have uninitialized list_heads being passed to list_add() it
will always be the case that those uninitialized values randomly trigger
the poison value.  Especially since a list_add() operation will seed the
stack with the poison value for later stack allocations to trip over.

For example, see these two false positive reports:

  list_add attempted on force-poisoned entry
  WARNING: at lib/list_debug.c:34
  [..]
  NIP [c00000000043c390] __list_add+0xb0/0x150
  LR [c00000000043c38c] __list_add+0xac/0x150
  Call Trace:
    __list_add+0xac/0x150 (unreliable)
    __down+0x4c/0xf8
    down+0x68/0x70
    xfs_buf_lock+0x4c/0x150 [xfs]

  list_add attempted on force-poisoned entry(0000000000000500),
   new->next == d0000000059ecdb0, new->prev == 0000000000000500
  WARNING: at lib/list_debug.c:33
  [..]
  NIP [c00000000042db78] __list_add+0xa8/0x140
  LR [c00000000042db74] __list_add+0xa4/0x140
  Call Trace:
    __list_add+0xa4/0x140 (unreliable)
    rwsem_down_read_failed+0x6c/0x1a0
    down_read+0x58/0x60
    xfs_log_commit_cil+0x7c/0x600 [xfs]

Fixes: commit 5c2c2587b132 ("mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Eryu Guan <eguan@redhat.com>
Tested-by: Eryu Guan <eguan@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-09 15:43:42 -08:00
Petr Mladek
f995b5f720 livepatch: Fix the error message about unresolvable ambiguity
klp_find_callback() stops the search when sympos is not defined and
a second symbol of the same name is found. It means that the current
error message about the unresolvable ambiguity always prints "(2 matches)".

Let's remove this information. The total number of occurrences is
not much helpful. The author of the patch still must put a non-trivial
effort into searching the right position in the object file.

[jkosina@suse.cz: fixed grammar as suggested by Josh]
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Chris J Arges <chris.j.arges@canonical.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-03-09 21:58:41 +01:00
Toshi Kani
8095d0f225 resource: Export insert_resource and remove_resource
insert_resource() and remove_resouce() are called by producers
of resources, such as FW modules and bus drivers.  These modules
may be implemented as loadable modules.

Export insert_resource() and remove_resouce() so that they can
be called from such modules.

link: https://lkml.org/lkml/2016/3/8/872
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-09 11:07:20 -08:00
Toshi Kani
ff3cc952d3 resource: Add remove_resource interface
insert_resource() and insert_resource_conflict() are called
by resource producers to insert a new resource.  When there
is any conflict, they move conflicting resources down to the
children of the new resource.  There is no destructor of these
interfaces, however.

Add remove_resource(), which removes a resource previously
inserted by insert_resource() or insert_resource_conflict(),
and moves the children up to where they were before.

__release_resource() is changed to have @release_child, so
that this function can be used for remove_resource() as well.

Also add comments to clarify that these functions are intended
for producers of resources to avoid any confusion with
request/release_resource() for consumers.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-09 11:07:20 -08:00
Toshi Kani
4e0d8f7eff resource: Change __request_region to inherit from immediate parent
__request_region() sets 'flags' of a new resource from @parent
as it inherits the parent's attribute.  When a target resource
has a conflict, this function inserts the new resource entry
under the conflicted entry by updating @parent.  In this case,
the new resource entry needs to inherit attribute from the updated
parent.  This conflict is a typical case since __request_region()
is used to allocate a new resource from a specific resource range.

For instance, request_mem_region() calls __request_region() with
@parent set to &iomem_resource, which is the root entry of the
whole iomem range.  When this request results in inserting a new
entry "DEV-A" under "BUS-1", "DEV-A" needs to inherit from the
immediate parent "BUS-1" as it holds specific attribute for the
range.

root (&iomem_resource)
 :
 + "BUS-1"
    + "DEV-A"

Change __request_region() to set 'flags' and 'desc' of a new entry
from the immediate parent.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-09 11:07:20 -08:00
Rafael J. Wysocki
34e2c555f3 cpufreq: Add mechanism for registering utilization update callbacks
Introduce a mechanism by which parts of the cpufreq subsystem
("setpolicy" drivers or the core) can register callbacks to be
executed from cpufreq_update_util() which is invoked by the
scheduler's update_load_avg() on CPU utilization changes.

This allows the "setpolicy" drivers to dispense with their timers
and do all of the computations they need and frequency/voltage
adjustments in the update_load_avg() code path, among other things.

The update_load_avg() changes were suggested by Peter Zijlstra.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
2016-03-09 14:39:19 +01:00
Chen Fan
e237a55184 x86/ACPI/PCI: Recognize that Interrupt Line 255 means "not connected"
Per the x86-specific footnote to PCI spec r3.0, sec 6.2.4, the value 255 in
the Interrupt Line register means "unknown" or "no connection."
Previously, when we couldn't derive an IRQ from the _PRT, we fell back to
using the value from Interrupt Line as an IRQ.  It's questionable whether
we should do that at all, but the spec clearly suggests we shouldn't do it
for the value 255 on x86.

Calling request_irq() with IRQ 255 may succeed, but the driver won't
receive any interrupts.  Or, if IRQ 255 is shared with another device, it
may succeed, and the driver's ISR will be called at random times when the
*other* device interrupts.  Or it may fail if another device is using IRQ
255 with incompatible flags.  What we *want* is for request_irq() to fail
predictably so the driver can fall back to polling.

On x86, assume 255 in the Interrupt Line means the INTx line is not
connected.  In that case, set dev->irq to IRQ_NOTCONNECTED so request_irq()
will fail gracefully with -ENOTCONN.

We found this problem on a system where Secure Boot firmware assigned
Interrupt Line 255 to an i801_smbus device and another device was already
using MSI-X IRQ 255.  This was in v3.10, where i801_probe() fails if
request_irq() fails:

  i801_smbus 0000:00:1f.3: enabling device (0140 -> 0143)
  i801_smbus 0000:00:1f.3: can't derive routing for PCI INT C
  i801_smbus 0000:00:1f.3: PCI INT C: no GSI
  genirq: Flags mismatch irq 255. 00000080 (i801_smbus) vs. 00000000 (megasa)
  CPU: 0 PID: 2487 Comm: kworker/0:1 Not tainted 3.10.0-229.el7.x86_64 #1
  Hardware name: FUJITSU PRIMEQUEST 2800E2/D3736, BIOS PRIMEQUEST 2000 Serie5
  Call Trace:
    dump_stack+0x19/0x1b
    __setup_irq+0x54a/0x570
    request_threaded_irq+0xcc/0x170
    i801_probe+0x32f/0x508 [i2c_i801]
    local_pci_probe+0x45/0xa0
  i801_smbus 0000:00:1f.3: Failed to allocate irq 255: -16
  i801_smbus: probe of 0000:00:1f.3 failed with error -16

After aeb8a3d16ae0 ("i2c: i801: Check if interrupts are disabled"),
i801_probe() will fall back to polling if request_irq() fails.  But we
still need this patch because request_irq() may succeed or fail depending
on other devices in the system.  If request_irq() fails, i801_smbus will
work by falling back to polling, but if it succeeds, i801_smbus won't work
because it expects interrupts that it may not receive.

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-09 01:23:35 +01:00
Alexei Starovoitov
557c0c6e7d bpf: convert stackmap to pre-allocation
It was observed that calling bpf_get_stackid() from a kprobe inside
slub or from spin_unlock causes similar deadlock as with hashmap,
therefore convert stackmap to use pre-allocated memory.

The call_rcu is no longer feasible mechanism, since delayed freeing
causes bpf_get_stackid() to fail unpredictably when number of actual
stacks is significantly less than user requested max_entries.
Since elements are no longer freed into slub, we can push elements into
freelist immediately and let them be recycled.
However the very unlikley race between user space map_lookup() and
program-side recycling is possible:
     cpu0                          cpu1
     ----                          ----
user does lookup(stackidX)
starts copying ips into buffer
                                   delete(stackidX)
                                   calls bpf_get_stackid()
				   which recyles the element and
                                   overwrites with new stack trace

To avoid user space seeing a partial stack trace consisting of two
merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket);
to preserve consistent stack trace delivery to user space.
Now we can move memset(,0) of left-over element value from critical
path of bpf_get_stackid() into slow-path of user space lookup.
Also disallow lookup() from bpf program, since it's useless and
program shouldn't be messing with collected stack trace.

Note that similar race between user space lookup and kernel side updates
is also present in hashmap, but it's not a new race. bpf programs were
always allowed to modify hash and array map elements while user space
is copying them.

Fixes: d5a3b1f69186 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 15:28:31 -05:00
Alexei Starovoitov
823707b68d bpf: check for reserved flag bits in array and stack maps
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 15:28:31 -05:00
Alexei Starovoitov
6c90598174 bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.

The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work

At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.

Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.

While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.

Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.

Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.

Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.

Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:

1 cpu:
pcpu_ida	2.1M
pcpu_ida nolock	2.3M
bt		2.4M
kmalloc		1.8M
hlist+spinlock	2.3M
pcpu_freelist	2.6M

4 cpu:
pcpu_ida	1.5M
pcpu_ida nolock	1.8M
bt w/smp_align	1.7M
bt no/smp_align	1.1M
kmalloc		0.7M
hlist+spinlock	0.2M
pcpu_freelist	2.0M

8 cpu:
pcpu_ida	0.7M
bt w/smp_align	0.8M
kmalloc		0.4M
pcpu_freelist	1.5M

32 cpu:
kmalloc		0.13M
pcpu_freelist	0.49M

pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.

bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist

hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.

kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 15:28:31 -05:00
Alexei Starovoitov
e19494edab bpf: introduce percpu_freelist
Introduce simple percpu_freelist to keep single list of elements
spread across per-cpu singly linked lists.

/* push element into the list */
void pcpu_freelist_push(struct pcpu_freelist *, struct pcpu_freelist_node *);

/* pop element from the list */
struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *);

The object is pushed to the current cpu list.
Pop first trying to get the object from the current cpu list,
if it's empty goes to the neigbour cpu list.

For bpf program usage pattern the collision rate is very low,
since programs push and pop the objects typically on the same cpu.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 15:28:31 -05:00
Alexei Starovoitov
b121d1e74d bpf: prevent kprobe+bpf deadlocks
if kprobe is placed within update or delete hash map helpers
that hold bucket spin lock and triggered bpf program is trying to
grab the spinlock for the same bucket on the same cpu, it will
deadlock.
Fix it by extending existing recursion prevention mechanism.

Note, map_lookup and other tracing helpers don't have this problem,
since they don't hold any locks and don't modify global data.
bpf_trace_printk has its own recursive check and ok as well.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 15:28:30 -05:00
David S. Miller
810813c47a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of overlapping changes, as well as one instance
(vxlan) of a bug fix in 'net' overlapping with code movement
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 12:34:12 -05:00
Tejun Heo
f6d635ad34 cgroup: implement cgroup_subsys->implicit_on_dfl
Some controllers, perf_event for now and possibly freezer in the
future, don't really make sense to control explicitly through
"cgroup.subtree_control".  For example, the primary role of perf_event
is identifying the cgroups of tasks; however, because the controller
also keeps a small amount of state per cgroup, it can't be replaced
with simple cgroup membership tests.

This patch implements cgroup_subsys->implicit_on_dfl flag.  When set,
the controller is implicitly enabled on all cgroups on the v2
hierarchy so that utility type controllers such as perf_event can be
enabled and function transparently.

An implicit controller doesn't show up in "cgroup.controllers" or
"cgroup.subtree_control", is exempt from no internal process rule and
can be stolen from the default hierarchy even if there are non-root
csses.

v2: Reimplemented on top of the recent updates to css handling and
    subsystem rebinding.  Rebinding implicit subsystems is now a
    simple matter of exempting it from the busy subsystem check.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-08 11:51:26 -05:00
Tejun Heo
e4857982f4 cgroup: use css_set->mg_dst_cgrp for the migration target cgroup
Migration can be multi-target on the default hierarchy when a
controller is enabled - processes belonging to each child cgroup have
to be moved to the child cgroup itself to refresh css association.

This isn't a problem for cgroup_migrate_add_src() as each source
css_set still maps to single source and target cgroups; however,
cgroup_migrate_prepare_dst() is called once after all source css_sets
are added and thus might not have a single destination cgroup.  This
is currently worked around by specifying NULL for @dst_cgrp and using
the source's default cgroup as destination as the only multi-target
migration in use is self-targetting.  While this works, it's subtle
and clunky.

As all taget cgroups are already specified while preparing the source
css_sets, this clunkiness can easily be removed by recording the
target cgroup in each source css_set.  This patch adds
css_set->mg_dst_cgrp which is recorded on cgroup_migrate_src() and
used by cgroup_migrate_prepare_dst().  This also makes migration code
ready for arbitrary multi-target migration.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-08 11:51:26 -05:00
Tejun Heo
37ff9f8f47 cgroup: make cgroup[_taskset]_migrate() take cgroup_root instead of cgroup
On the default hierarchy, a migration can be multi-source and/or
multi-destination.  cgroup_taskest_migrate() used to incorrectly
assume single destination cgroup but the bug has been fixed by
1f7dd3e5a6e4 ("cgroup: fix handling of multi-destination migration
from subtree_control enabling").

Since the commit, @dst_cgrp to cgroup[_taskset]_migrate() is only used
to determine which subsystems are affected or which cgroup_root the
migration is taking place in.  As such, @dst_cgrp is misleading.  This
patch replaces @dst_cgrp with @root.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-08 11:51:26 -05:00
Tejun Heo
6c694c8825 cgroup: move migration destination verification out of cgroup_migrate_prepare_dst()
cgroup_migrate_prepare_dst() verifies whether the destination cgroup
is allowable; however, the test doesn't really belong there.  It's too
deep and common in the stack and as a result the test itself is gated
by another test.

Separate the test out into cgroup_may_migrate_to() and update
cgroup_attach_task() and cgroup_transfer_tasks() to perform the test
directly.  This doesn't cause any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-08 11:51:25 -05:00
Tejun Heo
58cdb1ceb1 cgroup: fix incorrect destination cgroup in cgroup_update_dfl_csses()
cgroup_update_dfl_csses() should move each task in the subtree to
self; however, it was incorrectly calling cgroup_migrate_add_src()
with the root of the subtree as @dst_cgrp.  Fortunately,
cgroup_migrate_add_src() currently uses @dst_cgrp only to determine
the hierarchy and the bug doesn't cause any actual breakages.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
2016-03-08 11:51:25 -05:00
Chunyu Hu
1cf8067b54 tracing: Fix typoes in code comment and printk in trace_nop.c
echo nop > /sys/kernel/debug/tracing/options/current_tracer
echo 1 > /sys/kernel/debug/tracing/options/test_nop_accept
echo 0 > /sys/kernel/debug/tracing/options/test_nop_accept
echo 1 > /sys/kernel/debug/tracing/options/test_nop_refuse

Before the fix, the dmesg is a bit ugly since a align issue.

[  191.973081] nop_test_accept flag set to 1: we accept. Now cat trace_options to see the result
[  195.156942] nop_test_refuse flag set to 1: we refuse.Now cat trace_options to see the result

After the fix, the dmesg will show aligned log for nop_test_refuse and nop_test_accept.

[ 2718.032413] nop_test_refuse flag set to 1: we refuse. Now cat trace_options to see the result
[ 2734.253360] nop_test_accept flag set to 1: we accept. Now cat trace_options to see the result

Link: http://lkml.kernel.org/r/1457444222-8654-2-git-send-email-chuhu@redhat.com

Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-08 11:23:57 -05:00
Steven Rostedt (Red Hat)
353206f5ca tracing: Use flags instead of bool in trigger structure
gcc isn't known for handling bool in structures. Instead of using bool, use
an integer mask and use bit flags instead.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-08 11:19:36 -05:00
Tom Zanussi
a88e1cfb1d tracing: Add an unreg_all() callback to trigger commands
Add a new unreg_all() callback that can be used to remove all
command-specific triggers from an event and arrange to have it called
whenever a trigger file is opened with O_TRUNC set.

Commands that don't want truncate semantics, or existing commands that
don't implement this function simply do nothing and their triggers
remain intact.

Link: http://lkml.kernel.org/r/2b7d62854d01f28c19185e1bbb8f826f385edfba.1449767187.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-08 11:19:35 -05:00
Tom Zanussi
a5863dae84 tracing: Add needs_rec flag to event triggers
Add a new needs_rec flag for triggers that require unconditional
access to trace records in order to function.

Normally a trigger requires access to the contents of a trace record
only if it has a filter associated with it (since filters need the
contents of a record in order to make a filtering decision).  Some
types of triggers, such as 'hist' triggers, require access to trace
record contents independent of the presence of filters, so add a new
flag for those triggers.

Link: http://lkml.kernel.org/r/7be8fa38f9b90fdb6c47ca0f98d20a07b9fd512b.1449767187.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-08 11:19:34 -05:00
Tom Zanussi
104f281044 tracing: Add a per-event-trigger 'paused' field
Add a simple per-trigger 'paused' flag, allowing individual triggers
to pause.  We could leave it to individual triggers that need this
functionality to do it themselves, but we also want to allow other
events to control pausing, so add it to the trigger data.

Link: http://lkml.kernel.org/r/fed37e4879684d7dcc57fe00ce0cbf170032b06d.1449767187.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-08 11:19:33 -05:00
Tom Zanussi
dbfeaa7aba tracing: Add get_syscall_name()
Add a utility function to grab the syscall name from the syscall
metadata, given a syscall id.

Link: http://lkml.kernel.org/r/be26a8dfe3f15e16a837799f1c1e2b4d62742843.1449767187.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-08 11:19:32 -05:00
Tom Zanussi
c4a5923055 tracing: Add event record param to trigger_ops.func()
Some triggers may need access to the trace event, so pass it in.  Also
fix up the existing trigger funcs and their callers.

Link: http://lkml.kernel.org/r/543e31e9fc445ef61077421ab219033401c39846.1449767187.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-08 11:19:31 -05:00
Tom Zanussi
ab4bf00892 tracing: Make event trigger functions available
Make various event trigger utility functions available outside of
trace_events_trigger.c so that new triggers can be defined outside of
that file.

Link: http://lkml.kernel.org/r/4a40c1695dd43cac6cd475d72e13ffe30ba84bff.1449767187.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-08 11:19:30 -05:00
Tom Zanussi
4ef56902fb tracing: Make ftrace_event_field checking functions available
Make is_string_field() and is_function_field() accessible outside of
trace_event_filters.c for other users of ftrace_event_fields.

Link: http://lkml.kernel.org/r/2d3f00d3311702e556e82eed7754bae6f017939f.1449767187.git.tom.zanussi@linux.intel.com

Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-08 11:19:29 -05:00
Chunyu Hu
d39cdd2036 tracing: Make tracer_flags use the right set_flag callback
When I was updating the ftrace_stress test of ltp. I encountered
a strange phenomemon, excute following steps:

echo nop > /sys/kernel/debug/tracing/current_tracer
echo 0 > /sys/kernel/debug/tracing/options/funcgraph-cpu
bash: echo: write error: Invalid argument

check dmesg:
[ 1024.903855] nop_test_refuse flag set to 0: we refuse.Now cat trace_options to see the result

The reason is that the trace option test will randomly setup trace
option under tracing/options no matter what the current_tracer is.
but the set_tracer_option is always using the set_flag callback
from the current_tracer. This patch adds a pointer to tracer_flags
and make it point to the tracer it belongs to. When the option is
setup, the set_flag of the right tracer will be used no matter
what the the current_tracer is.

And the old dummy_tracer_flags is used for all the tracers which
doesn't have a tracer_flags, having issue to use it to save the
pointer of a tracer. So remove it and use dynamic dummy tracer_flags
for tracers needing a dummy tracer_flags, as a result, there are no
tracers sharing tracer_flags, so remove the check code.

And save the current tracer to trace_option_dentry seems not good as
it may waste mem space when mount the debug/trace fs more than one time.

Link: http://lkml.kernel.org/r/1457444222-8654-1-git-send-email-chuhu@redhat.com

Signed-off-by: Chunyu Hu <chuhu@redhat.com>
[ Fixed up function tracer options to work with the change ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-08 11:19:08 -05:00
Jianyu Zhan
29b75eb2d5 futex: Replace barrier() in unqueue_me() with READ_ONCE()
Commit e91467ecd1ef ("bug in futex unqueue_me") introduced a barrier() in
unqueue_me() to prevent the compiler from rereading the lock pointer which
might change after a check for NULL.

Replace the barrier() with a READ_ONCE() for the following reasons:

1) READ_ONCE() is a weaker form of barrier() that affects only the specific
   load operation, while barrier() is a general compiler level memory barrier.
   READ_ONCE() was not available at the time when the barrier was added.

2) Aside of that READ_ONCE() is descriptive and self explainatory while a
   barrier without comment is not clear to the casual reader.

No functional change.

[ tglx: Massaged changelog ]

Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Darren Hart <dvhart@linux.intel.com>
Cc: dave@stgolabs.net
Cc: peterz@infradead.org
Cc: linux@rasmusvillemoes.dk
Cc: akpm@linux-foundation.org
Cc: fengguang.wu@intel.com
Cc: bigeasy@linutronix.de
Link: http://lkml.kernel.org/r/1457314344-5685-1-git-send-email-nasa4836@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-03-08 17:04:02 +01:00
Ingo Molnar
1f25184656 Merge branch 'timers/core-v9' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/nohz
Pull nohz enhancements from Frederic Weisbecker:

"Currently in nohz full configs, the tick dependency is checked
 asynchronously by nohz code from interrupt and context switch for each
 concerned subsystem with a set of function provided by these. Such
 functions are made of many conditions and details that can be heavyweight
 as they are called on fastpath: sched_can_stop_tick(),
 posix_cpu_timer_can_stop_tick(), perf_event_can_stop_tick()...

 Thomas suggested a few months ago to make that tick dependency check
 synchronous. Instead of checking subsystems details from each interrupt
 to guess if the tick can be stopped, every subsystem that may have a tick
 dependency should set itself a flag specifying the state of that
 dependency. This way we can verify if we can stop the tick with a single
 lightweight mask check on fast path.

 This conversion from a pull to a push model to implement tick dependency
 is the core feature of this patchset that is split into:

  * Nohz wide kick simplification
  * Improve nohz tracing
  * Introduce tick dependency mask
  * Migrate scheduler, posix timers, perf events and sched clock tick
    dependencies to the tick dependency mask."

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-08 13:17:54 +01:00
Ingo Molnar
fe36d8912c Merge branch 'linus' into irq/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-08 12:26:07 +01:00
Chris Friesen
f9c904b761 sched/cputime: Fix steal_account_process_tick() to always return jiffies
The callers of steal_account_process_tick() expect it to return
whether a jiffy should be considered stolen or not.

Currently the return value of steal_account_process_tick() is in
units of cputime, which vary between either jiffies or nsecs
depending on CONFIG_VIRT_CPU_ACCOUNTING_GEN.

If cputime has nsecs granularity and there is a tiny amount of
stolen time (a few nsecs, say) then we will consider the entire
tick stolen and will not account the tick on user/system/idle,
causing /proc/stats to show invalid data.

The fix is to change steal_account_process_tick() to accumulate
the stolen time and only account it once it's worth a jiffy.

(Thanks to Frederic Weisbecker for suggestions to fix a bug in my
first version of the patch.)

Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/56DBBDB8.40305@mail.usask.ca
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-08 12:24:56 +01:00
Luca Abeni
72f9f3fdc9 sched/deadline: Remove dl_new from struct sched_dl_entity
The dl_new field of struct sched_dl_entity is currently used to
identify new deadline tasks, so that their deadline and runtime
can be properly initialised.

However, these tasks can be easily identified by checking if
their deadline is smaller than the current time when they switch
to SCHED_DEADLINE. So, dl_new can be removed by introducing this
check in switched_to_dl(); this allows to simplify the
SCHED_DEADLINE code.

Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1457350024-7825-2-git-send-email-luca.abeni@unitn.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-08 12:24:55 +01:00
Alexander Shishkin
927a557085 perf/core: Fix perf_sched_count derailment
The error path in perf_event_open() is such that asking for a sampling
event on a PMU that doesn't generate interrupts will end up in dropping
the perf_sched_count even though it hasn't been incremented for this
event yet.

Given a sufficient amount of these calls, we'll end up disabling
scheduler's jump label even though we'd still have active events in the
system, thereby facilitating the arrival of the infernal regions upon us.

I'm fixing this by moving account_event() inside perf_event_alloc().

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1456917854-29427-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-08 12:18:31 +01:00
Ingo Molnar
6436257b49 time/timekeeping: Work around false positive GCC warning
Newer GCC versions trigger the following warning:

  kernel/time/timekeeping.c: In function ‘get_device_system_crosststamp’:
  kernel/time/timekeeping.c:987:5: warning: ‘clock_was_set_seq’ may be used uninitialized in this function [-Wmaybe-uninitialized]
    if (discontinuity) {
     ^
  kernel/time/timekeeping.c:1045:15: note: ‘clock_was_set_seq’ was declared here
    unsigned int clock_was_set_seq;
                 ^

GCC clearly is unable to recognize that the 'do_interp' boolean tracks
the initialization status of 'clock_was_set_seq'.

The GCC version used was:

  gcc version 5.3.1 20151207 (Red Hat 5.3.1-2) (GCC)

Work it around by initializing clock_was_set_seq to 0. Compilers that
are able to recognize the code flow will eliminate the unnecessary
initialization.

Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-08 11:09:53 +01:00