25710 Commits

Author SHA1 Message Date
John Stultz
0bcdc0987c time: Fix ktime_get_raw() incorrect base accumulation
In comqit fc6eead7c1e2 ("time: Clean up CLOCK_MONOTONIC_RAW time
handling"), the following code got mistakenly added to the update of the
raw timekeeper:

 /* Update the monotonic raw base */
 seconds = tk->raw_sec;
 nsec = (u32)(tk->tkr_raw.xtime_nsec >> tk->tkr_raw.shift);
 tk->tkr_raw.base = ns_to_ktime(seconds * NSEC_PER_SEC + nsec);

Which adds the raw_sec value and the shifted down raw xtime_nsec to the
base value.

But the read function adds the shifted down tk->tkr_raw.xtime_nsec value
another time, The result of this is that ktime_get_raw() users (which are
all internal users) see the raw time move faster then it should (the rate
at which can vary with the current size of tkr_raw.xtime_nsec), which has
resulted in at least problems with graphics rendering performance.

The change tried to match the monotonic base update logic:

 seconds = (u64)(tk->xtime_sec + tk->wall_to_monotonic.tv_sec);
 nsec = (u32) tk->wall_to_monotonic.tv_nsec;
 tk->tkr_mono.base = ns_to_ktime(seconds * NSEC_PER_SEC + nsec);

Which adds the wall_to_monotonic.tv_nsec value, but not the
tk->tkr_mono.xtime_nsec value to the base.

To fix this, simplify the tkr_raw.base accumulation to only accumulate the
raw_sec portion, and do not include the tkr_raw.xtime_nsec portion, which
will be added at read time.

Fixes: fc6eead7c1e2 ("time: Clean up CLOCK_MONOTONIC_RAW time handling")
Reported-and-tested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Daniel Mentz <danielmentz@google.com>
Link: http://lkml.kernel.org/r/1503701824-1645-1-git-send-email-john.stultz@linaro.org
2017-08-26 16:06:12 +02:00
Ingo Molnar
413d63d71b Merge branch 'linus' into x86/mm to pick up fixes and to fix conflicts
Conflicts:
	arch/x86/kernel/head64.c
	arch/x86/mm/mmap.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-26 09:19:13 +02:00
Eric Biggers
2b7e8665b4 fork: fix incorrect fput of ->exe_file causing use-after-free
Commit 7c051267931a ("mm, fork: make dup_mmap wait for mmap_sem for
write killable") made it possible to kill a forking task while it is
waiting to acquire its ->mmap_sem for write, in dup_mmap().

However, it was overlooked that this introduced an new error path before
a reference is taken on the mm_struct's ->exe_file.  Since the
->exe_file of the new mm_struct was already set to the old ->exe_file by
the memcpy() in dup_mm(), it was possible for the mmput() in the error
path of dup_mm() to drop a reference to ->exe_file which was never
taken.

This caused the struct file to later be freed prematurely.

Fix it by updating mm_init() to NULL out the ->exe_file, in the same
place it clears other things like the list of mmaps.

This bug was found by syzkaller.  It can be reproduced using the
following C program:

    #define _GNU_SOURCE
    #include <pthread.h>
    #include <stdlib.h>
    #include <sys/mman.h>
    #include <sys/syscall.h>
    #include <sys/wait.h>
    #include <unistd.h>

    static void *mmap_thread(void *_arg)
    {
        for (;;) {
            mmap(NULL, 0x1000000, PROT_READ,
                 MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
        }
    }

    static void *fork_thread(void *_arg)
    {
        usleep(rand() % 10000);
        fork();
    }

    int main(void)
    {
        fork();
        fork();
        fork();
        for (;;) {
            if (fork() == 0) {
                pthread_t t;

                pthread_create(&t, NULL, mmap_thread, NULL);
                pthread_create(&t, NULL, fork_thread, NULL);
                usleep(rand() % 10000);
                syscall(__NR_exit_group, 0);
            }
            wait(NULL);
        }
    }

No special kernel config options are needed.  It usually causes a NULL
pointer dereference in __remove_shared_vm_struct() during exit, or in
dup_mmap() (which is usually inlined into copy_process()) during fork.
Both are due to a vm_area_struct's ->vm_file being used after it's
already been freed.

Google Bug Id: 64772007

Link: http://lkml.kernel.org/r/20170823211408.31198-1-ebiggers3@gmail.com
Fixes: 7c051267931a ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>	[v4.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-25 16:12:46 -07:00
Jiri Slaby
30d6e0a419 futex: Remove duplicated code and fix undefined behaviour
There is code duplicated over all architecture's headers for
futex_atomic_op_inuser. Namely op decoding, access_ok check for uaddr,
and comparison of the result.

Remove this duplication and leave up to the arches only the needed
assembly which is now in arch_futex_atomic_op_inuser.

This effectively distributes the Will Deacon's arm64 fix for undefined
behaviour reported by UBSAN to all architectures. The fix was done in
commit 5f16a046f8e1 (arm64: futex: Fix undefined behaviour with
FUTEX_OP_OPARG_SHIFT usage). Look there for an example dump.

And as suggested by Thomas, check for negative oparg too, because it was
also reported to cause undefined behaviour report.

Note that s390 removed access_ok check in d12a29703 ("s390/uaccess:
remove pointless access_ok() checks") as access_ok there returns true.
We introduce it back to the helper for the sake of simplicity (it gets
optimized away anyway).

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> [s390]
Acked-by: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
Reviewed-by: Darren Hart (VMware) <dvhart@infradead.org>
Reviewed-by: Will Deacon <will.deacon@arm.com> [core/arm64]
Cc: linux-mips@linux-mips.org
Cc: Rich Felker <dalias@libc.org>
Cc: linux-ia64@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: peterz@infradead.org
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: sparclinux@vger.kernel.org
Cc: Jonas Bonn <jonas@southpole.se>
Cc: linux-s390@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: linux-hexagon@vger.kernel.org
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: linux-snps-arc@lists.infradead.org
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-xtensa@linux-xtensa.org
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: openrisc@lists.librecores.org
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Stafford Horne <shorne@gmail.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Richard Henderson <rth@twiddle.net>
Cc: Chris Zankel <chris@zankel.net>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-parisc@vger.kernel.org
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: linux-alpha@vger.kernel.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: "David S. Miller" <davem@davemloft.net>
Link: http://lkml.kernel.org/r/20170824073105.3901-1-jslaby@suse.cz
2017-08-25 22:49:59 +02:00
Thomas Gleixner
b33394ba5c genirq/proc: Avoid uninitalized variable warning
kernel/irq/proc.c: In function ‘show_irq_affinity’:
include/linux/cpumask.h:24:29: warning: ‘mask’ may be used uninitialized in this function [-Wmaybe-uninitialized]
 #define cpumask_bits(maskp) ((maskp)->bits)

gcc is silly, but admittedly it can't know that this won't be called with
anything else than the enumerated constants.

Shut up the warning by creating a default clause.

Fixes: 6bc6d4abd22e ("genirq/proc: Use the the accessor to report the effective affinity
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-08-25 22:40:26 +02:00
Dan Carpenter
20c4d49c0f irqdomain: Prevent potential NULL pointer dereference in irq_domain_push_irq()
This code generates a Smatch warning:

  kernel/irq/irqdomain.c:1511 irq_domain_push_irq()
  warn: variable dereferenced before check 'root_irq_data' (see line 1508)

irq_get_irq_data() can return a NULL pointer, but the code dereferences
the returned pointer before checking it.

Move the NULL pointer check before the dereference.

[ tglx: Rewrote changelog to be precise and conforming to the instructions
  	in submitting-patches and added a Fixes tag. Sigh! ]

Fixes: 495c38d3001f ("irqdomain: Add irq_domain_{push,pop}_irq() functions")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David Daney <david.daney@cavium.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: kernel-janitors@vger.kernel.org
Link: http://lkml.kernel.org/r/20170825121409.6rfv4vt6ztz2oqkt@mwanda
2017-08-25 22:40:26 +02:00
kbuild test robot
ce8bdd6957 genirq: Fix semicolon.cocci warnings
kernel/irq/proc.c:69:2-3: Unneeded semicolon

Remove unneeded semicolon.

Generated by: scripts/coccinelle/misc/semicolon.cocci

Fixes: 0d3f54257dc3 ("genirq: Introduce effective affinity mask")
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: kbuild-all@01.org
Link: http://lkml.kernel.org/r/20170822075053.GA93890@lkp-hsx02
2017-08-25 22:40:25 +02:00
Peter Zijlstra
bbdacdfed2 sched/debug: Optimize sched_domain sysctl generation
Currently we unconditionally destroy all sysctl bits and regenerate
them after we've rebuild the domains (even if that rebuild is a
no-op).

And since we unconditionally (re)build the sysctl for all possible
CPUs, onlining all CPUs gets us O(n^2) time. Instead change this to
only rebuild the bits for CPUs we've actually installed new domains
on.

Reported-by: Ofer Levi(SW) <oferle@mellanox.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:12:20 +02:00
Peter Zijlstra
09e0dd8e0f sched/topology: Avoid pointless rebuild
Fix partition_sched_domains() to try and preserve the existing machine
wide domain instead of unconditionally destroying it. We do this by
attempting to allocate the new single domain, only when that fails to
we reuse the fallback_doms.

When using fallback_doms we need to first destroy and then recreate
because both the old and new could be backed by it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ofer Levi(SW) <oferle@mellanox.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet.Gupta1@synopsys.com <Vineet.Gupta1@synopsys.com>
Cc: rusty@rustcorp.com.au <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:12:20 +02:00
Peter Zijlstra
77d1dfda0e sched/topology, cpuset: Avoid spurious/wrong domain rebuilds
When disabling cpuset.sched_load_balance we expect to be able to online
CPUs without generating sched_domains. However this is currently
completely broken.

What happens is that we generate the sched_domains and then destroy
them. This is because of the spurious 'default' domain build in
cpuset_update_active_cpus(). That builds a single machine wide domain
and then schedules a work to build the 'real' domains. The work then
finds there are _no_ domains and destroys the lot again.

Furthermore, if there actually were cpusets, building the machine wide
domain is actively wrong, because it would allow tasks to 'escape' their
cpuset. Also I don't think its needed, the scheduler really should
respect the active mask.

Reported-by: Ofer Levi(SW) <oferle@mellanox.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet.Gupta1@synopsys.com <Vineet.Gupta1@synopsys.com>
Cc: rusty@rustcorp.com.au <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:12:20 +02:00
Peter Zijlstra
a090c4f2cd sched/topology: Improve comments
Mike provided a better comment for destroy_sched_domain() ...

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:12:19 +02:00
Shu Wang
213c5a459a sched/topology: Fix memory leak in __sdt_alloc()
Found this issue by kmemleak: the 'sg' and 'sgc' pointers from
__sdt_alloc() might be leaked as each domain holds many groups' ref,
but in destroy_sched_domain(), it only declined the first group ref.

Onlining and offlining a CPU can trigger this leak, and cause OOM.

Reproducer for my 6 CPUs machine:

  while true
  do
      echo 0 > /sys/devices/system/cpu/cpu5/online;
      echo 1 > /sys/devices/system/cpu/cpu5/online;
  done

  unreferenced object 0xffff88007d772a80 (size 64):
    comm "cpuhp/5", pid 39, jiffies 4294719962 (age 35.251s)
    hex dump (first 32 bytes):
      c0 22 77 7d 00 88 ff ff 02 00 00 00 01 00 00 00  ."w}............
      40 2a 77 7d 00 88 ff ff 00 00 00 00 00 00 00 00  @*w}............
    backtrace:
      [<ffffffff8176525a>] kmemleak_alloc+0x4a/0xa0
      [<ffffffff8121efe1>] __kmalloc_node+0xf1/0x280
      [<ffffffff810d94a8>] build_sched_domains+0x1e8/0xf20
      [<ffffffff810da674>] partition_sched_domains+0x304/0x360
      [<ffffffff81139557>] cpuset_update_active_cpus+0x17/0x40
      [<ffffffff810bdb2e>] sched_cpu_activate+0xae/0xc0
      [<ffffffff810900e0>] cpuhp_invoke_callback+0x90/0x400
      [<ffffffff81090597>] cpuhp_up_callbacks+0x37/0xb0
      [<ffffffff81090887>] cpuhp_thread_fun+0xd7/0xf0
      [<ffffffff810b37e0>] smpboot_thread_fn+0x110/0x160
      [<ffffffff810af5d9>] kthread+0x109/0x140
      [<ffffffff81770e45>] ret_from_fork+0x25/0x30
      [<ffffffffffffffff>] 0xffffffffffffffff

  unreferenced object 0xffff88007d772a40 (size 64):
    comm "cpuhp/5", pid 39, jiffies 4294719962 (age 35.251s)
    hex dump (first 32 bytes):
      03 00 00 00 00 00 00 00 00 04 00 00 00 00 00 00  ................
      00 04 00 00 00 00 00 00 4f 3c fc ff 00 00 00 00  ........O<......
    backtrace:
      [<ffffffff8176525a>] kmemleak_alloc+0x4a/0xa0
      [<ffffffff8121efe1>] __kmalloc_node+0xf1/0x280
      [<ffffffff810da16d>] build_sched_domains+0xead/0xf20
      [<ffffffff810da674>] partition_sched_domains+0x304/0x360
      [<ffffffff81139557>] cpuset_update_active_cpus+0x17/0x40
      [<ffffffff810bdb2e>] sched_cpu_activate+0xae/0xc0
      [<ffffffff810900e0>] cpuhp_invoke_callback+0x90/0x400
      [<ffffffff81090597>] cpuhp_up_callbacks+0x37/0xb0
      [<ffffffff81090887>] cpuhp_thread_fun+0xd7/0xf0
      [<ffffffff810b37e0>] smpboot_thread_fn+0x110/0x160
      [<ffffffff810af5d9>] kthread+0x109/0x140
      [<ffffffff81770e45>] ret_from_fork+0x25/0x30
      [<ffffffffffffffff>] 0xffffffffffffffff

Reported-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Shu Wang <shuwang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Chunyu Hu <chuhu@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: liwang@redhat.com
Link: http://lkml.kernel.org/r/1502351536-9108-1-git-send-email-shuwang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:12:19 +02:00
Ingo Molnar
3a9ff4fd04 Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:07:13 +02:00
Peter Zijlstra
e6f3faa734 locking/lockdep: Fix workqueue crossrelease annotation
The new completion/crossrelease annotations interact unfavourable with
the extant flush_work()/flush_workqueue() annotations.

The problem is that when a single work class does:

  wait_for_completion(&C)

and

  complete(&C)

in different executions, we'll build dependencies like:

  lock_map_acquire(W)
  complete_acquire(C)

and

  lock_map_acquire(W)
  complete_release(C)

which results in the dependency chain: W->C->W, which lockdep thinks
spells deadlock, even though there is no deadlock potential since
works are ran concurrently.

One possibility would be to change the work 'lock' to recursive-read,
but that would mean hitting a lockdep limitation on recursive locks.
Also, unconditinoally switching to recursive-read here would fail to
detect the actual deadlock on single-threaded workqueues, which do
have a problem with this.

For now, forcefully disregard these locks for crossrelease.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: byungchul.park@lge.com
Cc: david@fromorbit.com
Cc: johannes@sipsolutions.net
Cc: oleg@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:06:33 +02:00
Peter Zijlstra
a1d14934ea workqueue/lockdep: 'Fix' flush_work() annotation
The flush_work() annotation as introduced by commit:

  e159489baa71 ("workqueue: relax lockdep annotation on flush_work()")

hits on the lockdep problem with recursive read locks.

The situation as described is:

Work W1:                Work W2:        Task:

ARR(Q)                  ARR(Q)		flush_workqueue(Q)
A(W1)                   A(W2)             A(Q)
  flush_work(W2)			  R(Q)
    A(W2)
    R(W2)
    if (special)
      A(Q)
    else
      ARR(Q)
    R(Q)

where: A - acquire, ARR - acquire-read-recursive, R - release.

Where under 'special' conditions we want to trigger a lock recursion
deadlock, but otherwise allow the flush_work(). The allowing is done
by using recursive read locks (ARR), but lockdep is broken for
recursive stuff.

However, there appears to be no need to acquire the lock if we're not
'special', so if we remove the 'else' clause things become much
simpler and no longer need the recursion thing at all.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boqun.feng@gmail.com
Cc: byungchul.park@lge.com
Cc: david@fromorbit.com
Cc: johannes@sipsolutions.net
Cc: oleg@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:06:32 +02:00
Ingo Molnar
10c9850cb2 Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:04:51 +02:00
Jesper Dangaard Brouer
d0618410ec tracing, perf: Adjust code layout in get_recursion_context()
In an XDP redirect applications using tracepoint xdp:xdp_redirect to
diagnose TX overrun, I noticed perf_swevent_get_recursion_context()
was consuming 2% CPU. This was reduced to 1.85% with this simple
change.

Looking at the annotated asm code, it was clear that the unlikely case
in_nmi() test was chosen (by the compiler) as the most likely
event/branch.  This small adjustment makes the compiler (GCC version
7.1.1 20170622 (Red Hat 7.1.1-3)) put in_nmi() as an unlikely branch.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/150342256382.16595.986861478681783732.stgit@firesoul
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:04:18 +02:00
Oleg Nesterov
1d953111b6 perf/core: Don't report zero PIDs for exiting tasks
The exiting/dead task has no PIDs and in this case perf_event_pid/tid()
return zero, change them to return -1 to distinguish this case from
idle threads.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170822155928.GA6892@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:04:17 +02:00
Will Deacon
d9a50b0256 perf/aux: Ensure aux_wakeup represents most recent wakeup index
The aux_watermark member of struct ring_buffer represents the period (in
terms of bytes) at which wakeup events should be generated when data is
written to the aux buffer in non-snapshot mode. On hardware that cannot
generate an interrupt when the aux_head reaches an arbitrary wakeup index
(such as ARM SPE), the aux_head sampled from handle->head in
perf_aux_output_{skip,end} may in fact be past the wakeup index. This
can lead to wakeup slowly falling behind the head. For example, consider
the case where hardware can only generate an interrupt on a page-boundary
and the aux buffer is initialised as follows:

  // Buffer size is 2 * PAGE_SIZE
  rb->aux_head = rb->aux_wakeup = 0
  rb->aux_watermark = PAGE_SIZE / 2

following the first perf_aux_output_begin call, the handle is
initialised with:

  handle->head = 0
  handle->size = 2 * PAGE_SIZE
  handle->wakeup = PAGE_SIZE / 2

and the hardware will be programmed to generate an interrupt at
PAGE_SIZE.

When the interrupt is raised, the hardware head will be at PAGE_SIZE,
so calling perf_aux_output_end(handle, PAGE_SIZE) puts the ring buffer
into the following state:

  rb->aux_head = PAGE_SIZE
  rb->aux_wakeup = PAGE_SIZE / 2
  rb->aux_watermark = PAGE_SIZE / 2

and then the next call to perf_aux_output_begin will result in:

  handle->head = handle->wakeup = PAGE_SIZE

for which the semantics are unclear and, for a smaller aux_watermark
(e.g. PAGE_SIZE / 4), then the wakeup would in fact be behind head at
this point.

This patch fixes the problem by rounding down the aux_head (as sampled
from the handle) to the nearest aux_watermark boundary when updating
rb->aux_wakeup, therefore taking into account any overruns by the
hardware.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/1502900297-21839-2-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:04:16 +02:00
Will Deacon
2ab346cfb0 perf/aux: Make aux_{head,wakeup} ring_buffer members long
The aux_head and aux_wakeup members of struct ring_buffer are defined
using the local_t type, despite the fact that they are only accessed via
the perf_aux_output_*() functions, which cannot race with each other for a
given ring buffer.

This patch changes the type of the members to long, so we can avoid
using the local_*() API where it isn't needed.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/1502900297-21839-1-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:04:15 +02:00
Ingo Molnar
290d9bf281 Merge branch 'perf/urgent' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:01:05 +02:00
Mark Rutland
64aee2a965 perf/core: Fix group {cpu,task} validation
Regardless of which events form a group, it does not make sense for the
events to target different tasks and/or CPUs, as this leaves the group
inconsistent and impossible to schedule. The core perf code assumes that
these are consistent across (successfully intialised) groups.

Core perf code only verifies this when moving SW events into a HW
context. Thus, we can violate this requirement for pure SW groups and
pure HW groups, unless the relevant PMU driver happens to perform this
verification itself. These mismatched groups subsequently wreak havoc
elsewhere.

For example, we handle watchpoints as SW events, and reserve watchpoint
HW on a per-CPU basis at pmu::event_init() time to ensure that any event
that is initialised is guaranteed to have a slot at pmu::add() time.
However, the core code only checks the group leader's cpu filter (via
event_filter_match()), and can thus install follower events onto CPUs
violating thier (mismatched) CPU filters, potentially installing them
into a CPU without sufficient reserved slots.

This can be triggered with the below test case, resulting in warnings
from arch backends.

  #define _GNU_SOURCE
  #include <linux/hw_breakpoint.h>
  #include <linux/perf_event.h>
  #include <sched.h>
  #include <stdio.h>
  #include <sys/prctl.h>
  #include <sys/syscall.h>
  #include <unistd.h>

  static int perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu,
			   int group_fd, unsigned long flags)
  {
	return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags);
  }

  char watched_char;

  struct perf_event_attr wp_attr = {
	.type = PERF_TYPE_BREAKPOINT,
	.bp_type = HW_BREAKPOINT_RW,
	.bp_addr = (unsigned long)&watched_char,
	.bp_len = 1,
	.size = sizeof(wp_attr),
  };

  int main(int argc, char *argv[])
  {
	int leader, ret;
	cpu_set_t cpus;

	/*
	 * Force use of CPU0 to ensure our CPU0-bound events get scheduled.
	 */
	CPU_ZERO(&cpus);
	CPU_SET(0, &cpus);
	ret = sched_setaffinity(0, sizeof(cpus), &cpus);
	if (ret) {
		printf("Unable to set cpu affinity\n");
		return 1;
	}

	/* open leader event, bound to this task, CPU0 only */
	leader = perf_event_open(&wp_attr, 0, 0, -1, 0);
	if (leader < 0) {
		printf("Couldn't open leader: %d\n", leader);
		return 1;
	}

	/*
	 * Open a follower event that is bound to the same task, but a
	 * different CPU. This means that the group should never be possible to
	 * schedule.
	 */
	ret = perf_event_open(&wp_attr, 0, 1, leader, 0);
	if (ret < 0) {
		printf("Couldn't open mismatched follower: %d\n", ret);
		return 1;
	} else {
		printf("Opened leader/follower with mismastched CPUs\n");
	}

	/*
	 * Open as many independent events as we can, all bound to the same
	 * task, CPU0 only.
	 */
	do {
		ret = perf_event_open(&wp_attr, 0, 0, -1, 0);
	} while (ret >= 0);

	/*
	 * Force enable/disble all events to trigger the erronoeous
	 * installation of the follower event.
	 */
	printf("Opened all events. Toggling..\n");
	for (;;) {
		prctl(PR_TASK_PERF_EVENTS_DISABLE, 0, 0, 0, 0);
		prctl(PR_TASK_PERF_EVENTS_ENABLE, 0, 0, 0, 0);
	}

	return 0;
  }

Fix this by validating this requirement regardless of whether we're
moving events.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zhou Chengming <zhouchengming1@huawei.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1498142498-15758-1-git-send-email-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:00:34 +02:00
Eric Biggers
3fd8712707 strparser: initialize all callbacks
commit bbb03029a899 ("strparser: Generalize strparser") added more
function pointers to 'struct strp_callbacks'; however, kcm_attach() was
not updated to initialize them.  This could cause the ->lock() and/or
->unlock() function pointers to be set to garbage values, causing a
crash in strp_work().

Fix the bug by moving the callback structs into static memory, so
unspecified members are zeroed.  Also constify them while we're at it.

This bug was found by syzkaller, which encountered the following splat:

    IP: 0x55
    PGD 3b1ca067
    P4D 3b1ca067
    PUD 3b12f067
    PMD 0

    Oops: 0010 [#1] SMP KASAN
    Dumping ftrace buffer:
       (ftrace buffer empty)
    Modules linked in:
    CPU: 2 PID: 1194 Comm: kworker/u8:1 Not tainted 4.13.0-rc4-next-20170811 #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Workqueue: kstrp strp_work
    task: ffff88006bb0e480 task.stack: ffff88006bb10000
    RIP: 0010:0x55
    RSP: 0018:ffff88006bb17540 EFLAGS: 00010246
    RAX: dffffc0000000000 RBX: ffff88006ce4bd60 RCX: 0000000000000000
    RDX: 1ffff1000d9c97bd RSI: 0000000000000000 RDI: ffff88006ce4bc48
    RBP: ffff88006bb17558 R08: ffffffff81467ab2 R09: 0000000000000000
    R10: ffff88006bb17438 R11: ffff88006bb17940 R12: ffff88006ce4bc48
    R13: ffff88003c683018 R14: ffff88006bb17980 R15: ffff88003c683000
    FS:  0000000000000000(0000) GS:ffff88006de00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000055 CR3: 000000003c145000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
     process_one_work+0xbf3/0x1bc0 kernel/workqueue.c:2098
     worker_thread+0x223/0x1860 kernel/workqueue.c:2233
     kthread+0x35e/0x430 kernel/kthread.c:231
     ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431
    Code:  Bad RIP value.
    RIP: 0x55 RSP: ffff88006bb17540
    CR2: 0000000000000055
    ---[ end trace f0e4920047069cee ]---

Here is a C reproducer (requires CONFIG_BPF_SYSCALL=y and
CONFIG_AF_KCM=y):

    #include <linux/bpf.h>
    #include <linux/kcm.h>
    #include <linux/types.h>
    #include <stdint.h>
    #include <sys/ioctl.h>
    #include <sys/socket.h>
    #include <sys/syscall.h>
    #include <unistd.h>

    static const struct bpf_insn bpf_insns[3] = {
        { .code = 0xb7 }, /* BPF_MOV64_IMM(0, 0) */
        { .code = 0x95 }, /* BPF_EXIT_INSN() */
    };

    static const union bpf_attr bpf_attr = {
        .prog_type = 1,
        .insn_cnt = 2,
        .insns = (uintptr_t)&bpf_insns,
        .license = (uintptr_t)"",
    };

    int main(void)
    {
        int bpf_fd = syscall(__NR_bpf, BPF_PROG_LOAD,
                             &bpf_attr, sizeof(bpf_attr));
        int inet_fd = socket(AF_INET, SOCK_STREAM, 0);
        int kcm_fd = socket(AF_KCM, SOCK_DGRAM, 0);

        ioctl(kcm_fd, SIOCKCMATTACH,
              &(struct kcm_attach) { .fd = inet_fd, .bpf_fd = bpf_fd });
    }

Fixes: bbb03029a899 ("strparser: Generalize strparser")
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Tom Herbert <tom@quantonium.net>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 21:57:50 -07:00
Linus Torvalds
415be6c256 Various bug fixes:
- Two small memory leaks in error paths.
 
  - A missed return error code on an error path.
 
  - A fix to check the tracing ring buffer CPU when it doesn't
    exist (caused by setting maxcpus on the command line that is less
    than the actual number of CPUs, and then onlining them manually).
 
  - A fix to have the reset of boot tracers called by lateinit_sync()
    instead of just lateinit(). As some of the tracers register via
    lateinit(), and if the clear happens before the tracer is registered,
    it will never start even though it was told to via the kernel command
    line.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEEQEw9Eu0DdyUUkuUUybkF8mrZjcsFAlme4DwUHHJvc3RlZHRA
 Z29vZG1pcy5vcmcACgkQybkF8mrZjcssNwf+Itap7Mtbk48wJYNqfjk1pzyiOcYV
 WM88EOBFM46dttVN6cBs2uUmtdvmX/g52RtsHzG6ZbwxzLE+tIGbSO2plGoknOyD
 lro5CSHT2j3bu0enqkxfznDUT0PNrELEaYBoMK0yhMsXm0v+XqHUxkIqb19Ubuo+
 ORPBShZghJtAiEBFArV1nXBW1kzrIFJwjymdF2ccqUlg+XxtPS1wgnZPIOjCa8ia
 YM4bX3aTUh4LiUUvS7FlJsrwjB+JFOHdXu1Vg140CvJEon1a+bW4Jx88MxoN6zrp
 xmFlXm/8MLWz27GO11IkveH01mSrdP67bKIIx8v2ybPBbwsW0Msb2HfZkw==
 =rUJQ
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Various bug fixes:

   - Two small memory leaks in error paths.

   - A missed return error code on an error path.

   - A fix to check the tracing ring buffer CPU when it doesn't exist
     (caused by setting maxcpus on the command line that is less than
     the actual number of CPUs, and then onlining them manually).

   - A fix to have the reset of boot tracers called by lateinit_sync()
     instead of just lateinit(). As some of the tracers register via
     lateinit(), and if the clear happens before the tracer is
     registered, it will never start even though it was told to via the
     kernel command line"

* tag 'trace-v4.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix freeing of filter in create_filter() when set_str is false
  tracing: Fix kmemleak in tracing_map_array_free()
  ftrace: Check for null ret_stack on profile function graph entry function
  ring-buffer: Have ring_buffer_alloc_read_page() return error on offline CPU
  tracing: Missing error code in tracer_alloc_buffers()
  tracing: Call clear_boot_tracer() at lateinit_sync
2017-08-24 14:08:22 -07:00
Waiman Long
1c08c22c87 cpuset: Fix incorrect memory_pressure control file mapping
The memory_pressure control file was incorrectly set up without
a private value (0, by default). As a result, this control
file was treated like memory_migrate on read. By adding back the
FILE_MEMORY_PRESSURE private value, the correct memory pressure value
will be returned.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 7dbdb199d3bf ("cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE")
Cc: stable@vger.kernel.org # v4.4+
2017-08-24 09:42:28 -07:00
Steven Rostedt (VMware)
8b0db1a5bd tracing: Fix freeing of filter in create_filter() when set_str is false
Performing the following task with kmemleak enabled:

 # cd /sys/kernel/tracing/events/irq/irq_handler_entry/
 # echo 'enable_event:kmem:kmalloc:3 if irq >' > trigger
 # echo 'enable_event:kmem:kmalloc:3 if irq > 31' > trigger
 # echo scan > /sys/kernel/debug/kmemleak
 # cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff8800b9290308 (size 32):
  comm "bash", pid 1114, jiffies 4294848451 (age 141.139s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff81cef5aa>] kmemleak_alloc+0x4a/0xa0
    [<ffffffff81357938>] kmem_cache_alloc_trace+0x158/0x290
    [<ffffffff81261c09>] create_filter_start.constprop.28+0x99/0x940
    [<ffffffff812639c9>] create_filter+0xa9/0x160
    [<ffffffff81263bdc>] create_event_filter+0xc/0x10
    [<ffffffff812655e5>] set_trigger_filter+0xe5/0x210
    [<ffffffff812660c4>] event_enable_trigger_func+0x324/0x490
    [<ffffffff812652e2>] event_trigger_write+0x1a2/0x260
    [<ffffffff8138cf87>] __vfs_write+0xd7/0x380
    [<ffffffff8138f421>] vfs_write+0x101/0x260
    [<ffffffff8139187b>] SyS_write+0xab/0x130
    [<ffffffff81cfd501>] entry_SYSCALL_64_fastpath+0x1f/0xbe
    [<ffffffffffffffff>] 0xffffffffffffffff

The function create_filter() is passed a 'filterp' pointer that gets
allocated, and if "set_str" is true, it is up to the caller to free it, even
on error. The problem is that the pointer is not freed by create_filter()
when set_str is false. This is a bug, and it is not up to the caller to free
the filter on error if it doesn't care about the string.

Link: http://lkml.kernel.org/r/1502705898-27571-2-git-send-email-chuhu@redhat.com

Cc: stable@vger.kernel.org
Fixes: 38b78eb85 ("tracing: Factorize filter creation")
Reported-by: Chunyu Hu <chuhu@redhat.com>
Tested-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-08-24 10:07:38 -04:00
Chunyu Hu
475bb3c69a tracing: Fix kmemleak in tracing_map_array_free()
kmemleak reported the below leak when I was doing clear of the hist
trigger. With this patch, the kmeamleak is gone.

unreferenced object 0xffff94322b63d760 (size 32):
  comm "bash", pid 1522, jiffies 4403687962 (age 2442.311s)
  hex dump (first 32 bytes):
    00 01 00 00 04 00 00 00 08 00 00 00 ff 00 00 00  ................
    10 00 00 00 00 00 00 00 80 a8 7a f2 31 94 ff ff  ..........z.1...
  backtrace:
    [<ffffffff9e96c27a>] kmemleak_alloc+0x4a/0xa0
    [<ffffffff9e424cba>] kmem_cache_alloc_trace+0xca/0x1d0
    [<ffffffff9e377736>] tracing_map_array_alloc+0x26/0x140
    [<ffffffff9e261be0>] kretprobe_trampoline+0x0/0x50
    [<ffffffff9e38b935>] create_hist_data+0x535/0x750
    [<ffffffff9e38bd47>] event_hist_trigger_func+0x1f7/0x420
    [<ffffffff9e38893d>] event_trigger_write+0xfd/0x1a0
    [<ffffffff9e44dfc7>] __vfs_write+0x37/0x170
    [<ffffffff9e44f552>] vfs_write+0xb2/0x1b0
    [<ffffffff9e450b85>] SyS_write+0x55/0xc0
    [<ffffffff9e203857>] do_syscall_64+0x67/0x150
    [<ffffffff9e977ce7>] return_from_SYSCALL_64+0x0/0x6a
    [<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff9431f27aa880 (size 128):
  comm "bash", pid 1522, jiffies 4403687962 (age 2442.311s)
  hex dump (first 32 bytes):
    00 00 8c 2a 32 94 ff ff 00 f0 8b 2a 32 94 ff ff  ...*2......*2...
    00 e0 8b 2a 32 94 ff ff 00 d0 8b 2a 32 94 ff ff  ...*2......*2...
  backtrace:
    [<ffffffff9e96c27a>] kmemleak_alloc+0x4a/0xa0
    [<ffffffff9e425348>] __kmalloc+0xe8/0x220
    [<ffffffff9e3777c1>] tracing_map_array_alloc+0xb1/0x140
    [<ffffffff9e261be0>] kretprobe_trampoline+0x0/0x50
    [<ffffffff9e38b935>] create_hist_data+0x535/0x750
    [<ffffffff9e38bd47>] event_hist_trigger_func+0x1f7/0x420
    [<ffffffff9e38893d>] event_trigger_write+0xfd/0x1a0
    [<ffffffff9e44dfc7>] __vfs_write+0x37/0x170
    [<ffffffff9e44f552>] vfs_write+0xb2/0x1b0
    [<ffffffff9e450b85>] SyS_write+0x55/0xc0
    [<ffffffff9e203857>] do_syscall_64+0x67/0x150
    [<ffffffff9e977ce7>] return_from_SYSCALL_64+0x0/0x6a
    [<ffffffffffffffff>] 0xffffffffffffffff

Link: http://lkml.kernel.org/r/1502705898-27571-1-git-send-email-chuhu@redhat.com

Cc: stable@vger.kernel.org
Fixes: 08d43a5fa063 ("tracing: Add lock-free tracing_map")
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-08-24 10:05:51 -04:00
Steven Rostedt (VMware)
a8f0f9e499 ftrace: Check for null ret_stack on profile function graph entry function
There's a small race when function graph shutsdown and the calling of the
registered function graph entry callback. The callback must not reference
the task's ret_stack without first checking that it is not NULL. Note, when
a ret_stack is allocated for a task, it stays allocated until the task exits.
The problem here, is that function_graph is shutdown, and a new task was
created, which doesn't have its ret_stack allocated. But since some of the
functions are still being traced, the callbacks can still be called.

The normal function_graph code handles this, but starting with commit
8861dd303c ("ftrace: Access ret_stack->subtime only in the function
profiler") the profiler code references the ret_stack on function entry, but
doesn't check if it is NULL first.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=196611

Cc: stable@vger.kernel.org
Fixes: 8861dd303c ("ftrace: Access ret_stack->subtime only in the function profiler")
Reported-by: lilydjwg@gmail.com
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-08-24 10:04:01 -04:00
Nicholas Piggin
2fe59f507a timers: Fix excessive granularity of new timers after a nohz idle
When a timer base is idle, it is forwarded when a new timer is added
to ensure that granularity does not become excessive. When not idle,
the timer tick is expected to increment the base.

However there are several problems:

- If an existing timer is modified, the base is forwarded only after
  the index is calculated.

- The base is not forwarded by add_timer_on.

- There is a window after a timer is restarted from a nohz idle, after
  it is marked not-idle and before the timer tick on this CPU, where a
  timer may be added but the ancient base does not get forwarded.

These result in excessive granularity (a 1 jiffy timeout can blow out
to 100s of jiffies), which cause the rcu lockup detector to trigger,
among other things.

Fix this by keeping track of whether the timer base has been idle
since it was last run or forwarded, and if so then forward it before
adding a new timer.

There is still a case where mod_timer optimises the case of a pending
timer mod with the same expiry time, where the timer can see excessive
granularity relative to the new, shorter interval. A comment is added,
but it's not changed because it is an important fastpath for
networking.

This has been tested and found to fix the RCU softlockup messages.

Testing was also done with tracing to measure requested versus
achieved wakeup latencies for all non-deferrable timers in an idle
system (with no lockup watchdogs running). Wakeup latency relative to
absolute latency is calculated (note this suffers from round-up skew
at low absolute times) and analysed:

             max     avg      std
upstream   506.0    1.20     4.68
patched      2.0    1.08     0.15

The bug was noticed due to the lockup detector Kconfig changes
dropping it out of people's .configs and resulting in larger base
clk skew When the lockup detectors are enabled, no CPU can go idle for
longer than 4 seconds, which limits the granularity errors.
Sub-optimal timer behaviour is observable on a smaller scale in that
case:

	     max     avg      std
upstream     9.0    1.05     0.19
patched      2.0    1.04     0.11

Fixes: Fixes: a683f390b93f ("timers: Forward the wheel clock whenever possible")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: David Miller <davem@davemloft.net>
Cc: dzickus@redhat.com
Cc: sfr@canb.auug.org.au
Cc: mpe@ellerman.id.au
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: linuxarm@huawei.com
Cc: abdhalee@linux.vnet.ibm.com
Cc: John Stultz <john.stultz@linaro.org>
Cc: akpm@linux-foundation.org
Cc: paulmck@linux.vnet.ibm.com
Cc: torvalds@linux-foundation.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170822084348.21436-1-npiggin@gmail.com
2017-08-24 11:40:18 +02:00
Ingo Molnar
93da8b221d Merge branch 'linus' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-24 10:12:33 +02:00
Daniel Borkmann
a5e2da6e97 bpf: netdev is never null in __dev_map_flush
No need to test for it in fast-path, every dev in bpf_dtab_netdev
is guaranteed to be non-NULL, otherwise dev_map_update_elem() will
fail in the first place.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-23 22:43:40 -07:00
Edward Cree
8e9cd9ce90 bpf/verifier: document liveness analysis
The liveness tracking algorithm is quite subtle; add comments to explain it.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-23 22:38:08 -07:00
Edward Cree
1b688a19a9 bpf/verifier: remove varlen_map_value_access flag
The optimisation it does is broken when the 'new' register value has a
 variable offset and the 'old' was constant.  I broke it with my pointer
 types unification (see Fixes tag below), before which the 'new' value
 would have type PTR_TO_MAP_VALUE_ADJ and would thus not compare equal;
 other changes in that patch mean that its original behaviour (ignore
 min/max values) cannot be restored.
Tests on a sample set of cilium programs show no change in count of
 processed instructions.

Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-23 22:38:08 -07:00
Edward Cree
63f45f8406 bpf/verifier: when pruning a branch, ignore its write marks
The fact that writes occurred in reaching the continuation state does
 not screen off its reads from us, because we're not really its parent.
So detect 'not really the parent' in do_propagate_liveness, and ignore
 write marks in that case.

Fixes: dc503a8ad984 ("bpf/verifier: track liveness for pruning")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-23 22:38:07 -07:00
Christoph Hellwig
74d46992e0 block: replace bi_bdev with a gendisk pointer and partitions index
This way we don't need a block_device structure to submit I/O.  The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open.  Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device.  But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:55 -06:00
Peter Zijlstra
c5a94a618e workqueue: Use TASK_IDLE
Workqueues don't use signals, it (ab)uses TASK_INTERRUPTIBLE to avoid
increasing the loadavg numbers. We've 'recently' introduced TASK_IDLE
for this case:

  80ed87c8a9ca ("sched/wait: Introduce TASK_NOLOAD and TASK_IDLE")

use it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-08-23 06:30:35 -07:00
Marc Zyngier
0abce64a55 genirq: Let irq_set_vcpu_affinity() iterate over hierarchy
When assigning an interrupt to a vcpu, it is not unlikely that
the level of the hierarchy implementing irq_set_vcpu_affinity
is not the top level (think a generic MSI domain on top of a
virtualization aware interrupt controller).

In such a case, let's iterate over the hierarchy until we find
an irqchip implementing it.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2017-08-23 11:09:14 +01:00
Daniel Borkmann
af4d045cee bpf: minor cleanups for dev_map
Some minor code cleanups, while going over it I also noticed that
we're accounting the bitmap only for one CPU currently, so fix that
up as well.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-22 21:26:29 -07:00
Martijn Coenen
9e18d0c82f ANDROID: binder: add hwbinder,vndbinder to BINDER_DEVICES.
These will be required going forward.

Signed-off-by: Martijn Coenen <maco@android.com>
Cc: stable <stable@vger.kernel.org> # 4.11+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-22 18:43:23 -07:00
Daniel Borkmann
33ba43ed0a bpf: fix map value attribute for hash of maps
Currently, iproute2's BPF ELF loader works fine with array of maps
when retrieving the fd from a pinned node and doing a selfcheck
against the provided map attributes from the object file, but we
fail to do the same for hash of maps and thus refuse to get the
map from pinned node.

Reason is that when allocating hash of maps, fd_htab_map_alloc() will
set the value size to sizeof(void *), and any user space map creation
requests are forced to set 4 bytes as value size. Thus, selfcheck
will complain about exposed 8 bytes on 64 bit archs vs. 4 bytes from
object file as value size. Contract is that fdinfo or BPF_MAP_GET_FD_BY_ID
returns the value size used to create the map.

Fix it by handling it the same way as we do for array of maps, which
means that we leave value size at 4 bytes and in the allocation phase
round up value size to 8 bytes. alloc_htab_elem() needs an adjustment
in order to copy rounded up 8 bytes due to bpf_fd_htab_map_update_elem()
calling into htab_map_update_elem() with the pointer of the map
pointer as value. Unlike array of maps where we just xchg(), we're
using the generic htab_map_update_elem() callback also used from helper
calls, which published the key/value already on return, so we need
to ensure to memcpy() the right size.

Fixes: bcc6b1b7ebf8 ("bpf: Add hash of maps support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-22 16:32:02 -07:00
Daniel Borkmann
cd36c3a21a bpf: fix map value attribute for hash of maps
Currently, iproute2's BPF ELF loader works fine with array of maps
when retrieving the fd from a pinned node and doing a selfcheck
against the provided map attributes from the object file, but we
fail to do the same for hash of maps and thus refuse to get the
map from pinned node.

Reason is that when allocating hash of maps, fd_htab_map_alloc() will
set the value size to sizeof(void *), and any user space map creation
requests are forced to set 4 bytes as value size. Thus, selfcheck
will complain about exposed 8 bytes on 64 bit archs vs. 4 bytes from
object file as value size. Contract is that fdinfo or BPF_MAP_GET_FD_BY_ID
returns the value size used to create the map.

Fix it by handling it the same way as we do for array of maps, which
means that we leave value size at 4 bytes and in the allocation phase
round up value size to 8 bytes. alloc_htab_elem() needs an adjustment
in order to copy rounded up 8 bytes due to bpf_fd_htab_map_update_elem()
calling into htab_map_update_elem() with the pointer of the map
pointer as value. Unlike array of maps where we just xchg(), we're
using the generic htab_map_update_elem() callback also used from helper
calls, which published the key/value already on return, so we need
to ensure to memcpy() the right size.

Fixes: bcc6b1b7ebf8 ("bpf: Add hash of maps support")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-22 16:31:00 -07:00
David S. Miller
e2a7c34fb2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-08-21 17:06:42 -07:00
Oleg Nesterov
dd1c1f2f20 pids: make task_tgid_nr_ns() safe
This was reported many times, and this was even mentioned in commit
52ee2dfdd4f5 ("pids: refactor vnr/nr_ns helpers to make them safe") but
somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is
not safe because task->group_leader points to nowhere after the exiting
task passes exit_notify(), rcu_read_lock() can not help.

We really need to change __unhash_process() to nullify group_leader,
parent, and real_parent, but this needs some cleanups.  Until then we
can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and
fix the problem.

Reported-by: Troy Kensinger <tkensinger@google.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-21 12:47:31 -07:00
Ingo Molnar
94edf6f3c2 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

 - Removal of spin_unlock_wait()
 - SRCU updates
 - Torture-test updates
 - Documentation updates
 - Miscellaneous fixes
 - CPU-hotplug fixes
 - Miscellaneous non-RCU fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-21 09:45:19 +02:00
Daniel Borkmann
274043c6c9 bpf: fix double free from dev_map_notification()
In the current code, dev_map_free() can still race with dev_map_notification().
In dev_map_free(), we remove dtab from the list of dtabs after we purged
all entries from it. However, we don't do xchg() with NULL or the like,
so the entry at that point is still pointing to the device. If a unregister
notification comes in at the same time, we therefore risk a double-free,
since the pointer is still present in the map, and then pushed again to
__dev_map_entry_free().

All this is completely unnecessary. Just remove the dtab from the list
right before the synchronize_rcu(), so all outstanding readers from the
notifier list have finished by then, thus we don't need to deal with this
corner case anymore and also wouldn't need to nullify dev entires. This is
fine because we iterate over the map releasing all entries and therefore
dev references anyway.

Fixes: 4cc7b9544b9a ("bpf: devmap fix mutex in rcu critical section")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-20 19:45:54 -07:00
Linus Torvalds
e46db8d2ef Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "Two fixes for the perf subsystem:

   - Fix an inconsistency of RDPMC mm struct tagging across exec() which
     causes RDPMC to fault.

   - Correct the timestamp mechanics across IOC_DISABLE/ENABLE which
     causes incorrect timestamps and total time calculations"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix time on IOC_ENABLE
  perf/x86: Fix RDPMC vs. mm_struct tracking
2017-08-20 09:20:57 -07:00
Linus Torvalds
9dae41a238 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "A pile of smallish changes all over the place:

   - Add a missing ISB in the GIC V1 driver

   - Remove an ACPI version check in the GIC V3 ITS driver

   - Add the missing irq_pm_shutdown function for BRCMSTB-L2 to avoid
     spurious wakeups

   - Remove the artifical limitation of ITS instances to the number of
     NUMA nodes which prevents utilizing the ITS hardware correctly

   - Prevent a infinite parsing loop in the GIC-V3 ITS/MSI code

   - Honour the force affinity argument in the GIC-V3 driver which is
     required to make perf work correctly

   - Correctly report allocation failures in GIC-V2/V3 to avoid using
     half allocated and initialized interrupts.

   - Fixup checks against nr_cpu_ids in the generic IPI code"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq/ipi: Fixup checks against nr_cpu_ids
  genirq: Restore trigger settings in irq_modify_status()
  MAINTAINERS: Remove Jason Cooper's irqchip git tree
  irqchip/gic-v3-its-platform-msi: Fix msi-parent parsing loop
  irqchip/gic-v3-its: Allow GIC ITS number more than MAX_NUMNODES
  irqchip: brcmstb-l2: Define an irq_pm_shutdown function
  irqchip/gic: Ensure we have an ISB between ack and ->handle_irq
  irqchip/gic-v3-its: Remove ACPICA version check for ACPI NUMA
  irqchip/gic-v3: Honor forced affinity setting
  irqchip/gic-v3: Report failures in gic_irq_domain_alloc
  irqchip/gic-v2: Report failures in gic_irq_domain_alloc
  irqchip/atmel-aic: Remove root argument from ->fixup() prototype
  irqchip/atmel-aic: Fix unbalanced refcount in aic_common_rtc_irq_fixup()
  irqchip/atmel-aic: Fix unbalanced of_node_put() in aic_common_irq_fixup()
2017-08-20 09:07:56 -07:00
Linus Torvalds
e18a5ebc2d Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull watchdog fix from Thomas Gleixner:
 "A fix for the hardlockup watchdog to prevent false positives with
  extreme Turbo-Modes which make the perf/NMI watchdog fire faster than
  the hrtimer which is used to verify.

  Slightly larger than the minimal fix, which just would increase the
  hrtimer frequency, but comes with extra overhead of more watchdog
  timer interrupts and thread wakeups for all users.

  With this change we restrict the overhead to the extreme Turbo-Mode
  systems"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kernel/watchdog: Prevent false positives with turbo modes
2017-08-20 08:54:30 -07:00
Thomas Gleixner
4e2a809703 Merge branch 'fortglx/4.14/time' of https://git.linaro.org/people/john.stultz/linux into timers/core
Pull timekeepig updates from John Stultz

 - kselftest improvements

 - Use the proper timekeeper in the debug code

 - Prevent accessing an unavailable wakeup source in the alarmtimer sysfs
   interface.
2017-08-20 11:46:46 +02:00
Alexey Dobriyan
8fbbe2d7cc genirq/ipi: Fixup checks against nr_cpu_ids
Valid CPU ids are [0, nr_cpu_ids-1] inclusive.

Fixes: 3b8e29a82dd1 ("genirq: Implement ipi_send_mask/single()")
Fixes: f9bce791ae2a ("genirq: Add a new function to get IPI reverse mapping")
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170819095751.GB27864@avx2
2017-08-20 10:49:05 +02:00