27750 Commits

Author SHA1 Message Date
Ingo Molnar
371b326908 Linux 4.17-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlr4xw8eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGNYoH/1d5zyMpVJVUKZ0K
 LuEctCGby1PjSvSOhmMuxFVagFAqfBJXmwWTeohLfLG48r/Yk0AsZQ5HH13/8baj
 k/T8UgUvKZKustndCRp+joQ3Pa1ZpcIFaWRvB8pKFCefJ/F/Lj4B4X1HYI7vLq0K
 /ZBXUdy3ry0lcVuypnaARYAb2O7l/nyZIjZ3FhiuyymWe7Jpo+G7VK922LOMSX/y
 VYFZCWa8nxN+yFhO0ao9X5k7ggIiUrEBtbfNrk19VtAn0hx+OYKW2KfJK/eHNey/
 CKrOT+KAxU8VU29AEIbYzlL3yrQmULcEoIDiqJ/6m5m6JwsEbP6EqQHs0TiuQFpq
 A0MO9rw=
 =yjUP
 -----END PGP SIGNATURE-----

Merge tag 'v4.17-rc5' into locking/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-15 08:10:50 +02:00
Richard Guy Briggs
c0b0ae8a87 audit: use inline function to set audit context
Recognizing that the audit context is an internal audit value, use an
access function to set the audit context pointer for the task
rather than reaching directly into the task struct to set it.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: merge fuzz in audit.h]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-05-14 17:45:21 -04:00
Song Liu
bae77c5eb5 bpf: enable stackmap with build_id in nmi context
Currently, we cannot parse build_id in nmi context because of
up_read(&current->mm->mmap_sem), this makes stackmap with build_id
less useful. This patch enables parsing build_id in nmi by putting
the up_read() call in irq_work. To avoid memory allocation in nmi
context, we use per cpu variable for the irq_work. As a result, only
one irq_work per cpu is allowed. If the irq_work is in-use, we
fallback to only report ips.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-14 23:29:45 +02:00
Richard Guy Briggs
cdfb6b341f audit: use inline function to get audit context
Recognizing that the audit context is an internal audit value, use an
access function to retrieve the audit context pointer for the task
rather than reaching directly into the task struct to get it.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: merge fuzz in auditsc.c and selinuxfs.c, checkpatch.pl fixes]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-05-14 17:24:18 -04:00
Richard Guy Briggs
f0b752168d audit: convert sessionid unset to a macro
Use a macro, "AUDIT_SID_UNSET", to replace each instance of
initialization and comparison to an audit session ID.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-05-14 15:56:35 -04:00
Christoph Hellwig
0eb0b63c1d block: consistently use GFP_NOIO instead of __GFP_NORECLAIM
Same numerical value (for now at least), but a much better documentation
of intent.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 08:55:18 -06:00
Frederic Weisbecker
0f6f47bacb softirq/core: Turn default irq_cpustat_t to standard per-cpu
In order to optimize and consolidate softirq mask accesses, let's
convert the default irq_cpustat_t implementation to per-CPU standard API.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/1525786706-22846-5-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 11:25:27 +02:00
Ingo Molnar
4b96583869 Linux 4.17-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlr4xw8eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGNYoH/1d5zyMpVJVUKZ0K
 LuEctCGby1PjSvSOhmMuxFVagFAqfBJXmwWTeohLfLG48r/Yk0AsZQ5HH13/8baj
 k/T8UgUvKZKustndCRp+joQ3Pa1ZpcIFaWRvB8pKFCefJ/F/Lj4B4X1HYI7vLq0K
 /ZBXUdy3ry0lcVuypnaARYAb2O7l/nyZIjZ3FhiuyymWe7Jpo+G7VK922LOMSX/y
 VYFZCWa8nxN+yFhO0ao9X5k7ggIiUrEBtbfNrk19VtAn0hx+OYKW2KfJK/eHNey/
 CKrOT+KAxU8VU29AEIbYzlL3yrQmULcEoIDiqJ/6m5m6JwsEbP6EqQHs0TiuQFpq
 A0MO9rw=
 =yjUP
 -----END PGP SIGNATURE-----

Merge tag 'v4.17-rc5' into irq/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 11:22:59 +02:00
Tetsuo Handa
8cc05c71ba locking/lockdep: Move sanity check to inside lockdep_print_held_locks()
Calling lockdep_print_held_locks() on a running thread is considered unsafe.

Since all callers should follow that rule and the sanity check is not heavy,
this patch moves the sanity check to inside lockdep_print_held_locks().

As a side effect of this patch, the number of locks held by running threads
will be printed as well. This change will be preferable when we want to
know which threads might be relevant to a problem but are unable to print
any clues because that thread is running.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1523011279-8206-2-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 09:15:02 +02:00
Tetsuo Handa
0f736a52e4 locking/lockdep: Use for_each_process_thread() for debug_show_all_locks()
debug_show_all_locks() tries to grab the tasklist_lock for two seconds, but
calling while_each_thread() without tasklist_lock held is not safe.

See the following commit for more information:

  4449a51a7c281602 ("vm_is_stack: use for_each_thread() rather then buggy while_each_thread()")

Change debug_show_all_locks() from "do_each_thread()/while_each_thread()
with possibility of missing tasklist_lock" to "for_each_process_thread()
with RCU", and add a call to touch_all_softlockup_watchdogs() like
show_state_filter() does.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1523011279-8206-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 09:15:02 +02:00
Rohit Jain
943d355d7f sched/core: Distinguish between idle_cpu() calls based on desired effect, introduce available_idle_cpu()
In the following commit:

  247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted vCPUs")

... we distinguish between idle_cpu() when the vCPU is not running for
scheduling threads.

However, the idle_cpu() function is used in other places for
actually checking whether the state of the CPU is idle or not.

Hence split the use of that function based on the desired return value,
by introducing the available_idle_cpu() function.

This fixes a (slight) regression in that initial vCPU commit, because
some code paths (like the load-balancer) don't care and shouldn't care
if the vCPU is preempted or not, they just want to know if there's any
tasks on the CPU.

Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dhaval.giani@oracle.com
Cc: linux-kernel@vger.kernel.org
Cc: matt@codeblueprint.co.uk
Cc: steven.sistare@oracle.com
Cc: subhra.mazumdar@oracle.com
Link: http://lkml.kernel.org/r/1525883988-10356-1-git-send-email-rohit.k.jain@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 09:12:26 +02:00
Mel Gorman
1378447598 sched/numa: Stagger NUMA balancing scan periods for new threads
Threads share an address space and each can change the protections of the
same address space to trap NUMA faults. This is redundant and potentially
counter-productive as any thread doing the update will suffice. Potentially
only one thread is required but that thread may be idle or it may not have
any locality concerns and pick an unsuitable scan rate.

This patch uses independent scan period but they are staggered based on
the number of address space users when the thread is created.  The intent
is that threads will avoid scanning at the same time and have a chance
to adapt their scan rate later if necessary. This reduces the total scan
activity early in the lifetime of the threads.

The different in headline performance across a range of machines and
workloads is marginal but the system CPU usage is reduced as well as overall
scan activity.  The following is the time reported by NAS Parallel Benchmark
using unbound openmp threads and a D size class:

			      4.17.0-rc1             4.17.0-rc1
				 vanilla           stagger-v1r1
	Time bt.D      442.77 (   0.00%)      419.70 (   5.21%)
	Time cg.D      171.90 (   0.00%)      180.85 (  -5.21%)
	Time ep.D       33.10 (   0.00%)       32.90 (   0.60%)
	Time is.D        9.59 (   0.00%)        9.42 (   1.77%)
	Time lu.D      306.75 (   0.00%)      304.65 (   0.68%)
	Time mg.D       54.56 (   0.00%)       52.38 (   4.00%)
	Time sp.D     1020.03 (   0.00%)      903.77 (  11.40%)
	Time ua.D      400.58 (   0.00%)      386.49 (   3.52%)

Note it's not a universal win but we have no prior knowledge of which
thread matters but the number of threads created often exceeds the size
of the node when the threads are not bound. However, there is a reducation
of overall system CPU usage:

				    4.17.0-rc1             4.17.0-rc1
				       vanilla           stagger-v1r1
	sys-time-bt.D         48.78 (   0.00%)       48.22 (   1.15%)
	sys-time-cg.D         25.31 (   0.00%)       26.63 (  -5.22%)
	sys-time-ep.D          1.65 (   0.00%)        0.62 (  62.42%)
	sys-time-is.D         40.05 (   0.00%)       24.45 (  38.95%)
	sys-time-lu.D         37.55 (   0.00%)       29.02 (  22.72%)
	sys-time-mg.D         47.52 (   0.00%)       34.92 (  26.52%)
	sys-time-sp.D        119.01 (   0.00%)      109.05 (   8.37%)
	sys-time-ua.D         51.52 (   0.00%)       45.13 (  12.40%)

NUMA scan activity is also reduced:

	NUMA alloc local               1042828     1342670
	NUMA base PTE updates        140481138    93577468
	NUMA huge PMD updates           272171      180766
	NUMA page range updates      279832690   186129660
	NUMA hint faults               1395972     1193897
	NUMA hint local faults          877925      855053
	NUMA hint local percent             62          71
	NUMA pages migrated           12057909     9158023

Similar observations are made for other thread-intensive workloads. System
CPU usage is lower even though the headline gains in performance tend to be
small. For example, specjbb 2005 shows almost no difference in performance
but scan activity is reduced by a third on a 4-socket box. I didn't find
a workload (thread intensive or otherwise) that suffered badly.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20180504154109.mvrha2qo5wdl65vr@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 09:12:24 +02:00
Ingo Molnar
dfd5c3ea64 Linux 4.17-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlr4xw8eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGNYoH/1d5zyMpVJVUKZ0K
 LuEctCGby1PjSvSOhmMuxFVagFAqfBJXmwWTeohLfLG48r/Yk0AsZQ5HH13/8baj
 k/T8UgUvKZKustndCRp+joQ3Pa1ZpcIFaWRvB8pKFCefJ/F/Lj4B4X1HYI7vLq0K
 /ZBXUdy3ry0lcVuypnaARYAb2O7l/nyZIjZ3FhiuyymWe7Jpo+G7VK922LOMSX/y
 VYFZCWa8nxN+yFhO0ao9X5k7ggIiUrEBtbfNrk19VtAn0hx+OYKW2KfJK/eHNey/
 CKrOT+KAxU8VU29AEIbYzlL3yrQmULcEoIDiqJ/6m5m6JwsEbP6EqQHs0TiuQFpq
 A0MO9rw=
 =yjUP
 -----END PGP SIGNATURE-----

Merge tag 'v4.17-rc5' into sched/core, to pick up fixes and dependencies

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 09:02:14 +02:00
Linus Torvalds
66e1c94db3 Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/pti updates from Thomas Gleixner:
 "A mixed bag of fixes and updates for the ghosts which are hunting us.

  The scheduler fixes have been pulled into that branch to avoid
  conflicts.

   - A set of fixes to address a khread_parkme() race which caused lost
     wakeups and loss of state.

   - A deadlock fix for stop_machine() solved by moving the wakeups
     outside of the stopper_lock held region.

   - A set of Spectre V1 array access restrictions. The possible
     problematic spots were discuvered by Dan Carpenters new checks in
     smatch.

   - Removal of an unused file which was forgotten when the rest of that
     functionality was removed"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/vdso: Remove unused file
  perf/x86/cstate: Fix possible Spectre-v1 indexing for pkg_msr
  perf/x86/msr: Fix possible Spectre-v1 indexing in the MSR driver
  perf/x86: Fix possible Spectre-v1 indexing for x86_pmu::event_map()
  perf/x86: Fix possible Spectre-v1 indexing for hw_perf_event cache_*
  perf/core: Fix possible Spectre-v1 indexing for ->aux_pages[]
  sched/autogroup: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]
  sched/core: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]
  sched/core: Introduce set_special_state()
  kthread, sched/wait: Fix kthread_parkme() completion issue
  kthread, sched/wait: Fix kthread_parkme() wait-loop
  sched/fair: Fix the update of blocked load when newly idle
  stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock
2018-05-13 10:53:08 -07:00
Linus Torvalds
86a4ac433b Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
 "Revert the new NUMA aware placement approach which turned out to
  create more problems than it solved"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()"
2018-05-13 10:46:53 -07:00
Marc Zyngier
0be8153cbc genirq/msi: Allow level-triggered MSIs to be exposed by MSI providers
So far, MSIs have been used to signal edge-triggered interrupts, as
a write is a good model for an edge (you can't "unwrite" something).
On the other hand, routing zillions of wires in an SoC because you
need level interrupts is a bit extreme.

People have come up with a variety of schemes to support this, which
involves sending two messages: one to signal the interrupt, and one
to clear it. Since the kernel cannot represent this, we've ended up
with side-band mechanisms that are pretty awful.

Instead, let's acknoledge the requirement, and ensure that, under the
right circumstances, the irq_compose_msg and irq_write_msg can take
as a parameter an array of two messages instead of a pointer to a
single one. We also add some checking that the compose method only
clobbers the second message if the MSI domain has been created with
the MSI_FLAG_LEVEL_CAPABLE flags.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <robh@kernel.org>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Link: https://lkml.kernel.org/r/20180508121438.11301-2-marc.zyngier@arm.com
2018-05-13 15:58:59 +02:00
Chen Lin
ed772ec873 timer_list: Remove unused function pointer typedef
Remove the 'printf_fn_t' typedef as it is not used.

Signed-off-by: Chen Lin <chen45464546@163.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: sboyd@kernel.org
Cc: john.stultz@linaro.org
Link: https://lkml.kernel.org/r/1526053649-24229-1-git-send-email-chen45464546@163.com
2018-05-13 15:56:01 +02:00
Mauro Carvalho Chehab
bf9c96bec7 timers: Adjust a kernel-doc comment
Those three warnings can easily solved by using :: to indicate a
code block:

	./kernel/time/timer.c:1259: WARNING: Unexpected indentation.
	./kernel/time/timer.c:1261: WARNING: Unexpected indentation.
	./kernel/time/timer.c:1262: WARNING: Block quote ends without a blank line; unexpected unindent.

While here, align the lines at the block.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Linux Doc Mailing List <linux-doc@vger.kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: https://lkml.kernel.org/r/f02e6a0ce27f3b5e33415d92d07a40598904b3ee.1525684985.git.mchehab%2Bsamsung@kernel.org
2018-05-13 15:55:43 +02:00
Sudeep Holla
1332a90558 tick: Prefer a lower rating device only if it's CPU local device
Checking the equality of cpumask for both new and old tick device doesn't
ensure that it's CPU local device. This will cause issue if a low rating
clockevent tick device is registered first followed by the registration
of higher rating clockevent tick device.

In such case, clockevents_released list will never get emptied as both
the devices get selected as preferred one and we will loop forever in
clockevents_notify_released.

Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: https://lkml.kernel.org/r/1525881728-4858-1-git-send-email-sudeep.holla@arm.com
2018-05-13 15:07:41 +02:00
Mel Gorman
789ba28013 Revert "sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()"
This reverts commit 7347fc87dfe6b7315e74310ee1243dc222c68086.

Srikar Dronamra pointed out that while the commit in question did show
a performance improvement on ppc64, it did so at the cost of disabling
active CPU migration by automatic NUMA balancing which was not the intent.
The issue was that a serious flaw in the logic failed to ever active balance
if SD_WAKE_AFFINE was disabled on scheduler domains. Even when it's enabled,
the logic is still bizarre and against the original intent.

Investigation showed that fixing the patch in either the way he suggested,
using the correct comparison for jiffies values or introducing a new
numa_migrate_deferred variable in task_struct all perform similarly to a
revert with a mix of gains and losses depending on the workload, machine
and socket count.

The original intent of the commit was to handle a problem whereby
wake_affine, idle balancing and automatic NUMA balancing disagree on the
appropriate placement for a task. This was particularly true for cases where
a single task was a massive waker of tasks but where wake_wide logic did
not apply.  This was particularly noticeable when a futex (a barrier) woke
all worker threads and tried pulling the wakees to the waker nodes. In that
specific case, it could be handled by tuning MPI or openMP appropriately,
but the behavior is not illogical and was worth attempting to fix. However,
the approach was wrong. Given that we're at rc4 and a fix is not obvious,
it's better to play safe, revert this commit and retry later.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: ggherdovich@suse.cz
Cc: hpa@zytor.com
Cc: matt@codeblueprint.co.uk
Cc: mpe@ellerman.id.au
Link: http://lkml.kernel.org/r/20180509163115.6fnnyeg4vdm2ct4v@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-12 08:37:56 +02:00
Linus Torvalds
f0ab773f5c Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "13 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  rbtree: include rcu.h
  scripts/faddr2line: fix error when addr2line output contains discriminator
  ocfs2: take inode cluster lock before moving reflinked inode from orphan dir
  mm, oom: fix concurrent munlock and oom reaper unmap, v3
  mm: migrate: fix double call of radix_tree_replace_slot()
  proc/kcore: don't bounds check against address 0
  mm: don't show nr_indirectly_reclaimable in /proc/vmstat
  mm: sections are not offlined during memory hotremove
  z3fold: fix reclaim lock-ups
  init: fix false positives in W+X checking
  lib/find_bit_benchmark.c: avoid soft lockup in test_find_first_bit()
  KASAN: prohibit KASAN+STRUCTLEAK combination
  MAINTAINERS: update Shuah's email address
2018-05-11 18:04:12 -07:00
David S. Miller
b2d6cee117 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The bpf syscall and selftests conflicts were trivial
overlapping changes.

The r8169 change involved moving the added mdelay from 'net' into a
different function.

A TLS close bug fix overlapped with the splitting of the TLS state
into separate TX and RX parts.  I just expanded the tests in the bug
fix from "ctx->conf == X" into "ctx->tx_conf == X && ctx->rx_conf
== X".

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11 20:53:22 -04:00
Jeffrey Hugo
ae646f0b9c init: fix false positives in W+X checking
load_module() creates W+X mappings via __vmalloc_node_range() (from
layout_and_allocate()->move_module()->module_alloc()) by using
PAGE_KERNEL_EXEC.  These mappings are later cleaned up via
"call_rcu_sched(&freeinit->rcu, do_free_init)" from do_init_module().

This is a problem because call_rcu_sched() queues work, which can be run
after debug_checkwx() is run, resulting in a race condition.  If hit,
the race results in a nasty splat about insecure W+X mappings, which
results in a poor user experience as these are not the mappings that
debug_checkwx() is intended to catch.

This issue is observed on multiple arm64 platforms, and has been
artificially triggered on an x86 platform.

Address the race by flushing the queued work before running the
arch-defined mark_rodata_ro() which then calls debug_checkwx().

Link: http://lkml.kernel.org/r/1525103946-29526-1-git-send-email-jhugo@codeaurora.org
Fixes: e1a58320a38d ("x86/mm: Warn on W^X mappings")
Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>
Reported-by: Timur Tabi <timur@codeaurora.org>
Reported-by: Jan Glauber <jan.glauber@caviumnetworks.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11 17:28:45 -07:00
Linus Torvalds
4bc871984f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Verify lengths of keys provided by the user is AF_KEY, from Kevin
    Easton.

 2) Add device ID for BCM89610 PHY. Thanks to Bhadram Varka.

 3) Add Spectre guards to some ATM code, courtesy of Gustavo A. R.
    Silva.

 4) Fix infinite loop in NSH protocol code. To Eric Dumazet we are most
    grateful for this fix.

 5) Line up /proc/net/netlink headers properly. This fix from YU Bo, we
    do appreciate.

 6) Use after free in TLS code. Once again we are blessed by the
    honorable Eric Dumazet with this fix.

 7) Fix regression in TLS code causing stalls on partial TLS records.
    This fix is bestowed upon us by Andrew Tomt.

 8) Deal with too small MTUs properly in LLC code, another great gift
    from Eric Dumazet.

 9) Handle cached route flushing properly wrt. MTU locking in ipv4, to
    Hangbin Liu we give thanks for this.

10) Fix regression in SO_BINDTODEVIC handling wrt. UDP socket demux.
    Paolo Abeni, he gave us this.

11) Range check coalescing parameters in mlx4 driver, thank you Moshe
    Shemesh.

12) Some ipv6 ICMP error handling fixes in rxrpc, from our good brother
    David Howells.

13) Fix kexec on mlx5 by freeing IRQs in shutdown path. Daniel Juergens,
    you're the best!

14) Don't send bonding RLB updates to invalid MAC addresses. Debabrata
    Benerjee saved us!

15) Uh oh, we were leaking in udp_sendmsg and ping_v4_sendmsg. The ship
    is now water tight, thanks to Andrey Ignatov.

16) IPSEC memory leak in ixgbe from Colin Ian King, man we've got holes
    everywhere!

17) Fix error path in tcf_proto_create, Jiri Pirko what would we do
    without you!

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (92 commits)
  net sched actions: fix refcnt leak in skbmod
  net: sched: fix error path in tcf_proto_create() when modules are not configured
  net sched actions: fix invalid pointer dereferencing if skbedit flags missing
  ixgbe: fix memory leak on ipsec allocation
  ixgbevf: fix ixgbevf_xmit_frame()'s return type
  ixgbe: return error on unsupported SFP module when resetting
  ice: Set rq_last_status when cleaning rq
  ipv4: fix memory leaks in udp_sendmsg, ping_v4_sendmsg
  mlxsw: core: Fix an error handling path in 'mlxsw_core_bus_device_register()'
  bonding: send learning packets for vlans on slave
  bonding: do not allow rlb updates to invalid mac
  net/mlx5e: Err if asked to offload TC match on frag being first
  net/mlx5: E-Switch, Include VF RDMA stats in vport statistics
  net/mlx5: Free IRQs in shutdown path
  rxrpc: Trace UDP transmission failure
  rxrpc: Add a tracepoint to log ICMP/ICMP6 and error messages
  rxrpc: Fix the min security level for kernel calls
  rxrpc: Fix error reception on AF_INET6 sockets
  rxrpc: Fix missing start of call timeout
  qed: fix spelling mistake: "taskelt" -> "tasklet"
  ...
2018-05-11 14:14:46 -07:00
Linus Torvalds
c110a8b792 Working on some new updates to trace filtering, I noticed that the
regex_match_front() test was updated to be limited to the size
 of the pattern instead of the full test string. But as the test string
 is not guaranteed to be nul terminated, it still needs to consider
 the size of the test string.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCWvWzNRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qhiPAP9bmOzqT3YK+dF19pLJCrmjyF95Wh85
 /10xaH3G1Q5e8AEA3ZXQqVNEGnaEs2uO/c5yvTP6/k1WEfGuTqTO5IH2hwI=
 =cKB5
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "Working on some new updates to trace filtering, I noticed that the
  regex_match_front() test was updated to be limited to the size of the
  pattern instead of the full test string.

  But as the test string is not guaranteed to be nul terminated, it
  still needs to consider the size of the test string"

* tag 'trace-v4.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix regex_match_front() to not over compare the test string
2018-05-11 13:04:35 -07:00
Linus Torvalds
41e3e10823 Power management fixes for 4.17-rc5
- Restore device_may_wakeup() check in pci_enable_wake() removed
    inadvertently during the 4.13 cycle to prevent systems from
    drawing excessive power when suspended or off, among other
    things (Rafael Wysocki).
 
  - Fix pci_dev_run_wake() to properly handle devices that only can
    signal PME# when in the D3cold power state (Kai Heng Feng).
 
  - Fix the schedutil cpufreq governor to avoid using UINT_MAX
    as the new CPU frequency in some cases due to a missing check
    (Rafael Wysocki).
 
  - Remove a stale comment regarding worker kthreads from the
    schedutil cpufreq governor (Juri Lelli).
 
  - Fix a copy-paste mistake in the intel_pstate driver documentation
    (Juri Lelli).
 
  - Fix a typo in the system sleep states documentation (Jonathan
    Neuschäfer).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJa9ZxLAAoJEILEb/54YlRxosQQAIoRa353q55oy3hNUKzybOY0
 z2MtQjjgDQsRKKFe8hbfjLy0QnSQCUASW8LaHpfDBqeO8ZR2TwRwR7H8b3dUpZj9
 ehsOrzNNnOlj1rSAbRaUfPJU1fA8HDoWcfwaKHwUVYXr9zwZTFv2x4UTJ2+bmOx9
 UdCI0Jl2aKtBSe+SPGNiSewQ3oLD3LYcv9VV/sTJ1XP0Wmwr0SoikzDIiJCo+lo1
 gXvQlM7ngxKtt02k4XUYEUjt49TrjWjLNQrAXVvFI7kn1KRlkzLl1E1g299/DxRw
 CSTboeDOkaKGJP84YmvdEUBp+IF1bQ8JwPe/Q/8i5+1MvBnvLgXOPlqpLAKAVjxr
 NBI7aAb83Q0aAecx0ioPVET9EDQ+AVrCj20PnitURfy1nl059knNwrvSnqCw1uLD
 JGVY2z4mm4zI2LlaUWKCK0PLTgucRZIU8HUiiBsI2u42KmG3EdfoDzvNUsxcZ146
 5Q+asEKTJoqltJfxwgQGaix7xXC75JVE65ICWB29ba3RddFZ7r4pu+pTg7yEsrpX
 98p3CPmQjbVbX5wcs9l0H0lYrOCEZj4saDHsmQ+62fQRu9VhxeSHmWBykOM9/k2j
 TRpRJK59BeeUMRtf1676B/uKevfuuT8seSXWtQwyWZc+Z+ZTJq/WKxVN7iV6/F21
 95RVu+yL1bhNKDjzJhyG
 =bCt1
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix two PCI power management regressions from the 4.13 cycle and
  one cpufreq schedutil governor bug introduced during the 4.12 cycle,
  drop a stale comment from the schedutil code and fix two mistakes in
  docs.

  Specifics:

   - Restore device_may_wakeup() check in pci_enable_wake() removed
     inadvertently during the 4.13 cycle to prevent systems from drawing
     excessive power when suspended or off, among other things (Rafael
     Wysocki).

   - Fix pci_dev_run_wake() to properly handle devices that only can
     signal PME# when in the D3cold power state (Kai Heng Feng).

   - Fix the schedutil cpufreq governor to avoid using UINT_MAX as the
     new CPU frequency in some cases due to a missing check (Rafael
     Wysocki).

   - Remove a stale comment regarding worker kthreads from the schedutil
     cpufreq governor (Juri Lelli).

   - Fix a copy-paste mistake in the intel_pstate driver documentation
     (Juri Lelli).

   - Fix a typo in the system sleep states documentation (Jonathan
     Neuschäfer)"

* tag 'pm-4.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PCI / PM: Check device_may_wakeup() in pci_enable_wake()
  PCI / PM: Always check PME wakeup capability for runtime wakeup support
  cpufreq: schedutil: Avoid using invalid next_freq
  cpufreq: schedutil: remove stale comment
  PM: docs: intel_pstate: fix Active Mode w/o HWP paragraph
  PM: docs: sleep-states: Fix a typo ("includig")
2018-05-11 09:49:02 -07:00
Steven Rostedt (VMware)
dc432c3d7f tracing: Fix regex_match_front() to not over compare the test string
The regex match function regex_match_front() in the tracing filter logic,
was fixed to test just the pattern length from testing the entire test
string. That is, it went from strncmp(str, r->pattern, len) to
strcmp(str, r->pattern, r->len).

The issue is that str is not guaranteed to be nul terminated, and if r->len
is greater than the length of str, it can access more memory than is
allocated.

The solution is to add a simple test if (len < r->len) return 0.

Cc: stable@vger.kernel.org
Fixes: 285caad415f45 ("tracing/filters: Fix MATCH_FRONT_ONLY filter matching")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-05-11 10:56:42 -04:00
Jann Horn
0a0b987344 compat: fix 4-byte infoleak via uninitialized struct field
Commit 3a4d44b61625 ("ntp: Move adjtimex related compat syscalls to
native counterparts") removed the memset() in compat_get_timex().  Since
then, the compat adjtimex syscall can invoke do_adjtimex() with an
uninitialized ->tai.

If do_adjtimex() doesn't write to ->tai (e.g.  because the arguments are
invalid), compat_put_timex() then copies the uninitialized ->tai field
to userspace.

Fix it by adding the memset() back.

Fixes: 3a4d44b61625 ("ntp: Move adjtimex related compat syscalls to native counterparts")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-10 17:51:58 -07:00
Al Viro
4c392e6591 powerpc/syscalls: switch rtas(2) to SYSCALL_DEFINE
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[mpe: Update sys_ni.c for s/ppc_rtas/sys_rtas/]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-05-10 23:25:14 +10:00
Doug Berger
2ef7c01c0c PM / wakeup: Only update last time for active wakeup sources
When wakelock support was added, the wakeup_source_add() function
was updated to set the last_time value of the wakeup source. This
has the unintended side effect of producing confusing output from
pm_print_active_wakeup_sources() when a wakeup source is added
prior to a sleep that is blocked by a different wakeup source.

The function pm_print_active_wakeup_sources() will search for the
most recently active wakeup source when no active source is found.
If a wakeup source is added after a different wakeup source blocks
the system from going to sleep it may have a later last_time value
than the blocking source and be output as the last active wakeup
source even if it has never actually been active.

It looks to me like the change to wakeup_source_add() was made to
prevent the wakelock garbage collection from accidentally dropping
a wakelock during the narrow window between adding the wakelock to
the wakelock list in wakelock_lookup_add() and the activation of
the wakeup source in pm_wake_lock().

This commit changes the behavior so that only the last_time of the
wakeup source used by a wakelock is initialized prior to adding it
to the wakeup source list. This preserves the meaning of the
last_time value as the last time the wakeup source was active and
allows a wakeup source that has never been active to have a
last_time value of 0.

Fixes: b86ff9820fd5 (PM / Sleep: Add user space interface for manipulating wakeup sources, v3)
Signed-off-by: Doug Berger <opendmb@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-10 11:39:19 +02:00
Jakub Kicinski
0d83003256 bpf: xdp: allow offloads to store into rx_queue_index
It's fairly easy for offloaded XDP programs to select the RX queue
packets go to.  We need a way of expressing this in the software.
Allow write to the rx_queue_index field of struct xdp_md for
device-bound programs.

Skip convert_ctx_access callback entirely for offloads.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-09 18:04:36 +02:00
Martin KaFai Lau
62dab84c81 bpf: btf: Add struct bpf_btf_info
During BPF_OBJ_GET_INFO_BY_FD on a btf_fd, the current bpf_attr's
info.info is directly filled with the BTF binary data.  It is
not extensible.  In this case, we want to add BTF ID.

This patch adds "struct bpf_btf_info" which has the BTF ID as
one of its member.  The BTF binary data itself is exposed through
the "btf" and "btf_size" members.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-09 17:25:13 +02:00
Martin KaFai Lau
78958fca7e bpf: btf: Introduce BTF ID
This patch gives an ID to each loaded BTF.  The ID is allocated by
the idr like the existing prog-id and map-id.

The bpf_put(map->btf) is moved to __bpf_map_put() so that the
userspace can stop seeing the BTF ID ASAP when the last BTF
refcnt is gone.

It also makes BTF accessible from userspace through the
1. new BPF_BTF_GET_FD_BY_ID command.  It is limited to CAP_SYS_ADMIN
   which is inline with the BPF_BTF_LOAD cmd and the existing
   BPF_[MAP|PROG]_GET_FD_BY_ID cmd.
2. new btf_id (and btf_key_id + btf_value_id) in "struct bpf_map_info"

Once the BTF ID handler is accessible from userspace, freeing a BTF
object has to go through a rcu period.  The BPF_BTF_GET_FD_BY_ID cmd
can then be done under a rcu_read_lock() instead of taking
spin_lock.
[Note: A similar rcu usage can be done to the existing
       bpf_prog_get_fd_by_id() in a follow up patch]

When processing the BPF_BTF_GET_FD_BY_ID cmd,
refcount_inc_not_zero() is needed because the BTF object
could be already in the rcu dead row .  btf_get() is
removed since its usage is currently limited to btf.c
alone.  refcount_inc() is used directly instead.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-09 17:25:13 +02:00
Martin KaFai Lau
82e9697250 bpf: btf: Avoid WARN_ON when CONFIG_REFCOUNT_FULL=y
If CONFIG_REFCOUNT_FULL=y, refcount_inc() WARN when refcount is 0.
When creating a new btf, the initial btf->refcnt is 0 and
triggered the following:

[   34.855452] refcount_t: increment on 0; use-after-free.
[   34.856252] WARNING: CPU: 6 PID: 1857 at lib/refcount.c:153 refcount_inc+0x26/0x30
....
[   34.868809] Call Trace:
[   34.869168]  btf_new_fd+0x1af6/0x24d0
[   34.869645]  ? btf_type_seq_show+0x200/0x200
[   34.870212]  ? lock_acquire+0x3b0/0x3b0
[   34.870726]  ? security_capable+0x54/0x90
[   34.871247]  __x64_sys_bpf+0x1b2/0x310
[   34.871761]  ? __ia32_sys_bpf+0x310/0x310
[   34.872285]  ? bad_area_access_error+0x310/0x310
[   34.872894]  do_syscall_64+0x95/0x3f0

This patch uses refcount_set() instead.

Reported-by: Yonghong Song <yhs@fb.com>
Tested-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-09 17:25:13 +02:00
Rafael J. Wysocki
97739501f2 cpufreq: schedutil: Avoid using invalid next_freq
If the next_freq field of struct sugov_policy is set to UINT_MAX,
it shouldn't be used for updating the CPU frequency (this is a
special "invalid" value), but after commit b7eaf1aab9f8 (cpufreq:
schedutil: Avoid reducing frequency of busy CPUs prematurely) it
may be passed as the new frequency to sugov_update_commit() in
sugov_update_single().

Fix that by adding an extra check for the special UINT_MAX value
of next_freq to sugov_update_single().

Fixes: b7eaf1aab9f8 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
Reported-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: 4.12+ <stable@vger.kernel.org> # 4.12+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-09 12:21:17 +02:00
Juri Lelli
a744490f12 cpufreq: schedutil: remove stale comment
After commit 794a56ebd9a57 (sched/cpufreq: Change the worker kthread to
SCHED_DEADLINE) schedutil kthreads are "ignored" for a clock frequency
selection point of view, so the potential corner case for RT tasks is not
possible at all now.

Remove the stale comment mentioning it.

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-05-09 12:20:24 +02:00
Tyler Hicks
326bee0286 seccomp: Don't special case audited processes when logging
Seccomp logging for "handled" actions such as RET_TRAP, RET_TRACE, or
RET_ERRNO can be very noisy for processes that are being audited. This
patch modifies the seccomp logging behavior to treat processes that are
being inspected via the audit subsystem the same as processes that
aren't under inspection. Handled actions will no longer be logged just
because the process is being inspected. Since v4.14, applications have
the ability to request logging of handled actions by using the
SECCOMP_FILTER_FLAG_LOG flag when loading seccomp filters.

With this patch, the logic for deciding if an action will be logged is:

  if action == RET_ALLOW:
    do not log
  else if action not in actions_logged:
    do not log
  else if action == RET_KILL:
    log
  else if action == RET_LOG:
    log
  else if filter-requests-logging:
    log
  else:
    do not log

Reported-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-05-08 02:04:23 -04:00
Tyler Hicks
ea6eca7785 seccomp: Audit attempts to modify the actions_logged sysctl
The decision to log a seccomp action will always be subject to the
value of the kernel.seccomp.actions_logged sysctl, even for processes
that are being inspected via the audit subsystem, in an upcoming patch.
Therefore, we need to emit an audit record on attempts at writing to the
actions_logged sysctl when auditing is enabled.

This patch updates the write handler for the actions_logged sysctl to
emit an audit record on attempts to write to the sysctl. Successful
writes to the sysctl will result in a record that includes a normalized
list of logged actions in the "actions" field and a "res" field equal to
1. Unsuccessful writes to the sysctl will result in a record that
doesn't include the "actions" field and has a "res" field equal to 0.

Not all unsuccessful writes to the sysctl are audited. For example, an
audit record will not be emitted if an unprivileged process attempts to
open the sysctl file for reading since that access control check is not
part of the sysctl's write handler.

Below are some example audit records when writing various strings to the
actions_logged sysctl.

Writing "not-a-real-action", when the kernel.seccomp.actions_logged
sysctl previously was "kill_process kill_thread trap errno trace log",
emits this audit record:

 type=CONFIG_CHANGE msg=audit(1525392371.454:120): op=seccomp-logging
 actions=? old-actions=kill_process,kill_thread,trap,errno,trace,log
 res=0

If you then write "kill_process kill_thread errno trace log", this audit
record is emitted:

 type=CONFIG_CHANGE msg=audit(1525392401.645:126): op=seccomp-logging
 actions=kill_process,kill_thread,errno,trace,log
 old-actions=kill_process,kill_thread,trap,errno,trace,log res=1

If you then write "log log errno trace kill_process kill_thread", which
is unordered and contains the log action twice, it results in the same
actions value as the previous record:

 type=CONFIG_CHANGE msg=audit(1525392436.354:132): op=seccomp-logging
 actions=kill_process,kill_thread,errno,trace,log
 old-actions=kill_process,kill_thread,errno,trace,log res=1

If you then write an empty string to the sysctl, this audit record is
emitted:

 type=CONFIG_CHANGE msg=audit(1525392494.413:138): op=seccomp-logging
 actions=(none) old-actions=kill_process,kill_thread,errno,trace,log
 res=1

No audit records are generated when reading the actions_logged sysctl.

Suggested-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-05-08 02:03:28 -04:00
Tyler Hicks
beb44acaf0 seccomp: Configurable separator for the actions_logged string
The function that converts a bitmask of seccomp actions that are
allowed to be logged is currently only used for constructing the display
string for the kernel.seccomp.actions_logged sysctl. That string wants a
space character to be used for the separator between actions.

A future patch will make use of the same function for building a string
that will be sent to the audit subsystem for tracking modifications to
the kernel.seccomp.actions_logged sysctl. That string will need to use a
comma as a separator. This patch allows the separator character to be
configurable to meet both needs.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-05-08 02:02:25 -04:00
Tyler Hicks
d013db0294 seccomp: Separate read and write code for actions_logged sysctl
Break the read and write paths of the kernel.seccomp.actions_logged
sysctl into separate functions to maintain readability. An upcoming
change will need to audit writes, but not reads, of this sysctl which
would introduce too many conditional code paths on whether or not the
'write' parameter evaluates to true.

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2018-05-08 02:01:09 -04:00
David S. Miller
01adc4851a Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Minor conflict, a CHECK was placed into an if() statement
in net-next, whilst a newline was added to that CHECK
call in 'net'.  Thanks to Daniel for the merge resolution.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-07 23:35:08 -04:00
Andy Shevchenko
cc659e76f3 rdmacg: Convert to use match_string() helper
The new helper returns index of the matching string in an array.
We are going to use it here.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-05-07 09:27:26 -07:00
Linus Torvalds
fe282c609d Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull clocksource fixes from Thomas Gleixner:
 "The recent addition of the early TSC clocksource breaks on machines
  which have an unstable TSC because in case that TSC is disabled, then
  the clocksource selection logic falls back to the early TSC which is
  obviously bogus.

  That also unearthed a few robustness issues in the clocksource
  derating code which are addressed as well"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource: Rework stale comment
  clocksource: Consistent de-rate when marking unstable
  x86/tsc: Fix mark_tsc_unstable()
  clocksource: Initialize cs->wd_list
  clocksource: Allow clocksource_mark_unstable() on unregistered clocksources
  x86/tsc: Always unregister clocksource_tsc_early
2018-05-06 05:35:23 -10:00
Ingo Molnar
12e2c41148 Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-05 10:01:34 +02:00
Linus Torvalds
4b293907d3 Some of the files in the tracing directory show file mode 0444
when they are writable by root. To fix the confusion, they should
 be 0644. Note, either case root can still write to them.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCWuyBchQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qmDLAQDyddL4DS480WXv3t3I/ZPwjHVuI4qS
 cPUsAsjn3Xs9wAD+O6/rE8SL/Q2tUIWlWk9wC4YpGqEoR6R3x98qpnGP3gA=
 =L/Kw
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.17-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Some of the files in the tracing directory show file mode 0444 when
  they are writable by root. To fix the confusion, they should be 0644.
  Note, either case root can still write to them.

  Zhengyuan asked why I never applied that patch (the first one is from
  2014!). I simply forgot about it. /me lowers head in shame"

* tag 'trace-v4.17-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix the file mode of stack tracer
  ftrace: Have set_graph_* files have normal file modes
2018-05-04 20:57:28 -10:00
Peter Zijlstra
4411ec1d19 perf/core: Fix possible Spectre-v1 indexing for ->aux_pages[]
> kernel/events/ring_buffer.c:871 perf_mmap_to_page() warn: potential spectre issue 'rb->aux_pages'

Userspace controls @pgoff through the fault address. Sanitize the
array index before doing the array dereference.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-05 08:37:27 +02:00
Peter Zijlstra
354d779307 sched/autogroup: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]
> kernel/sched/autogroup.c:230 proc_sched_autogroup_set_nice() warn: potential spectre issue 'sched_prio_to_weight'

Userspace controls @nice, sanitize the array index.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-05 08:34:42 +02:00
Peter Zijlstra
7281c8dec8 sched/core: Fix possible Spectre-v1 indexing for sched_prio_to_weight[]
> kernel/sched/core.c:6921 cpu_weight_nice_write_s64() warn: potential spectre issue 'sched_prio_to_weight'

Userspace controls @nice, so sanitize the value before using it to
index an array.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-05 08:32:36 +02:00
David S. Miller
2ba5622fba Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2018-05-05

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Sanitize attr->{prog,map}_type from bpf(2) since used as an array index
   to retrieve prog/map specific ops such that we prevent potential out of
   bounds value under speculation, from Mark and Daniel.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-04 19:50:10 -04:00
Thomas Gleixner
8bf37d8c06 seccomp: Move speculation migitation control to arch code
The migitation control is simpler to implement in architecture code as it
avoids the extra function call to check the mode. Aside of that having an
explicit seccomp enabled mode in the architecture mitigations would require
even more workarounds.

Move it into architecture code and provide a weak function in the seccomp
code. Remove the 'which' argument as this allows the architecture to decide
which mitigations are relevant for seccomp.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-05-05 00:51:44 +02:00