24417 Commits

Author SHA1 Message Date
Deepa Dinamani
d2e3e0ca5d time: Change k_clock clock_getres() to use timespec64
struct timespec is not y2038 safe on 32 bit machines.  Replace uses of
struct timespec with struct timespec64 in the kernel. The syscall
interfaces themselves will be changed in a separate series.

The clock_getres() interface has also been changed to use timespec64 even
though this particular interface is not affected by the y2038 problem. This
helps verification for internal kernel code for y2038 readiness by getting
rid of time_t/ timeval/ timespec completely.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: y2038@lists.linaro.org
Cc: john.stultz@linaro.org
Cc: arnd@arndb.de
Link: http://lkml.kernel.org/r/1490555058-4603-5-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-14 21:49:55 +02:00
Deepa Dinamani
3c9c12f4b4 time: Change k_clock clock_get() to use timespec64
struct timespec is not y2038 safe on 32 bit machines.  Replace uses of
struct timespec with struct timespec64 in the kernel.

The syscall interfaces themselves will be changed in a separate series.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: y2038@lists.linaro.org
Cc: john.stultz@linaro.org
Cc: arnd@arndb.de
Link: http://lkml.kernel.org/r/1490555058-4603-4-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-14 21:49:55 +02:00
Deepa Dinamani
d340266e19 time: Change posix clocks ops interfaces to use timespec64
struct timespec is not y2038 safe on 32 bit machines.

The posix clocks apis use struct timespec directly and through struct
itimerspec.

Replace the posix clock interfaces to use struct timespec64 and struct
itimerspec64 instead.  Also fix up their implementations accordingly.

Note that the clock_getres() interface has also been changed to use
timespec64 even though this particular interface is not affected by the
y2038 problem. This helps verification for internal kernel code for y2038
readiness by getting rid of time_t/ timeval/ timespec.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: arnd@arndb.de
Cc: y2038@lists.linaro.org
Cc: netdev@vger.kernel.org
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: john.stultz@linaro.org
Link: http://lkml.kernel.org/r/1490555058-4603-3-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-14 21:49:54 +02:00
Deepa Dinamani
2ac00f17b2 time: Delete do_sys_setimeofday()
struct timespec is not y2038 safe on 32 bit machines and needs to be
replaced with struct timespec64.

do_sys_timeofday() is just a wrapper function.  Replace all calls to this
function with direct calls to do_sys_timeofday64() instead and delete
do_sys_timeofday().

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: y2038@lists.linaro.org
Cc: john.stultz@linaro.org
Cc: arnd@arndb.de
Cc: linux-alpha@vger.kernel.org
Link: http://lkml.kernel.org/r/1490555058-4603-2-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-14 21:49:54 +02:00
Matthias Kaehlcke
d170fe7dd9 genirq: Use cpumask_available() for check of cpumask variable
This fixes the following clang warning when CONFIG_CPUMASK_OFFSTACK=n:

kernel/irq/manage.c:839:28: error: address of array
'desc->irq_common_data.affinity' will always evaluate to 'true'
[-Werror,-Wpointer-bool-conversion]

Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Cc: Grant Grundler <grundler@chromium.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg Hackmann <ghackmann@google.com>
Cc: Michael Davidson <md@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20170412182030.83657-2-mka@chromium.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-14 20:49:27 +02:00
Peter Zijlstra
94ffac5d84 futex: Fix small (and harmless looking) inconsistencies
During (post-commit) review Darren spotted a few minor things. One
(harmless AFAICT) type inconsistency and a comment that wasn't as
clear as hoped.

Reported-by: Darren Hart (VMWare) <dvhart@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Darren Hart (VMware) <dvhart@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-14 10:30:07 +02:00
Thomas Gleixner
97181f9bd5 futex: Avoid freeing an active timer
Alexander reported a hrtimer debug_object splat:

  ODEBUG: free active (active state 0) object type: hrtimer hint: hrtimer_wakeup (kernel/time/hrtimer.c:1423)

  debug_object_free (lib/debugobjects.c:603)
  destroy_hrtimer_on_stack (kernel/time/hrtimer.c:427)
  futex_lock_pi (kernel/futex.c:2740)
  do_futex (kernel/futex.c:3399)
  SyS_futex (kernel/futex.c:3447 kernel/futex.c:3415)
  do_syscall_64 (arch/x86/entry/common.c:284)
  entry_SYSCALL64_slow_path (arch/x86/entry/entry_64.S:249)

Which was caused by commit:

  cfafcd117da0 ("futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()")

... losing the hrtimer_cancel() in the shuffle. Where previously the
hrtimer_cancel() was done by rt_mutex_slowlock() we now need to do it
manually.

Reported-by: Alexander Levin <alexander.levin@verizon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: cfafcd117da0 ("futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()")
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1704101802370.2906@nanos
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-14 10:29:53 +02:00
Ingo Molnar
0ba78a95a6 Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-14 10:29:40 +02:00
Peter Zijlstra
283e2ed399 sched/fair: Move the PELT constants into a generated header
Now that we have a tool to generate the PELT constants in C form,
use its output as a separate header.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-14 10:26:37 +02:00
Peter Zijlstra
bb0bd044e6 sched/fair: Increase PELT accuracy for small tasks
We truncate (and loose) the lower 10 bits of runtime in
___update_load_avg(), this means there's a consistent bias to
under-account tasks. This is esp. significant for small tasks.

Cure this by only forwarding last_update_time to the point we've
actually accounted for, leaving the remainder for the next time.

Reported-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-14 10:26:36 +02:00
Peter Zijlstra
3841cdc310 sched/fair: Fix comments
Historically our periods (or p) argument in PELT denoted the number of
full periods (what is now d2). However recent patches have changed
this to the total decay (previously p+1), leading to a confusing
discrepancy between comments and code.

Try and clarify things by making periods (in code) and p (in comments)
be the same thing (again).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-14 10:26:36 +02:00
Peter Zijlstra
05296e7535 sched/fair: Fix corner case in __accumulate_sum()
Paul noticed that in the (periods >= LOAD_AVG_MAX_N) case in
__accumulate_sum(), the returned contribution value (LOAD_AVG_MAX) is
incorrect.

This is because at this point, the decay_load() on the old state --
the first step in accumulate_sum() -- will not have resulted in 0, and
will therefore result in a sum larger than the maximum value of our
series. Obviously broken.

Note that:

	decay_load(LOAD_AVG_MAX, LOAD_AVG_MAX_N) =

                1   (345 / 32)
	47742 * - ^            = ~27
                2

Not to mention that any further contribution from the d3 segment (our
new period) would also push it over the maximum.

Solve this by noting that we can write our c2 term:

		    p
	c2 = 1024 \Sum y^n
		   n=1

In terms of our maximum value:

		    inf		      inf	  p
	max = 1024 \Sum y^n = 1024 ( \Sum y^n + \Sum y^n + y^0 )
		    n=0		      n=p+1	 n=1

Further note that:

           inf              inf            inf
        ( \Sum y^n ) y^p = \Sum y^(n+p) = \Sum y^n
           n=0              n=0            n=p

Combined that gives us:

		    p
	c2 = 1024 \Sum y^n
		   n=1

		     inf        inf
	   = 1024 ( \Sum y^n - \Sum y^n - y^0 )
		     n=0        n=p+1

	   = max - (max y^(p+1)) - 1024

Further simplify things by dealing with p=0 early on.

Reported-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Cc: linux-kernel@vger.kernel.org
Fixes: a481db34b9be ("sched/fair: Optimize ___update_sched_avg()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-14 10:26:34 +02:00
Keith Busch
3412386b53 irq/affinity: Fix extra vecs calculation
This fixes a math error calculating the extra_vecs. The error assumed
only 1 cpu per vector, but the value needs to account for the actual
number of cpus per vector in order to get the correct remainder for
extra CPU assignment.

Fixes: 7bf8222b9bd0 ("irq/affinity: Fix CPU spread for unbalanced nodes")
Reported-by: Xiaolong Ye <xiaolong.ye@intel.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Link: http://lkml.kernel.org/r/1492104492-19943-1-git-send-email-keith.busch@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-13 23:41:00 +02:00
Rafael J. Wysocki
39b64aa1c0 cpufreq: schedutil: Reduce frequencies slower
The schedutil governor reduces frequencies too fast in some
situations which cases undesirable performance drops to
appear.

To address that issue, make schedutil reduce the frequency slower by
setting it to the average of the value chosen during the previous
iteration of governor computations and the new one coming from its
frequency selection formula.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=194963
Reported-by: John <john.ettedgui@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2017-04-13 03:46:40 +02:00
Linus Torvalds
b9b3322f13 Merge branch 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit
Pull audit fix from Paul Moore:
 "One more small audit fix, this should be the last for v4.11.

  Seth Forshee noticed a problem where the audit retry queue wasn't
  being flushed properly when audit was enabled and the audit daemon
  wasn't running; this patches fixes the problem (see the commit
  description for more details on the change).

  Both Seth and I have tested this and everything looks good"

* 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit:
  audit: make sure we don't let the retry queue grow without bounds
2017-04-12 00:02:33 -07:00
Linus Torvalds
06ea4c38bc Merge branch 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "This contains fixes for two long standing subtle bugs:

   - kthread_bind() on a new kthread binds it to specific CPUs and
     prevents userland from messing with the affinity or cgroup
     membership. Unfortunately, for cgroup membership, there's a window
     between kthread creation and kthread_bind*() invocation where the
     kthread can be moved into a non-root cgroup by userland.

     Depending on what controllers are in effect, this can assign the
     kthread unexpected attributes. For example, in the reported case,
     workqueue workers ended up in a non-root cpuset cgroups and had
     their CPU affinities overridden. This broke workqueue invariants
     and led to workqueue stalls.

     Fixed by closing the window between kthread creation and
     kthread_bind() as suggested by Oleg.

   - There was a bug in cgroup mount path which could allow two
     competing mount attempts to attach the same cgroup_root to two
     different superblocks.

     This was caused by mishandling return value from kernfs_pin_sb().

     Fixed"

* 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: avoid attaching a cgroup root to two different superblocks
  cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
2017-04-11 23:38:16 -07:00
Johannes Berg
96a94cc515 bpf: reference may_access_skb() from __bpf_prog_run()
It took me quite some time to figure out how this was linked,
so in order to save the next person the effort of finding it
add a comment in __bpf_prog_run() that indicates what exactly
determines that a program can access the ctx == skb.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-11 10:54:27 -04:00
NeilBrown
717a94b5fc sched/core: Remove 'task' parameter and rename tsk_restore_flags() to current_restore_flags()
It is not safe for one thread to modify the ->flags
of another thread as there is no locking that can protect
the update.

So tsk_restore_flags(), which takes a task pointer and modifies
the flags, is an invitation to do the wrong thing.

All current users pass "current" as the task, so no developers have
accepted that invitation.  It would be best to ensure it remains
that way.

So rename tsk_restore_flags() to current_restore_flags() and don't
pass in a task_struct pointer.  Always operate on current->flags.

Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-11 09:06:32 +02:00
Ingo Molnar
36a4dfc378 Linux 4.11-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJY6mY1AAoJEHm+PkMAQRiGB14IAImsH28JPjxJVDasMIRPBxVc
 euPPlZgoBieu7sNt+kEsEqdkXuu0MLk6gln0IGxWLeoB2S+u3Tz5LMa2YArVqV9Z
 tWzOnI9auE73P2Pz/tUMOdyMs5tO0PolQxX3uljbULBozOHjHRh13fsXchX2yQvl
 mFeFCDqpPV0KhWRH/ciA8uIHdvYPhMpkKgRtmR8jXL0yzqLp6+2J+Bs8nHG4NNng
 HMVxZPC8jOE/TgWq6k/GmXgxh3H/AideFdHFbLKYnIFJW41ZGOI8a262zq3NmjPd
 lywpVU7O7RMhSITY5PnuR3LpNV8ftw1hz2y6t35unyFK1P02adOSj5GJ3hGdhaQ=
 =Xz5O
 -----END PGP SIGNATURE-----

Merge tag 'v4.11-rc6' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-11 09:05:36 +02:00
Ingo Molnar
84b1e36a6a Linux 4.11-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJY6mY1AAoJEHm+PkMAQRiGB14IAImsH28JPjxJVDasMIRPBxVc
 euPPlZgoBieu7sNt+kEsEqdkXuu0MLk6gln0IGxWLeoB2S+u3Tz5LMa2YArVqV9Z
 tWzOnI9auE73P2Pz/tUMOdyMs5tO0PolQxX3uljbULBozOHjHRh13fsXchX2yQvl
 mFeFCDqpPV0KhWRH/ciA8uIHdvYPhMpkKgRtmR8jXL0yzqLp6+2J+Bs8nHG4NNng
 HMVxZPC8jOE/TgWq6k/GmXgxh3H/AideFdHFbLKYnIFJW41ZGOI8a262zq3NmjPd
 lywpVU7O7RMhSITY5PnuR3LpNV8ftw1hz2y6t35unyFK1P02adOSj5GJ3hGdhaQ=
 =Xz5O
 -----END PGP SIGNATURE-----

Merge tag 'v4.11-rc6' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-11 08:42:47 +02:00
Zefan Li
bfb0b80db5 cgroup: avoid attaching a cgroup root to two different superblocks
Run this:

    touch file0
    for ((; ;))
    {
        mount -t cpuset xxx file0
    }

And this concurrently:

    touch file1
    for ((; ;))
    {
        mount -t cpuset xxx file1
    }

We'll trigger a warning like this:

 ------------[ cut here ]------------
 WARNING: CPU: 1 PID: 4675 at lib/percpu-refcount.c:317 percpu_ref_kill_and_confirm+0x92/0xb0
 percpu_ref_kill_and_confirm called more than once on css_release!
 CPU: 1 PID: 4675 Comm: mount Not tainted 4.11.0-rc5+ #5
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
 Call Trace:
  dump_stack+0x63/0x84
  __warn+0xd1/0xf0
  warn_slowpath_fmt+0x5f/0x80
  percpu_ref_kill_and_confirm+0x92/0xb0
  cgroup_kill_sb+0x95/0xb0
  deactivate_locked_super+0x43/0x70
  deactivate_super+0x46/0x60
 ...
 ---[ end trace a79f61c2a2633700 ]---

Here's a race:

  Thread A				Thread B

  cgroup1_mount()
    # alloc a new cgroup root
    cgroup_setup_root()
					cgroup1_mount()
					  # no sb yet, returns NULL
					  kernfs_pin_sb()

					  # but succeeds in getting the refcnt,
					  # so re-use cgroup root
					  percpu_ref_tryget_live()
    # alloc sb with cgroup root
    cgroup_do_mount()

  cgroup_kill_sb()
					  # alloc another sb with same root
					  cgroup_do_mount()

					cgroup_kill_sb()

We end up using the same cgroup root for two different superblocks,
so percpu_ref_kill() will be called twice on the same root when the
two superblocks are destroyed.

We should fix to make sure the superblock pinning is really successful.

Cc: stable@vger.kernel.org # 3.16+
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-04-11 09:00:57 +09:00
Rakib Mullick
30e03acda5 cpuset: Remove cpuset_update_active_cpus()'s parameter.
In cpuset_update_active_cpus(), cpu_online isn't used anymore. Remove
it.

Signed-off-by: Rakib Mullick<rakib.mullick@gmail.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-04-11 08:57:54 +09:00
Paul Moore
264d509637 audit: make sure we don't let the retry queue grow without bounds
The retry queue is intended to provide a temporary buffer in the case
of transient errors when communicating with auditd, it is not meant
as a long life queue, that functionality is provided by the hold
queue.

This patch fixes a problem identified by Seth where the retry queue
could grow uncontrollably if an auditd instance did not connect to
the kernel to drain the queues.  This commit fixes this by doing the
following:

* Make sure we always call auditd_reset() if we decide the connection
with audit is really dead.  There were some cases in
kauditd_hold_skb() where we did not reset the connection, this patch
relocates the reset calls to kauditd_thread() so all the error
conditions are caught and the connection reset.  As a side effect,
this means we could move auditd_reset() and get rid of the forward
definition at the top of kernel/audit.c.

* We never checked the status of the auditd connection when
processing the main audit queue which meant that the retry queue
could grow unchecked.  This patch adds a call to auditd_reset()
after the main queue has been processed if auditd is not connected,
the auditd_reset() call will make sure the retry and hold queues are
correctly managed/flushed so that the retry queue remains reasonable.

Cc: <stable@vger.kernel.org> # 4.10.x-: 5b52330bbfe6
Reported-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-04-10 11:16:59 -04:00
Liping Zhang
425fffd886 sysctl: report EINVAL if value is larger than UINT_MAX for proc_douintvec
Currently, inputting the following command will succeed but actually the
value will be truncated:

  # echo 0x12ffffffff > /proc/sys/net/ipv4/tcp_notsent_lowat

This is not friendly to the user, so instead, we should report error
when the value is larger than UINT_MAX.

Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-08 10:27:40 -07:00
Linus Torvalds
62fedca5ce Merge branch 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit
Pull audit cleanup from Paul Moore:
 "A week later than I had hoped, but as promised, here is the audit
  uninline-fix we talked about during the last audit pull request.

  The patch is slightly different than what we originally discussed as
  it made more sense to keep the audit_signal_info() function in
  auditsc.c rather than move it and bunch of other related
  variables/definitions into audit.c/audit.h.

  At some point in the future I need to look at how the audit code is
  organized across kernel/audit*, I suspect we could do things a bit
  better, but it doesn't seem like a -rc release is a good place for
  that ;)

  Regardless, this patch passes our tests without problem and looks good
  for v4.11"

* 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit:
  audit: move audit_signal_info() into kernel/auditsc.c
2017-04-08 01:37:25 -07:00
bsegall@google.com
5402e97af6 ptrace: fix PTRACE_LISTEN race corrupting task->state
In PT_SEIZED + LISTEN mode STOP/CONT signals cause a wakeup against
__TASK_TRACED.  If this races with the ptrace_unfreeze_traced at the end
of a PTRACE_LISTEN, this can wake the task /after/ the check against
__TASK_TRACED, but before the reset of state to TASK_TRACED.  This
causes it to instead clobber TASK_WAKING, allowing a subsequent wakeup
against TRACED while the task is still on the rq wake_list, corrupting
it.

Oleg said:
 "The kernel can crash or this can lead to other hard-to-debug problems.
  In short, "task->state = TASK_TRACED" in ptrace_unfreeze_traced()
  assumes that nobody else can wake it up, but PTRACE_LISTEN breaks the
  contract. Obviusly it is very wrong to manipulate task->state if this
  task is already running, or WAKING, or it sleeps again"

[akpm@linux-foundation.org: coding-style fixes]
Fixes: 9899d11f ("ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL")
Link: http://lkml.kernel.org/r/xm26y3vfhmkp.fsf_-_@bsegall-linux.mtv.corp.google.com
Signed-off-by: Ben Segall <bsegall@google.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-08 00:47:48 -07:00
Liping Zhang
5380e5644a sysctl: don't print negative flag for proc_douintvec
I saw some very confusing sysctl output on my system:
  # cat /proc/sys/net/core/xfrm_aevent_rseqth
  -2
  # cat /proc/sys/net/core/xfrm_aevent_etime
  -10
  # cat /proc/sys/net/ipv4/tcp_notsent_lowat
  -4294967295

Because we forget to set the *negp flag in proc_douintvec, so it will
become a garbage value.

Since the value related to proc_douintvec is always an unsigned integer,
so we can set *negp to false explictily to fix this issue.

Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-07 09:46:44 -07:00
Linus Torvalds
4691f4a6d4 Wei Yongjun fixed a long standing bug in the ring buffer startup test.
If for some unknown reason, the kthread that is created fails to be
 created, the return from kthread_create() is an PTR_ERR and not a NULL.
 The test incorrectly checks for NULL instead of an error.
 -----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJY5mOWFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 XcsH/iBX7Kf4ta/0Jo4+sR4+HeDmWNPVBTwlei+dvMfaK1rWDgW6hbwSJg3geUwN
 d2zL/o7uCWbXubO9sjeCX2n+ecUiUcJRheewfdm0KzaPH387ofdUd24yz3DNDNcl
 /yaZMmeApjpHJjJWxoH5TUSF/yliC2FvjHYWxgEx9qhrzldLk/r5qAealj2tKl1Q
 1cgSQEgXf5n6Wg0onBuR2JiMOo3+4lXh+pIpO1Dupalhj7cC91HatDDYrNmGRIWR
 qucf3iQLoD/m88bgpxsRortkQ09NfVJExxzIPliVoYF8VwtzL+77XD81EdgvLdTs
 WP+CAoMFk83fkuXK7Vg1HZZa5zg=
 =Z0D5
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "Wei Yongjun fixed a long standing bug in the ring buffer startup test.

  If for some unknown reason, the kthread that is created fails to be
  created, the return from kthread_create() is an PTR_ERR and not a
  NULL. The test incorrectly checks for NULL instead of an error"

* tag 'trace-v4.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ring-buffer: Fix return value check in test_ringbuffer()
2017-04-06 13:12:12 -07:00
Linus Torvalds
ea6b1720ce Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Reject invalid updates to netfilter expectation policies, from Pablo
    Neira Ayuso.

 2) Fix memory leak in nfnl_cthelper, from Jeffy Chen.

 3) Don't do stupid things if we get a neigh_probe() on a neigh entry
    whose ops lack a solicit method. From Eric Dumazet.

 4) Don't transmit packets in r8152 driver when the carrier is off, from
    Hayes Wang.

 5) Fix ipv6 packet type detection in aquantia driver, from Pavel
    Belous.

 6) Don't write uninitialized data into hw registers in bna driver, from
    Arnd Bergmann.

 7) Fix locking in ping_unhash(), from Eric Dumazet.

 8) Make BPF verifier range checks able to understand certain sequences
    emitted by LLVM, from Alexei Starovoitov.

 9) Fix use after free in ipconfig, from Mark Rutland.

10) Fix refcount leak on force commit in openvswitch, from Jarno
    Rajahalme.

11) Fix various overflow checks in AF_PACKET, from Andrey Konovalov.

12) Fix endianness bug in be2net driver, from Suresh Reddy.

13) Don't forget to wake TX queues when processing a timeout, from
    Grygorii Strashko.

14) ARP header on-stack storage is wrong in flow dissector, from Simon
    Horman.

15) Lost retransmit and reordering SNMP stats in TCP can be
    underreported. From Yuchung Cheng.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (82 commits)
  nfp: fix potential use after free on xdp prog
  tcp: fix reordering SNMP under-counting
  tcp: fix lost retransmit SNMP under-counting
  sctp: get sock from transport in sctp_transport_update_pmtu
  net: ethernet: ti: cpsw: fix race condition during open()
  l2tp: fix PPP pseudo-wire auto-loading
  bnx2x: fix spelling mistake in macros HW_INTERRUT_ASSERT_SET_*
  l2tp: take reference on sessions being dumped
  tcp: minimize false-positives on TCP/GRO check
  sctp: check for dst and pathmtu update in sctp_packet_config
  flow dissector: correct size of storage for ARP
  net: ethernet: ti: cpsw: wake tx queues on ndo_tx_timeout
  l2tp: take a reference on sessions used in genetlink handlers
  l2tp: hold session while sending creation notifications
  l2tp: fix duplicate session creation
  l2tp: ensure session can't get removed during pppol2tp_session_ioctl()
  l2tp: fix race in l2tp_recv_common()
  sctp: use right in and out stream cnt
  bpf: add various verifier test cases for self-tests
  bpf, verifier: fix rejection of unaligned access checks for map_value_adj
  ...
2017-04-05 20:17:38 -07:00
Mike Galbraith
def34eaae5 rtmutex: Plug preempt count leak in rt_mutex_futex_unlock()
mark_wakeup_next_waiter() already disables preemption, doing so again
leaves us with an unpaired preempt_disable().

Fixes: 2a1c60299406 ("rtmutex: Deboost before waking up the top waiter")
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/1491379707.6538.2.camel@gmx.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-05 16:59:37 +02:00
Wei Yongjun
62277de758 ring-buffer: Fix return value check in test_ringbuffer()
In case of error, the function kthread_run() returns ERR_PTR()
and never returns NULL. The NULL test in the return value check
should be replaced with IS_ERR().

Link: http://lkml.kernel.org/r/1466184839-14927-1-git-send-email-weiyj_lk@163.com

Cc: stable@vger.kernel.org
Fixes: 6c43e554a ("ring-buffer: Add ring buffer startup selftest")
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-05 09:36:52 -04:00
Keith Busch
7bf8222b9b irq/affinity: Fix CPU spread for unbalanced nodes
The irq_create_affinity_masks routine is responsible for assigning a
number of interrupt vectors to CPUs. The optimal assignemnet will spread
requested vectors to all CPUs, with the fewest CPUs sharing a vector.

The algorithm may fail to assign some vectors to any CPUs if a node's
CPU count is lower than the average number of vectors per node. These
vectors are unusable and create an un-optimal spread.

Recalculate the number of vectors to assign at each node iteration by using
the remaining number of vectors and nodes to be assigned, not exceeding the
number of CPUs in that node. This will guarantee that every CPU is assigned
at least one vector.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: linux-nvme@lists.infradead.org
Link: http://lkml.kernel.org/r/1491247553-7603-1-git-send-email-keith.busch@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-04 11:57:28 +02:00
Peter Zijlstra
19830e5524 rtmutex: Fix more prio comparisons
There was a pure ->prio comparison left in try_to_wake_rt_mutex(),
convert it to use rt_mutex_waiter_less(), noting that greater-or-equal
is not-less (both in kernel priority view).

This necessitated the introduction of cmp_task() which creates a
pointer to an unnamed stack variable of struct rt_mutex_waiter type to
compare against tasks.

With this, we can now also create and employ rt_mutex_waiter_equal().

Reviewed-and-tested-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.455584638@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-04 11:44:07 +02:00
Peter Zijlstra
e0aad5b44f rtmutex: Fix PI chain order integrity
rt_mutex_waiter::prio is a copy of task_struct::prio which is updated
during the PI chain walk, such that the PI chain order isn't messed up
by (asynchronous) task state updates.

Currently rt_mutex_waiter_less() uses task state for deadline tasks;
this is broken, since the task state can, as said above, change
asynchronously, causing the RB tree order to change without actual
tree update -> FAIL.

Fix this by also copying the deadline into the rt_mutex_waiter state
and updating it along with its prio field.

Ideally we would also force PI chain updates whenever DL tasks update
their deadline parameter, but for first approximation this is less
broken than it was.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.403992539@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-04 11:44:06 +02:00
Peter Zijlstra
b91473ff6e sched,tracing: Update trace_sched_pi_setprio()
Pass the PI donor task, instead of a numerical priority.

Numerical priorities are not sufficient to describe state ever since
SCHED_DEADLINE.

Annotate all sched tracepoints that are currently broken; fixing them
will bork userspace. *hate*.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.353599881@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-04 11:44:06 +02:00
Peter Zijlstra
acd58620e4 sched/rtmutex: Refactor rt_mutex_setprio()
With the introduction of SCHED_DEADLINE the whole notion that priority
is a single number is gone, therefore the @prio argument to
rt_mutex_setprio() doesn't make sense anymore.

So rework the code to pass a pi_task instead.

Note this also fixes a problem with pi_top_task caching; previously we
would not set the pointer (call rt_mutex_update_top_task) if the
priority didn't change, this could lead to a stale pointer.

As for the XXX, I think its fine to use pi_task->prio, because if it
differs from waiter->prio, a PI chain update is immenent.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.303827095@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-04 11:44:06 +02:00
Peter Zijlstra
aa2bfe5536 rtmutex: Clean up
Previous patches changed the meaning of the return value of
rt_mutex_slowunlock(); update comments and code to reflect this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.255058238@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-04 11:44:05 +02:00
Xunlei Pang
85e2d4f992 sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update
Currently dl tasks will actually return at the very beginning
of rt_mutex_adjust_prio_chain() in !detect_deadlock cases:

    if (waiter->prio == task->prio) {
        if (!detect_deadlock)
            goto out_unlock_pi; // out here
        else
            requeue = false;
    }

As the deadline value of blocked deadline tasks(waiters) without
changing their sched_class(thus prio doesn't change) never changes,
this seems reasonable, but it actually misses the chance of updating
rt_mutex_waiter's "dl_runtime(period)_copy" if a waiter updates its
deadline parameters(dl_runtime, dl_period) or boosted waiter changes
to !deadline class.

Thus, force deadline task not out by adding the !dl_prio() condition.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/1460633827-345-7-git-send-email-xlpang@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.206577901@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-04 11:44:05 +02:00
Xunlei Pang
e96a7705e7 sched/rtmutex/deadline: Fix a PI crash for deadline tasks
A crash happened while I was playing with deadline PI rtmutex.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
    PGD 232a75067 PUD 230947067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 1 PID: 10994 Comm: a.out Not tainted

    Call Trace:
    [<ffffffff810b658c>] enqueue_task+0x2c/0x80
    [<ffffffff810ba763>] activate_task+0x23/0x30
    [<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260
    [<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20
    [<ffffffff8164e783>] __schedule+0xd3/0x900
    [<ffffffff8164efd9>] schedule+0x29/0x70
    [<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0
    [<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190
    [<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60
    [<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390
    [<ffffffff810ed8b0>] do_futex+0x190/0x5b0
    [<ffffffff810edd50>] SyS_futex+0x80/0x180

This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
are only protected by pi_lock when operating pi waiters, while
rt_mutex_get_top_task(), will access them with rq lock held but
not holding pi_lock.

In order to tackle it, we introduce new "pi_top_task" pointer
cached in task_struct, and add new rt_mutex_update_top_task()
to update its value, it can be called by rt_mutex_setprio()
which held both owner's pi_lock and rq lock. Thus "pi_top_task"
can be safely accessed by enqueue_task_dl() under rq lock.

Originally-From: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.157682758@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-04 11:44:05 +02:00
Xunlei Pang
2a1c602994 rtmutex: Deboost before waking up the top waiter
We should deboost before waking the high-priority task, such that we
don't run two tasks with the same "state" (priority, deadline,
sched_class, etc).

In order to make sure the boosting task doesn't start running between
unlock and deboost (due to 'spurious' wakeup), we move the deboost
under the wait_lock, that way its serialized against the wait loop in
__rt_mutex_slowlock().

Doing the deboost early can however lead to priority-inversion if
current would get preempted after the deboost but before waking our
high-prio task, hence we disable preemption before doing deboost, and
enabling it after the wake up is over.

This gets us the right semantic order, but most importantly however;
this change ensures pointer stability for the next patch, where we
have rt_mutex_setprio() cache a pointer to the top-most waiter task.
If we, as before this change, do the wakeup first and then deboost,
this pointer might point into thin air.

[peterz: Changelog + patch munging]
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.110065320@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-04 11:44:05 +02:00
Thomas Gleixner
38bffdac07 Merge branch 'sched/core' into locking/core
Required for the rtmutex/sched_deadline patches which depend on both
branches
2017-04-04 11:31:12 +02:00
Linus Torvalds
128c434a70 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
 "This update provides:

   - make the scheduler clock switch to unstable mode smooth so the
     timestamps stay at microseconds granularity instead of switching to
     tick granularity.

   - unbreak perf test tsc by taking the new offset into account which
     was added in order to proveide better sched clock continuity

   - switching sched clock to unstable mode runs all clock related
     computations which affect the sched clock output itself from a work
     queue. In case of preemption sched clock uses half updated data and
     provides wrong timestamps. Keep the math in the protected context
     and delegate only the static key switch to workqueue context.

   - remove a duplicate header include"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/headers: Remove duplicate #include <linux/sched/debug.h> line
  sched/clock: Fix broken stable to unstable transfer
  sched/clock, x86/perf: Fix "perf test tsc"
  sched/clock: Fix clear_sched_clock_stable() preempt wobbly
2017-04-02 09:25:10 -07:00
Al Viro
bee3f412d6 Merge branch 'parisc-4.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux into uaccess.parisc 2017-04-02 10:33:48 -04:00
Daniel Borkmann
79adffcd64 bpf, verifier: fix rejection of unaligned access checks for map_value_adj
Currently, the verifier doesn't reject unaligned access for map_value_adj
register types. Commit 484611357c19 ("bpf: allow access into map value
arrays") added logic to check_ptr_alignment() extending it from PTR_TO_PACKET
to also PTR_TO_MAP_VALUE_ADJ, but for PTR_TO_MAP_VALUE_ADJ no enforcement
is in place, because reg->id for PTR_TO_MAP_VALUE_ADJ reg types is never
non-zero, meaning, we can cause BPF_H/_W/_DW-based unaligned access for
architectures not supporting efficient unaligned access, and thus worst
case could raise exceptions on some archs that are unable to correct the
unaligned access or perform a different memory access to the actual
requested one and such.

i) Unaligned load with !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
   on r0 (map_value_adj):

   0: (bf) r2 = r10
   1: (07) r2 += -8
   2: (7a) *(u64 *)(r2 +0) = 0
   3: (18) r1 = 0x42533a00
   5: (85) call bpf_map_lookup_elem#1
   6: (15) if r0 == 0x0 goto pc+11
    R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
   7: (61) r1 = *(u32 *)(r0 +0)
   8: (35) if r1 >= 0xb goto pc+9
    R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R1=inv,min_value=0,max_value=10 R10=fp
   9: (07) r0 += 3
  10: (79) r7 = *(u64 *)(r0 +0)
    R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R1=inv,min_value=0,max_value=10 R10=fp
  11: (79) r7 = *(u64 *)(r0 +2)
    R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R1=inv,min_value=0,max_value=10 R7=inv R10=fp
  [...]

ii) Unaligned store with !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
    on r0 (map_value_adj):

   0: (bf) r2 = r10
   1: (07) r2 += -8
   2: (7a) *(u64 *)(r2 +0) = 0
   3: (18) r1 = 0x4df16a00
   5: (85) call bpf_map_lookup_elem#1
   6: (15) if r0 == 0x0 goto pc+19
    R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
   7: (07) r0 += 3
   8: (7a) *(u64 *)(r0 +0) = 42
    R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
   9: (7a) *(u64 *)(r0 +2) = 43
    R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
  10: (7a) *(u64 *)(r0 -2) = 44
    R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
  [...]

For the PTR_TO_PACKET type, reg->id is initially zero when skb->data
was fetched, it later receives a reg->id from env->id_gen generator
once another register with UNKNOWN_VALUE type was added to it via
check_packet_ptr_add(). The purpose of this reg->id is twofold: i) it
is used in find_good_pkt_pointers() for setting the allowed access
range for regs with PTR_TO_PACKET of same id once verifier matched
on data/data_end tests, and ii) for check_ptr_alignment() to determine
that when not having efficient unaligned access and register with
UNKNOWN_VALUE was added to PTR_TO_PACKET, that we're only allowed
to access the content bytewise due to unknown unalignment. reg->id
was never intended for PTR_TO_MAP_VALUE{,_ADJ} types and thus is
always zero, the only marking is in PTR_TO_MAP_VALUE_OR_NULL that
was added after 484611357c19 via 57a09bf0a416 ("bpf: Detect identical
PTR_TO_MAP_VALUE_OR_NULL registers"). Above tests will fail for
non-root environment due to prohibited pointer arithmetic.

The fix splits register-type specific checks into their own helper
instead of keeping them combined, so we don't run into a similar
issue in future once we extend check_ptr_alignment() further and
forget to add reg->type checks for some of the checks.

Fixes: 484611357c19 ("bpf: allow access into map value arrays")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 12:36:37 -07:00
Daniel Borkmann
fce366a9dd bpf, verifier: fix alu ops against map_value{, _adj} register types
While looking into map_value_adj, I noticed that alu operations
directly on the map_value() resp. map_value_adj() register (any
alu operation on a map_value() register will turn it into a
map_value_adj() typed register) are not sufficiently protected
against some of the operations. Two non-exhaustive examples are
provided that the verifier needs to reject:

 i) BPF_AND on r0 (map_value_adj):

  0: (bf) r2 = r10
  1: (07) r2 += -8
  2: (7a) *(u64 *)(r2 +0) = 0
  3: (18) r1 = 0xbf842a00
  5: (85) call bpf_map_lookup_elem#1
  6: (15) if r0 == 0x0 goto pc+2
   R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
  7: (57) r0 &= 8
  8: (7a) *(u64 *)(r0 +0) = 22
   R0=map_value_adj(ks=8,vs=48,id=0),min_value=0,max_value=8 R10=fp
  9: (95) exit

  from 6 to 9: R0=inv,min_value=0,max_value=0 R10=fp
  9: (95) exit
  processed 10 insns

ii) BPF_ADD in 32 bit mode on r0 (map_value_adj):

  0: (bf) r2 = r10
  1: (07) r2 += -8
  2: (7a) *(u64 *)(r2 +0) = 0
  3: (18) r1 = 0xc24eee00
  5: (85) call bpf_map_lookup_elem#1
  6: (15) if r0 == 0x0 goto pc+2
   R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
  7: (04) (u32) r0 += (u32) 0
  8: (7a) *(u64 *)(r0 +0) = 22
   R0=map_value_adj(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
  9: (95) exit

  from 6 to 9: R0=inv,min_value=0,max_value=0 R10=fp
  9: (95) exit
  processed 10 insns

Issue is, while min_value / max_value boundaries for the access
are adjusted appropriately, we change the pointer value in a way
that cannot be sufficiently tracked anymore from its origin.
Operations like BPF_{AND,OR,DIV,MUL,etc} on a destination register
that is PTR_TO_MAP_VALUE{,_ADJ} was probably unintended, in fact,
all the test cases coming with 484611357c19 ("bpf: allow access
into map value arrays") perform BPF_ADD only on the destination
register that is PTR_TO_MAP_VALUE_ADJ.

Only for UNKNOWN_VALUE register types such operations make sense,
f.e. with unknown memory content fetched initially from a constant
offset from the map value memory into a register. That register is
then later tested against lower / upper bounds, so that the verifier
can then do the tracking of min_value / max_value, and properly
check once that UNKNOWN_VALUE register is added to the destination
register with type PTR_TO_MAP_VALUE{,_ADJ}. This is also what the
original use-case is solving. Note, tracking on what is being
added is done through adjust_reg_min_max_vals() and later access
to the map value enforced with these boundaries and the given offset
from the insn through check_map_access_adj().

Tests will fail for non-root environment due to prohibited pointer
arithmetic, in particular in check_alu_op(), we bail out on the
is_pointer_value() check on the dst_reg (which is false in root
case as we allow for pointer arithmetic via env->allow_ptr_leaks).

Similarly to PTR_TO_PACKET, one way to fix it is to restrict the
allowed operations on PTR_TO_MAP_VALUE{,_ADJ} registers to 64 bit
mode BPF_ADD. The test_verifier suite runs fine after the patch
and it also rejects mentioned test cases.

Fixes: 484611357c19 ("bpf: allow access into map value arrays")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 12:36:37 -07:00
Linus Torvalds
035f0cd3f8 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
 "This fixes the following issues:

   - memory corruption when kmalloc fails in xts/lrw

   - mark some CCP DMA channels as private

   - fix reordering race in padata

   - regression in omap-rng DT description"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: xts,lrw - fix out-of-bounds write after kmalloc failure
  crypto: ccp - Make some CCP DMA channels private
  padata: avoid race in reordering
  dt-bindings: rng: clocks property on omap_rng not always mandatory
2017-03-31 12:11:32 -07:00
Nicholas Mc Guire
5fc63f9577 timekeeping: Remove pointless conversion to bool
interp_forward is type bool so assignment from a logical operation directly
is sufficient.

Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Cc: "Christopher S. Hall" <christopher.s.hall@intel.com>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/1490382215-30505-1-git-send-email-der.herr@hofr.at
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-31 10:26:56 +02:00
Thomas Gleixner
9005615baf Merge branch 'fortglx/4.12/time' of https://git.linaro.org/people/john.stultz/linux into timers/core
Pull timekeeping changes from John Stultz:

 Main changes are the initial steps of Nicoli's work to make the clockevent
 timers be corrected for NTP adjustments. Then a few smaller fixes that
 I've queued, and adding Stephen Boyd to the maintainers list for
 timekeeping.
2017-03-31 09:48:00 +02:00
Chris Wilson
57dd924e54 locking/ww-mutex: Limit stress test to 2 seconds
Use a timeout rather than a fixed number of loops to avoid running for
very long periods, such as under the kbuilder VMs.

Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170310105733.6444-1-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-30 09:49:47 +02:00
Ingo Molnar
c69f203df3 Merge branch 'linus' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-30 09:48:58 +02:00