25348 Commits

Author SHA1 Message Date
Jan Kara
86ffe245c4 fsnotify: Move object pointer to fsnotify_mark_connector
Move pointer to inode / vfsmount from mark itself to the
fsnotify_mark_connector structure. This is another step on the path
towards decoupling inode / vfsmount lifetime from notification mark
lifetime.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:35 +02:00
Jan Kara
9dd813c15b fsnotify: Move mark list head from object into dedicated structure
Currently notification marks are attached to object (inode or vfsmnt) by
a hlist_head in the object. The list is also protected by a spinlock in
the object. So while there is any mark attached to the list of marks,
the object must be pinned in memory (and thus e.g. last iput() deleting
inode cannot happen). Also for list iteration in fsnotify() to work, we
must hold fsnotify_mark_srcu lock so that mark itself and
mark->obj_list.next cannot get freed. Thus we are required to wait for
response to fanotify events from userspace process with
fsnotify_mark_srcu lock held. That causes issues when userspace process
is buggy and does not reply to some event - basically the whole
notification subsystem gets eventually stuck.

So to be able to drop fsnotify_mark_srcu lock while waiting for
response, we have to pin the mark in memory and make sure it stays in
the object list (as removing the mark waiting for response could lead to
lost notification events for groups later in the list). However we don't
want inode reclaim to block on such mark as that would lead to system
just locking up elsewhere.

This commit is the first in the series that paves way towards solving
these conflicting lifetime needs. Instead of anchoring the list of marks
directly in the object, we anchor it in a dedicated structure
(fsnotify_mark_connector) and just point to that structure from the
object. The following commits will also add spinlock protecting the list
and object pointer to the structure.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:37:34 +02:00
Jan Kara
43471d15df audit_tree: Use mark flags to check whether mark is alive
Currently audit code uses checking of mark->inode to verify whether mark
is still alive. Switch that to checking mark flags as that is more
logical and current way will become unreliable in future.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-10 17:36:32 +02:00
Paul Moore
264d509637 audit: make sure we don't let the retry queue grow without bounds
The retry queue is intended to provide a temporary buffer in the case
of transient errors when communicating with auditd, it is not meant
as a long life queue, that functionality is provided by the hold
queue.

This patch fixes a problem identified by Seth where the retry queue
could grow uncontrollably if an auditd instance did not connect to
the kernel to drain the queues.  This commit fixes this by doing the
following:

* Make sure we always call auditd_reset() if we decide the connection
with audit is really dead.  There were some cases in
kauditd_hold_skb() where we did not reset the connection, this patch
relocates the reset calls to kauditd_thread() so all the error
conditions are caught and the connection reset.  As a side effect,
this means we could move auditd_reset() and get rid of the forward
definition at the top of kernel/audit.c.

* We never checked the status of the auditd connection when
processing the main audit queue which meant that the retry queue
could grow unchecked.  This patch adds a call to auditd_reset()
after the main queue has been processed if auditd is not connected,
the auditd_reset() call will make sure the retry and hold queues are
correctly managed/flushed so that the retry queue remains reasonable.

Cc: <stable@vger.kernel.org> # 4.10.x-: 5b52330bbfe6
Reported-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-04-10 11:16:59 -04:00
Jason A. Donenfeld
07a77929ba padata: free correct variable
The author meant to free the variable that was just allocated, instead
of the one that failed to be allocated, but made a simple typo. This
patch rectifies that.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-04-10 19:17:27 +08:00
Liping Zhang
425fffd886 sysctl: report EINVAL if value is larger than UINT_MAX for proc_douintvec
Currently, inputting the following command will succeed but actually the
value will be truncated:

  # echo 0x12ffffffff > /proc/sys/net/ipv4/tcp_notsent_lowat

This is not friendly to the user, so instead, we should report error
when the value is larger than UINT_MAX.

Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-08 10:27:40 -07:00
Linus Torvalds
62fedca5ce Merge branch 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit
Pull audit cleanup from Paul Moore:
 "A week later than I had hoped, but as promised, here is the audit
  uninline-fix we talked about during the last audit pull request.

  The patch is slightly different than what we originally discussed as
  it made more sense to keep the audit_signal_info() function in
  auditsc.c rather than move it and bunch of other related
  variables/definitions into audit.c/audit.h.

  At some point in the future I need to look at how the audit code is
  organized across kernel/audit*, I suspect we could do things a bit
  better, but it doesn't seem like a -rc release is a good place for
  that ;)

  Regardless, this patch passes our tests without problem and looks good
  for v4.11"

* 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit:
  audit: move audit_signal_info() into kernel/auditsc.c
2017-04-08 01:37:25 -07:00
bsegall@google.com
5402e97af6 ptrace: fix PTRACE_LISTEN race corrupting task->state
In PT_SEIZED + LISTEN mode STOP/CONT signals cause a wakeup against
__TASK_TRACED.  If this races with the ptrace_unfreeze_traced at the end
of a PTRACE_LISTEN, this can wake the task /after/ the check against
__TASK_TRACED, but before the reset of state to TASK_TRACED.  This
causes it to instead clobber TASK_WAKING, allowing a subsequent wakeup
against TRACED while the task is still on the rq wake_list, corrupting
it.

Oleg said:
 "The kernel can crash or this can lead to other hard-to-debug problems.
  In short, "task->state = TASK_TRACED" in ptrace_unfreeze_traced()
  assumes that nobody else can wake it up, but PTRACE_LISTEN breaks the
  contract. Obviusly it is very wrong to manipulate task->state if this
  task is already running, or WAKING, or it sleeps again"

[akpm@linux-foundation.org: coding-style fixes]
Fixes: 9899d11f ("ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL")
Link: http://lkml.kernel.org/r/xm26y3vfhmkp.fsf_-_@bsegall-linux.mtv.corp.google.com
Signed-off-by: Ben Segall <bsegall@google.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-08 00:47:48 -07:00
Liping Zhang
5380e5644a sysctl: don't print negative flag for proc_douintvec
I saw some very confusing sysctl output on my system:
  # cat /proc/sys/net/core/xfrm_aevent_rseqth
  -2
  # cat /proc/sys/net/core/xfrm_aevent_etime
  -10
  # cat /proc/sys/net/ipv4/tcp_notsent_lowat
  -4294967295

Because we forget to set the *negp flag in proc_douintvec, so it will
become a garbage value.

Since the value related to proc_douintvec is always an unsigned integer,
so we can set *negp to false explictily to fix this issue.

Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-07 09:46:44 -07:00
Steven Rostedt (VMware)
0598e4f08e ftrace: Add use of synchronize_rcu_tasks() with dynamic trampolines
The function tracer needs to be more careful than other subsystems when it
comes to freeing data. Especially if that data is actually executable code.
When a single function is traced, a trampoline can be dynamically allocated
which is called to jump to the function trace callback. When the callback is
no longer needed, the dynamic allocated trampoline needs to be freed. This
is where the issues arise. The dynamically allocated trampoline must not be
used again. As function tracing can trace all subsystems, including
subsystems that are used to serialize aspects of freeing (namely RCU), it
must take extra care when doing the freeing.

Before synchronize_rcu_tasks() was around, there was no way for the function
tracer to know that nothing was using the dynamically allocated trampoline
when CONFIG_PREEMPT was enabled. That's because a task could be indefinitely
preempted while sitting on the trampoline. Now with synchronize_rcu_tasks(),
it will wait till all tasks have either voluntarily scheduled (not on the
trampoline) or goes into userspace (not on the trampoline). Then it is safe
to free the trampoline even with CONFIG_PREEMPT set.

Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-07 09:41:51 -04:00
Linus Torvalds
4691f4a6d4 Wei Yongjun fixed a long standing bug in the ring buffer startup test.
If for some unknown reason, the kthread that is created fails to be
 created, the return from kthread_create() is an PTR_ERR and not a NULL.
 The test incorrectly checks for NULL instead of an error.
 -----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJY5mOWFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 XcsH/iBX7Kf4ta/0Jo4+sR4+HeDmWNPVBTwlei+dvMfaK1rWDgW6hbwSJg3geUwN
 d2zL/o7uCWbXubO9sjeCX2n+ecUiUcJRheewfdm0KzaPH387ofdUd24yz3DNDNcl
 /yaZMmeApjpHJjJWxoH5TUSF/yliC2FvjHYWxgEx9qhrzldLk/r5qAealj2tKl1Q
 1cgSQEgXf5n6Wg0onBuR2JiMOo3+4lXh+pIpO1Dupalhj7cC91HatDDYrNmGRIWR
 qucf3iQLoD/m88bgpxsRortkQ09NfVJExxzIPliVoYF8VwtzL+77XD81EdgvLdTs
 WP+CAoMFk83fkuXK7Vg1HZZa5zg=
 =Z0D5
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "Wei Yongjun fixed a long standing bug in the ring buffer startup test.

  If for some unknown reason, the kthread that is created fails to be
  created, the return from kthread_create() is an PTR_ERR and not a
  NULL. The test incorrectly checks for NULL instead of an error"

* tag 'trace-v4.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ring-buffer: Fix return value check in test_ringbuffer()
2017-04-06 13:12:12 -07:00
David S. Miller
6f14f443d3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Mostly simple cases of overlapping changes (adding code nearby,
a function whose name changes, for example).

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-06 08:24:51 -07:00
Linus Torvalds
ea6b1720ce Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Reject invalid updates to netfilter expectation policies, from Pablo
    Neira Ayuso.

 2) Fix memory leak in nfnl_cthelper, from Jeffy Chen.

 3) Don't do stupid things if we get a neigh_probe() on a neigh entry
    whose ops lack a solicit method. From Eric Dumazet.

 4) Don't transmit packets in r8152 driver when the carrier is off, from
    Hayes Wang.

 5) Fix ipv6 packet type detection in aquantia driver, from Pavel
    Belous.

 6) Don't write uninitialized data into hw registers in bna driver, from
    Arnd Bergmann.

 7) Fix locking in ping_unhash(), from Eric Dumazet.

 8) Make BPF verifier range checks able to understand certain sequences
    emitted by LLVM, from Alexei Starovoitov.

 9) Fix use after free in ipconfig, from Mark Rutland.

10) Fix refcount leak on force commit in openvswitch, from Jarno
    Rajahalme.

11) Fix various overflow checks in AF_PACKET, from Andrey Konovalov.

12) Fix endianness bug in be2net driver, from Suresh Reddy.

13) Don't forget to wake TX queues when processing a timeout, from
    Grygorii Strashko.

14) ARP header on-stack storage is wrong in flow dissector, from Simon
    Horman.

15) Lost retransmit and reordering SNMP stats in TCP can be
    underreported. From Yuchung Cheng.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (82 commits)
  nfp: fix potential use after free on xdp prog
  tcp: fix reordering SNMP under-counting
  tcp: fix lost retransmit SNMP under-counting
  sctp: get sock from transport in sctp_transport_update_pmtu
  net: ethernet: ti: cpsw: fix race condition during open()
  l2tp: fix PPP pseudo-wire auto-loading
  bnx2x: fix spelling mistake in macros HW_INTERRUT_ASSERT_SET_*
  l2tp: take reference on sessions being dumped
  tcp: minimize false-positives on TCP/GRO check
  sctp: check for dst and pathmtu update in sctp_packet_config
  flow dissector: correct size of storage for ARP
  net: ethernet: ti: cpsw: wake tx queues on ndo_tx_timeout
  l2tp: take a reference on sessions used in genetlink handlers
  l2tp: hold session while sending creation notifications
  l2tp: fix duplicate session creation
  l2tp: ensure session can't get removed during pppol2tp_session_ioctl()
  l2tp: fix race in l2tp_recv_common()
  sctp: use right in and out stream cnt
  bpf: add various verifier test cases for self-tests
  bpf, verifier: fix rejection of unaligned access checks for map_value_adj
  ...
2017-04-05 20:17:38 -07:00
Mike Galbraith
def34eaae5 rtmutex: Plug preempt count leak in rt_mutex_futex_unlock()
mark_wakeup_next_waiter() already disables preemption, doing so again
leaves us with an unpaired preempt_disable().

Fixes: 2a1c60299406 ("rtmutex: Deboost before waking up the top waiter")
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/1491379707.6538.2.camel@gmx.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-05 16:59:37 +02:00
Herbert Xu
c6dc060906 Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Merge the crypto tree to resolve conflict between caam changes.
2017-04-05 21:57:07 +08:00
Wei Yongjun
62277de758 ring-buffer: Fix return value check in test_ringbuffer()
In case of error, the function kthread_run() returns ERR_PTR()
and never returns NULL. The NULL test in the return value check
should be replaced with IS_ERR().

Link: http://lkml.kernel.org/r/1466184839-14927-1-git-send-email-weiyj_lk@163.com

Cc: stable@vger.kernel.org
Fixes: 6c43e554a ("ring-buffer: Add ring buffer startup selftest")
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-05 09:36:52 -04:00
Jan Kara
f410ff6554 audit: Abstract hash key handling
Audit tree currently uses inode pointer as a key into the hash table.
Getting that from notification mark will be somewhat more difficult with
coming fsnotify changes. So abstract getting of hash key from the audit
chunk and inode so that we can change the method to obtain a key easily.

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
CC: Paul Moore <paul@paul-moore.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-04-05 09:39:37 +02:00
Alban Crequy
696ced4fb1 tracing/kprobes: expose maxactive for kretprobe in kprobe_events
When a kretprobe is installed on a kernel function, there is a maximum
limit of how many calls in parallel it can catch (aka "maxactive"). A
kernel module could call register_kretprobe() and initialize maxactive
(see example in samples/kprobes/kretprobe_example.c).

But that is not exposed to userspace and it is currently not possible to
choose maxactive when writing to /sys/kernel/debug/tracing/kprobe_events

The default maxactive can be as low as 1 on single-core with a
non-preemptive kernel. This is too low and we need to increase it not
only for recursive functions, but for functions that sleep or resched.

This patch updates the format of the command that can be written to
kprobe_events so that maxactive can be optionally specified.

I need this for a bpf program attached to the kretprobe of
inet_csk_accept, which can sleep for a long time.

This patch includes a basic selftest:

> # ./ftracetest -v  test.d/kprobe/
> === Ftrace unit tests ===
> [1] Kprobe dynamic event - adding and removing	[PASS]
> [2] Kprobe dynamic event - busy event check	[PASS]
> [3] Kprobe dynamic event with arguments	[PASS]
> [4] Kprobes event arguments with types	[PASS]
> [5] Kprobe dynamic event with function tracer	[PASS]
> [6] Kretprobe dynamic event with arguments	[PASS]
> [7] Kretprobe dynamic event with maxactive	[PASS]
>
> # of passed:  7
> # of failed:  0
> # of unresolved:  0
> # of untested:  0
> # of unsupported:  0
> # of xfailed:  0
> # of undefined(test bug):  0

BugLink: https://github.com/iovisor/bcc/issues/1072
Link: http://lkml.kernel.org/r/1491215782-15490-1-git-send-email-alban@kinvolk.io

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-04 10:32:03 -04:00
Petr Mladek
257ab44311 printk: Correctly handle preemption in console_unlock()
Some console drivers code calls console_conditional_schedule()
that looks at @console_may_schedule. The value must be cleared
when the drivers are called from console_unlock() with
interrupts disabled. But rescheduling is fine when the same
code is called, for example, from tty operations where the
console semaphore is taken via console_lock().

This is why @console_may_schedule is cleared before calling console
drivers. The original value is stored to decide if we could sleep
between lines.

Now, @console_may_schedule is not cleared when we call
console_trylock() and jump back to the "again" goto label.
This has become a problem, since the commit 6b97a20d3a7909daa066
("printk: set may_schedule for some of console_trylock() callers").
@console_may_schedule might get enabled now.

There is also the opposite problem. console_lock() can be called
only from preemptive context. It can always enable scheduling in
the console code. But console_trylock() is not able to detect it
when CONFIG_PREEMPT_COUNT is disabled. Therefore we should use the
original @console_may_schedule value after re-acquiring
the console semaphore in console_unlock().

This patch solves both problems by moving the "again" goto label.

Alternative solution was to clear and restore the value around
call_console_drivers(). Then console_conditional_schedule() could
be used also inside console_unlock(). But there was a potential race
with console_flush_on_panic() as reported by Sergey Senozhatsky.
That function should be called only where there is only one CPU
and with interrupts disabled. But better be on the safe side
because stopping CPUs might fail.

Fixes: 6b97a20d3a7909 ("printk: set may_schedule for some of console_trylock() callers")
Link: http://lkml.kernel.org/r/1490372045-22288-1-git-send-email-pmladek@suse.com
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: linux-fbdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2017-04-04 14:08:01 +02:00
Keith Busch
7bf8222b9b irq/affinity: Fix CPU spread for unbalanced nodes
The irq_create_affinity_masks routine is responsible for assigning a
number of interrupt vectors to CPUs. The optimal assignemnet will spread
requested vectors to all CPUs, with the fewest CPUs sharing a vector.

The algorithm may fail to assign some vectors to any CPUs if a node's
CPU count is lower than the average number of vectors per node. These
vectors are unusable and create an un-optimal spread.

Recalculate the number of vectors to assign at each node iteration by using
the remaining number of vectors and nodes to be assigned, not exceeding the
number of CPUs in that node. This will guarantee that every CPU is assigned
at least one vector.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: linux-nvme@lists.infradead.org
Link: http://lkml.kernel.org/r/1491247553-7603-1-git-send-email-keith.busch@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-04 11:57:28 +02:00
Peter Zijlstra
19830e5524 rtmutex: Fix more prio comparisons
There was a pure ->prio comparison left in try_to_wake_rt_mutex(),
convert it to use rt_mutex_waiter_less(), noting that greater-or-equal
is not-less (both in kernel priority view).

This necessitated the introduction of cmp_task() which creates a
pointer to an unnamed stack variable of struct rt_mutex_waiter type to
compare against tasks.

With this, we can now also create and employ rt_mutex_waiter_equal().

Reviewed-and-tested-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.455584638@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-04 11:44:07 +02:00
Peter Zijlstra
e0aad5b44f rtmutex: Fix PI chain order integrity
rt_mutex_waiter::prio is a copy of task_struct::prio which is updated
during the PI chain walk, such that the PI chain order isn't messed up
by (asynchronous) task state updates.

Currently rt_mutex_waiter_less() uses task state for deadline tasks;
this is broken, since the task state can, as said above, change
asynchronously, causing the RB tree order to change without actual
tree update -> FAIL.

Fix this by also copying the deadline into the rt_mutex_waiter state
and updating it along with its prio field.

Ideally we would also force PI chain updates whenever DL tasks update
their deadline parameter, but for first approximation this is less
broken than it was.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.403992539@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-04 11:44:06 +02:00
Peter Zijlstra
b91473ff6e sched,tracing: Update trace_sched_pi_setprio()
Pass the PI donor task, instead of a numerical priority.

Numerical priorities are not sufficient to describe state ever since
SCHED_DEADLINE.

Annotate all sched tracepoints that are currently broken; fixing them
will bork userspace. *hate*.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.353599881@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-04 11:44:06 +02:00
Peter Zijlstra
acd58620e4 sched/rtmutex: Refactor rt_mutex_setprio()
With the introduction of SCHED_DEADLINE the whole notion that priority
is a single number is gone, therefore the @prio argument to
rt_mutex_setprio() doesn't make sense anymore.

So rework the code to pass a pi_task instead.

Note this also fixes a problem with pi_top_task caching; previously we
would not set the pointer (call rt_mutex_update_top_task) if the
priority didn't change, this could lead to a stale pointer.

As for the XXX, I think its fine to use pi_task->prio, because if it
differs from waiter->prio, a PI chain update is immenent.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.303827095@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-04 11:44:06 +02:00
Peter Zijlstra
aa2bfe5536 rtmutex: Clean up
Previous patches changed the meaning of the return value of
rt_mutex_slowunlock(); update comments and code to reflect this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.255058238@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-04 11:44:05 +02:00
Xunlei Pang
85e2d4f992 sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update
Currently dl tasks will actually return at the very beginning
of rt_mutex_adjust_prio_chain() in !detect_deadlock cases:

    if (waiter->prio == task->prio) {
        if (!detect_deadlock)
            goto out_unlock_pi; // out here
        else
            requeue = false;
    }

As the deadline value of blocked deadline tasks(waiters) without
changing their sched_class(thus prio doesn't change) never changes,
this seems reasonable, but it actually misses the chance of updating
rt_mutex_waiter's "dl_runtime(period)_copy" if a waiter updates its
deadline parameters(dl_runtime, dl_period) or boosted waiter changes
to !deadline class.

Thus, force deadline task not out by adding the !dl_prio() condition.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/1460633827-345-7-git-send-email-xlpang@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.206577901@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-04 11:44:05 +02:00
Xunlei Pang
e96a7705e7 sched/rtmutex/deadline: Fix a PI crash for deadline tasks
A crash happened while I was playing with deadline PI rtmutex.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
    PGD 232a75067 PUD 230947067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 1 PID: 10994 Comm: a.out Not tainted

    Call Trace:
    [<ffffffff810b658c>] enqueue_task+0x2c/0x80
    [<ffffffff810ba763>] activate_task+0x23/0x30
    [<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260
    [<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20
    [<ffffffff8164e783>] __schedule+0xd3/0x900
    [<ffffffff8164efd9>] schedule+0x29/0x70
    [<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0
    [<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190
    [<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60
    [<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390
    [<ffffffff810ed8b0>] do_futex+0x190/0x5b0
    [<ffffffff810edd50>] SyS_futex+0x80/0x180

This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
are only protected by pi_lock when operating pi waiters, while
rt_mutex_get_top_task(), will access them with rq lock held but
not holding pi_lock.

In order to tackle it, we introduce new "pi_top_task" pointer
cached in task_struct, and add new rt_mutex_update_top_task()
to update its value, it can be called by rt_mutex_setprio()
which held both owner's pi_lock and rq lock. Thus "pi_top_task"
can be safely accessed by enqueue_task_dl() under rq lock.

Originally-From: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.157682758@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-04 11:44:05 +02:00
Xunlei Pang
2a1c602994 rtmutex: Deboost before waking up the top waiter
We should deboost before waking the high-priority task, such that we
don't run two tasks with the same "state" (priority, deadline,
sched_class, etc).

In order to make sure the boosting task doesn't start running between
unlock and deboost (due to 'spurious' wakeup), we move the deboost
under the wait_lock, that way its serialized against the wait loop in
__rt_mutex_slowlock().

Doing the deboost early can however lead to priority-inversion if
current would get preempted after the deboost but before waking our
high-prio task, hence we disable preemption before doing deboost, and
enabling it after the wake up is over.

This gets us the right semantic order, but most importantly however;
this change ensures pointer stability for the next patch, where we
have rt_mutex_setprio() cache a pointer to the top-most waiter task.
If we, as before this change, do the wakeup first and then deboost,
this pointer might point into thin air.

[peterz: Changelog + patch munging]
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.110065320@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-04 11:44:05 +02:00
Thomas Gleixner
38bffdac07 Merge branch 'sched/core' into locking/core
Required for the rtmutex/sched_deadline patches which depend on both
branches
2017-04-04 11:31:12 +02:00
Steven Rostedt (VMware)
b80f0f6c9e ftrace: Have init/main.c call ftrace directly to free init memory
Relying on free_reserved_area() to call ftrace to free init memory proved to
not be sufficient. The issue is that on x86, when debug_pagealloc is
enabled, the init memory is not freed, but simply set as not present. Since
ftrace was uninformed of this, starting function tracing still tries to
update pages that are not present according to the page tables, causing
ftrace to bug, as well as killing the kernel itself.

Instead of relying on free_reserved_area(), have init/main.c call ftrace
directly just before it frees the init memory. Then it needs to use
__init_begin and __init_end to know where the init memory location is.
Looking at all archs (and testing what I can), it appears that this should
work for each of them.

Reported-by: kernel test robot <xiaolong.ye@intel.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-03 14:04:00 -04:00
Greg Kroah-Hartman
9ccd8d7445 Merge 4.11-rc5 into tty-next
We want the serial fixes in here as well to handle merge issues.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-03 14:18:16 +02:00
Linus Torvalds
128c434a70 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
 "This update provides:

   - make the scheduler clock switch to unstable mode smooth so the
     timestamps stay at microseconds granularity instead of switching to
     tick granularity.

   - unbreak perf test tsc by taking the new offset into account which
     was added in order to proveide better sched clock continuity

   - switching sched clock to unstable mode runs all clock related
     computations which affect the sched clock output itself from a work
     queue. In case of preemption sched clock uses half updated data and
     provides wrong timestamps. Keep the math in the protected context
     and delegate only the static key switch to workqueue context.

   - remove a duplicate header include"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/headers: Remove duplicate #include <linux/sched/debug.h> line
  sched/clock: Fix broken stable to unstable transfer
  sched/clock, x86/perf: Fix "perf test tsc"
  sched/clock: Fix clear_sched_clock_stable() preempt wobbly
2017-04-02 09:25:10 -07:00
Al Viro
bee3f412d6 Merge branch 'parisc-4.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux into uaccess.parisc 2017-04-02 10:33:48 -04:00
Alexei Starovoitov
1cf1cae963 bpf: introduce BPF_PROG_TEST_RUN command
development and testing of networking bpf programs is quite cumbersome.
Despite availability of user space bpf interpreters the kernel is
the ultimate authority and execution environment.
Current test frameworks for TC include creation of netns, veth,
qdiscs and use of various packet generators just to test functionality
of a bpf program. XDP testing is even more complicated, since
qemu needs to be started with gro/gso disabled and precise queue
configuration, transferring of xdp program from host into guest,
attaching to virtio/eth0 and generating traffic from the host
while capturing the results from the guest.

Moreover analyzing performance bottlenecks in XDP program is
impossible in virtio environment, since cost of running the program
is tiny comparing to the overhead of virtio packet processing,
so performance testing can only be done on physical nic
with another server generating traffic.

Furthermore ongoing changes to user space control plane of production
applications cannot be run on the test servers leaving bpf programs
stubbed out for testing.

Last but not least, the upstream llvm changes are validated by the bpf
backend testsuite which has no ability to test the code generated.

To improve this situation introduce BPF_PROG_TEST_RUN command
to test and performance benchmark bpf programs.

Joint work with Daniel Borkmann.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 12:45:57 -07:00
Daniel Borkmann
79adffcd64 bpf, verifier: fix rejection of unaligned access checks for map_value_adj
Currently, the verifier doesn't reject unaligned access for map_value_adj
register types. Commit 484611357c19 ("bpf: allow access into map value
arrays") added logic to check_ptr_alignment() extending it from PTR_TO_PACKET
to also PTR_TO_MAP_VALUE_ADJ, but for PTR_TO_MAP_VALUE_ADJ no enforcement
is in place, because reg->id for PTR_TO_MAP_VALUE_ADJ reg types is never
non-zero, meaning, we can cause BPF_H/_W/_DW-based unaligned access for
architectures not supporting efficient unaligned access, and thus worst
case could raise exceptions on some archs that are unable to correct the
unaligned access or perform a different memory access to the actual
requested one and such.

i) Unaligned load with !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
   on r0 (map_value_adj):

   0: (bf) r2 = r10
   1: (07) r2 += -8
   2: (7a) *(u64 *)(r2 +0) = 0
   3: (18) r1 = 0x42533a00
   5: (85) call bpf_map_lookup_elem#1
   6: (15) if r0 == 0x0 goto pc+11
    R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
   7: (61) r1 = *(u32 *)(r0 +0)
   8: (35) if r1 >= 0xb goto pc+9
    R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R1=inv,min_value=0,max_value=10 R10=fp
   9: (07) r0 += 3
  10: (79) r7 = *(u64 *)(r0 +0)
    R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R1=inv,min_value=0,max_value=10 R10=fp
  11: (79) r7 = *(u64 *)(r0 +2)
    R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R1=inv,min_value=0,max_value=10 R7=inv R10=fp
  [...]

ii) Unaligned store with !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
    on r0 (map_value_adj):

   0: (bf) r2 = r10
   1: (07) r2 += -8
   2: (7a) *(u64 *)(r2 +0) = 0
   3: (18) r1 = 0x4df16a00
   5: (85) call bpf_map_lookup_elem#1
   6: (15) if r0 == 0x0 goto pc+19
    R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
   7: (07) r0 += 3
   8: (7a) *(u64 *)(r0 +0) = 42
    R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
   9: (7a) *(u64 *)(r0 +2) = 43
    R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
  10: (7a) *(u64 *)(r0 -2) = 44
    R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
  [...]

For the PTR_TO_PACKET type, reg->id is initially zero when skb->data
was fetched, it later receives a reg->id from env->id_gen generator
once another register with UNKNOWN_VALUE type was added to it via
check_packet_ptr_add(). The purpose of this reg->id is twofold: i) it
is used in find_good_pkt_pointers() for setting the allowed access
range for regs with PTR_TO_PACKET of same id once verifier matched
on data/data_end tests, and ii) for check_ptr_alignment() to determine
that when not having efficient unaligned access and register with
UNKNOWN_VALUE was added to PTR_TO_PACKET, that we're only allowed
to access the content bytewise due to unknown unalignment. reg->id
was never intended for PTR_TO_MAP_VALUE{,_ADJ} types and thus is
always zero, the only marking is in PTR_TO_MAP_VALUE_OR_NULL that
was added after 484611357c19 via 57a09bf0a416 ("bpf: Detect identical
PTR_TO_MAP_VALUE_OR_NULL registers"). Above tests will fail for
non-root environment due to prohibited pointer arithmetic.

The fix splits register-type specific checks into their own helper
instead of keeping them combined, so we don't run into a similar
issue in future once we extend check_ptr_alignment() further and
forget to add reg->type checks for some of the checks.

Fixes: 484611357c19 ("bpf: allow access into map value arrays")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 12:36:37 -07:00
Daniel Borkmann
fce366a9dd bpf, verifier: fix alu ops against map_value{, _adj} register types
While looking into map_value_adj, I noticed that alu operations
directly on the map_value() resp. map_value_adj() register (any
alu operation on a map_value() register will turn it into a
map_value_adj() typed register) are not sufficiently protected
against some of the operations. Two non-exhaustive examples are
provided that the verifier needs to reject:

 i) BPF_AND on r0 (map_value_adj):

  0: (bf) r2 = r10
  1: (07) r2 += -8
  2: (7a) *(u64 *)(r2 +0) = 0
  3: (18) r1 = 0xbf842a00
  5: (85) call bpf_map_lookup_elem#1
  6: (15) if r0 == 0x0 goto pc+2
   R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
  7: (57) r0 &= 8
  8: (7a) *(u64 *)(r0 +0) = 22
   R0=map_value_adj(ks=8,vs=48,id=0),min_value=0,max_value=8 R10=fp
  9: (95) exit

  from 6 to 9: R0=inv,min_value=0,max_value=0 R10=fp
  9: (95) exit
  processed 10 insns

ii) BPF_ADD in 32 bit mode on r0 (map_value_adj):

  0: (bf) r2 = r10
  1: (07) r2 += -8
  2: (7a) *(u64 *)(r2 +0) = 0
  3: (18) r1 = 0xc24eee00
  5: (85) call bpf_map_lookup_elem#1
  6: (15) if r0 == 0x0 goto pc+2
   R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
  7: (04) (u32) r0 += (u32) 0
  8: (7a) *(u64 *)(r0 +0) = 22
   R0=map_value_adj(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
  9: (95) exit

  from 6 to 9: R0=inv,min_value=0,max_value=0 R10=fp
  9: (95) exit
  processed 10 insns

Issue is, while min_value / max_value boundaries for the access
are adjusted appropriately, we change the pointer value in a way
that cannot be sufficiently tracked anymore from its origin.
Operations like BPF_{AND,OR,DIV,MUL,etc} on a destination register
that is PTR_TO_MAP_VALUE{,_ADJ} was probably unintended, in fact,
all the test cases coming with 484611357c19 ("bpf: allow access
into map value arrays") perform BPF_ADD only on the destination
register that is PTR_TO_MAP_VALUE_ADJ.

Only for UNKNOWN_VALUE register types such operations make sense,
f.e. with unknown memory content fetched initially from a constant
offset from the map value memory into a register. That register is
then later tested against lower / upper bounds, so that the verifier
can then do the tracking of min_value / max_value, and properly
check once that UNKNOWN_VALUE register is added to the destination
register with type PTR_TO_MAP_VALUE{,_ADJ}. This is also what the
original use-case is solving. Note, tracking on what is being
added is done through adjust_reg_min_max_vals() and later access
to the map value enforced with these boundaries and the given offset
from the insn through check_map_access_adj().

Tests will fail for non-root environment due to prohibited pointer
arithmetic, in particular in check_alu_op(), we bail out on the
is_pointer_value() check on the dst_reg (which is false in root
case as we allow for pointer arithmetic via env->allow_ptr_leaks).

Similarly to PTR_TO_PACKET, one way to fix it is to restrict the
allowed operations on PTR_TO_MAP_VALUE{,_ADJ} registers to 64 bit
mode BPF_ADD. The test_verifier suite runs fine after the patch
and it also rejects mentioned test cases.

Fixes: 484611357c19 ("bpf: allow access into map value arrays")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-01 12:36:37 -07:00
Steven Rostedt (VMware)
5bd84629a7 ftrace: Create separate t_func_next() to simplify the function / hash logic
I noticed that if I use dd to read the set_ftrace_filter file that the first
hash command is repeated.

 # cd /sys/kernel/debug/tracing
 # echo schedule > set_ftrace_filter
 # echo do_IRQ >> set_ftrace_filter
 # echo schedule:traceoff >> set_ftrace_filter
 # echo do_IRQ:traceoff >> set_ftrace_filter

 # cat set_ftrace_filter
 schedule
 do_IRQ
 schedule:traceoff:unlimited
 do_IRQ:traceoff:unlimited

 # dd if=set_ftrace_filter bs=1
 schedule
 do_IRQ
 schedule:traceoff:unlimited
 schedule:traceoff:unlimited
 do_IRQ:traceoff:unlimited
 98+0 records in
 98+0 records out
 98 bytes copied, 0.00265011 s, 37.0 kB/s

This is due to the way t_start() calls t_next() as well as the seq_file
calls t_next() and the state is slightly different between the two. Namely,
t_start() will call t_next() with a local "pos" variable.

By separating out the function listing from t_next() into its own function,
we can have better control of outputting the functions and the hash of
triggers. This simplifies the code.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-31 18:00:45 -04:00
Steven Rostedt (VMware)
43ff926a0c ftrace: Update func_pos in t_start() when all functions are enabled
If all functions are enabled, there's a comment displayed in the file to
denote that:

  # cd /sys/kernel/debug/tracing
  # cat set_ftrace_filter
 #### all functions enabled ####

If a function trigger is set, those are displayed as well:

  # echo schedule:traceoff >> /debug/tracing/set_ftrace_filter
  # cat set_ftrace_filter
 #### all functions enabled ####
 schedule:traceoff:unlimited

But if you read that file with dd, the output can change:

  # dd if=/debug/tracing/set_ftrace_filter bs=1
 #### all functions enabled ####
 32+0 records in
 32+0 records out
 32 bytes copied, 7.0237e-05 s, 456 kB/s

This is because the "pos" variable is updated for the comment, but func_pos
is not. "func_pos" is used by the triggers (or hashes) to know how many
functions were printed and it bases its index from the pos - func_pos.
func_pos should be 1 to count for the comment printed. But since it is not,
t_hash_start() thinks that one trigger was already printed.

The cat gets to t_hash_start() via t_next() and not t_start() which updates
both pos and func_pos.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-31 18:00:37 -04:00
Steven Rostedt (VMware)
2d71d98900 ftrace: Return NULL at end of t_start() instead of calling t_hash_start()
The loop in t_start() of calling t_next() will call t_hash_start() if the
pos is beyond the functions and enters the hash items. There's no reason to
check if p is NULL and call t_hash_start(), as that would be redundant.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-31 18:00:37 -04:00
Steven Rostedt (VMware)
c20489dad1 ftrace: Assign iter->hash to filter or notrace hashes on seq read
Instead of testing if the hash to use is the filter_hash or the notrace_hash
at each iteration, do the test at open, and set the iter->hash to point to
the corresponding filter or notrace hash. Then use that directly instead of
testing which hash needs to be used each iteration.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-31 18:00:36 -04:00
Steven Rostedt (VMware)
c1bc5919f6 ftrace: Clean up __seq_open_private() return check
The return status check of __seq_open_private() is rather strange:

	iter = __seq_open_private();
	if (iter) {
		/* do stuff */
	}

	return iter ? 0 : -ENOMEM;

It makes much more sense to do the return of failure right away:

	iter = __seq_open_private();
	if (!iter)
		return -ENOMEM;

	/* do stuff */

	return 0;

This clean up will make updates to this code a bit nicer.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-31 18:00:35 -04:00
Linus Torvalds
035f0cd3f8 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
 "This fixes the following issues:

   - memory corruption when kmalloc fails in xts/lrw

   - mark some CCP DMA channels as private

   - fix reordering race in padata

   - regression in omap-rng DT description"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: xts,lrw - fix out-of-bounds write after kmalloc failure
  crypto: ccp - Make some CCP DMA channels private
  padata: avoid race in reordering
  dt-bindings: rng: clocks property on omap_rng not always mandatory
2017-03-31 12:11:32 -07:00
Samuel Thibault
2ed2b8621b braille-console: Fix value returned by _braille_console_setup
commit bbeddf52adc1 ("printk: move braille console support into
separate braille.[ch] files") introduced _braille_console_setup()
to outline the braille initialization code.  There was however some
confusion over the value it was supposed to return. commit 2cfe6c4ac7ee
("printk: Fix return of braille_register_console()") tried to fix it
but failed to.

This fixes and documents the returned value according to the use
in printk.c: non-zero return means a parsing error, and thus this
console configuration should be ignored.

Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Aleksey Makarov <aleksey.makarov@linaro.org>
Cc: Joe Perches <joe@perches.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-31 15:29:06 +02:00
Nicholas Mc Guire
5fc63f9577 timekeeping: Remove pointless conversion to bool
interp_forward is type bool so assignment from a logical operation directly
is sufficient.

Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Cc: "Christopher S. Hall" <christopher.s.hall@intel.com>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/1490382215-30505-1-git-send-email-der.herr@hofr.at
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-31 10:26:56 +02:00
Thomas Gleixner
9005615baf Merge branch 'fortglx/4.12/time' of https://git.linaro.org/people/john.stultz/linux into timers/core
Pull timekeeping changes from John Stultz:

 Main changes are the initial steps of Nicoli's work to make the clockevent
 timers be corrected for NTP adjustments. Then a few smaller fixes that
 I've queued, and adding Stephen Boyd to the maintainers list for
 timekeeping.
2017-03-31 09:48:00 +02:00
Zhou Chengming
72f04b50d6 livepatch: Reduce the time of finding module symbols
It's reported that the time of insmoding a klp.ko for one of our
out-tree modules is too long.

~ time sudo insmod klp.ko
real	0m23.799s
user	0m0.036s
sys	0m21.256s

Then we found the reason: our out-tree module used a lot of static local
variables, so klp.ko has a lot of relocation records which reference the
module. Then for each such entry klp_find_object_symbol() is called to
resolve it, but this function uses the interface kallsyms_on_each_symbol()
even for finding module symbols, so will waste a lot of time on walking
through vmlinux kallsyms table many times.

This patch changes it to use module_kallsyms_on_each_symbol() for modules
symbols. After we apply this patch, the sys time reduced dramatically.

~ time sudo insmod klp.ko
real	0m1.007s
user	0m0.032s
sys	0m0.924s

Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Jessica Yu <jeyu@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-30 10:41:38 +02:00
Chris Wilson
57dd924e54 locking/ww-mutex: Limit stress test to 2 seconds
Use a timeout rather than a fixed number of loops to avoid running for
very long periods, such as under the kbuilder VMs.

Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170310105733.6444-1-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-30 09:49:47 +02:00
Ingo Molnar
c69f203df3 Merge branch 'linus' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-30 09:48:58 +02:00
Yuyang Du
a481db34b9 sched/fair: Optimize ___update_sched_avg()
The main PELT function ___update_load_avg(), which implements the
accumulation and progression of the geometric average series, is
implemented along the following lines for the scenario where the time
delta spans all 3 possible sections (see figure below):

  1. add the remainder of the last incomplete period
  2. decay old sum
  3. accumulate new sum in full periods since last_update_time
  4. accumulate the current incomplete period
  5. update averages

Or:

            d1          d2           d3
            ^           ^            ^
            |           |            |
          |<->|<----------------->|<--->|
  ... |---x---|------| ... |------|-----x (now)

  load_sum' = (load_sum + weight * scale * d1) * y^(p+1) +	(1,2)

                                        p
	      weight * scale * 1024 * \Sum y^n +		(3)
                                       n=1

	      weight * scale * d3 * y^0				(4)

  load_avg' = load_sum' / LOAD_AVG_MAX				(5)

Where:

 d1 - is the delta part completing the remainder of the last
      incomplete period,
 d2 - is the delta part spannind complete periods, and
 d3 - is the delta part starting the current incomplete period.

We can simplify the code in two steps; the first step is to separate
the first term into new and old parts like:

  (load_sum + weight * scale * d1) * y^(p+1) = load_sum * y^(p+1) +
					       weight * scale * d1 * y^(p+1)

Once we've done that, its easy to see that all new terms carry the
common factors:

  weight * scale

If we factor those out, we arrive at the form:

  load_sum' = load_sum * y^(p+1) +

	      weight * scale * (d1 * y^(p+1) +

					 p
			        1024 * \Sum y^n +
					n=1

				d3 * y^0)

Which results in a simpler, smaller and faster implementation.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: matt@codeblueprint.co.uk
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1486935863-25251-3-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-30 09:43:41 +02:00
Peter Zijlstra
0ccb977f4c sched/fair: Explicitly generate __update_load_avg() instances
The __update_load_avg() function is an __always_inline because its
used with constant propagation to generate different variants of the
code without having to duplicate it (which would be prone to bugs).

Explicitly instantiate the 3 variants.

Note that most of this is called from rather hot paths, so reducing
branches is good.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-30 09:43:40 +02:00