IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
We should deboost before waking the high-priority task, such that we
don't run two tasks with the same "state" (priority, deadline,
sched_class, etc).
In order to make sure the boosting task doesn't start running between
unlock and deboost (due to 'spurious' wakeup), we move the deboost
under the wait_lock, that way its serialized against the wait loop in
__rt_mutex_slowlock().
Doing the deboost early can however lead to priority-inversion if
current would get preempted after the deboost but before waking our
high-prio task, hence we disable preemption before doing deboost, and
enabling it after the wake up is over.
This gets us the right semantic order, but most importantly however;
this change ensures pointer stability for the next patch, where we
have rt_mutex_setprio() cache a pointer to the top-most waiter task.
If we, as before this change, do the wakeup first and then deboost,
this pointer might point into thin air.
[peterz: Changelog + patch munging]
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.110065320@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Relying on free_reserved_area() to call ftrace to free init memory proved to
not be sufficient. The issue is that on x86, when debug_pagealloc is
enabled, the init memory is not freed, but simply set as not present. Since
ftrace was uninformed of this, starting function tracing still tries to
update pages that are not present according to the page tables, causing
ftrace to bug, as well as killing the kernel itself.
Instead of relying on free_reserved_area(), have init/main.c call ftrace
directly just before it frees the init memory. Then it needs to use
__init_begin and __init_end to know where the init memory location is.
Looking at all archs (and testing what I can), it appears that this should
work for each of them.
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull scheduler fixes from Thomas Gleixner:
"This update provides:
- make the scheduler clock switch to unstable mode smooth so the
timestamps stay at microseconds granularity instead of switching to
tick granularity.
- unbreak perf test tsc by taking the new offset into account which
was added in order to proveide better sched clock continuity
- switching sched clock to unstable mode runs all clock related
computations which affect the sched clock output itself from a work
queue. In case of preemption sched clock uses half updated data and
provides wrong timestamps. Keep the math in the protected context
and delegate only the static key switch to workqueue context.
- remove a duplicate header include"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/headers: Remove duplicate #include <linux/sched/debug.h> line
sched/clock: Fix broken stable to unstable transfer
sched/clock, x86/perf: Fix "perf test tsc"
sched/clock: Fix clear_sched_clock_stable() preempt wobbly
development and testing of networking bpf programs is quite cumbersome.
Despite availability of user space bpf interpreters the kernel is
the ultimate authority and execution environment.
Current test frameworks for TC include creation of netns, veth,
qdiscs and use of various packet generators just to test functionality
of a bpf program. XDP testing is even more complicated, since
qemu needs to be started with gro/gso disabled and precise queue
configuration, transferring of xdp program from host into guest,
attaching to virtio/eth0 and generating traffic from the host
while capturing the results from the guest.
Moreover analyzing performance bottlenecks in XDP program is
impossible in virtio environment, since cost of running the program
is tiny comparing to the overhead of virtio packet processing,
so performance testing can only be done on physical nic
with another server generating traffic.
Furthermore ongoing changes to user space control plane of production
applications cannot be run on the test servers leaving bpf programs
stubbed out for testing.
Last but not least, the upstream llvm changes are validated by the bpf
backend testsuite which has no ability to test the code generated.
To improve this situation introduce BPF_PROG_TEST_RUN command
to test and performance benchmark bpf programs.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the verifier doesn't reject unaligned access for map_value_adj
register types. Commit 484611357c19 ("bpf: allow access into map value
arrays") added logic to check_ptr_alignment() extending it from PTR_TO_PACKET
to also PTR_TO_MAP_VALUE_ADJ, but for PTR_TO_MAP_VALUE_ADJ no enforcement
is in place, because reg->id for PTR_TO_MAP_VALUE_ADJ reg types is never
non-zero, meaning, we can cause BPF_H/_W/_DW-based unaligned access for
architectures not supporting efficient unaligned access, and thus worst
case could raise exceptions on some archs that are unable to correct the
unaligned access or perform a different memory access to the actual
requested one and such.
i) Unaligned load with !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
on r0 (map_value_adj):
0: (bf) r2 = r10
1: (07) r2 += -8
2: (7a) *(u64 *)(r2 +0) = 0
3: (18) r1 = 0x42533a00
5: (85) call bpf_map_lookup_elem#1
6: (15) if r0 == 0x0 goto pc+11
R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
7: (61) r1 = *(u32 *)(r0 +0)
8: (35) if r1 >= 0xb goto pc+9
R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R1=inv,min_value=0,max_value=10 R10=fp
9: (07) r0 += 3
10: (79) r7 = *(u64 *)(r0 +0)
R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R1=inv,min_value=0,max_value=10 R10=fp
11: (79) r7 = *(u64 *)(r0 +2)
R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R1=inv,min_value=0,max_value=10 R7=inv R10=fp
[...]
ii) Unaligned store with !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
on r0 (map_value_adj):
0: (bf) r2 = r10
1: (07) r2 += -8
2: (7a) *(u64 *)(r2 +0) = 0
3: (18) r1 = 0x4df16a00
5: (85) call bpf_map_lookup_elem#1
6: (15) if r0 == 0x0 goto pc+19
R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
7: (07) r0 += 3
8: (7a) *(u64 *)(r0 +0) = 42
R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
9: (7a) *(u64 *)(r0 +2) = 43
R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
10: (7a) *(u64 *)(r0 -2) = 44
R0=map_value_adj(ks=8,vs=48,id=0),min_value=3,max_value=3 R10=fp
[...]
For the PTR_TO_PACKET type, reg->id is initially zero when skb->data
was fetched, it later receives a reg->id from env->id_gen generator
once another register with UNKNOWN_VALUE type was added to it via
check_packet_ptr_add(). The purpose of this reg->id is twofold: i) it
is used in find_good_pkt_pointers() for setting the allowed access
range for regs with PTR_TO_PACKET of same id once verifier matched
on data/data_end tests, and ii) for check_ptr_alignment() to determine
that when not having efficient unaligned access and register with
UNKNOWN_VALUE was added to PTR_TO_PACKET, that we're only allowed
to access the content bytewise due to unknown unalignment. reg->id
was never intended for PTR_TO_MAP_VALUE{,_ADJ} types and thus is
always zero, the only marking is in PTR_TO_MAP_VALUE_OR_NULL that
was added after 484611357c19 via 57a09bf0a416 ("bpf: Detect identical
PTR_TO_MAP_VALUE_OR_NULL registers"). Above tests will fail for
non-root environment due to prohibited pointer arithmetic.
The fix splits register-type specific checks into their own helper
instead of keeping them combined, so we don't run into a similar
issue in future once we extend check_ptr_alignment() further and
forget to add reg->type checks for some of the checks.
Fixes: 484611357c19 ("bpf: allow access into map value arrays")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
While looking into map_value_adj, I noticed that alu operations
directly on the map_value() resp. map_value_adj() register (any
alu operation on a map_value() register will turn it into a
map_value_adj() typed register) are not sufficiently protected
against some of the operations. Two non-exhaustive examples are
provided that the verifier needs to reject:
i) BPF_AND on r0 (map_value_adj):
0: (bf) r2 = r10
1: (07) r2 += -8
2: (7a) *(u64 *)(r2 +0) = 0
3: (18) r1 = 0xbf842a00
5: (85) call bpf_map_lookup_elem#1
6: (15) if r0 == 0x0 goto pc+2
R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
7: (57) r0 &= 8
8: (7a) *(u64 *)(r0 +0) = 22
R0=map_value_adj(ks=8,vs=48,id=0),min_value=0,max_value=8 R10=fp
9: (95) exit
from 6 to 9: R0=inv,min_value=0,max_value=0 R10=fp
9: (95) exit
processed 10 insns
ii) BPF_ADD in 32 bit mode on r0 (map_value_adj):
0: (bf) r2 = r10
1: (07) r2 += -8
2: (7a) *(u64 *)(r2 +0) = 0
3: (18) r1 = 0xc24eee00
5: (85) call bpf_map_lookup_elem#1
6: (15) if r0 == 0x0 goto pc+2
R0=map_value(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
7: (04) (u32) r0 += (u32) 0
8: (7a) *(u64 *)(r0 +0) = 22
R0=map_value_adj(ks=8,vs=48,id=0),min_value=0,max_value=0 R10=fp
9: (95) exit
from 6 to 9: R0=inv,min_value=0,max_value=0 R10=fp
9: (95) exit
processed 10 insns
Issue is, while min_value / max_value boundaries for the access
are adjusted appropriately, we change the pointer value in a way
that cannot be sufficiently tracked anymore from its origin.
Operations like BPF_{AND,OR,DIV,MUL,etc} on a destination register
that is PTR_TO_MAP_VALUE{,_ADJ} was probably unintended, in fact,
all the test cases coming with 484611357c19 ("bpf: allow access
into map value arrays") perform BPF_ADD only on the destination
register that is PTR_TO_MAP_VALUE_ADJ.
Only for UNKNOWN_VALUE register types such operations make sense,
f.e. with unknown memory content fetched initially from a constant
offset from the map value memory into a register. That register is
then later tested against lower / upper bounds, so that the verifier
can then do the tracking of min_value / max_value, and properly
check once that UNKNOWN_VALUE register is added to the destination
register with type PTR_TO_MAP_VALUE{,_ADJ}. This is also what the
original use-case is solving. Note, tracking on what is being
added is done through adjust_reg_min_max_vals() and later access
to the map value enforced with these boundaries and the given offset
from the insn through check_map_access_adj().
Tests will fail for non-root environment due to prohibited pointer
arithmetic, in particular in check_alu_op(), we bail out on the
is_pointer_value() check on the dst_reg (which is false in root
case as we allow for pointer arithmetic via env->allow_ptr_leaks).
Similarly to PTR_TO_PACKET, one way to fix it is to restrict the
allowed operations on PTR_TO_MAP_VALUE{,_ADJ} registers to 64 bit
mode BPF_ADD. The test_verifier suite runs fine after the patch
and it also rejects mentioned test cases.
Fixes: 484611357c19 ("bpf: allow access into map value arrays")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
I noticed that if I use dd to read the set_ftrace_filter file that the first
hash command is repeated.
# cd /sys/kernel/debug/tracing
# echo schedule > set_ftrace_filter
# echo do_IRQ >> set_ftrace_filter
# echo schedule:traceoff >> set_ftrace_filter
# echo do_IRQ:traceoff >> set_ftrace_filter
# cat set_ftrace_filter
schedule
do_IRQ
schedule:traceoff:unlimited
do_IRQ:traceoff:unlimited
# dd if=set_ftrace_filter bs=1
schedule
do_IRQ
schedule:traceoff:unlimited
schedule:traceoff:unlimited
do_IRQ:traceoff:unlimited
98+0 records in
98+0 records out
98 bytes copied, 0.00265011 s, 37.0 kB/s
This is due to the way t_start() calls t_next() as well as the seq_file
calls t_next() and the state is slightly different between the two. Namely,
t_start() will call t_next() with a local "pos" variable.
By separating out the function listing from t_next() into its own function,
we can have better control of outputting the functions and the hash of
triggers. This simplifies the code.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
If all functions are enabled, there's a comment displayed in the file to
denote that:
# cd /sys/kernel/debug/tracing
# cat set_ftrace_filter
#### all functions enabled ####
If a function trigger is set, those are displayed as well:
# echo schedule:traceoff >> /debug/tracing/set_ftrace_filter
# cat set_ftrace_filter
#### all functions enabled ####
schedule:traceoff:unlimited
But if you read that file with dd, the output can change:
# dd if=/debug/tracing/set_ftrace_filter bs=1
#### all functions enabled ####
32+0 records in
32+0 records out
32 bytes copied, 7.0237e-05 s, 456 kB/s
This is because the "pos" variable is updated for the comment, but func_pos
is not. "func_pos" is used by the triggers (or hashes) to know how many
functions were printed and it bases its index from the pos - func_pos.
func_pos should be 1 to count for the comment printed. But since it is not,
t_hash_start() thinks that one trigger was already printed.
The cat gets to t_hash_start() via t_next() and not t_start() which updates
both pos and func_pos.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The loop in t_start() of calling t_next() will call t_hash_start() if the
pos is beyond the functions and enters the hash items. There's no reason to
check if p is NULL and call t_hash_start(), as that would be redundant.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Instead of testing if the hash to use is the filter_hash or the notrace_hash
at each iteration, do the test at open, and set the iter->hash to point to
the corresponding filter or notrace hash. Then use that directly instead of
testing which hash needs to be used each iteration.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The return status check of __seq_open_private() is rather strange:
iter = __seq_open_private();
if (iter) {
/* do stuff */
}
return iter ? 0 : -ENOMEM;
It makes much more sense to do the return of failure right away:
iter = __seq_open_private();
if (!iter)
return -ENOMEM;
/* do stuff */
return 0;
This clean up will make updates to this code a bit nicer.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull crypto fixes from Herbert Xu:
"This fixes the following issues:
- memory corruption when kmalloc fails in xts/lrw
- mark some CCP DMA channels as private
- fix reordering race in padata
- regression in omap-rng DT description"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: xts,lrw - fix out-of-bounds write after kmalloc failure
crypto: ccp - Make some CCP DMA channels private
padata: avoid race in reordering
dt-bindings: rng: clocks property on omap_rng not always mandatory
interp_forward is type bool so assignment from a logical operation directly
is sufficient.
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Cc: "Christopher S. Hall" <christopher.s.hall@intel.com>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/1490382215-30505-1-git-send-email-der.herr@hofr.at
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull timekeeping changes from John Stultz:
Main changes are the initial steps of Nicoli's work to make the clockevent
timers be corrected for NTP adjustments. Then a few smaller fixes that
I've queued, and adding Stephen Boyd to the maintainers list for
timekeeping.
It's reported that the time of insmoding a klp.ko for one of our
out-tree modules is too long.
~ time sudo insmod klp.ko
real 0m23.799s
user 0m0.036s
sys 0m21.256s
Then we found the reason: our out-tree module used a lot of static local
variables, so klp.ko has a lot of relocation records which reference the
module. Then for each such entry klp_find_object_symbol() is called to
resolve it, but this function uses the interface kallsyms_on_each_symbol()
even for finding module symbols, so will waste a lot of time on walking
through vmlinux kallsyms table many times.
This patch changes it to use module_kallsyms_on_each_symbol() for modules
symbols. After we apply this patch, the sys time reduced dramatically.
~ time sudo insmod klp.ko
real 0m1.007s
user 0m0.032s
sys 0m0.924s
Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Jessica Yu <jeyu@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Use a timeout rather than a fixed number of loops to avoid running for
very long periods, such as under the kbuilder VMs.
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170310105733.6444-1-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The main PELT function ___update_load_avg(), which implements the
accumulation and progression of the geometric average series, is
implemented along the following lines for the scenario where the time
delta spans all 3 possible sections (see figure below):
1. add the remainder of the last incomplete period
2. decay old sum
3. accumulate new sum in full periods since last_update_time
4. accumulate the current incomplete period
5. update averages
Or:
d1 d2 d3
^ ^ ^
| | |
|<->|<----------------->|<--->|
... |---x---|------| ... |------|-----x (now)
load_sum' = (load_sum + weight * scale * d1) * y^(p+1) + (1,2)
p
weight * scale * 1024 * \Sum y^n + (3)
n=1
weight * scale * d3 * y^0 (4)
load_avg' = load_sum' / LOAD_AVG_MAX (5)
Where:
d1 - is the delta part completing the remainder of the last
incomplete period,
d2 - is the delta part spannind complete periods, and
d3 - is the delta part starting the current incomplete period.
We can simplify the code in two steps; the first step is to separate
the first term into new and old parts like:
(load_sum + weight * scale * d1) * y^(p+1) = load_sum * y^(p+1) +
weight * scale * d1 * y^(p+1)
Once we've done that, its easy to see that all new terms carry the
common factors:
weight * scale
If we factor those out, we arrive at the form:
load_sum' = load_sum * y^(p+1) +
weight * scale * (d1 * y^(p+1) +
p
1024 * \Sum y^n +
n=1
d3 * y^0)
Which results in a simpler, smaller and faster implementation.
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: matt@codeblueprint.co.uk
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1486935863-25251-3-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The __update_load_avg() function is an __always_inline because its
used with constant propagation to generate different variants of the
code without having to duplicate it (which would be prone to bugs).
Explicitly instantiate the 3 variants.
Note that most of this is called from rather hot paths, so reducing
branches is good.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJY2C9qAAoJEHm+PkMAQRiGaBQIAIGzdlZ6ImiP6zoukrRv7qUr
44ITm0lsBiL85QGedhQQL+Y9UqwUmlqgFqnH0Gr8YHNbLJWXzdjGbl5aVo4KjASq
104NLUDXtPww/xZdH4wJMzhuwucYwZOUyDOjOr0ak3cGxOE2xjNjHMZXxWUf20GO
EpRr6WhV1DUAvAdjdNa9KlcOjMluNpMLLyL1CFLjrkkArrWAyqOURKHAb6ZLghfv
iZV1qJTVPyYGpnlI3kuEgu2GuDjxqpoNLSr3wHyEHm/pBPEl7MX6zPbzcegBV8TY
cRRlXo4notdsuknmSNcj0hHuTQvw1kl7BhieLKVsnCyCIM6jjX4TSQZFutmbzwM=
=5iRl
-----END PGP SIGNATURE-----
Backmerge tag 'v4.11-rc4' into drm-next
Linux 4.11-rc4
The i915 GVT team need the rc4 code to base some more code on.
We switched from "struct task_struct"->security to "struct cred"->security
in Linux 2.6.29. But not all LSM modules were happy with that change.
TOMOYO LSM module is an example which want to use per "struct task_struct"
security blob, for TOMOYO's security context is defined based on "struct
task_struct" rather than "struct cred". AppArmor LSM module is another
example which want to use it, for AppArmor is currently abusing the cred
a little bit to store the change_hat and setexeccon info. Although
security_task_free() hook was revived in Linux 3.4 because Yama LSM module
wanted to release per "struct task_struct" security blob,
security_task_alloc() hook and "struct task_struct"->security field were
not revived. Nowadays, we are getting proposals of lightweight LSM modules
which want to use per "struct task_struct" security blob.
We are already allowing multiple concurrent LSM modules (up to one fully
armored module which uses "struct cred"->security field or exclusive hooks
like security_xfrm_state_pol_flow_match(), plus unlimited number of
lightweight modules which do not use "struct cred"->security nor exclusive
hooks) as long as they are built into the kernel. But this patch does not
implement variable length "struct task_struct"->security field which will
become needed when multiple LSM modules want to use "struct task_struct"->
security field. Although it won't be difficult to implement variable length
"struct task_struct"->security field, let's think about it after we merged
this patch.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Tested-by: Djalal Harouni <tixxdz@gmail.com>
Acked-by: José Bollo <jobol@nonadev.net>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: James Morris <james.l.morris@oracle.com>
Cc: José Bollo <jobol@nonadev.net>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Commit 5b52330bbfe6 ("audit: fix auditd/kernel connection state
tracking") made inlining audit_signal_info() a bit pointless as
it was always calling into auditd_test_task() so let's remove the
inline function in kernel/audit.h and convert __audit_signal_info()
in kernel/auditsc.c into audit_signal_info().
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Use BUG_ON() rather than an explicit if followed by BUG() for
improved readability and also consistency.
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Signed-off-by: Tejun Heo <tj@kernel.org>
When it is determined that the clock is actually unstable, and
we switch from stable to unstable, the __clear_sched_clock_stable()
function is eventually called.
In this function we set gtod_offset so the following holds true:
sched_clock() + raw_offset == ktime_get_ns() + gtod_offset
But instead of getting the latest timestamps, we use the last values
from scd, so instead of sched_clock() we use scd->tick_raw, and
instead of ktime_get_ns() we use scd->tick_gtod.
However, later, when we use gtod_offset sched_clock_local() we do not
add it to scd->tick_gtod to calculate the correct clock value when we
determine the boundaries for min/max clocks.
This can result in tick granularity sched_clock() values, so fix it.
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Fixes: 5680d8094ffa ("sched/clock: Provide better clock continuity")
Link: http://lkml.kernel.org/r/1490214265-899964-2-git-send-email-pasha.tatashin@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If the child domain prefers tasks to go siblings, the local group could
end up pulling tasks to itself even if the local group is almost equally
loaded as the source group.
Lets assume a 4 core,smt==2 machine running 5 thread ebizzy workload.
Everytime, local group has capacity and source group has atleast 2 threads,
local group tries to pull the task. This causes the threads to constantly
move between different cores. This is even more profound if the cores have
more threads, like in Power 8, smt 8 mode.
Fix this by only allowing local group to pull a task, if the source group
has more number of tasks than the local group.
Here are the relevant perf stat numbers of a 22 core,smt 8 Power 8 machine.
Without patch:
Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):
1,440 context-switches # 0.001 K/sec ( +- 1.26% )
366 cpu-migrations # 0.000 K/sec ( +- 5.58% )
3,933 page-faults # 0.002 K/sec ( +- 11.08% )
Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):
6,287 context-switches # 0.001 K/sec ( +- 3.65% )
3,776 cpu-migrations # 0.001 K/sec ( +- 4.84% )
5,702 page-faults # 0.001 K/sec ( +- 9.36% )
Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):
8,776 context-switches # 0.001 K/sec ( +- 0.73% )
2,790 cpu-migrations # 0.000 K/sec ( +- 0.98% )
10,540 page-faults # 0.001 K/sec ( +- 3.12% )
With patch:
Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):
1,133 context-switches # 0.001 K/sec ( +- 4.72% )
123 cpu-migrations # 0.000 K/sec ( +- 3.42% )
3,858 page-faults # 0.002 K/sec ( +- 8.52% )
Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):
2,169 context-switches # 0.000 K/sec ( +- 6.19% )
189 cpu-migrations # 0.000 K/sec ( +- 12.75% )
5,917 page-faults # 0.001 K/sec ( +- 8.09% )
Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):
5,333 context-switches # 0.001 K/sec ( +- 5.91% )
506 cpu-migrations # 0.000 K/sec ( +- 3.35% )
10,792 page-faults # 0.001 K/sec ( +- 7.75% )
Which show that in these workloads CPU migrations get reduced significantly.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/1490205470-10249-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since commit 383776fa7527 ("locking/lockdep: Handle statically initialized
PER_CPU locks properly") we try to collapse per-cpu locks into a single
class by giving them all the same key. For this key we choose the canonical
address of the per-cpu object, which would be the offset into the per-cpu
area.
This has two problems:
- there is a case where we run !0 lock->key through static_obj() and
expect this to pass; it doesn't for canonical pointers.
- 0 is a valid canonical address.
Cure both issues by redefining the canonical address as the address of the
per-cpu variable on the boot CPU.
Since I didn't want to rely on CPU0 being the boot-cpu, or even existing at
all, track the boot CPU in a variable.
Fixes: 383776fa7527 ("locking/lockdep: Handle statically initialized PER_CPU locks properly")
Reported-by: kernel test robot <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Borislav Petkov <bp@suse.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-mm@kvack.org
Cc: wfg@linux.intel.com
Cc: kernel test robot <fengguang.wu@intel.com>
Cc: LKP <lkp@01.org>
Link: http://lkml.kernel.org/r/20170320114108.kbvcsuepem45j5cr@hirez.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull audit fix from Paul Moore:
"We've got an audit fix, and unfortunately it is big.
While I'm not excited that we need to be sending you something this
large during the -rcX phase, it does fix some very real, and very
tangled, problems relating to locking, backlog queues, and the audit
daemon connection.
This code has passed our testsuite without problem and it has held up
to my ad-hoc stress tests (arguably better than the existing code),
please consider pulling this as fix for the next v4.11-rcX tag"
* 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit:
audit: fix auditd/kernel connection state tracking
llvm can optimize the 'if (ptr > data_end)' checks to be in the order
slightly different than the original C code which will confuse verifier.
Like:
if (ptr + 16 > data_end)
return TC_ACT_SHOT;
// may be followed by
if (ptr + 14 > data_end)
return TC_ACT_SHOT;
while llvm can see that 'ptr' is valid for all 16 bytes,
the verifier could not.
Fix verifier logic to account for such case and add a test.
Reported-by: Huapeng Zhou <hzhou@fb.com>
Fixes: 969bf05eb3ce ("bpf: direct packet access")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently trace_handle_return() looks like this:
static inline enum print_line_t trace_handle_return(struct trace_seq *s)
{
return trace_seq_has_overflowed(s) ?
TRACE_TYPE_PARTIAL_LINE : TRACE_TYPE_HANDLED;
}
Where trace_seq_overflowed(s) is:
static inline bool trace_seq_has_overflowed(struct trace_seq *s)
{
return s->full || seq_buf_has_overflowed(&s->seq);
}
And seq_buf_has_overflowed(&s->seq) is:
static inline bool
seq_buf_has_overflowed(struct seq_buf *s)
{
return s->len > s->size;
}
Making trace_handle_return() into:
return (s->full || (s->seq->len > s->seq->size)) ?
TRACE_TYPE_PARTIAL_LINE :
TRACE_TYPE_HANDLED;
One would think this is not an issue to keep as an inline. But because this
is used in the TRACE_EVENT() macro, it is extended for every tracepoint in
the system. Taking a look at a single tracepoint x86_irq_vector (was the
first one I randomly chosen). As trace_handle_return is used in the
TRACE_EVENT() macro of trace_raw_output_##call() we disassemble
trace_raw_output_x86_irq_vector and do a diff:
- is the original
+ is the out-of-line code
I removed identical lines that were different just due to different
addresses.
--- /tmp/irq-vec-orig 2017-03-16 09:12:48.569384851 -0400
+++ /tmp/irq-vec-ool 2017-03-16 09:13:39.378153385 -0400
@@ -6,27 +6,23 @@
53 push %rbx
48 89 fb mov %rdi,%rbx
4c 8b a7 c0 20 00 00 mov 0x20c0(%rdi),%r12
e8 f7 72 13 00 callq ffffffff81155c80 <trace_raw_output_prep>
83 f8 01 cmp $0x1,%eax
74 05 je ffffffff8101e993 <trace_raw_output_x86_irq_vector+0x23>
5b pop %rbx
41 5c pop %r12
5d pop %rbp
c3 retq
41 8b 54 24 08 mov 0x8(%r12),%edx
- 48 8d bb 98 10 00 00 lea 0x1098(%rbx),%rdi
+ 48 81 c3 98 10 00 00 add $0x1098,%rbx
- 48 c7 c6 7b 8a a0 81 mov $0xffffffff81a08a7b,%rsi
+ 48 c7 c6 ab 8a a0 81 mov $0xffffffff81a08aab,%rsi
- e8 c5 85 13 00 callq ffffffff81156f70 <trace_seq_printf>
=== here's the start of the main difference ===
+ 48 89 df mov %rbx,%rdi
+ e8 62 7e 13 00 callq ffffffff81156810 <trace_seq_printf>
- 8b 93 b8 20 00 00 mov 0x20b8(%rbx),%edx
- 31 c0 xor %eax,%eax
- 85 d2 test %edx,%edx
- 75 11 jne ffffffff8101e9c8 <trace_raw_output_x86_irq_vector+0x58>
- 48 8b 83 a8 20 00 00 mov 0x20a8(%rbx),%rax
- 48 39 83 a0 20 00 00 cmp %rax,0x20a0(%rbx)
- 0f 93 c0 setae %al
+ 48 89 df mov %rbx,%rdi
+ e8 4a c5 12 00 callq ffffffff8114af00 <trace_handle_return>
5b pop %rbx
- 0f b6 c0 movzbl %al,%eax
=== end ===
41 5c pop %r12
5d pop %rbp
c3 retq
If you notice, the original has 22 bytes of text more than the out of line
version. As this is for every TRACE_EVENT() defined in the system, this can
become quite large.
text data bss dec hex filename
8690305 5450490 1298432 15439227 eb957b vmlinux-orig
8681725 5450490 1298432 15430647 eb73f7 vmlinux-handle
This change has a total of 8580 bytes in savings.
$ objdump -dr /tmp/vmlinux-orig | grep '^[0-9a-f]* <trace_raw_output' | wc -l
324
That's 324 tracepoints. But this does not include modules (which contain
many more tracepoints). For an allyesconfig build:
$ objdump -dr vmlinux-allyes-orig | grep '^[0-9a-f]* <trace_raw_output' | wc -l
1401
That's 1401 tracepoints giving us:
text data bss dec hex filename
137920629 140221067 53264384 331406080 13c0db00 vmlinux-allyes-orig
137827709 140221067 53264384 331313160 13bf7008 vmlinux-allyes-handle
92920 bytes in savings!!!
Link: http://lkml.kernel.org/r/20170315021431.13107-2-andi@firstfloor.org
Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Adding a hook into free_reserve_area() that informs ftrace that boot up init
text is being free, lets ftrace safely remove those init functions from its
records, which keeps ftrace from trying to modify text that no longer
exists.
Note, this still does not allow for tracing .init text of modules, as
modules require different work for freeing its init code.
Link: http://lkml.kernel.org/r/1488502497.7212.24.camel@linux.intel.com
Cc: linux-mm@kvack.org
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Requested-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Register the function tracer right after the tracing buffers are initialized
in early boot up. This will allow function tracing to begin early if it is
enabled via the kernel command line.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As tracing can now be enabled very early in boot up, even before some
critical system services (like scheduling), do not run the tracer selftests
until after early_initcall() is performed. If a tracer is registered before
such time, it is saved off in a list and the test is run when the system is
able to handle more diverse functions.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Create an early_trace_init() function that will initialize the buffers and
allow for ealier use of trace_printk(). This will also allow for future work
to have function tracing start earlier at boot up.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There is no need to always call blocking console_lock() in
console_cpu_notify(), it's quite possible that console_sem can
be locked by other CPU on the system, either already printing
or soon to begin printing the messages. console_lock() in this
case can simply block CPU hotplug for unknown period of time
(console_unlock() is time unbound). Not that hotplug is very
fast, but still, with other CPUs being online and doing
printk() console_cpu_notify() can stuck.
Use console_trylock() instead and opt-out if console_sem is
already acquired from another CPU, since that CPU will do
the printing for us.
Link: http://lkml.kernel.org/r/20170121104729.8585-1-sergey.senozhatsky@gmail.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
This patch fix spelling typos found in
Documentation/output/xml/driver-api/basics.xml.
It is because the xml file was generated from comments in source,
so I had to fix the comments.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Under extremely heavy uses of padata, crashes occur, and with list
debugging turned on, this happens instead:
[87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33
__list_add+0xae/0x130
[87487.301868] list_add corruption. prev->next should be next
(ffffb17abfc043d0), but was ffff8dba70872c80. (prev=ffff8dba70872b00).
[87487.339011] [<ffffffff9a53d075>] dump_stack+0x68/0xa3
[87487.342198] [<ffffffff99e119a1>] ? console_unlock+0x281/0x6d0
[87487.345364] [<ffffffff99d6b91f>] __warn+0xff/0x140
[87487.348513] [<ffffffff99d6b9aa>] warn_slowpath_fmt+0x4a/0x50
[87487.351659] [<ffffffff9a58b5de>] __list_add+0xae/0x130
[87487.354772] [<ffffffff9add5094>] ? _raw_spin_lock+0x64/0x70
[87487.357915] [<ffffffff99eefd66>] padata_reorder+0x1e6/0x420
[87487.361084] [<ffffffff99ef0055>] padata_do_serial+0xa5/0x120
padata_reorder calls list_add_tail with the list to which its adding
locked, which seems correct:
spin_lock(&squeue->serial.lock);
list_add_tail(&padata->list, &squeue->serial.list);
spin_unlock(&squeue->serial.lock);
This therefore leaves only place where such inconsistency could occur:
if padata->list is added at the same time on two different threads.
This pdata pointer comes from the function call to
padata_get_next(pd), which has in it the following block:
next_queue = per_cpu_ptr(pd->pqueue, cpu);
padata = NULL;
reorder = &next_queue->reorder;
if (!list_empty(&reorder->list)) {
padata = list_entry(reorder->list.next,
struct padata_priv, list);
spin_lock(&reorder->lock);
list_del_init(&padata->list);
atomic_dec(&pd->reorder_objects);
spin_unlock(&reorder->lock);
pd->processed++;
goto out;
}
out:
return padata;
I strongly suspect that the problem here is that two threads can race
on reorder list. Even though the deletion is locked, call to
list_entry is not locked, which means it's feasible that two threads
pick up the same padata object and subsequently call list_add_tail on
them at the same time. The fix is thus be hoist that lock outside of
that block.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
- Make intel_pstate use one set of global P-state limits in the
active mode regardless of the scaling_governor settings for
individual CPUs instead of switching back and forth between two
of them in a way that is hard to control (Rafael Wysocki).
- Drop a useless function from intel_pstate to prevent it from
modifying the maximum supported frequency value unexpectedly
which may confuse the cpufreq core (Rafael Wysocki).
- Fix the cpufreq core to restore policy limits on CPU online so
that the limits are not reset over system suspend/resume, among
other things (Viresh Kumar).
- Fix the initialization of the schedutil cpufreq governor to
make the IO-wait boosting mechanism in it actually work on
systems with one CPU per cpufreq policy (Rafael Wysocki).
- Add a sanity check to the cpuidle core to prevent crashes from
happening if the architecture code initialization fails to set
up things as expected (Vaidyanathan Srinivasan).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJY1HLGAAoJEILEb/54YlRxjAsQAIFcYfJKosA8IlmKcricR/WH
CqBMCH9S6Y5YsggrFjcj3lVJbf5P43/h3U++O+/97lJsevfp4inCChpvVSQIWX4v
xzk2v+5Ms8ROqVSLy34yTaB0ysCC//J5FvdQLsj0Zw9W/8yvi0DosPfeiAD7sYeb
4qkJK3+yv5sLtZ41FmdVYabhC5KHQSAV6p6X+KOZnFV8cm+8TfOSERhStXASMTGc
tvDpjIjPA1GLpHYdOK4UQ+Er1Hgwk2fNX7eXrpHh7QCQx4eZEN+g7DAC95Ify9Am
gkTFc5eUfOFKU5KMshdQh6gnfoNaKi4d3E/ahmnU+KQuyKiy4KMyTNcuUBszSaDM
ZTm0GooseV0UajaLH08BNbfpqDsiKc2fm1qkCQkXxpGjs80/bYn5gK+fpvkziq9x
210Wc7XTWf7JfmPs0d3gZekaohHtqJVigCuA4dXH6kvbDvbDcfKOza6rqNmpSFrQ
ifWH6M12Ut/G5NfwihhTRhoKkeQjqHFgikNC8BjF2Myem20026Vr6MKZrubDAlkq
VWP7lT2zNSs1btsNqDrA9+wejwK8OwwrpfZOx3hbYL6Q+u/AuljIJ79aRz8ROZcE
jQZeOKprmlAaDIASdxIM4yjwzSQE0l/CaHteHfKaddmcWTrtKR+CEfU0ODzYWoK0
dcQwyMF3tOToj7BjWqKM
=jy4C
-----END PGP SIGNATURE-----
Merge tag 'pm-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"One of these is an intel_pstate regression fix and it is not a small
change, but it mostly removes code that shouldn't be there. That code
was acquired by mistake and has been a source of constant pain since
then, so the time has come to get rid of it finally. We have not seen
problems with this change in the lab, so fingers crossed.
The rest is more usual: one more intel_pstate commit removing useless
code, a cpufreq core fix to make it restore policy limits on CPU
online (which prevents the limits from being reset over system
suspend/resume), a schedutil cpufreq governor initialization fix to
make it actually work as advertised on all systems and an extra sanity
check in the cpuidle core to prevent crashes from happening if the
arch code messes things up.
Specifics:
- Make intel_pstate use one set of global P-state limits in the
active mode regardless of the scaling_governor settings for
individual CPUs instead of switching back and forth between two of
them in a way that is hard to control (Rafael Wysocki).
- Drop a useless function from intel_pstate to prevent it from
modifying the maximum supported frequency value unexpectedly which
may confuse the cpufreq core (Rafael Wysocki).
- Fix the cpufreq core to restore policy limits on CPU online so that
the limits are not reset over system suspend/resume, among other
things (Viresh Kumar).
- Fix the initialization of the schedutil cpufreq governor to make
the IO-wait boosting mechanism in it actually work on systems with
one CPU per cpufreq policy (Rafael Wysocki).
- Add a sanity check to the cpuidle core to prevent crashes from
happening if the architecture code initialization fails to set up
things as expected (Vaidyanathan Srinivasan)"
* tag 'pm-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: Restore policy min/max limits on CPU online
cpuidle: Validate cpu_dev in cpuidle_add_sysfs()
cpufreq: intel_pstate: Fix policy data management in passive mode
cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start()
cpufreq: intel_pstate: One set of global limits in active mode
sugov_update_commit() calls trace_cpu_frequency() to record the
current CPU frequency if it has not changed in the fast switch case
to prevent utilities from getting confused (they may report that the
CPU is idle if the frequency has not been recorded for too long, for
example).
However, that may cause the tracepoint to be triggered quite often
for no real reason (if the frequency doesn't change, we will not
modify the last update time stamp and governor computations may
run again shortly when that happens), so don't do that (arguably, it
is done to work around a utilities bug anyway).
That allows code duplication in sugov_update_commit() to be reduced
somewhat too.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Conflicts:
drivers/net/ethernet/broadcom/genet/bcmmii.c
drivers/net/hyperv/netvsc.c
kernel/bpf/hashtab.c
Almost entirely overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
On systems with a large number of CPUs, running sysrq-<q> can cause
watchdog timeouts. There are two slow sections of code in the sysrq-<q>
path in timer_list.c.
1. print_active_timers() - This function is called by print_cpu() and
contains a slow goto loop. On a machine with hundreds of CPUs, this
loop took approximately 100ms for the first CPU in a NUMA node.
(Subsequent CPUs in the same node ran much quicker.) The total time
to print all of the CPUs is ultimately long enough to trigger the
soft lockup watchdog.
2. print_tickdevice() - This function outputs a large amount of textual
information. This function also took approximately 100ms per CPU.
Since sysrq-<q> is not a performance critical path, there should be no
harm in touching the nmi watchdog in both slow sections above. Touching
it in just one location was insufficient on systems with hundreds of
CPUs as occasional timeouts were still observed during testing.
This issue was observed on an Oracle T7 machine with 128 CPUs, but I
anticipate it may affect other systems with similarly large numbers of
CPUs.
Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>
Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The scheduler clock framework may not use the correct timeout for the clock
wrap. This happens when a new clock driver calls sched_clock_register()
after the kernel called sched_clock_postinit(). In this case the clock wrap
timeout is too long thus sched_clock_poll() is called too late and the clock
already wrapped.
On my ARM system the scheduler was no longer scheduling any other task than
the idle task because the sched_clock() wrapped.
Signed-off-by: David Engraf <david.engraf@sysgo.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
A clockevent device's rate should be configured before or at registration
and changed afterwards through clockevents_update_freq() only.
For the configuration at registration, we already have
clockevents_config_and_register().
Right now, there are no clockevents_config() users outside of the
clockevents core.
To mitigiate the risk of drivers errorneously reconfiguring their rates
through clockevents_config() *after* device registration, make
clockevents_config() static.
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Pull networking fixes from David Miller:
1) Several netfilter fixes from Pablo and the crew:
- Handle fragmented packets properly in netfilter conntrack, from
Florian Westphal.
- Fix SCTP ICMP packet handling, from Ying Xue.
- Fix big-endian bug in nftables, from Liping Zhang.
- Fix alignment of fake conntrack entry, from Steven Rostedt.
2) Fix feature flags setting in fjes driver, from Taku Izumi.
3) Openvswitch ipv6 tunnel source address not set properly, from Or
Gerlitz.
4) Fix jumbo MTU handling in amd-xgbe driver, from Thomas Lendacky.
5) sk->sk_frag.page not released properly in some cases, from Eric
Dumazet.
6) Fix RTNL deadlocks in nl80211, from Johannes Berg.
7) Fix erroneous RTNL lockdep splat in crypto, from Herbert Xu.
8) Cure improper inflight handling during AF_UNIX GC, from Andrey
Ulanov.
9) sch_dsmark doesn't write to packet headers properly, from Eric
Dumazet.
10) Fix SCM_TIMESTAMPING_OPT_STATS handling in TCP, from Soheil Hassas
Yeganeh.
11) Add some IDs for Motorola qmi_wwan chips, from Tony Lindgren.
12) Fix nametbl deadlock in tipc, from Ying Xue.
13) GRO and LRO packets not counted correctly in mlx5 driver, from Gal
Pressman.
14) Fix reset of internal PHYs in bcmgenet, from Doug Berger.
15) Fix hashmap allocation handling, from Alexei Starovoitov.
16) nl_fib_input() needs stronger netlink message length checking, from
Eric Dumazet.
17) Fix double-free of sk->sk_filter during sock clone, from Daniel
Borkmann.
18) Fix RX checksum offloading in aquantia driver, from Pavel Belous.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (85 commits)
net:ethernet:aquantia: Fix for RX checksum offload.
amd-xgbe: Fix the ECC-related bit position definitions
sfc: cleanup a condition in efx_udp_tunnel_del()
Bluetooth: btqcomsmd: fix compile-test dependency
inet: frag: release spinlock before calling icmp_send()
tcp: initialize icsk_ack.lrcvtime at session start time
genetlink: fix counting regression on ctrl_dumpfamily()
socket, bpf: fix sk_filter use after free in sk_clone_lock
ipv4: provide stronger user input validation in nl_fib_input()
bpf: fix hashmap extra_elems logic
enic: update enic maintainers
net: bcmgenet: remove bcmgenet_internal_phy_setup()
ipv6: make sure to initialize sockc.tsflags before first use
fjes: Do not load fjes driver if extended socket device is not power on.
fjes: Do not load fjes driver if system does not have extended socket device.
net/mlx5e: Count LRO packets correctly
net/mlx5e: Count GSO packets correctly
net/mlx5: Increase number of max QPs in default profile
net/mlx5e: Avoid supporting udp tunnel port ndo for VF reps
net/mlx5e: Use the proper UAPI values when offloading TC vlan actions
...
When PREEMPT_RT_FULL does the spinlock -> rt_mutex substitution the PI
chain code will (falsely) report a deadlock and BUG.
The problem is that it hold hb->lock (now an rt_mutex) while doing
task_blocks_on_rt_mutex on the futex's pi_state::rtmutex. This, when
interleaved just right with futex_unlock_pi() leads it to believe to see an
AB-BA deadlock.
Task1 (holds rt_mutex, Task2 (does FUTEX_LOCK_PI)
does FUTEX_UNLOCK_PI)
lock hb->lock
lock rt_mutex (as per start_proxy)
lock hb->lock
Which is a trivial AB-BA.
It is not an actual deadlock, because it won't be holding hb->lock by the
time it actually blocks on the rt_mutex, but the chainwalk code doesn't
know that and it would be a nightmare to handle this gracefully.
To avoid this problem, do the same as in futex_unlock_pi() and drop
hb->lock after acquiring wait_lock. This still fully serializes against
futex_unlock_pi(), since adding to the wait_list does the very same lock
dance, and removing it holds both locks.
Aside of solving the RT problem this makes the lock and unlock mechanism
symetric and reduces the hb->lock held time.
Reported-and-tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: juri.lelli@arm.com
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.161341537@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>