IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
- Fix handling of throttled parents in enqueue_task_fair() completely. The
recent fix overlooked a corner case where the first iteration terminates
do a entiry being on rq which makes the list management incomplete and
later triggers the assertion which checks for completeness.
- Fix a similar problem in unthrottle_cfs_rq().
- Show the correct uclamp values in procfs which prints the effective
value twice instead of requested and effective.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl7Ki3QTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoVA5D/9d3ajxbUD7Qzyr9LW4dQ/GLdVrLC+h
rav3KskFzF+v9qyEz5j6ZUEe6ZtGyqk3n+xlVCoBjEZonygaZ3A9rjn5p6p+junR
ueUHQM9BYOQ2qX/rnQubpgzphfT2BdNXtrT0aWCQdXjbDwTJcDhS8AMjDAOnPntc
Pj7brhP/lPtFF2JubBshlJGWCHdriALOJGyFw+FftBq6yotsFel/EH9i/5++JSFX
3DmcFZXJMIf7cRk9xWzmP2QoOkyV6KU9FD9zlRxpvLOskT7qJ7HXaCgFLHl/HZPD
Ukqmy8Ua8hGbKseqfuyz9vV/xJazdiEyqPOSnd8wJzk5umw2rplknOk2qpJ+Infv
MLjnfgrjNtBvgCq4lHnmGeTvdjktLPuPusQIVjZzqis6RryZiNsNspuE5QPZP4Fh
87/rfYQ7VoAGTRmeC9+t4X0BSnpkT4KkQvWCHtocmzl4nuym0cgfZxOCrnRW+NEN
LeDzgujT8uRaDyeZcTwg6pysfxR2Kod6mWomnT9t17ZD93EaY/FEPUL+g8+VbDyD
mAu1BiENX3DL5erZKBHcvHbniYVkcZ/l/cgX6o2wcFLuREaEEbHgsXYHdLqZd9QY
eNwbldtPlagy+f2jla84O1/fFcM1za/R6V9Mbb+/86xJeNu2R/+Vv0FiOqpmHW24
G9UW+hQcQ+ZxgA==
=akes
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2020-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
"A set of fixes for the scheduler:
- Fix handling of throttled parents in enqueue_task_fair() completely.
The recent fix overlooked a corner case where the first iteration
terminates due to an entity already being on the runqueue which
makes the list management incomplete and later triggers the
assertion which checks for completeness.
- Fix a similar problem in unthrottle_cfs_rq().
- Show the correct uclamp values in procfs which prints the effective
value twice instead of requested and effective"
* tag 'sched-urgent-2020-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list
sched/debug: Fix requested task uclamp values shown in procfs
sched/fair: Fix enqueue_task_fair() warning some more
If uboot passes a blank string to console_setup then it results in
a trashed memory. Ultimately, the kernel crashes during freeing up
the memory.
This fix checks if there is a blank parameter being
passed to console_setup from uboot. In case it detects that
the console parameter is blank then it doesn't setup the serial
device and it gracefully exits.
Link: https://lore.kernel.org/r/20200522065306.83-1-shreyas.joshi@biamp.com
Signed-off-by: Shreyas Joshi <shreyas.joshi@biamp.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
[pmladek@suse.com: Better format the commit message and code, remove unnecessary brackets.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
Currently, when considering the branches that may be taken for a jump
instruction if the register being compared is a pointer the verifier
assumes both branches may be taken. But, if the jump instruction
is comparing if a pointer is NULL we have this information in the
verifier encoded in the reg->type so we can do better in these cases.
Specifically, these two common cases can be handled.
* If the instruction is BPF_JEQ and we are comparing against a
zero value. This test is 'if ptr == 0 goto +X' then using the
type information in reg->type we can decide if the ptr is not
null. This allows us to avoid pushing both branches onto the
stack and instead only use the != 0 case. For example
PTR_TO_SOCK and PTR_TO_SOCK_OR_NULL encode the null pointer.
Note if the type is PTR_TO_SOCK_OR_NULL we can not learn anything.
And also if the value is non-zero we learn nothing because it
could be any arbitrary value a different pointer for example
* If the instruction is BPF_JNE and ware comparing against a zero
value then a similar analysis as above can be done. The test in
asm looks like 'if ptr != 0 goto +X'. Again using the type
information if the non null type is set (from above PTR_TO_SOCK)
we know the jump is taken.
In this patch we extend is_branch_taken() to consider this extra
information and to return only the branch that will be taken. This
resolves a verifier issue reported with C code like the following.
See progs/test_sk_lookup_kern.c in selftests.
sk = bpf_sk_lookup_tcp(skb, tuple, tuple_len, BPF_F_CURRENT_NETNS, 0);
bpf_printk("sk=%d\n", sk ? 1 : 0);
if (sk)
bpf_sk_release(sk);
return sk ? TC_ACT_OK : TC_ACT_UNSPEC;
In the above the bpf_printk() will resolve the pointer from
PTR_TO_SOCK_OR_NULL to PTR_TO_SOCK. Then the second test guarding
the release will cause the verifier to walk both paths resulting
in the an unreleased sock reference. See verifier/ref_tracking.c
in selftests for an assembly version of the above.
After the above additional logic is added the C code above passes
as expected.
Reported-by: Andrey Ignatov <rdna@fb.com>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/159009164651.6313.380418298578070501.stgit@john-Precision-5820-Tower
The XSKMAP is partly implemented by net/xdp/xsk.c. Move xskmap.c from
kernel/bpf/ to net/xdp/, which is the logical place for AF_XDP related
code. Also, move AF_XDP struct definitions, and function declarations
only used by AF_XDP internals into net/xdp/xsk.h.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200520192103.355233-3-bjorn.topel@gmail.com
Highlights of this series:
* Remove serialization of sending RPC/RDMA Replies
* Convert the TCP socket send path to use xdr_buf::bvecs (pre-requisite for
RPC-on-TLS)
* Fix svcrdma backchannel sendto return code
* Convert a number of dprintk call sites to use tracepoints
* Fix the "suggest braces around empty body in an 'else' statement" warning
Userspace libraries, e.g. glibc's dprintf(), perform a SEEK_CUR operation
over any file descriptor requested to make sure the current position isn't
pointing to junk due to previous manipulation of that same fd. And whenever
that fd doesn't have support for such operation, the userspace code expects
-ESPIPE to be returned.
However, when the fd in question references the /dev/kmsg interface, the
current kernel code state returns -EINVAL instead, causing an unexpected
behavior in userspace: in the case of glibc, when -ESPIPE is returned it
gets ignored and the call completes successfully, while returning -EINVAL
forces dprintf to fail without performing any action over that fd:
if (_IO_SEEKOFF (fp, (off64_t)0, _IO_seek_cur, _IOS_INPUT|_IOS_OUTPUT) ==
_IO_pos_BAD && errno != ESPIPE)
return NULL;
With this patch we make sure to return the correct value when SEEK_CUR is
requested over kmsg and also add some kernel doc information to formalize
this behavior.
Link: https://lore.kernel.org/r/20200317103344.574277-1-bmeneg@redhat.com
Cc: linux-kernel@vger.kernel.org
Cc: rostedt@goodmis.org,
Cc: David.Laight@ACULAB.COM
Signed-off-by: Bruno Meneguele <bmeneg@redhat.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
There is a typo in comment, fix it.
Signed-off-by: Ethon Paul <ethp@qq.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
In some cases we need to have an IRQ domain created out of software node.
One of such cases is DesignWare GPIO driver when it's instantiated from
half-baked ACPI table (alas, we can't fix it for devices which are few years
on market) and thus using software nodes to quirk this. But the driver
is using IRQ domains based on per GPIO port firmware nodes, which are in
the above case software ones. This brings a warning message to be printed
[ 73.957183] irq: Invalid fwnode type for irqdomain
and creates an anonymous IRQ domain without a debugfs entry.
Allowing software nodes to be valid for IRQ domains rids us of the warning
and debugs gets correctly populated.
% ls -1 /sys/kernel/debug/irq/domains/
...
intel-quark-dw-apb-gpio:portA
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
[maz: refactored commit message]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200520164927.39090-3-andriy.shevchenko@linux.intel.com
Now that __irq_domain_add() is able to better deals with generic
fwnodes, there is no need to special-case ACPI anymore.
Get rid of the special treatment for ACPI.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200520164927.39090-2-andriy.shevchenko@linux.intel.com
__irq_domain_add() relies in some places on the fact that the fwnode
can be only of type OF. This prevents refactoring of the code to support
other types of fwnode. Make it less OF-dependent by switching it
to use the fwnode directly where it makes sense.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200520164927.39090-1-andriy.shevchenko@linux.intel.com
As discussed in [0], it's dangerous to allow mapping BPF map, that's meant to
be frozen and is read-only on BPF program side, because that allows user-space
to actually store a writable view to the page even after it is frozen. This is
exacerbated by BPF verifier making a strong assumption that contents of such
frozen map will remain unchanged. To prevent this, disallow mapping
BPF_F_RDONLY_PROG mmap()'able BPF maps as writable, ever.
[0] https://lore.kernel.org/bpf/CAEf4BzYGWYhXdp6BJ7_=9OQPJxQpgug080MMjdSB72i9R+5c6g@mail.gmail.com/
Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/bpf/20200519053824.1089415-1-andriin@fb.com
Some table unregister actions seem to be initiated by the kernel to
garbage collect unused tables that are not initiated by any userspace
actions. It was found to be necessary to add the subject credentials to
cover this case to reveal the source of these actions. A sample record:
The uid, auid, tty, ses and exe fields have not been included since they
are in the SYSCALL record and contain nothing useful in the non-user
context.
Here are two sample orphaned records:
type=NETFILTER_CFG msg=audit(2020-05-20 12:14:36.505:5) : table=filter family=ipv4 entries=0 op=register pid=1 subj=kernel comm=swapper/0
type=NETFILTER_CFG msg=audit(2020-05-20 12:15:27.701:301) : table=nat family=bridge entries=0 op=unregister pid=30 subj=system_u:system_r:kernel_t:s0 comm=kworker/u4:1
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
It is almost possible to use the result of prepare_exec_creds with no
modifications during exec. Update prepare_exec_creds to initialize
the suid and the fsuid to the euid, and the sgid and the fsgid to the
egid. This is all that is needed to handle the common case of exec
when nothing special like a setuid exec is happening.
That this preserves the existing behavior of exec can be verified
by examing bprm_fill_uid and cap_bprm_set_creds.
This change makes it clear that the later parts of exec that
update bprm->cred are just need to handle special cases such
as setuid exec and change of domains.
Link: https://lkml.kernel.org/r/871rng22dm.fsf_-_@x220.int.ebiederm.org
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
And replace the arcane return value convention with a simple bool
where true means success and false means failure.
[AV: braino fix folded in]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
tracing_pipe_buf_ops has identical ops to default_pipe_buf_ops, so use
that instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Merge our uaccess-ppc topic branch. It is based on the uaccess topic
branch that we're sharing with Viro.
This includes the addition of user_[read|write]_access_begin(), as
well as some powerpc specific changes to our uaccess routines that
would conflict badly if merged separately.
Elsewhere in the file, the function trace_kprobe_has_same_kprobe uses
a trace_probe_event.probes object as the second argument of
list_for_each_entry, ie as a list head, while the list_for_each_entry
iterates over the list fields of the trace_probe structures, making
them the list elements. So, exchange the arguments on the list_add
call to put the list head in the second argument.
Since both list_head structures were just initialized, this problem
did not cause any loss of information.
Link: https://lkml.kernel.org/r/1588879808-24488-1-git-send-email-Julia.Lawall@inria.fr
Fixes: 60d53e2c3b75 ("tracing/probe: Split trace_event related data from trace_probe")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When an anomaly is detected in the function call modification
code, ftrace_bug() is called to disable function tracing as well
as give some warn and information that may help debug the problem.
But currently, we call FTRACE_WARN_ON_ONCE() first in ftrace_bug(),
so when panic_on_warn is set, we can't see the debugging information
here. Call FTRACE_WARN_ON_ONCE() at the end of ftrace_bug() to ensure
that the debugging information is displayed first.
after this patch, the dmesg looks like:
------------[ ftrace bug ]------------
ftrace failed to modify
[<ffff800010081004>] bcm2835_handle_irq+0x4/0x58
actual: 1f:20:03:d5
Setting ftrace call site to call ftrace function
ftrace record flags: 80000001
(1)
expected tramp: ffff80001009d6f0
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1635 at kernel/trace/ftrace.c:2078 ftrace_bug+0x204/0x238
Kernel panic - not syncing: panic_on_warn set ...
CPU: 2 PID: 1635 Comm: sh Not tainted 5.7.0-rc5-00033-gb922183867f5 #14
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x1b0
show_stack+0x20/0x30
dump_stack+0xc0/0x10c
panic+0x16c/0x368
__warn+0x120/0x160
report_bug+0xc8/0x160
bug_handler+0x28/0x98
brk_handler+0x70/0xd0
do_debug_exception+0xcc/0x1ac
el1_sync_handler+0xe4/0x120
el1_sync+0x7c/0x100
ftrace_bug+0x204/0x238
Link: https://lkml.kernel.org/r/20200515100828.7091-1-cj.chengjian@huawei.com
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200507185804.GA15036@embeddedor
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200511201227.GA14041@embeddedor
When users write some huge number into cpu.cfs_quota_us or
cpu.rt_runtime_us, overflow might happen during to_ratio() shifts of
schedulable checks.
to_ratio() could be altered to avoid unnecessary internal overflow, but
min_cfs_quota_period is less than 1 << BW_SHIFT, so a cutoff would still
be needed. Set a cap MAX_BW for cfs_quota_us and rt_runtime_us to
prevent overflow.
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Link: https://lkml.kernel.org/r/20200425105248.60093-1-changhuaixin@linux.alibaba.com
The user_mode(task_pt_regs(tsk)) always return true for
user thread, and false for kernel thread. So it means that
the cpuacct.usage_sys is the time that kernel thread uses
not the time that thread uses in the kernel mode. We can
try get_irq_regs() first, if it is NULL, then we can fall
back to task_pt_regs().
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200420070453.76815-1-songmuchun@bytedance.com
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200507192141.GA16183@embeddedor
update_tg_cfs_*() propagate the impact of the attach/detach of an entity
down into the cfs_rq hierarchy and must keep the sync with the current pelt
window.
Even if we can't sync child cfs_rq and its group se, we can sync the group
se and its parent cfs_rq with current position in the PELT window. In fact,
we must keep them sync in order to stay also synced with others entities
and group entities that are already attached to the cfs_rq.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200506155301.14288-1-vincent.guittot@linaro.org
The cpuacct_charge() and cpuacct_account_field() are called with
rq->lock held, and this means preemption(and IRQs) are indeed
disabled, so it is safe to use __this_cpu_*() to allow for better
code-generation.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200507031039.32615-1-songmuchun@bytedance.com
enqueue_task_fair jumps to enqueue_throttle label when cfs_rq_of(se) is
throttled which means that se can't be NULL in such case and we can move
the label after the if (!se) statement. Futhermore, the latter can be
removed because se is always NULL when reaching this point.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200513135502.4672-1-vincent.guittot@linaro.org
Although not exactly identical, unthrottle_cfs_rq() and enqueue_task_fair()
are quite close and follow the same sequence for enqueuing an entity in the
cfs hierarchy. Modify unthrottle_cfs_rq() to use the same pattern as
enqueue_task_fair(). This fixes a problem already faced with the latter and
add an optimization in the last for_each_sched_entity loop.
Fixes: fe61468b2cb (sched/fair: Fix enqueue_task_fair warning)
Reported-by Tao Zhou <zohooouoto@zoho.com.cn>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Link: https://lkml.kernel.org/r/20200513135528.4742-1-vincent.guittot@linaro.org
The intention of commit 96e74ebf8d59 ("sched/debug: Add task uclamp
values to SCHED_DEBUG procfs") was to print requested and effective
task uclamp values. The requested values printed are read from p->uclamp,
which holds the last effective values. Fix this by printing the values
from p->uclamp_req.
Fixes: 96e74ebf8d59 ("sched/debug: Add task uclamp values to SCHED_DEBUG procfs")
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/1589115401-26391-1-git-send-email-pkondeti@codeaurora.org
sched/fair: Fix enqueue_task_fair warning some more
The recent patch, fe61468b2cb (sched/fair: Fix enqueue_task_fair warning)
did not fully resolve the issues with the rq->tmp_alone_branch !=
&rq->leaf_cfs_rq_list warning in enqueue_task_fair. There is a case where
the first for_each_sched_entity loop exits due to on_rq, having incompletely
updated the list. In this case the second for_each_sched_entity loop can
further modify se. The later code to fix up the list management fails to do
what is needed because se does not point to the sched_entity which broke out
of the first loop. The list is not fixed up because the throttled parent was
already added back to the list by a task enqueue in a parallel child hierarchy.
Address this by calling list_add_leaf_cfs_rq if there are throttled parents
while doing the second for_each_sched_entity loop.
Fixes: fe61468b2cb ("sched/fair: Fix enqueue_task_fair warning")
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20200512135222.GC2201@lorien.usersys.redhat.com
As stated in 983695fa6765 ("bpf: fix unconnected udp hooks"), the objective
for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be
transparent to applications. In Cilium we make use of these hooks [0] in
order to enable E-W load balancing for existing Kubernetes service types
for all Cilium managed nodes in the cluster. Those backends can be local
or remote. The main advantage of this approach is that it operates as close
as possible to the socket, and therefore allows to avoid packet-based NAT
given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses.
This also allows to expose NodePort services on loopback addresses in the
host namespace, for example. As another advantage, this also efficiently
blocks bind requests for applications in the host namespace for exposed
ports. However, one missing item is that we also need to perform reverse
xlation for inet{,6}_getname() hooks such that we can return the service
IP/port tuple back to the application instead of the remote peer address.
The vast majority of applications does not bother about getpeername(), but
in a few occasions we've seen breakage when validating the peer's address
since it returns unexpectedly the backend tuple instead of the service one.
Therefore, this trivial patch allows to customise and adds a getpeername()
as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order
to address this situation.
Simple example:
# ./cilium/cilium service list
ID Frontend Service Type Backend
1 1.2.3.4:80 ClusterIP 1 => 10.0.0.10:80
Before; curl's verbose output example, no getpeername() reverse xlation:
# curl --verbose 1.2.3.4
* Rebuilt URL to: 1.2.3.4/
* Trying 1.2.3.4...
* TCP_NODELAY set
* Connected to 1.2.3.4 (10.0.0.10) port 80 (#0)
> GET / HTTP/1.1
> Host: 1.2.3.4
> User-Agent: curl/7.58.0
> Accept: */*
[...]
After; with getpeername() reverse xlation:
# curl --verbose 1.2.3.4
* Rebuilt URL to: 1.2.3.4/
* Trying 1.2.3.4...
* TCP_NODELAY set
* Connected to 1.2.3.4 (1.2.3.4) port 80 (#0)
> GET / HTTP/1.1
> Host: 1.2.3.4
> User-Agent: curl/7.58.0
> Accept: */*
[...]
Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed
peer to the context similar as in inet{,6}_getname() fashion, but API-wise
this is suboptimal as it always enforces programs having to test for ctx->peer
which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split.
Similarly, the checked return code is on tnum_range(1, 1), but if a use case
comes up in future, it can easily be changed to return an error code instead.
Helper and ctx member access is the same as with connect/sendmsg/etc hooks.
[0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net
Make it possible to reduce the attack surface in case the snapshot
device is not to be used from userspace.
Signed-off-by: Domenico Andreoli <domenico.andreoli@linux.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Hibernation concurrency handling is currently delegated to user.c,
where it's also used for regulating the access to the snapshot device.
In the prospective of making user.c a separate configuration option,
such mutual exclusion is brought into hibernate.c and made available
through accessor helpers hereby introduced.
Signed-off-by: Domenico Andreoli <domenico.andreoli@linux.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add handling for loss of notifications by having read() insert a
loss-notification message after it has read the pipe buffer that was last
in the ring when the loss occurred.
Lossage can come about either by running out of notification descriptors or
by running out of space in the pipe ring.
Signed-off-by: David Howells <dhowells@redhat.com>
Allow a buffer to be marked such that read() must return the entire buffer
in one go or return ENOBUFS. Multiple buffers can be amalgamated into a
single read, but a short read will occur if the next "whole" buffer won't
fit.
This is useful for watch queue notifications to make sure we don't split a
notification across multiple reads, especially given that we need to
fabricate an overrun record under some circumstances - and that isn't in
the buffers.
Signed-off-by: David Howells <dhowells@redhat.com>
Make it possible to have a general notification queue built on top of a
standard pipe. Notifications are 'spliced' into the pipe and then read
out. splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex. This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.
The way the notification queue is used is:
(1) An application opens a pipe with a special flag and indicates the
number of messages it wishes to be able to queue at once (this can
only be set once):
pipe2(fds, O_NOTIFICATION_PIPE);
ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
(2) The application then uses poll() and read() as normal to extract data
from the pipe. read() will return multiple notifications if the
buffer is big enough, but it will not split a notification across
buffers - rather it will return a short read or EMSGSIZE.
Notification messages include a length in the header so that the
caller can split them up.
Each message has a header that describes it:
struct watch_notification {
__u32 type:24;
__u32 subtype:8;
__u32 info;
};
The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.
Supplementary data, such as the key ID that generated an event, can be
attached in additional slots. The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.
Signed-off-by: David Howells <dhowells@redhat.com>
Instrumentation is forbidden in the .noinstr.text section. Make kprobes
respect this.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lkml.kernel.org/r/20200505134100.179862032@linutronix.de
Same as rcu_is_watching() but without the preempt_disable/enable() pair
inside the function. It is merked noinstr so it ends up in the
non-instrumentable text section.
This is useful for non-preemptible code especially in the low level entry
section. Using rcu_is_watching() there results in a call to the
preempt_schedule_notrace() thunk which triggers noinstr section warnings in
objtool.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200512213810.518709291@linutronix.de
Interrupts and exceptions invoke rcu_irq_enter() on entry and need to
invoke rcu_irq_exit() before they either return to the interrupted code or
invoke the scheduler due to preemption.
The general assumption is that RCU idle code has to have preemption
disabled so that a return from interrupt cannot schedule. So the return
from interrupt code invokes rcu_irq_exit() and preempt_schedule_irq().
If there is any imbalance in the rcu_irq/nmi* invocations or RCU idle code
had preemption enabled then this goes unnoticed until the CPU goes idle or
some other RCU check is executed.
Provide rcu_irq_exit_preempt() which can be invoked from the
interrupt/exception return code in case that preemption is enabled. It
invokes rcu_irq_exit() and contains a few sanity checks in case that
CONFIG_PROVE_RCU is enabled to catch such issues directly.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134904.364456424@linutronix.de
The rcu_nmi_enter_common() and rcu_nmi_exit_common() functions take an
"irq" parameter that indicates whether these functions have been invoked from
an irq handler (irq==true) or an NMI handler (irq==false).
However, recent changes have applied notrace to a few critical functions
such that rcu_nmi_enter_common() and rcu_nmi_exit_common() many now rely on
in_nmi(). Note that in_nmi() works no differently than before, but rather
that tracing is now prohibited in code regions where in_nmi() would
incorrectly report NMI state.
Therefore remove the "irq" parameter and inline rcu_nmi_enter_common() and
rcu_nmi_exit_common() into rcu_nmi_enter() and rcu_nmi_exit(),
respectively.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134101.617130349@linutronix.de
These functions are invoked from context tracking and other places in the
low level entry code. Move them into the .noinstr.text section to exclude
them from instrumentation.
Mark the places which are safe to invoke traceable functions with
instrumentation_begin/end() so objtool won't complain.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lkml.kernel.org/r/20200505134100.575356107@linutronix.de
SuperH is the last remaining user of arch_ftrace_nmi_{enter,exit}(),
remove it from the generic code and into the SuperH code.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: https://lkml.kernel.org/r/20200505134101.248881738@linutronix.de
These functions are called {early,late} in nmi_{enter,exit} and should
not be traced or probed. They are also puny, so 'inline' them.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134101.048523500@linutronix.de
It happens early in nmi_enter(), no tracing, probing or other funnies
allowed. Specifically as nmi_enter() will be used in do_debug(), which
would cause recursive exceptions when kprobed.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134101.139720912@linutronix.de
There is plenty of space in the printk_context variable. Reserve one byte
there for the NMI context to be on the safe side.
It should never overflow. The BUG_ON(in_nmi() == NMI_MASK) in nmi_enter()
will trigger much earlier.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134100.681374113@linutronix.de
Force inlining and prevent instrumentation of all sorts by marking the
functions which are invoked from low level entry code with 'noinstr'.
Split the irqflags tracking into two parts. One which does the heavy
lifting while RCU is watching and the final one which can be invoked after
RCU is turned off.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134100.484532537@linutronix.de