IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
If an element is freed via RCU then recursion into BPF instrumentation
functions is not a concern. The element is already detached from the map
and the RCU callback does not hold any locks on which a kprobe, perf event
or tracepoint attached BPF program could deadlock.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145643.259118710@linutronix.de
Similar to __bpf_trace_run this is redundant because __bpf_trace_run() is
invoked from a trace point via __DO_TRACE() which already disables
preemption _before_ invoking any of the functions which are attached to a
trace point.
Remove it and add a cant_sleep() check.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145643.059995527@linutronix.de
trace_call_bpf() no longer disables preemption on its own.
All callers of this function has to do it explicitly.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
All callers are built in. No point to export this.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
__bpf_trace_run() disables preemption around the BPF_PROG_RUN() invocation.
This is redundant because __bpf_trace_run() is invoked from a trace point
via __DO_TRACE() which already disables preemption _before_ invoking any of
the functions which are attached to a trace point.
Remove it and add a cant_sleep() check.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145642.847220186@linutronix.de
The comment where the bucket lock is acquired says:
/* bpf_map_update_elem() can be called in_irq() */
which is not really helpful and aside of that it does not explain the
subtle details of the hash bucket locks expecially in the context of BPF
and perf, kprobes and tracing.
Add a comment at the top of the file which explains the protection scopes
and the details how potential deadlocks are prevented.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145642.755793061@linutronix.de
Aside of the general unsafety of run-time map allocation for
instrumentation type programs RT enabled kernels have another constraint:
The instrumentation programs are invoked with preemption disabled, but the
memory allocator spinlocks cannot be acquired in atomic context because
they are converted to 'sleeping' spinlocks on RT.
Therefore enforce map preallocation for these programs types when RT is
enabled.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145642.648784007@linutronix.de
The assumption that only programs attached to perf NMI events can deadlock
on memory allocators is wrong. Assume the following simplified callchain:
kmalloc() from regular non BPF context
cache empty
freelist empty
lock(zone->lock);
tracepoint or kprobe
BPF()
update_elem()
lock(bucket)
kmalloc()
cache empty
freelist empty
lock(zone->lock); <- DEADLOCK
There are other ways which do not involve locking to create wreckage:
kmalloc() from regular non BPF context
local_irq_save();
...
obj = slab_first();
kprobe()
BPF()
update_elem()
lock(bucket)
kmalloc()
local_irq_save();
...
obj = slab_first(); <- Same object as above ...
So preallocation _must_ be enforced for all variants of intrusive
instrumentation.
Unfortunately immediate enforcement would break backwards compatibility, so
for now such programs still are allowed to run, but a one time warning is
emitted in dmesg and the verifier emits a warning in the verifier log as
well so developers are made aware about this and can fix their programs
before the enforcement becomes mandatory.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145642.540542802@linutronix.de
Rework the flushing of proc to use a list of directory inodes that
need to be flushed.
The list is kept on struct pid not on struct task_struct, as there is
a fixed connection between proc inodes and pids but at least for the
case of de_thread the pid of a task_struct changes.
This removes the dependency on proc_mnt which allows for different
mounts of proc having different mount options even in the same pid
namespace and this allows for the removal of proc_mnt which will
trivially the first mount of proc to honor it's mount options.
This flushing remains an optimization. The functions
pid_delete_dentry and pid_revalidate ensure that ordinary dcache
management will not attempt to use dentries past the point their
respective task has died. When unused the shrinker will
eventually be able to remove these dentries.
There is a case in de_thread where proc_flush_pid can be
called early for a given pid. Which winds up being
safe (if suboptimal) as this is just an optiimization.
Only pid directories are put on the list as the other
per pid files are children of those directories and
d_invalidate on the directory will get them as well.
So that the pid can be used during flushing it's reference count is
taken in release_task and dropped in proc_flush_pid. Further the call
of proc_flush_pid is moved after the tasklist_lock is released in
release_task so that it is certain that the pid has already been
unhashed when flushing it taking place. This removes a small race
where a dentry could recreated.
As struct pid is supposed to be small and I need a per pid lock
I reuse the only lock that currently exists in struct pid the
the wait_pidfd.lock.
The net result is that this adds all of this functionality
with just a little extra list management overhead and
a single extra pointer in struct pid.
v2: Initialize pid->inodes. I somehow failed to get that
initialization into the initial version of the patch. A boot
failure was reported by "kernel test robot <lkp@intel.com>", and
failure to initialize that pid->inodes matches all of the reported
symptoms.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
task_numa_find_cpu() can scan a node multiple times. Minimally it scans to
gather statistics and later to find a suitable target. In some cases, the
second scan will simply pick an idle CPU if the load is not imbalanced.
This patch caches information on an idle core while gathering statistics
and uses it immediately if load is not imbalanced to avoid a second scan
of the node runqueues. Preference is given to an idle core rather than an
idle SMT sibling to avoid packing HT siblings due to linearly scanning the
node cpumask.
As a side-effect, even when the second scan is necessary, the importance
of using select_idle_sibling is much reduced because information on idle
CPUs is cached and can be reused.
Note that this patch actually makes is harder to move to an idle CPU
as multiple tasks can race for the same idle CPU due to a race checking
numa_migrate_on. This is addressed in the next patch.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-11-mgorman@techsingularity.net
Now that runnable_load_avg has been removed, we can replace it by a new
signal that will highlight the runnable pressure on a cfs_rq. This signal
track the waiting time of tasks on rq and can help to better define the
state of rqs.
At now, only util_avg is used to define the state of a rq:
A rq with more that around 80% of utilization and more than 1 tasks is
considered as overloaded.
But the util_avg signal of a rq can become temporaly low after that a task
migrated onto another rq which can bias the classification of the rq.
When tasks compete for the same rq, their runnable average signal will be
higher than util_avg as it will include the waiting time and we can use
this signal to better classify cfs_rqs.
The new runnable_avg will track the runnable time of a task which simply
adds the waiting time to the running time. The runnable _avg of cfs_rq
will be the /Sum of se's runnable_avg and the runnable_avg of group entity
will follow the one of the rq similarly to util_avg.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-9-mgorman@techsingularity.net
The walk through the cgroup hierarchy during the enqueue/dequeue of a task
is split in 2 distinct parts for throttled cfs_rq without any added value
but making code less readable.
Change the code ordering such that everything related to a cfs_rq
(throttled or not) will be done in the same loop.
In addition, the same steps ordering is used when updating a cfs_rq:
- update_load_avg
- update_cfs_group
- update *h_nr_running
This reordering enables the use of h_nr_running in PELT algorithm.
No functional and performance changes are expected and have been noticed
during tests.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: "Dietmar Eggemann <dietmar.eggemann@arm.com>"
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/20200224095223.13361-5-mgorman@techsingularity.net
Commit 219ca39427 ("audit: use union for audit_field values since
they are mutually exclusive") combined a number of separate fields in
the audit_field struct into a single union. Generally this worked
just fine because they are generally mutually exclusive.
Unfortunately in audit_data_to_entry() the overlap can be a problem
when a specific error case is triggered that causes the error path
code to attempt to cleanup an audit_field struct and the cleanup
involves attempting to free a stored LSM string (the lsm_str field).
Currently the code always has a non-NULL value in the
audit_field.lsm_str field as the top of the for-loop transfers a
value into audit_field.val (both .lsm_str and .val are part of the
same union); if audit_data_to_entry() fails and the audit_field
struct is specified to contain a LSM string, but the
audit_field.lsm_str has not yet been properly set, the error handling
code will attempt to free the bogus audit_field.lsm_str value that
was set with audit_field.val at the top of the for-loop.
This patch corrects this by ensuring that the audit_field.val is only
set when needed (it is cleared when the audit_field struct is
allocated with kcalloc()). It also corrects a few other issues to
ensure that in case of error the proper error code is returned.
Cc: stable@vger.kernel.org
Fixes: 219ca39427 ("audit: use union for audit_field values since they are mutually exclusive")
Reported-by: syzbot+1f4d90ead370d72e450b@syzkaller.appspotmail.com
Signed-off-by: Paul Moore <paul@paul-moore.com>
Pull irq fixes from Thomas Gleixner:
"Two fixes for the irq core code which are follow ups to the recent MSI
fixes:
- The WARN_ON which was put into the MSI setaffinity callback for
paranoia reasons actually triggered via a callchain which escaped
when all the possible ways to reach that code were analyzed.
The proc/irq/$N/*affinity interfaces have a quirk which came in
when ALPHA moved to the generic interface: In case that the written
affinity mask does not contain any online CPU it calls into ALPHAs
magic auto affinity setting code.
A few years later this mechanism was also made available to x86 for
no good reasons and in a way which circumvents all sanity checks
for interrupts which cannot have their affinity set from process
context on X86 due to the way the X86 interrupt delivery works.
It would be possible to make this work properly, but there is no
point in doing so. If the interrupt is not yet started then the
affinity setting has no effect and if it is started already then it
is already assigned to an online CPU so there is no point to
randomly move it to some other CPU. Just return EINVAL as the code
has done before that change forever.
- The new MSI quirk bit in the irq domain flags turned out to be
already occupied, which escaped the author and the reviewers
because the already in use bits were 0,6,2,3,4,5 listed in that
order.
That bit 6 was simply overlooked because the ordering was straight
forward linear otherwise. So the new bit ended up being a
duplicate.
Fix it up by switching the oddball 6 to the obvious 1"
* tag 'irq-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/irqdomain: Make sure all irq domain flags are distinct
genirq/proc: Reject invalid affinity masks (again)
Pull s390 fixes from Vasily Gorbik:
- Remove ieee_emulation_warnings sysctl which is a dead code.
- Avoid triggering rebuild of the kernel during make install.
- Enable protected virtualization guest support in default configs.
- Fix cio_ignore seq_file .next function to increase position index.
And use kobj_to_dev instead of container_of in cio code.
- Fix storage block address lists to contain absolute addresses in qdio
code.
- Few clang warnings and spelling fixes.
* tag 's390-5.6-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/qdio: fill SBALEs with absolute addresses
s390/qdio: fill SL with absolute addresses
s390: remove obsolete ieee_emulation_warnings
s390: make 'install' not depend on vmlinux
s390/kaslr: Fix casts in get_random
s390/mm: Explicitly compare PAGE_DEFAULT_KEY against zero in storage_key_init_range
s390/pkey/zcrypt: spelling s/crytp/crypt/
s390/cio: use kobj_to_dev() API
s390/defconfig: enable CONFIG_PROTECTED_VIRTUALIZATION_GUEST
s390/cio: cio_ignore_proc_seq_next should increase position index
Daniel Borkmann says:
====================
pull-request: bpf-next 2020-02-21
The following pull-request contains BPF updates for your *net-next* tree.
We've added 25 non-merge commits during the last 4 day(s) which contain
a total of 33 files changed, 2433 insertions(+), 161 deletions(-).
The main changes are:
1) Allow for adding TCP listen sockets into sock_map/hash so they can be used
with reuseport BPF programs, from Jakub Sitnicki.
2) Add a new bpf_program__set_attach_target() helper for adding libbpf support
to specify the tracepoint/function dynamically, from Eelco Chaudron.
3) Add bpf_read_branch_records() BPF helper which helps use cases like profile
guided optimizations, from Daniel Xu.
4) Enable bpf_perf_event_read_value() in all tracing programs, from Song Liu.
5) Relax BTF mandatory check if only used for libbpf itself e.g. to process
BTF defined maps, from Andrii Nakryiko.
6) Move BPF selftests -mcpu compilation attribute from 'probe' to 'v3' as it has
been observed that former fails in envs with low memlock, from Yonghong Song.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 736b46027e ("net: Add ID (if needed) to sock_reuseport and expose
reuseport_lock") has introduced lazy generation of reuseport group IDs that
survive group resize.
By comparing the identifier we check if BPF reuseport program is not trying
to select a socket from a BPF map that belongs to a different reuseport
group than the one the packet is for.
Because SOCKARRAY used to be the only BPF map type that can be used with
reuseport BPF, it was possible to delay the generation of reuseport group
ID until a socket from the group was inserted into BPF map for the first
time.
Now that SOCK{MAP,HASH} can be used with reuseport BPF we have two options,
either generate the reuseport ID on map update, like SOCKARRAY does, or
allocate an ID from the start when reuseport group gets created.
This patch takes the latter approach to keep sockmap free of calls into
reuseport code. This streamlines the reuseport_id access as its lifetime
now matches the longevity of reuseport object.
The cost of this simplification, however, is that we allocate reuseport IDs
for all SO_REUSEPORT users. Even those that don't use SOCKARRAY in their
setups. With the way identifiers are currently generated, we can have at
most S32_MAX reuseport groups, which hopefully is sufficient. If we ever
get close to the limit, we can switch an u64 counter like sk_cookie.
Another change is that we now always call into SOCKARRAY logic to unlink
the socket from the map when unhashing or closing the socket. Previously we
did it only when at least one socket from the group was in a BPF map.
It is worth noting that this doesn't conflict with sockmap tear-down in
case a socket is in a SOCK{MAP,HASH} and belongs to a reuseport
group. sockmap tear-down happens first:
prot->unhash
`- tcp_bpf_unhash
|- tcp_bpf_remove
| `- while (sk_psock_link_pop(psock))
| `- sk_psock_unlink
| `- sock_map_delete_from_link
| `- __sock_map_delete
| `- sock_map_unref
| `- sk_psock_put
| `- sk_psock_drop
| `- rcu_assign_sk_user_data(sk, NULL)
`- inet_unhash
`- reuseport_detach_sock
`- bpf_sk_reuseport_detach
`- WRITE_ONCE(sk->sk_user_data, NULL)
Suggested-by: Martin Lau <kafai@fb.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200218171023.844439-10-jakub@cloudflare.com
SOCKMAP & SOCKHASH now support storing references to listening
sockets. Nothing keeps us from using these map types a collection of
sockets to select from in BPF reuseport programs. Whitelist the map types
with the bpf_sk_select_reuseport helper.
The restriction that the socket has to be a member of a reuseport group
still applies. Sockets in SOCKMAP/SOCKHASH that don't have sk_reuseport_cb
set are not a valid target and we signal it with -EINVAL.
The main benefit from this change is that, in contrast to
REUSEPORT_SOCKARRAY, SOCK{MAP,HASH} don't impose a restriction that a
listening socket can be just one BPF map at the same time.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200218171023.844439-9-jakub@cloudflare.com
Pull networking fixes from David Miller:
1) Limit xt_hashlimit hash table size to avoid OOM or hung tasks, from
Cong Wang.
2) Fix deadlock in xsk by publishing global consumer pointers when NAPI
is finished, from Magnus Karlsson.
3) Set table field properly to RT_TABLE_COMPAT when necessary, from
Jethro Beekman.
4) NLA_STRING attributes are not necessary NULL terminated, deal wiht
that in IFLA_ALT_IFNAME. From Eric Dumazet.
5) Fix checksum handling in atlantic driver, from Dmitry Bezrukov.
6) Handle mtu==0 devices properly in wireguard, from Jason A.
Donenfeld.
7) Fix several lockdep warnings in bonding, from Taehee Yoo.
8) Fix cls_flower port blocking, from Jason Baron.
9) Sanitize internal map names in libbpf, from Toke Høiland-Jørgensen.
10) Fix RDMA race in qede driver, from Michal Kalderon.
11) Fix several false lockdep warnings by adding conditions to
list_for_each_entry_rcu(), from Madhuparna Bhowmik.
12) Fix sleep in atomic in mlx5 driver, from Huy Nguyen.
13) Fix potential deadlock in bpf_map_do_batch(), from Yonghong Song.
14) Hey, variables declared in switch statement before any case
statements are not initialized. I learn something every day. Get
rids of this stuff in several parts of the networking, from Kees
Cook.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
bnxt_en: Issue PCIe FLR in kdump kernel to cleanup pending DMAs.
bnxt_en: Improve device shutdown method.
net: netlink: cap max groups which will be considered in netlink_bind()
net: thunderx: workaround BGX TX Underflow issue
ionic: fix fw_status read
net: disable BRIDGE_NETFILTER by default
net: macb: Properly handle phylink on at91rm9200
s390/qeth: fix off-by-one in RX copybreak check
s390/qeth: don't warn for napi with 0 budget
s390/qeth: vnicc Fix EOPNOTSUPP precedence
openvswitch: Distribute switch variables for initialization
net: ip6_gre: Distribute switch variables for initialization
net: core: Distribute switch variables for initialization
udp: rehash on disconnect
net/tls: Fix to avoid gettig invalid tls record
bpf: Fix a potential deadlock with bpf_map_do_batch
bpf: Do not grab the bucket spinlock by default on htab batch ops
ice: Wait for VF to be reset/ready before configuration
ice: Don't tell the OS that link is going down
ice: Don't reject odd values of usecs set by user
...
Currently, if rcu_barrier() returns too soon, the test waits 100ms and
then does another instance of the test. However, if rcu_barrier() were
to have waited for more than 100ms too short a time, this could cause
the test's rcu_head structures to be reused while they were still on
RCU's callback lists. This can result in knock-on errors that obscure
the original rcu_barrier() test failure.
This commit therefore adds code that attempts to wait until all of
the test's callbacks have been invoked. Of course, if RCU completely
lost track of the corresponding rcu_head structures, this wait could be
forever. This commit therefore also complains if this attempted recovery
takes more than one second, and it also gives up when the test ends.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, rcu_torture_barrier_cbs() posts callbacks from whatever CPU
it is running on, which means that all these kthreads might well be
posting from the same CPU, which would drastically reduce the effectiveness
of this test. This commit therefore uses IPIs to make the callbacks be
posted from the corresponding CPU (given by local variable myid).
If the IPI fails (which can happen if the target CPU is offline or does
not exist at all), the callback is posted on whatever CPU is currently
running.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
During changes to kfree_rcu() code, we often check the amount of free
memory. As an alternative to checking this manually, this commit adds a
measurement in the test itself. It measures four times during the test
for available memory, digitally filters these measurements to produce a
running average with a weight of 0.5, and compares this digitally filtered
value with the amount of available memory at the beginning of the test.
Something like the following is printed at the end of the run:
Total time taken by all kfree'ers: 6369738407 ns, loops: 10000, batches: 764, memory footprint: 216MB
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcutorture global variable rcu_torture_current is accessed locklessly,
so it must use the RCU pointer load/store primitives. This commit
therefore adds several that were missed.
This data race was reported by KCSAN. Not appropriate for backporting due
to failure being unlikely and due to this being used only by rcutorture.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcutorture rcu_torture_count and rcu_torture_batch per-CPU variables
are read locklessly, so this commit adds the READ_ONCE() to a load in
order to avoid various types of compiler vandalism^Woptimization.
This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely and due to this being rcutorture.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_fwd_cb_nodelay variable suppresses excessively long read-side
delays while carrying out an rcutorture forward-progress test. As such,
it is accessed both by readers and updaters, and most of the accesses
therefore use *_ONCE(). Except for one in rcu_read_delay(), which this
commit fixes.
This data race was reported by KCSAN. Not appropriate for backporting
due to this being rcutorture.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The ->rtort_pipe_count field in the rcu_torture structure checks for
too-short grace periods, and is therefore read by rcutorture's readers
while being updated by rcutorture's writers. This commit therefore
adds the needed READ_ONCE() and WRITE_ONCE() invocations.
This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely and due to this being rcutorture.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
In theory, RCU-hotplug operations are supposed to work as soon as there
is more than one CPU online. However, in practice, in normal production
there is no way to make them happen until userspace is up and running.
Besides which, on smaller systems, rcutorture doesn't start doing hotplug
operations until 30 seconds after the start of boot, which on most
systems also means the better part of 30 seconds after the end of boot.
This commit therefore provides a new torture.disable_onoff_at_boot kernel
boot parameter that suppresses CPU-hotplug torture operations until
about the time that init is spawned.
Of course, if you know of a need for boottime CPU-hotplug operations,
then you should avoid passing this argument to any of the torture tests.
You might also want to look at the splats linked to below.
Link: https://lore.kernel.org/lkml/20191206185208.GA25636@paulmck-ThinkPad-P72/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
In normal production, an excessively long wait on a grace period
(synchronize_rcu(), for example) at boottime is often just as bad
as at any other time. In fact, given the desire for fast boot, any
sort of long wait at boot is a bad idea. However, heavy rcutorture
testing on large hyperthreaded systems can generate such long waits
during boot as a matter of course. This commit therefore causes
the rcupdate.rcu_cpu_stall_suppress_at_boot kernel boot parameter to
suppress reporting of bootime bad-sequence warning due to excessively
long grace-period waits.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
In normal production, an RCU CPU stall warning at boottime is often
just as bad as at any other time. In fact, given the desire for fast
boot, any sort of long-term stall at boot is a bad idea. However,
heavy rcutorture testing on large hyperthreaded systems can generate
boottime RCU CPU stalls as a matter of course. This commit therefore
provides a kernel boot parameter that suppresses reporting of boottime
RCU CPU stall warnings and similarly of rcutorture writer stalls.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
During boot, CPU hotplug is often disabled, for example by PCI probing.
On large systems that take substantial time to boot, this can result
in spurious RCU_HOTPLUG errors. This commit therefore forgives any
boottime -EBUSY CPU-hotplug failures by adjusting counters to pretend
that the corresponding attempt never happened. A non-splat record
of the failed attempt is emitted to the console with the added string
"(-EBUSY forgiven during boot)".
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Additional rcutorture aggression can result in, believe it or not,
boot times in excess of three minutes on large hyperthreaded systems.
This is long enough for rcutorture to decide to do some callback flooding,
which seems a bit excessive given that userspace cannot have started
until long after boot, and it is userspace that does the real-world
callback flooding. Worse yet, because Tiny RCU lacks forward-progress
functionality, the looping-in-the-kernel tests can also be problematic
during early boot.
This commit therefore causes rcutorture to hold off on callback
flooding until about the time that init is spawned, and the same
for looping-in-the-kernel tests for Tiny RCU.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Some larger systems can take in excess of 50 seconds to complete their
early boot initcalls prior to spawing init. This does not in any way
help the forward-progress judgments of built-in rcutorture (when
rcutorture is built as a module, the insmod or modprobe command normally
cannot happen until some time after boot completes). This commit
therefore suppresses such complaints until about the time that init
is spawned.
This also includes a fix to a stupid error located by kbuild test robot.
[ paulmck: Apply kbuild test robot feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ paulmck: Fix to nohz_full slow-expediting recovery logic, per bpetkov. ]
[ paulmck: Restrict splat to CONFIG_PREEMPT_RT=y kernels and simplify. ]
Tested-by: Borislav Petkov <bp@alien8.de>
A read of the srcu_struct structure's ->srcu_gp_seq field should not
need READ_ONCE() when that structure's ->lock is held. Except that this
lock is not always held when updating this field. This commit therefore
acquires the lock around updates and removes a now-unneeded READ_ONCE().
This data race was reported by KCSAN.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
[ paulmck: Switch from READ_ONCE() to lock per Peter Zilstra question. ]
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
The srcu_struct structure's ->srcu_idx field is accessed locklessly,
so reads must use READ_ONCE(). This commit therefore adds the needed
READ_ONCE() invocation where it was missed.
This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The srcu_struct structure's ->srcu_gp_seq_needed_exp field is accessed
locklessly, so updates must use WRITE_ONCE(). This commit therefore
adds the needed WRITE_ONCE() invocations.
This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The srcu_node structure's ->srcu_gp_seq_needed_exp field is accessed
locklessly, so updates must use WRITE_ONCE(). This commit therefore
adds the needed WRITE_ONCE() invocations.
This data race was reported by KCSAN. Not appropriate for backporting
due to failure being unlikely.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>