36302 Commits

Author SHA1 Message Date
Jens Axboe
e22bc9b481 kernel: make IO threads unfreezable by default
The io-wq threads were already marked as no-freeze, but the manager was
not. On resume, we perpetually have signal_pending() being true, and
hence the manager will loop and spin 100% of the time.

Just mark the tasks created by create_io_thread() as PF_NOFREEZE by
default, and remove any knowledge of it in io-wq and io_uring.

Reported-by: Kevin Locke <kevin@kevinlocke.name>
Tested-by: Kevin Locke <kevin@kevinlocke.name>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-10 07:28:43 -07:00
Edmundo Carmona Antoranz
13c2235b2b sched: Remove unnecessary variable from schedule_tail()
Since 565790d28b1 (sched: Fix balance_callback(), 2020-05-11), there
is no longer a need to reuse the result value of the call to finish_task_switch()
inside schedule_tail(), therefore the variable used to hold that value
(rq) is no longer needed.

Signed-off-by: Edmundo Carmona Antoranz <eantoranz@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210306210739.1370486-1-eantoranz@gmail.com
2021-03-10 09:51:49 +01:00
Clement Courbet
1e17fb8edc sched: Optimize __calc_delta()
A significant portion of __calc_delta() time is spent in the loop
shifting a u64 by 32 bits. Use `fls` instead of iterating.

This is ~7x faster on benchmarks.

The generic `fls` implementation (`generic_fls`) is still ~4x faster
than the loop.
Architectures that have a better implementation will make use of it. For
example, on x86 we get an additional factor 2 in speed without dedicated
implementation.

On GCC, the asm versions of `fls` are about the same speed as the
builtin. On Clang, the versions that use fls are more than twice as
slow as the builtin. This is because the way the `fls` function is
written, clang puts the value in memory:
https://godbolt.org/z/EfMbYe. This bug is filed at
https://bugs.llvm.org/show_bug.cgi?idI406.

```
name                                   cpu/op
BM_Calc<__calc_delta_loop>             9.57ms Â=B112%
BM_Calc<__calc_delta_generic_fls>      2.36ms Â=B113%
BM_Calc<__calc_delta_asm_fls>          2.45ms Â=B113%
BM_Calc<__calc_delta_asm_fls_nomem>    1.66ms Â=B112%
BM_Calc<__calc_delta_asm_fls64>        2.46ms Â=B113%
BM_Calc<__calc_delta_asm_fls64_nomem>  1.34ms Â=B115%
BM_Calc<__calc_delta_builtin>          1.32ms Â=B111%
```

Signed-off-by: Clement Courbet <courbet@google.com>
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210303224653.2579656-1-joshdon@google.com
2021-03-10 09:51:49 +01:00
David S. Miller
c1acda9807 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2021-03-09

The following pull-request contains BPF updates for your *net-next* tree.

We've added 90 non-merge commits during the last 17 day(s) which contain
a total of 114 files changed, 5158 insertions(+), 1288 deletions(-).

The main changes are:

1) Faster bpf_redirect_map(), from Björn.

2) skmsg cleanup, from Cong.

3) Support for floating point types in BTF, from Ilya.

4) Documentation for sys_bpf commands, from Joe.

5) Support for sk_lookup in bpf_prog_test_run, form Lorenz.

6) Enable task local storage for tracing programs, from Song.

7) bpf_for_each_map_elem() helper, from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-09 18:07:05 -08:00
Linus Torvalds
05a59d7979 Merge git://git.kernel.org:/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix transmissions in dynamic SMPS mode in ath9k, from Felix Fietkau.

 2) TX skb error handling fix in mt76 driver, also from Felix.

 3) Fix BPF_FETCH atomic in x86 JIT, from Brendan Jackman.

 4) Avoid double free of percpu pointers when freeing a cloned bpf prog.
    From Cong Wang.

 5) Use correct printf format for dma_addr_t in ath11k, from Geert
    Uytterhoeven.

 6) Fix resolve_btfids build with older toolchains, from Kun-Chuan
    Hsieh.

 7) Don't report truncated frames to mac80211 in mt76 driver, from
    Lorenzop Bianconi.

 8) Fix watcdog timeout on suspend/resume of stmmac, from Joakim Zhang.

 9) mscc ocelot needs NET_DEVLINK selct in Kconfig, from Arnd Bergmann.

10) Fix sign comparison bug in TCP_ZEROCOPY_RECEIVE getsockopt(), from
    Arjun Roy.

11) Ignore routes with deleted nexthop object in mlxsw, from Ido
    Schimmel.

12) Need to undo tcp early demux lookup sometimes in nf_nat, from
    Florian Westphal.

13) Fix gro aggregation for udp encaps with zero csum, from Daniel
    Borkmann.

14) Make sure to always use imp*_ndo_send when necessaey, from Jason A.
    Donenfeld.

15) Fix TRSCER masks in sh_eth driver from Sergey Shtylyov.

16) prevent overly huge skb allocationsd in qrtr, from Pavel Skripkin.

17) Prevent rx ring copnsumer index loss of sync in enetc, from Vladimir
    Oltean.

18) Make sure textsearch copntrol block is large enough, from Wilem de
    Bruijn.

19) Revert MAC changes to r8152 leading to instability, from Hates Wang.

20) Advance iov in 9p even for empty reads, from Jissheng Zhang.

21) Double hook unregister in nftables, from PabloNeira Ayuso.

22) Fix memleak in ixgbe, fropm Dinghao Liu.

23) Avoid dups in pkt scheduler class dumps, from Maximilian Heyne.

24) Various mptcp fixes from Florian Westphal, Paolo Abeni, and Geliang
    Tang.

25) Fix DOI refcount bugs in cipso, from Paul Moore.

26) One too many irqsave in ibmvnic, from Junlin Yang.

27) Fix infinite loop with MPLS gso segmenting via virtio_net, from
    Balazs Nemeth.

* git://git.kernel.org:/pub/scm/linux/kernel/git/netdev/net: (164 commits)
  s390/qeth: fix notification for pending buffers during teardown
  s390/qeth: schedule TX NAPI on QAOB completion
  s390/qeth: improve completion of pending TX buffers
  s390/qeth: fix memory leak after failed TX Buffer allocation
  net: avoid infinite loop in mpls_gso_segment when mpls_hlen == 0
  net: check if protocol extracted by virtio_net_hdr_set_proto is correct
  net: dsa: xrs700x: check if partner is same as port in hsr join
  net: lapbether: Remove netif_start_queue / netif_stop_queue
  atm: idt77252: fix null-ptr-dereference
  atm: uPD98402: fix incorrect allocation
  atm: fix a typo in the struct description
  net: qrtr: fix error return code of qrtr_sendmsg()
  mptcp: fix length of ADD_ADDR with port sub-option
  net: bonding: fix error return code of bond_neigh_init()
  net: enetc: allow hardware timestamping on TX queues with tc-etf enabled
  net: enetc: set MAC RX FIFO to recommended value
  net: davicom: Use platform_get_irq_optional()
  net: davicom: Fix regulator not turned off on driver removal
  net: davicom: Fix regulator not turned off on failed probe
  net: dsa: fix switchdev objects on bridge master mistakenly being applied on ports
  ...
2021-03-09 17:15:56 -08:00
Björn Töpel
ee75aef23a bpf, xdp: Restructure redirect actions
The XDP_REDIRECT implementations for maps and non-maps are fairly
similar, but obviously need to take different code paths depending on
if the target is using a map or not. Today, the redirect targets for
XDP either uses a map, or is based on ifindex.

Here, the map type and id are added to bpf_redirect_info, instead of
the actual map. Map type, map item/ifindex, and the map_id (if any) is
passed to xdp_do_redirect().

For ifindex-based redirect, used by the bpf_redirect() XDP BFP helper,
a special map type/id are used. Map type of UNSPEC together with map id
equal to INT_MAX has the special meaning of an ifindex based
redirect. Note that valid map ids are 1 inclusive, INT_MAX exclusive
([1,INT_MAX[).

In addition to making the code easier to follow, using explicit type
and id in bpf_redirect_info has a slight positive performance impact
by avoiding a pointer indirection for the map type lookup, and instead
use the cacheline for bpf_redirect_info.

Since the actual map is not passed via bpf_redirect_info anymore, the
map lookup is only done in the BPF helper. This means that the
bpf_clear_redirect_map() function can be removed. The actual map item
is RCU protected.

The bpf_redirect_info flags member is not used by XDP, and not
read/written any more. The map member is only written to when
required/used, and not unconditionally.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210308112907.559576-3-bjorn.topel@gmail.com
2021-03-10 01:06:34 +01:00
Björn Töpel
e6a4750ffe bpf, xdp: Make bpf_redirect_map() a map operation
Currently the bpf_redirect_map() implementation dispatches to the
correct map-lookup function via a switch-statement. To avoid the
dispatching, this change adds bpf_redirect_map() as a map
operation. Each map provides its bpf_redirect_map() version, and
correct function is automatically selected by the BPF verifier.

A nice side-effect of the code movement is that the map lookup
functions are now local to the map implementation files, which removes
one additional function call.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210308112907.559576-2-bjorn.topel@gmail.com
2021-03-10 01:06:34 +01:00
Marco Elver
bd0ccc4afc kcsan: Add missing license and copyright headers
Adds missing license and/or copyright headers for KCSAN source files.

Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:27:43 -08:00
Marco Elver
f6a1491403 kcsan: Switch to KUNIT_CASE_PARAM for parameterized tests
Since KUnit now support parameterized tests via KUNIT_CASE_PARAM, update
KCSAN's test to switch to it for parameterized tests. This simplifies
parameterized tests and gets rid of the "parameters in case name"
workaround (hack).

At the same time, we can increase the maximum number of threads used,
because on systems with too few CPUs, KUnit allows us to now stop at the
maximum useful threads and not unnecessarily execute redundant test
cases with (the same) limited threads as had been the case before.

Reviewed-by: David Gow <davidgow@google.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:27:43 -08:00
Marco Elver
a146fed56f kcsan: Make test follow KUnit style recommendations
Per recently added KUnit style recommendations at
Documentation/dev-tools/kunit/style.rst, make the following changes to
the KCSAN test:

	1. Rename 'kcsan-test.c' to 'kcsan_test.c'.

	2. Rename suite name 'kcsan-test' to 'kcsan'.

	3. Rename CONFIG_KCSAN_TEST to CONFIG_KCSAN_KUNIT_TEST and
	   default to KUNIT_ALL_TESTS.

Reviewed-by: David Gow <davidgow@google.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:27:43 -08:00
Marco Elver
e36299efe7 kcsan, debugfs: Move debugfs file creation out of early init
Commit 56348560d495 ("debugfs: do not attempt to create a new file
before the filesystem is initalized") forbids creating new debugfs files
until debugfs is fully initialized.  This means that KCSAN's debugfs
file creation, which happened at the end of __init(), no longer works.
And was apparently never supposed to work!

However, there is no reason to create KCSAN's debugfs file so early.
This commit therefore moves its creation to a late_initcall() callback.

Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: stable <stable@vger.kernel.org>
Fixes: 56348560d495 ("debugfs: do not attempt to create a new file before the filesystem is initalized")
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:27:43 -08:00
Stephen Zhang
0a27fff30a rcutorture: Replace rcu_torture_stall string with %s
This commit replaces a hard-coded "rcu_torture_stall" string in a
pr_alert() format with "%s" and __func__.

Signed-off-by: Stephen Zhang <stephenzhangzsd@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:22:28 -08:00
Stephen Zhang
4ac9de07b2 torture: Replace torture_init_begin string with %s
This commit replaces a hard-coded "torture_init_begin" string in
a pr_alert() format with "%s" and __func__.

Signed-off-by: Stephen Zhang <stephenzhangzsd@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:22:28 -08:00
Paul E. McKenney
a434dd10cd rcu-tasks: Add block comment laying out RCU Tasks Trace design
This commit adds a block comment that gives a high-level overview of
how RCU tasks trace grace periods progress.  It also adds a note about
how exiting tasks are handled, plus it gives an overview of the memory
ordering.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
[ paulmck: Fix commit log per Mathieu Desnoyers feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:22:02 -08:00
Lukas Bulwahn
85b8699428 rcu-tasks: Rectify kernel-doc for struct rcu_tasks
The command 'find ./kernel/rcu/ | xargs ./scripts/kernel-doc -none'
reported an issue with the kernel-doc of struct rcu_tasks.

This commit rectifies the kernel-doc, such that no issues remain for
./kernel/rcu/.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:22:02 -08:00
Paul E. McKenney
7308e02404 rcu: Make rcu_read_unlock_special() expedite strict grace periods
In kernels built with CONFIG_RCU_STRICT_GRACE_PERIOD=y, every grace
period is an expedited grace period.  However, rcu_read_unlock_special()
does not treat them that way, instead allowing the deferred quiescent
state to be reported whenever.  This commit therefore adds a check of
this Kconfig option that causes rcu_read_unlock_special() to treat all
grace periods as expedited for CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:21:41 -08:00
Paul E. McKenney
5e59fba573 rcutorture: Fix testing of RCU priority boosting
Currently, rcutorture refuses to test RCU priority boosting in
CONFIG_HOTPLUG_CPU=y kernels, which are the only kind normally built on
x86 these days.  This commit therefore updates rcutorture's tests of RCU
priority boosting to make them safe for CPU hotplug.  However, these tests
will fail unless TIMER_SOFTIRQ runs at realtime priority, which does not
happen in current mainline.  This commit therefore also refuses to test
RCU priority boosting except in kernels built with CONFIG_PREEMPT_RT=y.

While in the area, this commt adds some debug output at boost-fail time
that helps diagnose the cause of the failure, for example, failing to
run TIMER_SOFTIRQ at realtime priority.

Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:21:41 -08:00
Paul E. McKenney
39bbfc62cc rcu: Expedite deboost in case of deferred quiescent state
Historically, a task that has been subjected to RCU priority boosting is
deboosted at rcu_read_unlock() time.  However, with the advent of deferred
quiescent states, if the outermost rcu_read_unlock() was invoked with
either bottom halves, interrupts, or preemption disabled, the deboosting
will be delayed for some time.  During this time, a low-priority process
might be incorrectly running at a high real-time priority level.

Fortunately, rcu_read_unlock_special() already provides mechanisms for
forcing a minimal deferral of quiescent states, at least for kernels
built with CONFIG_IRQ_WORK=y.  These mechanisms are currently used
when expedited grace periods are pending that might be blocked by the
current task.  This commit therefore causes those mechanisms to also be
used in cases where the current task has been or might soon be subjected
to RCU priority boosting.  Note that this applies to all kernels built
with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
built with CONFIG_PREEMPT_RT=y.

This approach assumes that kernels build for use with aggressive real-time
applications are built with CONFIG_IRQ_WORK=y.  It is likely to be far
simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
scheme that works correctly in its absence.

While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
function's local variables.

Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:21:40 -08:00
Frederic Weisbecker
55adc3e1c8 rcu/nocb: Rename nocb_gp_update_state to nocb_gp_update_state_deoffloading
The name nocb_gp_update_state() is unenlightening, so this commit changes
it to nocb_gp_update_state_deoffloading().  This function now does what
its name says, updates state and returns true if the CPU corresponding to
the specified rcu_data structure is in the process of being de-offloaded.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:20:22 -08:00
Frederic Weisbecker
ec711bc12c rcu/nocb: Only (re-)initialize segcblist when needed on CPU up
At the start of a CPU-hotplug operation, the incoming CPU's callback
list can be in a number of states:

1.	Disabled and empty.  This is the case when the boot CPU has
	not invoked call_rcu(), when a non-boot CPU first comes online,
	and when a non-offloaded CPU comes back online.  In this case,
	it is both necessary and permissible to initialize ->cblist.
	Because either the CPU is currently running with interrupts
	disabled (boot CPU) or is not yet running at all (other CPUs),
	it is not necessary to acquire ->nocb_lock.

	In this case, initialization is required.

2.	Disabled and non-empty.  This cannot occur, because early boot
	call_rcu() invocations enable the callback list before enqueuing
	their callback.

3.	Enabled, whether empty or not.	In this case, the callback
	list has already been initialized.  This case occurs when the
	boot CPU has executed an early boot call_rcu() and also when
	an offloaded CPU comes back online.  In both cases, there is
	no need to initialize the callback list: In the boot-CPU case,
	the CPU has not (yet) gone offline, and in the offloaded case,
	the rcuo kthreads are taking care of business.

	Because it is not necessary to initialize the callback list,
	it is also not necessary to acquire ->nocb_lock.

Therefore, checking if the segcblist is enabled suffices.  This commit
therefore initializes the callback list at rcutree_prepare_cpu() time
only if that list is disabled.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:20:22 -08:00
Frederic Weisbecker
8a682b3974 rcu/nocb: Avoid confusing double write of rdp->nocb_cb_sleep
The nocb_cb_wait() function first sets the rdp->nocb_cb_sleep flag to
true by after invoking the callbacks, and then sets it back to false if
it finds more callbacks that are ready to invoke.

This is confusing and will become unsafe if this flag is ever read
locklessly.  This commit therefore writes it only once, based on the
state after both callback invocation and checking.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:20:21 -08:00
Frederic Weisbecker
64305db285 rcu/nocb: Forbid NOCB toggling on offline CPUs
It makes no sense to de-offload an offline CPU because that CPU will never
invoke any remaining callbacks.  It also makes little sense to offload an
offline CPU because any pending RCU callbacks were migrated when that CPU
went offline.  Yes, it is in theory possible to use a number of tricks
to permit offloading and deoffloading offline CPUs in certain cases, but
in practice it is far better to have the simple and deterministic rule
"Toggling the offload state of an offline CPU is forbidden".

For but one example, consider that an offloaded offline CPU might have
millions of callbacks queued.  Best to just say "no".

This commit therefore forbids toggling of the offloaded state of
offline CPUs.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:20:21 -08:00
Frederic Weisbecker
5de2e5bb80 rcu/nocb: Comment the reason behind BH disablement on batch processing
This commit explains why softirqs need to be disabled while invoking
callbacks, even when callback processing has been offloaded.  After
all, invoking callbacks concurrently is one thing, but concurrently
invoking the same callback is quite another.

Reported-by: Boqun Feng <boqun.feng@gmail.com>
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:20:20 -08:00
Frederic Weisbecker
3820b513a2 rcu/nocb: Detect unsafe checks for offloaded rdp
Provide CONFIG_PROVE_RCU sanity checks to ensure we are always reading
the offloaded state of an rdp in a safe and stable way and prevent from
its value to be changed under us. We must either hold the barrier mutex,
the cpu-hotplug lock (read or write) or the nocb lock.
Local non-preemptible reads are also safe. NOCB kthreads and timers have
their own means of synchronization against the offloaded state updaters.

Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:20:20 -08:00
Paul E. McKenney
0d3dd2c8ea rcutorture: Add crude tests for mem_dump_obj()
This commit adds a few crude tests for mem_dump_obj() to rcutorture
runs.  Just to prevent bitrot, you understand!

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:18:46 -08:00
Uladzislau Rezki (Sony)
686fe1bf6b rcuscale: Add kfree_rcu() single-argument scale test
The single-argument variant of kfree_rcu() is currently not
tested by any member of the rcutoture test suite.  This
commit therefore adds rcuscale code to test it.  This
testing is controlled by two new boolean module parameters,
kfree_rcu_test_single and kfree_rcu_test_double.  If one
is set and the other not, only the corresponding variant
is tested, otherwise both are tested, with the variant to
be tested determined randomly on each invocation.

Both of these module parameters are initialized to false,
so setting either to true will test only that variant.

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:18:07 -08:00
Uladzislau Rezki (Sony)
ee6ddf5847 kvfree_rcu: Use same set of GFP flags as does single-argument
Running an rcuscale stress-suite can lead to "Out of memory" of a
system. This can happen under high memory pressure with a small amount
of physical memory.

For example, a KVM test configuration with 64 CPUs and 512 megabytes
can result in OOM when running rcuscale with below parameters:

../kvm.sh --torture rcuscale --allcpus --duration 10 --kconfig CONFIG_NR_CPUS=64 \
--bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 rcuscale.holdoff=20 \
  rcuscale.kfree_loops=10000 torture.disable_onoff_at_boot" --trust-make

<snip>
[   12.054448] kworker/1:1H invoked oom-killer: gfp_mask=0x2cc0(GFP_KERNEL|__GFP_NOWARN), order=0, oom_score_adj=0
[   12.055303] CPU: 1 PID: 377 Comm: kworker/1:1H Not tainted 5.11.0-rc3+ #510
[   12.055416] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014
[   12.056485] Workqueue: events_highpri fill_page_cache_func
[   12.056485] Call Trace:
[   12.056485]  dump_stack+0x57/0x6a
[   12.056485]  dump_header+0x4c/0x30a
[   12.056485]  ? del_timer_sync+0x20/0x30
[   12.056485]  out_of_memory.cold.47+0xa/0x7e
[   12.056485]  __alloc_pages_slowpath.constprop.123+0x82f/0xc00
[   12.056485]  __alloc_pages_nodemask+0x289/0x2c0
[   12.056485]  __get_free_pages+0x8/0x30
[   12.056485]  fill_page_cache_func+0x39/0xb0
[   12.056485]  process_one_work+0x1ed/0x3b0
[   12.056485]  ? process_one_work+0x3b0/0x3b0
[   12.060485]  worker_thread+0x28/0x3c0
[   12.060485]  ? process_one_work+0x3b0/0x3b0
[   12.060485]  kthread+0x138/0x160
[   12.060485]  ? kthread_park+0x80/0x80
[   12.060485]  ret_from_fork+0x22/0x30
[   12.062156] Mem-Info:
[   12.062350] active_anon:0 inactive_anon:0 isolated_anon:0
[   12.062350]  active_file:0 inactive_file:0 isolated_file:0
[   12.062350]  unevictable:0 dirty:0 writeback:0
[   12.062350]  slab_reclaimable:2797 slab_unreclaimable:80920
[   12.062350]  mapped:1 shmem:2 pagetables:8 bounce:0
[   12.062350]  free:10488 free_pcp:1227 free_cma:0
...
[   12.101610] Out of memory and no killable processes...
[   12.102042] Kernel panic - not syncing: System is deadlocked on memory
[   12.102583] CPU: 1 PID: 377 Comm: kworker/1:1H Not tainted 5.11.0-rc3+ #510
[   12.102600] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014
<snip>

Because kvfree_rcu() has a fallback path, memory allocation failure is
not the end of the world.  Furthermore, the added overhead of aggressive
GFP settings must be balanced against the overhead of the fallback path,
which is a cache miss for double-argument kvfree_rcu() and a call to
synchronize_rcu() for single-argument kvfree_rcu().  The current choice
of GFP_KERNEL|__GFP_NOWARN can result in longer latencies than a call
to synchronize_rcu(), so less-tenacious GFP flags would be helpful.

Here is the tradeoff that must be balanced:
    a) Minimize use of the fallback path,
    b) Avoid pushing the system into OOM,
    c) Bound allocation latency to that of synchronize_rcu(), and
    d) Leave the emergency reserves to use cases lacking fallbacks.

This commit therefore changes GFP flags from GFP_KERNEL|__GFP_NOWARN to
GFP_KERNEL|__GFP_NORETRY|__GFP_NOMEMALLOC|__GFP_NOWARN.  This combination
leaves the emergency reserves alone and can initiate reclaim, but will
not invoke the OOM killer.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:18:07 -08:00
Uladzislau Rezki (Sony)
3e7ce7a187 kvfree_rcu: Replace __GFP_RETRY_MAYFAIL by __GFP_NORETRY
__GFP_RETRY_MAYFAIL can spend quite a bit of time reclaiming, and this
can be wasted effort given that there is a fallback code path in case
memory allocation fails.

__GFP_NORETRY does perform some light-weight reclaim, but it will fail
under OOM conditions, allowing the fallback to be taken as an alternative
to hard-OOMing the system.

There is a four-way tradeoff that must be balanced:
    1) Minimize use of the fallback path;
    2) Avoid full-up OOM;
    3) Do a light-wait allocation request;
    4) Avoid dipping into the emergency reserves.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:18:07 -08:00
Paul E. McKenney
7ffc9ec8ea kvfree_rcu: Make krc_this_cpu_unlock() use raw_spin_unlock_irqrestore()
The krc_this_cpu_unlock() function does a raw_spin_unlock() immediately
followed by a local_irq_restore().  This commit saves a line of code by
merging them into a raw_spin_unlock_irqrestore().  This transformation
also reduces scheduling latency because raw_spin_unlock_irqrestore()
responds immediately to a reschedule request.  In contrast,
local_irq_restore() does a scheduling-oblivious enabling of interrupts.

Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:18:07 -08:00
Paul E. McKenney
b01b405092 kvfree_rcu: Use __GFP_NOMEMALLOC for single-argument kvfree_rcu()
This commit applies the __GFP_NOMEMALLOC gfp flag to memory allocations
carried out by the single-argument variant of kvfree_rcu(), thus avoiding
this can-sleep code path from dipping into the emergency reserves.

Acked-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:18:07 -08:00
Uladzislau Rezki (Sony)
148e3731d1 kvfree_rcu: Directly allocate page for single-argument case
Single-argument kvfree_rcu() must be invoked from sleepable contexts,
so we can directly allocate pages.  Furthermmore, the fallback in case
of page-allocation failure is the high-latency synchronize_rcu(), so it
makes sense to do these page allocations from the fastpath, and even to
permit limited sleeping within the allocator.

This commit therefore allocates if needed on the fastpath using
GFP_KERNEL|__GFP_RETRY_MAYFAIL.  This also has the beneficial effect
of leaving kvfree_rcu()'s per-CPU caches to the double-argument variant
of kvfree_rcu(), given that the double-argument variant cannot directly
invoke the allocator.

[ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ]
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:18:07 -08:00
Zhouyi Zhou
6494ccb932 rcu: Remove spurious instrumentation_end() in rcu_nmi_enter()
In rcu_nmi_enter(), there is an erroneous instrumentation_end() in the
second branch of the "if" statement.  Oddly enough, "objtool check -f
vmlinux.o" fails to complain because it is unable to correctly cover
all cases.  Instead, objtool visits the third branch first, which marks
following trace_rcu_dyntick() as visited.  This commit therefore removes
the spurious instrumentation_end().

Fixes: 04b25a495bd6 ("rcu: Mark rcu_nmi_enter() call to rcu_cleanup_after_idle() noinstr")
Reported-by Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:17:35 -08:00
Neeraj Upadhyay
47fcbc8dd6 rcu: Fix CPU-offline trace in rcutree_dying_cpu
The condition in the trace_rcu_grace_period() in rcutree_dying_cpu() is
backwards, so that it uses the string "cpuofl" when the offline CPU is
blocking the current grace period and "cpuofl-bgp" otherwise.  Given that
the "-bgp" stands for "blocking grace period", this is at best misleading.
This commit therefore switches these strings in order to correctly trace
whether the outgoing cpu blocks the current grace period.

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:17:35 -08:00
Frederic Weisbecker
d3ad5bbc4d rcu: Remove superfluous rdp fetch
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar<mingo@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:17:35 -08:00
Paul Gortmaker
3e70df91f9 rcu: deprecate "all" option to rcu_nocbs=
With the core bitmap support now accepting "N" as a placeholder for
the end of the bitmap, "all" can be represented as "0-N" and has the
advantage of not being specific to RCU (or any other subsystem).

So deprecate the use of "all" by removing documentation references
to it.  The support itself needs to remain for now, since we don't
know how many people out there are using it currently, but since it
is in an __init area anyway, it isn't worth losing sleep over.

Cc: Yury Norov <yury.norov@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Acked-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-03-08 14:16:58 -08:00
Greg Kroah-Hartman
69dd4503a7 irqdomain: Remove debugfs_file from struct irq_domain
There's no need to keep around a dentry pointer to a simple file that
debugfs itself can look up when we need to remove it from the system.
So simplify the code by deleting the variable and cleaning up the logic
around the debugfs file.

Cc: Marc Zyngier <maz@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/YCvYV53ZdzQSWY6w@kroah.com
2021-03-08 20:12:08 +00:00
Tal Lossos
769c18b254 bpf: Change inode_storage's lookup_elem return value from NULL to -EBADF
bpf_fd_inode_storage_lookup_elem() returned NULL when getting a bad FD,
which caused -ENOENT in bpf_map_copy_value. -EBADF error is better than
-ENOENT for a bad FD behaviour.

The patch was partially contributed by CyberArk Software, Inc.

Fixes: 8ea636848aca ("bpf: Implement bpf_local_storage for inodes")
Signed-off-by: Tal Lossos <tallossos@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/bpf/20210307120948.61414-1-tallossos@gmail.com
2021-03-08 16:08:06 +01:00
Alexei Starovoitov
350a5c4dd2 bpf: Dont allow vmlinux BTF to be used in map_create and prog_load.
The syzbot got FD of vmlinux BTF and passed it into map_create which caused
crash in btf_type_id_size() when it tried to access resolved_ids. The vmlinux
BTF doesn't have 'resolved_ids' and 'resolved_sizes' initialized to save
memory. To avoid such issues disallow using vmlinux BTF in prog_load and
map_create commands.

Fixes: 5329722057d4 ("bpf: Assign ID to vmlinux BTF and return extra info for BTF in GET_OBJ_INFO")
Reported-by: syzbot+8bab8ed346746e7540e8@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210307225248.79031-1-alexei.starovoitov@gmail.com
2021-03-08 13:32:46 +01:00
John Ogness
505a27a734 printk: console: remove unnecessary safe buffer usage
Upon registering a console, safe buffers are activated when setting
up the sequence number to replay the log. However, these are already
protected by @console_sem and @syslog_lock. Remove the unnecessary
safe buffer usage.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210303101528.29901-16-john.ogness@linutronix.de
2021-03-08 11:43:38 +01:00
John Ogness
a4f9876532 printk: kmsg_dump: remove _nolock() variants
kmsg_dump_rewind() and kmsg_dump_get_line() are lockless, so there is
no need for _nolock() variants. Remove these functions and switch all
callers of the _nolock() variants.

The functions without _nolock() were chosen because they are already
exported to kernel modules.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210303101528.29901-15-john.ogness@linutronix.de
2021-03-08 11:43:35 +01:00
John Ogness
996e966640 printk: remove logbuf_lock
Since the ringbuffer is lockless, there is no need for it to be
protected by @logbuf_lock. Remove @logbuf_lock.

@console_seq, @exclusive_console_stop_seq, @console_dropped are
protected by @console_lock.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210303101528.29901-14-john.ogness@linutronix.de
2021-03-08 11:43:32 +01:00
John Ogness
f9f3f02db9 printk: introduce a kmsg_dump iterator
Rather than storing the iterator information in the registered
kmsg_dumper structure, create a separate iterator structure. The
kmsg_dump_iter structure can reside on the stack of the caller, thus
allowing lockless use of the kmsg_dump functions.

Update code that accesses the kernel logs using the kmsg_dumper
structure to use the new kmsg_dump_iter structure. For kmsg_dumpers,
this also means adding a call to kmsg_dump_rewind() to initialize
the iterator.

All this is in preparation for removal of @logbuf_lock.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org> # pstore
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210303101528.29901-13-john.ogness@linutronix.de
2021-03-08 11:43:27 +01:00
John Ogness
5f6c7648e5 printk: kmsg_dumper: remove @active field
All 6 kmsg_dumpers do not benefit from the @active flag:

  (provide their own synchronization)
  - arch/powerpc/kernel/nvram_64.c
  - arch/um/kernel/kmsg_dump.c
  - drivers/mtd/mtdoops.c
  - fs/pstore/platform.c

  (only dump on KMSG_DUMP_PANIC, which does not require
  synchronization)
  - arch/powerpc/platforms/powernv/opal-kmsg.c
  - drivers/hv/vmbus_drv.c

The other 2 kmsg_dump users also do not rely on @active:

  (hard-code @active to always be true)
  - arch/powerpc/xmon/xmon.c
  - kernel/debug/kdb/kdb_main.c

Therefore, @active can be removed.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210303101528.29901-12-john.ogness@linutronix.de
2021-03-08 11:43:23 +01:00
John Ogness
636babdc06 printk: add syslog_lock
The global variables @syslog_seq, @syslog_partial, @syslog_time
and write access to @clear_seq are protected by @logbuf_lock.
Once @logbuf_lock is removed, these variables will need their
own synchronization method. Introduce @syslog_lock for this
purpose.

@syslog_lock is a raw_spin_lock for now. This simplifies the
transition to removing @logbuf_lock. Once @logbuf_lock and the
safe buffers are removed, @syslog_lock can change to spin_lock.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210303101528.29901-11-john.ogness@linutronix.de
2021-03-08 11:43:20 +01:00
John Ogness
35b2b16348 printk: use atomic64_t for devkmsg_user.seq
@user->seq is indirectly protected by @logbuf_lock. Once @logbuf_lock
is removed, @user->seq will be no longer safe from an atomicity point
of view.

In preparation for the removal of @logbuf_lock, change it to
atomic64_t to provide this safety.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210303101528.29901-10-john.ogness@linutronix.de
2021-03-08 11:43:18 +01:00
John Ogness
7d7a23a91c printk: use seqcount_latch for clear_seq
kmsg_dump_rewind_nolock() locklessly reads @clear_seq. However,
this is not done atomically. Since @clear_seq is 64-bit, this
cannot be an atomic operation for all platforms. Therefore, use
a seqcount_latch to allow readers to always read a consistent
value.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210303101528.29901-9-john.ogness@linutronix.de
2021-03-08 11:43:15 +01:00
John Ogness
cf5b0208fd printk: introduce CONSOLE_LOG_MAX
Instead of using "LOG_LINE_MAX + PREFIX_MAX" for temporary buffer
sizes, introduce CONSOLE_LOG_MAX. This represents the maximum size
that is allowed to be printed to the console for a single record.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210303101528.29901-8-john.ogness@linutronix.de
2021-03-08 11:43:12 +01:00
John Ogness
4260e0e551 printk: consolidate kmsg_dump_get_buffer/syslog_print_all code
The logic for finding records to fit into a buffer is the same for
kmsg_dump_get_buffer() and syslog_print_all(). Introduce a helper
function find_first_fitting_seq() to handle this logic.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210303101528.29901-7-john.ogness@linutronix.de
2021-03-08 11:43:09 +01:00
John Ogness
726b509770 printk: refactor kmsg_dump_get_buffer()
kmsg_dump_get_buffer() requires nearly the same logic as
syslog_print_all(), but uses different variable names and
does not make use of the ringbuffer loop macros. Modify
kmsg_dump_get_buffer() so that the implementation is as similar
to syslog_print_all() as possible.

A follow-up commit will move this common logic into a
separate helper function.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210303101528.29901-6-john.ogness@linutronix.de
2021-03-08 11:43:05 +01:00
John Ogness
bb07b16c44 printk: limit second loop of syslog_print_all
The second loop of syslog_print_all() subtracts lengths that were
added in the first loop. With commit b031a684bfd0 ("printk: remove
logbuf_lock writer-protection of ringbuffer") it is possible that
records are (over)written during syslog_print_all(). This allows the
possibility of the second loop subtracting lengths that were never
added in the first loop.

This situation can result in syslog_print_all() filling the buffer
starting from a later record, even though there may have been room
to fit the earlier record(s) as well.

Fixes: b031a684bfd0 ("printk: remove logbuf_lock writer-protection of ringbuffer")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210303101528.29901-4-john.ogness@linutronix.de
2021-03-08 11:42:50 +01:00