34700 Commits

Author SHA1 Message Date
Joel Fernandes (Google)
53922270d2 rcu/segcblist: Prevent useless GP start if no CBs to accelerate
The rcu_segcblist_accelerate() function returns true iff it is necessary
to request another grace period.  A tracing session showed that this
function unnecessarily requests grace periods.

For example, consider the following sequence of events:
1. Callbacks are queued only on the NEXT segment of CPU A's callback list.
2. CPU A runs RCU_SOFTIRQ, accelerating these callbacks from NEXT to WAIT.
3. Thus rcu_segcblist_accelerate() returns true, requesting grace period N.
4. RCU's grace-period kthread wakes up on CPU B and starts grace period N.
4. CPU A notices the new grace period and invokes RCU_SOFTIRQ.
5. CPU A's RCU_SOFTIRQ again invokes rcu_segcblist_accelerate(), but
   there are no new callbacks.  However, rcu_segcblist_accelerate()
   nevertheless (uselessly) requests a new grace period N+1.

This extra grace period results in additional lock contention and also
additional wakeups, all for no good reason.

This commit therefore adds a check to rcu_segcblist_accelerate() that
prevents the return of true when there are no new callbacks.

This change reduces the number of grace periods (GPs) and wakeups in each
of eleven five-second rcutorture runs as follows:

+----+-------------------+-------------------+
| #  | Number of GPs     | Number of Wakeups |
+====+=========+=========+=========+=========+
| 1  | With    | Without | With    | Without |
+----+---------+---------+---------+---------+
| 2  |      75 |      89 |     113 |     119 |
+----+---------+---------+---------+---------+
| 3  |      62 |      91 |     105 |     123 |
+----+---------+---------+---------+---------+
| 4  |      60 |      79 |      98 |     110 |
+----+---------+---------+---------+---------+
| 5  |      63 |      79 |      99 |     112 |
+----+---------+---------+---------+---------+
| 6  |      57 |      89 |      96 |     123 |
+----+---------+---------+---------+---------+
| 7  |      64 |      85 |      97 |     118 |
+----+---------+---------+---------+---------+
| 8  |      58 |      83 |      98 |     113 |
+----+---------+---------+---------+---------+
| 9  |      57 |      77 |      89 |     104 |
+----+---------+---------+---------+---------+
| 10 |      66 |      82 |      98 |     119 |
+----+---------+---------+---------+---------+
| 11 |      52 |      82 |      83 |     117 |
+----+---------+---------+---------+---------+

The reduction in the number of wakeups ranges from 5% to 40%.

Cc: urezki@gmail.com
[ paulmck: Rework commit log and comment. ]
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-09-03 09:39:59 -07:00
peterz@infradead.org
23870f1227 locking/lockdep: Fix "USED" <- "IN-NMI" inversions
During the LPC RCU BoF Paul asked how come the "USED" <- "IN-NMI"
detector doesn't trip over rcu_read_lock()'s lockdep annotation.

Looking into this I found a very embarrasing typo in
verify_lock_unused():

	-	if (!(class->usage_mask & LOCK_USED))
	+	if (!(class->usage_mask & LOCKF_USED))

fixing that will indeed cause rcu_read_lock() to insta-splat :/

The above typo means that instead of testing for: 0x100 (1 <<
LOCK_USED), we test for 8 (LOCK_USED), which corresponds to (1 <<
LOCK_ENABLED_HARDIRQ).

So instead of testing for _any_ used lock, it will only match any lock
used with interrupts enabled.

The rcu_read_lock() annotation uses .check=0, which means it will not
set any of the interrupt bits and will thus never match.

In order to properly fix the situation and allow rcu_read_lock() to
correctly work, split LOCK_USED into LOCK_USED and LOCK_USED_READ and by
having .read users set USED_READ and test USED, pure read-recursive
locks are permitted.

Fixes: f6f48e180404 ("lockdep: Teach lockdep about "USED" <- "IN-NMI" inversions")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20200902160323.GK1362448@hirez.programming.kicks-ass.net
2020-09-03 11:19:42 +02:00
Yonghong Song
203d7b054f bpf: Avoid iterating duplicated files for task_file iterator
Currently, task_file iterator iterates all files from all tasks.
This may potentially visit a lot of duplicated files if there are
many tasks sharing the same files, e.g., typical pthreads
where these pthreads and the main thread are sharing the same files.

This patch changed task_file iterator to skip a particular task
if that task shares the same files as its group_leader (the task
having the same tgid and also task->tgid == task->pid).
This will preserve the same result, visiting all files from all
tasks, and will reduce runtime cost significantl, e.g., if there are
a lot of pthreads and the process has a lot of open files.

Suggested-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/bpf/20200902023112.1672792-1-yhs@fb.com
2020-09-02 16:40:33 +02:00
Qu Wenruo
14721add58 module: Add more error message for failed kernel module loading
When kernel module loading failed, user space only get one of the
following error messages:

- ENOEXEC
  This is the most confusing one. From corrupted ELF header to bad
  WRITE|EXEC flags check introduced by in module_enforce_rwx_sections()
  all returns this error number.

- EPERM
  This is for blacklisted modules. But mod doesn't do extra explain
  on this error either.

- ENOMEM
  The only error which needs no explain.

This means, if a user got "Exec format error" from modprobe, it provides
no meaningful way for the user to debug, and will take extra time
communicating to get extra info.

So this patch will add extra error messages for -ENOEXEC and -EPERM
errors, allowing user to do better debugging and reporting.

Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2020-09-02 11:18:40 +02:00
David S. Miller
150f29f5e6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-09-01

The following pull-request contains BPF updates for your *net-next* tree.

There are two small conflicts when pulling, resolve as follows:

1) Merge conflict in tools/lib/bpf/libbpf.c between 88a82120282b ("libbpf: Factor
   out common ELF operations and improve logging") in bpf-next and 1e891e513e16
   ("libbpf: Fix map index used in error message") in net-next. Resolve by taking
   the hunk in bpf-next:

        [...]
        scn = elf_sec_by_idx(obj, obj->efile.btf_maps_shndx);
        data = elf_sec_data(obj, scn);
        if (!scn || !data) {
                pr_warn("elf: failed to get %s map definitions for %s\n",
                        MAPS_ELF_SEC, obj->path);
                return -EINVAL;
        }
        [...]

2) Merge conflict in drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c between
   9647c57b11e5 ("xsk: i40e: ice: ixgbe: mlx5: Test for dma_need_sync earlier for
   better performance") in bpf-next and e20f0dbf204f ("net/mlx5e: RX, Add a prefetch
   command for small L1_CACHE_BYTES") in net-next. Resolve the two locations by retaining
   net_prefetch() and taking xsk_buff_dma_sync_for_cpu() from bpf-next. Should look like:

        [...]
        xdp_set_data_meta_invalid(xdp);
        xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool);
        net_prefetch(xdp->data);
        [...]

We've added 133 non-merge commits during the last 14 day(s) which contain
a total of 246 files changed, 13832 insertions(+), 3105 deletions(-).

The main changes are:

1) Initial support for sleepable BPF programs along with bpf_copy_from_user() helper
   for tracing to reliably access user memory, from Alexei Starovoitov.

2) Add BPF infra for writing and parsing TCP header options, from Martin KaFai Lau.

3) bpf_d_path() helper for returning full path for given 'struct path', from Jiri Olsa.

4) AF_XDP support for shared umems between devices and queues, from Magnus Karlsson.

5) Initial prep work for full BPF-to-BPF call support in libbpf, from Andrii Nakryiko.

6) Generalize bpf_sk_storage map & add local storage for inodes, from KP Singh.

7) Implement sockmap/hash updates from BPF context, from Lorenz Bauer.

8) BPF xor verification for scalar types & add BPF link iterator, from Yonghong Song.

9) Use target's prog type for BPF_PROG_TYPE_EXT prog verification, from Udip Pant.

10) Rework BPF tracing samples to use libbpf loader, from Daniel T. Lee.

11) Fix xdpsock sample to really cycle through all buffers, from Weqaar Janjua.

12) Improve type safety for tun/veth XDP frame handling, from Maciej Żenczykowski.

13) Various smaller cleanups and improvements all over the place.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-01 13:22:59 -07:00
Björn Töpel
ebc4ecd48c bpf: {cpu,dev}map: Change various functions return type from int to void
The functions bq_enqueue(), bq_flush_to_queue(), and bq_xmit_all() in
{cpu,dev}map.c always return zero. Changing the return type from int
to void makes the code easier to follow.

Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200901083928.6199-1-bjorn.topel@gmail.com
2020-09-01 15:45:58 +02:00
Jiri Kosina
ead5d1f4d8 Merge branch 'master' into for-next
Sync with Linus' branch in order to be able to apply fixups
of more recent patches.
2020-09-01 14:19:48 +02:00
Steven Rostedt (VMware)
d25e37d89d tracepoint: Optimize using static_call()
Currently the tracepoint site will iterate a vector and issue indirect
calls to however many handlers are registered (ie. the vector is
long).

Using static_call() it is possible to optimize this for the common
case of only having a single handler registered. In this case the
static_call() can directly call this handler. Otherwise, if the vector
is longer than 1, call a function that iterates the whole vector like
the current code.

[peterz: updated to new interface]

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20200818135805.279421092@infradead.org
2020-09-01 09:58:06 +02:00
Peter Zijlstra
a945c8345e static_call: Allow early init
In order to use static_call() to wire up x86_pmu, we need to
initialize earlier, specifically before memory allocation works; copy
some of the tricks from jump_label to enable this.

Primarily we overload key->next to store a sites pointer when there
are no modules, this avoids having to use kmalloc() to initialize the
sites and allows us to run much earlier.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20200818135805.220737930@infradead.org
2020-09-01 09:58:06 +02:00
Peter Zijlstra
5b06fd3bb9 static_call: Handle tail-calls
GCC can turn our static_call(name)(args...) into a tail call, in which
case we get a JMP.d32 into the trampoline (which then does a further
tail-call).

Teach objtool to recognise and mark these in .static_call_sites and
adjust the code patching to deal with this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20200818135805.101186767@infradead.org
2020-09-01 09:58:06 +02:00
Peter Zijlstra
f03c412915 static_call: Add simple self-test for static calls
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200818135804.922581202@infradead.org
2020-09-01 09:58:05 +02:00
Peter Zijlstra
6333e8f73b static_call: Avoid kprobes on inline static_call()s
Similar to how we disallow kprobes on any other dynamic text
(ftrace/jump_label) also disallow kprobes on inline static_call()s.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200818135804.744920586@infradead.org
2020-09-01 09:58:04 +02:00
Josh Poimboeuf
9183c3f9ed static_call: Add inline static call infrastructure
Add infrastructure for an arch-specific CONFIG_HAVE_STATIC_CALL_INLINE
option, which is a faster version of CONFIG_HAVE_STATIC_CALL.  At
runtime, the static call sites are patched directly, rather than using
the out-of-line trampolines.

Compared to out-of-line static calls, the performance benefits are more
modest, but still measurable.  Steven Rostedt did some tracepoint
measurements:

  https://lkml.kernel.org/r/20181126155405.72b4f718@gandalf.local.home

This code is heavily inspired by the jump label code (aka "static
jumps"), as some of the concepts are very similar.

For more details, see the comments in include/linux/static_call.h.

[peterz: simplified interface; merged trampolines]

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20200818135804.684334440@infradead.org
2020-09-01 09:58:04 +02:00
Peter Zijlstra
0db6e3734b jump_label,module: Fix module lifetime for __jump_label_mod_text_reserved()
Nothing ensures the module exists while we're iterating
mod->jump_entries in __jump_label_mod_text_reserved(), take a module
reference to ensure the module sticks around.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20200818135804.504501338@infradead.org
2020-09-01 09:58:04 +02:00
Peter Zijlstra
59cc8e0a90 module: Properly propagate MODULE_STATE_COMING failure
Now that notifiers got unbroken; use the proper interface to handle
notifier errors and propagate them.

There were already MODULE_STATE_COMING notifiers that failed; notably:

 - jump_label_module_notifier()
 - tracepoint_module_notify()
 - bpf_event_notify()

By propagating this error, we fix those users.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Jessica Yu <jeyu@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20200818135804.444372853@infradead.org
2020-09-01 09:58:04 +02:00
Peter Zijlstra
0340a6b7fb module: Fix up module_notifier return values
While auditing all module notifiers I noticed a whole bunch of fail
wrt the return value. Notifiers have a 'special' return semantics.

As is; NOTIFY_DONE vs NOTIFY_OK is a bit vague; but
notifier_from_errno(0) results in NOTIFY_OK and NOTIFY_DONE has a
comment that says "Don't care".

From this I've used NOTIFY_DONE when the function completely ignores
the callback and notifier_to_error() isn't used.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Robert Richter <rric@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20200818135804.385360407@infradead.org
2020-09-01 09:58:03 +02:00
Peter Zijlstra
70d9329857 notifier: Fix broken error handling pattern
The current notifiers have the following error handling pattern all
over the place:

	int err, nr;

	err = __foo_notifier_call_chain(&chain, val_up, v, -1, &nr);
	if (err & NOTIFIER_STOP_MASK)
		__foo_notifier_call_chain(&chain, val_down, v, nr-1, NULL)

And aside from the endless repetition thereof, it is broken. Consider
blocking notifiers; both calls take and drop the rwsem, this means
that the notifier list can change in between the two calls, making @nr
meaningless.

Fix this by replacing all the __foo_notifier_call_chain() functions
with foo_notifier_call_chain_robust() that embeds the above pattern,
but ensures it is inside a single lock region.

Note: I switched atomic_notifier_call_chain_robust() to use
      the spinlock, since RCU cannot provide the guarantee
      required for the recovery.

Note: software_resume() error handling was broken afaict.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20200818135804.325626653@infradead.org
2020-09-01 09:58:03 +02:00
Barry Song
2281f797f5 mm: cma: use CMA_MAX_NAME to define the length of cma name array
CMA_MAX_NAME should be visible to CMA's users as they might need it to set
the name of CMA areas and avoid hardcoding the size locally.
So this patch moves CMA_MAX_NAME from local header file to include/linux
header file and removes the hardcode in both hugetlb.c and contiguous.c.

Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-01 09:19:43 +02:00
Barry Song
b7176c261c dma-contiguous: provide the ability to reserve per-numa CMA
Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get
coherent DMA buffers to save their command queues and page tables. As
there is only one default CMA in the whole system, SMMUs on nodes other
than node0 will get remote memory. This leads to significant latency.

This patch provides per-numa CMA so that drivers like SMMU can get local
memory. Tests show localizing CMA can decrease dma_unmap latency much.
For instance, before this patch, SMMU on node2  has to wait for more than
560ns for the completion of CMD_SYNC in an empty command queue; with this
patch, it needs 240ns only.

A positive side effect of this patch would be improving performance even
further for those users who are worried about performance more than DMA
security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all
drivers can get local coherent DMA buffers.

Also, this patch changes the default CONFIG_CMA_AREAS to 19 in NUMA. As
1+CONFIG_CMA_AREAS should be quite enough for most servers on the market
even they enable both hugetlb_cma and pernuma_cma.
2 numa nodes: 2(hugetlb) + 2(pernuma) + 1(default global cma) = 5
4 numa nodes: 4(hugetlb) + 4(pernuma) + 1(default global cma) = 9
8 numa nodes: 8(hugetlb) + 8(pernuma) + 1(default global cma) = 17

Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-09-01 09:19:28 +02:00
Alexei Starovoitov
f56407fa6e bpf: Remove bpf_lsm_file_mprotect from sleepable list.
Technically the bpf programs can sleep while attached to bpf_lsm_file_mprotect,
but such programs need to access user memory. So they're in might_fault()
category. Which means they cannot be called from file_mprotect lsm hook that
takes write lock on mm->mmap_lock.
Adjust the test accordingly.

Also add might_fault() to __bpf_prog_enter_sleepable() to catch such deadlocks early.

Fixes: 1e6c62a88215 ("bpf: Introduce sleepable BPF programs")
Fixes: e68a144547fc ("selftests/bpf: Add sleepable tests")
Reported-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200831201651.82447-1-alexei.starovoitov@gmail.com
2020-08-31 23:03:57 +02:00
Alexei Starovoitov
29523c5e67 bpf: Fix build without BPF_LSM.
resolve_btfids doesn't like empty set. Add unused ID when BPF_LSM is off.

Fixes: 1e6c62a88215 ("bpf: Introduce sleepable BPF programs")
Reported-by: Björn Töpel <bjorn.topel@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Song Liu <songliubraving@fb.com>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200831163132.66521-1-alexei.starovoitov@gmail.com
2020-08-31 20:56:10 +02:00
Marco Elver
cd290ec246 kcsan: Use tracing-safe version of prandom
In the core runtime, we must minimize any calls to external library
functions to avoid any kind of recursion. This can happen even though
instrumentation is disabled for called functions, but tracing is
enabled.

Most recently, prandom_u32() added a tracepoint, which can cause
problems for KCSAN even if the rcuidle variant is used. For example:
	kcsan -> prandom_u32() -> trace_prandom_u32_rcuidle ->
	srcu_read_lock_notrace -> __srcu_read_lock -> kcsan ...

While we could disable KCSAN in kcsan_setup_watchpoint(), this does not
solve other unexpected behaviour we may get due recursing into functions
that may not be tolerant to such recursion:
	__srcu_read_lock -> kcsan -> ... -> __srcu_read_lock

Therefore, switch to using prandom_u32_state(), which is uninstrumented,
and does not have a tracepoint.

Link: https://lkml.kernel.org/r/20200821063043.1949509-1-elver@google.com
Link: https://lkml.kernel.org/r/20200820172046.GA177701@elver.google.com
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-08-30 21:50:13 -07:00
Linus Torvalds
dcc5c6f013 Three interrupt related fixes for X86:
- Move disabling of the local APIC after invoking fixup_irqs() to ensure
    that interrupts which are incoming are noted in the IRR and not ignored.
 
  - Unbreak affinity setting. The rework of the entry code reused the
    regular exception entry code for device interrupts. The vector number is
    pushed into the errorcode slot on the stack which is then lifted into an
    argument and set to -1 because that's regs->orig_ax which is used in
    quite some places to check whether the entry came from a syscall. But it
    was overlooked that orig_ax is used in the affinity cleanup code to
    validate whether the interrupt has arrived on the new target. It turned
    out that this vector check is pointless because interrupts are never
    moved from one vector to another on the same CPU. That check is a
    historical leftover from the time where x86 supported multi-CPU
    affinities, but not longer needed with the now strict single CPU
    affinity. Famous last words ...
 
  - Add a missing check for an empty cpumask into the matrix allocator. The
    affinity change added a warning to catch the case where an interrupt is
    moved on the same CPU to a different vector. This triggers because a
    condition with an empty cpumask returns an assignment from the allocator
    as the allocator uses for_each_cpu() without checking the cpumask for
    being empty. The historical inconsistent for_each_cpu() behaviour of
    ignoring the cpumask and unconditionally claiming that CPU0 is in the
    mask striked again. Sigh.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl9L6WYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoRV5D/9dRq/4pn5g1esnzm4GhIr2To3Qp6cl
 s7VswTdN8FmWBqVz79ZVYqj663UpL3pPY1np01ctrxRQLeDVfWcI2BMR5irnny8h
 otORhFysuDUl+yfuomWVbzfQQNJ+VeQVeWKD3cIhD1I3sXqDX5Wpa8n086hYKQXx
 eutVC3+JdzJZFm68xarlLW7h2f1au1eZZFgVnyY+J5KO9Dwm63a4RITdDVk7KV4t
 uKEDza5P9SY+kE9LAGNq8BAEObf9FeMXw0mRM7atRKVsJQQGVk6bgiuaRr01w1+W
 hQCPx/3g6PHFnGgx/KQgHf1jgrZFhXOyIDo6ZeFy+SJGIZRB3n8o5Kjns2l8Pa+K
 2qy1TRoZIsGkwGCi/BM6viLzBikbh/gnGYy/8KTEJdKs8P3ZKHUZVSAB1dpapOWX
 4n+rKoVPnvxgRSeZZo+tgLkvUdh+/9Huyr9vHiYjtbbB8tFvjlkOmrZ6sirHByDy
 jg6TjOJVb1CC/PoW4M7JNfmeKvHQnTACwH6djdVGDLPJspuUsYkPI0Uk0CX21SA3
 45Tuylvl9jT6+vq95Av2RbAiipmSpZ/O1NHV8Paf466SKmhUgG3lv5PHh3xTm1U2
 Be/RbJ75x4Muuw42ttU1LcpcLPcOZRQNEREoNd5UysgYYgWRekBvU+ZRQNW4g2nw
 3JDgJgm0iBUN9w==
 =zIi4
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
 "Three interrupt related fixes for X86:

   - Move disabling of the local APIC after invoking fixup_irqs() to
     ensure that interrupts which are incoming are noted in the IRR and
     not ignored.

   - Unbreak affinity setting.

     The rework of the entry code reused the regular exception entry
     code for device interrupts. The vector number is pushed into the
     errorcode slot on the stack which is then lifted into an argument
     and set to -1 because that's regs->orig_ax which is used in quite
     some places to check whether the entry came from a syscall.

     But it was overlooked that orig_ax is used in the affinity cleanup
     code to validate whether the interrupt has arrived on the new
     target. It turned out that this vector check is pointless because
     interrupts are never moved from one vector to another on the same
     CPU. That check is a historical leftover from the time where x86
     supported multi-CPU affinities, but not longer needed with the now
     strict single CPU affinity. Famous last words ...

   - Add a missing check for an empty cpumask into the matrix allocator.

     The affinity change added a warning to catch the case where an
     interrupt is moved on the same CPU to a different vector. This
     triggers because a condition with an empty cpumask returns an
     assignment from the allocator as the allocator uses for_each_cpu()
     without checking the cpumask for being empty. The historical
     inconsistent for_each_cpu() behaviour of ignoring the cpumask and
     unconditionally claiming that CPU0 is in the mask struck again.
     Sigh.

  plus a new entry into the MAINTAINER file for the HPE/UV platform"

* tag 'x86-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq/matrix: Deal with the sillyness of for_each_cpu() on UP
  x86/irq: Unbreak interrupt affinity setting
  x86/hotplug: Silence APIC only after all interrupts are migrated
  MAINTAINERS: Add entry for HPE Superdome Flex (UV) maintainers
2020-08-30 12:01:23 -07:00
Linus Torvalds
b69bea8a65 A set of fixes for lockdep, tracing and RCU:
- Prevent recursion by using raw_cpu_* operations
 
   - Fixup the interrupt state in the cpu idle code to be consistent
 
   - Push rcu_idle_enter/exit() invocations deeper into the idle path so
     that the lock operations are inside the RCU watching sections
 
   - Move trace_cpu_idle() into generic code so it's called before RCU goes
     idle.
 
   - Handle raw_local_irq* vs. local_irq* operations correctly
 
   - Move the tracepoints out from under the lockdep recursion handling
     which turned out to be fragile and inconsistent.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl9L5qETHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoV/NEADG+h02tj2I4gP7IQ3nVodEzS1+odPI
 orabY5ggH0kn4YIhPB4UtOd5zKZjr3FJs9wEhyhQpV6ZhvFfgaIKiYqfg+Q81aMO
 /BXrfh6jBD2Hu7gaPBnVdkKeh1ehl+w0PhTeJhPBHEEvbGeLUYWwyPNlaKz//VQl
 XCWl7e7o/Uw2UyJ469SCx3z+M2DMNqwdMys/zcqvTLiBdLNCwp4TW5ACzEA0rfHh
 Pepu3eIKnMURyt82QanrOATvT2io9pOOaUh59zeKi2WM8ikwKd/Eho2kXYng6GvM
 GzX4Kn13MsNobZXf9BhqEGICdRkaJqLsXlmBNmbJdSTCn5W2lLZqu2wCEp5VZHCc
 XwMbey8ek+BRskJMqAV4oq2GA8Om9KEYWOOdixyOG0UJCiW5qDowuDYBXTLV7FWj
 XhzLGuHpUF9eKLKokJ7ideLaDcpzwYjHr58pFLQrqPwmjVKWguLeYMg5BhhTiEuV
 wNfiLIGdMNsCpYKhnce3o9paV8+hy1ZveWhNy+/4HaDLoEwI2T62i8R7xxbrcWMg
 sgdAiQG+kVLwSJ13bN+Cz79uLYTIbqGaZHtOXmeIT3jSxBjx5RlXfzocwTHSYrNk
 GuLYHd7+QaemN49Rrf4bPR16Db7ifL32QkUtLBTBLcnos9jM+fcl+BWyqYRxhgDv
 xzDS+vfK8DvRiA==
 =Hgt6
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fixes from Thomas Gleixner:
 "A set of fixes for lockdep, tracing and RCU:

   - Prevent recursion by using raw_cpu_* operations

   - Fixup the interrupt state in the cpu idle code to be consistent

   - Push rcu_idle_enter/exit() invocations deeper into the idle path so
     that the lock operations are inside the RCU watching sections

   - Move trace_cpu_idle() into generic code so it's called before RCU
     goes idle.

   - Handle raw_local_irq* vs. local_irq* operations correctly

   - Move the tracepoints out from under the lockdep recursion handling
     which turned out to be fragile and inconsistent"

* tag 'locking-urgent-2020-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lockdep,trace: Expose tracepoints
  lockdep: Only trace IRQ edges
  mips: Implement arch_irqs_disabled()
  arm64: Implement arch_irqs_disabled()
  nds32: Implement arch_irqs_disabled()
  locking/lockdep: Cleanup
  x86/entry: Remove unused THUNKs
  cpuidle: Move trace_cpu_idle() into generic code
  cpuidle: Make CPUIDLE_FLAG_TLB_FLUSHED generic
  sched,idle,rcu: Push rcu_idle deeper into the idle path
  cpuidle: Fixup IRQ state
  lockdep: Use raw_cpu_*() for per-cpu variables
2020-08-30 11:43:50 -07:00
Thomas Gleixner
784a083037 genirq/matrix: Deal with the sillyness of for_each_cpu() on UP
Most of the CPU mask operations behave the same way, but for_each_cpu() and
it's variants ignore the cpumask argument and claim that CPU0 is always in
the mask. This is historical, inconsistent and annoying behaviour.

The matrix allocator uses for_each_cpu() and can be called on UP with an
empty cpumask. The calling code does not expect that this succeeds but
until commit e027fffff799 ("x86/irq: Unbreak interrupt affinity setting")
this went unnoticed. That commit added a WARN_ON() to catch cases which
move an interrupt from one vector to another on the same CPU. The warning
triggers on UP.

Add a check for the cpumask being empty to prevent this.

Fixes: 2f75d9e1c905 ("genirq: Implement bitmap matrix allocator")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
2020-08-30 19:17:28 +02:00
Alexei Starovoitov
07be4c4a3e bpf: Add bpf_copy_from_user() helper.
Sleepable BPF programs can now use copy_from_user() to access user memory.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200827220114.69225-4-alexei.starovoitov@gmail.com
2020-08-28 21:20:33 +02:00
Alexei Starovoitov
1e6c62a882 bpf: Introduce sleepable BPF programs
Introduce sleepable BPF programs that can request such property for themselves
via BPF_F_SLEEPABLE flag at program load time. In such case they will be able
to use helpers like bpf_copy_from_user() that might sleep. At present only
fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only
when they are attached to kernel functions that are known to allow sleeping.

The non-sleepable programs are relying on implicit rcu_read_lock() and
migrate_disable() to protect life time of programs, maps that they use and
per-cpu kernel structures used to pass info between bpf programs and the
kernel. The sleepable programs cannot be enclosed into rcu_read_lock().
migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs
should not be enclosed in migrate_disable() as well. Therefore
rcu_read_lock_trace is used to protect the life time of sleepable progs.

There are many networking and tracing program types. In many cases the
'struct bpf_prog *' pointer itself is rcu protected within some other kernel
data structure and the kernel code is using rcu_dereference() to load that
program pointer and call BPF_PROG_RUN() on it. All these cases are not touched.
Instead sleepable bpf programs are allowed with bpf trampoline only. The
program pointers are hard-coded into generated assembly of bpf trampoline and
synchronize_rcu_tasks_trace() is used to protect the life time of the program.
The same trampoline can hold both sleepable and non-sleepable progs.

When rcu_read_lock_trace is held it means that some sleepable bpf program is
running from bpf trampoline. Those programs can use bpf arrays and preallocated
hash/lru maps. These map types are waiting on programs to complete via
synchronize_rcu_tasks_trace();

Updates to trampoline now has to do synchronize_rcu_tasks_trace() and
synchronize_rcu_tasks() to wait for sleepable progs to finish and for
trampoline assembly to finish.

This is the first step of introducing sleepable progs. Eventually dynamically
allocated hash maps can be allowed and networking program types can become
sleepable too.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200827220114.69225-3-alexei.starovoitov@gmail.com
2020-08-28 21:20:33 +02:00
Martin KaFai Lau
134fede4ee bpf: Relax max_entries check for most of the inner map types
Most of the maps do not use max_entries during verification time.
Thus, those map_meta_equal() do not need to enforce max_entries
when it is inserted as an inner map during runtime.  The max_entries
check is removed from the default implementation bpf_map_meta_equal().

The prog_array_map and xsk_map are exception.  Its map_gen_lookup
uses max_entries to generate inline lookup code.  Thus, they will
implement its own map_meta_equal() to enforce max_entries.
Since there are only two cases now, the max_entries check
is not refactored and stays in its own .c file.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200828011813.1970516-1-kafai@fb.com
2020-08-28 15:41:30 +02:00
Martin KaFai Lau
f4d0525921 bpf: Add map_meta_equal map ops
Some properties of the inner map is used in the verification time.
When an inner map is inserted to an outer map at runtime,
bpf_map_meta_equal() is currently used to ensure those properties
of the inserting inner map stays the same as the verification
time.

In particular, the current bpf_map_meta_equal() checks max_entries which
turns out to be too restrictive for most of the maps which do not use
max_entries during the verification time.  It limits the use case that
wants to replace a smaller inner map with a larger inner map.  There are
some maps do use max_entries during verification though.  For example,
the map_gen_lookup in array_map_ops uses the max_entries to generate
the inline lookup code.

To accommodate differences between maps, the map_meta_equal is added
to bpf_map_ops.  Each map-type can decide what to check when its
map is used as an inner map during runtime.

Also, some map types cannot be used as an inner map and they are
currently black listed in bpf_map_meta_alloc() in map_in_map.c.
It is not unusual that the new map types may not aware that such
blacklist exists.  This patch enforces an explicit opt-in
and only allows a map to be used as an inner map if it has
implemented the map_meta_equal ops.  It is based on the
discussion in [1].

All maps that support inner map has its map_meta_equal points
to bpf_map_meta_equal in this patch.  A later patch will
relax the max_entries check for most maps.  bpf_types.h
counts 28 map types.  This patch adds 23 ".map_meta_equal"
by using coccinelle.  -5 for
	BPF_MAP_TYPE_PROG_ARRAY
	BPF_MAP_TYPE_(PERCPU)_CGROUP_STORAGE
	BPF_MAP_TYPE_STRUCT_OPS
	BPF_MAP_TYPE_ARRAY_OF_MAPS
	BPF_MAP_TYPE_HASH_OF_MAPS

The "if (inner_map->inner_map_meta)" check in bpf_map_meta_alloc()
is moved such that the same error is returned.

[1]: https://lore.kernel.org/bpf/20200522022342.899756-1-kafai@fb.com/

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200828011806.1970400-1-kafai@fb.com
2020-08-28 15:41:30 +02:00
Dave Airlie
cbc2e82932 drm-misc-next for 5.10:
UAPI Changes:
 
 Cross-subsystem Changes:
 
 Core Changes:
   - ttm: various cleanups and reworks of the API
 
 Driver Changes:
   - ast: various cleanups
   - gma500: A few fixes, conversion to GPIOd API
   - hisilicon: Change of maintainer, various reworks
   - ingenic: Clock handling and formats support improvements
   - mcde: improvements to the DSI support
   - mgag200: Support G200 desktop cards
   - mxsfb: Support the i.MX7 and i.MX8M and the alpha plane
   - panfrost: support devfreq
   - ps8640: Retrieve the EDID from eDP control, misc improvements
   - tidss: Add a workaround for AM65xx YUV formats handling
   - virtio: a few cleanups, support for virtio-gpu exported resources
   - bridges: Support the chained bridges on more drivers,
     new bridges: Toshiba TC358762, Toshiba TC358775, Lontium LT9611
   - panels: Convert to dev_ based logging, read orientation from the DT,
     various fixes, new panels: Mantix MLAF057WE51-X, Chefree CH101OLHLWH-002,
     Powertip PH800480T013, KingDisplay KD116N21-30NV-A010
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRcEzekXsqa64kGDp7j7w1vZxhRxQUCX0fXGwAKCRDj7w1vZxhR
 xTmMAQDPmfSsBLLNnDxu4++zFrQ7OKmNSHCkVr4nAQ/yg3GVPQEAuRw6qPwPWuV3
 +jEPxaQSSmHOhx/jXfolV1tJaE/FHgA=
 =WYoO
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-next-2020-08-27' of git://anongit.freedesktop.org/drm/drm-misc into drm-next

drm-misc-next for 5.10:

UAPI Changes:

Cross-subsystem Changes:

Core Changes:
  - ttm: various cleanups and reworks of the API

Driver Changes:
  - ast: various cleanups
  - gma500: A few fixes, conversion to GPIOd API
  - hisilicon: Change of maintainer, various reworks
  - ingenic: Clock handling and formats support improvements
  - mcde: improvements to the DSI support
  - mgag200: Support G200 desktop cards
  - mxsfb: Support the i.MX7 and i.MX8M and the alpha plane
  - panfrost: support devfreq
  - ps8640: Retrieve the EDID from eDP control, misc improvements
  - tidss: Add a workaround for AM65xx YUV formats handling
  - virtio: a few cleanups, support for virtio-gpu exported resources
  - bridges: Support the chained bridges on more drivers,
    new bridges: Toshiba TC358762, Toshiba TC358775, Lontium LT9611
  - panels: Convert to dev_ based logging, read orientation from the DT,
    various fixes, new panels: Mantix MLAF057WE51-X, Chefree CH101OLHLWH-002,
    Powertip PH800480T013, KingDisplay KD116N21-30NV-A010

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Maxime Ripard <maxime@cerno.tech>
Link: https://patchwork.freedesktop.org/patch/msgid/20200827155517.do6emeacetpturli@gilmour.lan
2020-08-28 12:38:06 +10:00
Dan Carpenter
892fc9f683 dma-pool: Fix an uninitialized variable bug in atomic_pool_expand()
The "page" pointer can be used with out being initialized.

Fixes: d7e673ec2c8e ("dma-pool: Only allocate from CMA when in same memory zone")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-08-27 09:22:56 +02:00
Yonghong Song
2921c90d47 bpf: Fix a verifier failure with xor
bpf selftest test_progs/test_sk_assign failed with llvm 11 and llvm 12.
Compared to llvm 10, llvm 11 and 12 generates xor instruction which
is not handled properly in verifier. The following illustrates the
problem:

  16: (b4) w5 = 0
  17: ... R5_w=inv0 ...
  ...
  132: (a4) w5 ^= 1
  133: ... R5_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) ...
  ...
  37: (bc) w8 = w5
  38: ... R5=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
          R8_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) ...
  ...
  41: (bc) w3 = w8
  42: ... R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) ...
  45: (56) if w3 != 0x0 goto pc+1
   ... R3_w=inv0 ...
  46: (b7) r1 = 34
  47: R1_w=inv34 R7=pkt(id=0,off=26,r=38,imm=0)
  47: (0f) r7 += r1
  48: R1_w=invP34 R3_w=inv0 R7_w=pkt(id=0,off=60,r=38,imm=0)
  48: (b4) w9 = 0
  49: R1_w=invP34 R3_w=inv0 R7_w=pkt(id=0,off=60,r=38,imm=0)
  49: (69) r1 = *(u16 *)(r7 +0)
  invalid access to packet, off=60 size=2, R7(id=0,off=60,r=38)
  R7 offset is outside of the packet

At above insn 132, w5 = 0, but after w5 ^= 1, we give a really conservative
value of w5. At insn 45, in reality the condition should be always false.
But due to conservative value for w3, the verifier evaluates it could be
true and this later leads to verifier failure complaining potential
packet out-of-bound access.

This patch implemented proper XOR support in verifier.
In the above example, we have:
  132: R5=invP0
  132: (a4) w5 ^= 1
  133: R5_w=invP1
  ...
  37: (bc) w8 = w5
  ...
  41: (bc) w3 = w8
  42: R3_w=invP1
  ...
  45: (56) if w3 != 0x0 goto pc+1
  47: R3_w=invP1
  ...
  processed 353 insns ...
and the verifier can verify the program successfully.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200825064608.2017937-1-yhs@fb.com
2020-08-26 21:47:32 -07:00
Udip Pant
7e40781cc8 bpf: verifier: Use target program's type for access verifications
This patch adds changes in verifier to make decisions such as granting
of read / write access or enforcement of return code status based on
the program type of the target program while using dynamic program
extension (of type BPF_PROG_TYPE_EXT).

The BPF_PROG_TYPE_EXT type can be used to extend types such as XDP, SKB
and others. Since the BPF_PROG_TYPE_EXT program type on itself is just a
placeholder for those, we need this extended check for those extended
programs to actually work with proper access, while using this option.

Specifically, it introduces following changes:
- may_access_direct_pkt_data:
    allow access to packet data based on the target prog
- check_return_code:
    enforce return code based on the target prog
    (currently, this check is skipped for EXT program)
- check_ld_abs:
    check for 'may_access_skb' based on the target prog
- check_map_prog_compatibility:
    enforce the map compatibility check based on the target prog
- may_update_sockmap:
    allow sockmap update based on the target prog

Some other occurrences of prog->type is left as it without replacing
with the 'resolved' type:
- do_check_common() and check_attach_btf_id():
    already have specific logic to handle the EXT prog type
- jit_subprogs() and bpf_check():
    Not changed for jit compilation or while inferring env->ops

Next few patches in this series include selftests for some of these cases.

Signed-off-by: Udip Pant <udippant@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200825232003.2877030-2-udippant@fb.com
2020-08-26 12:47:56 -07:00
Xu Wang
c072035164 audit: Remove redundant null check
Because kfree_skb already checked NULL skb parameter,
so the additional check is unnecessary, just remove it.

Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-08-26 09:10:39 -04:00
Boqun Feng
f611e8cf98 lockdep: Take read/write status in consideration when generate chainkey
Currently, the chainkey of a lock chain is a hash sum of the class_idx
of all the held locks, the read/write status are not taken in to
consideration while generating the chainkey. This could result into a
problem, if we have:

	P1()
	{
		read_lock(B);
		lock(A);
	}

	P2()
	{
		lock(A);
		read_lock(B);
	}

	P3()
	{
		lock(A);
		write_lock(B);
	}

, and P1(), P2(), P3() run one by one. And when running P2(), lockdep
detects such a lock chain A -> B is not a deadlock, then it's added in
the chain cache, and then when running P3(), even if it's a deadlock, we
could miss it because of the hit of chain cache. This could be confirmed
by self testcase "chain cached mixed R-L/L-W ".

To resolve this, we use concept "hlock_id" to generate the chainkey, the
hlock_id is a tuple (hlock->class_idx, hlock->read), which fits in a u16
type. With this, the chainkeys are different is the lock sequences have
the same locks but different read/write status.

Besides, since we use "hlock_id" to generate chainkeys, the chain_hlocks
array now store the "hlock_id"s rather than lock_class indexes.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200807074238.1632519-15-boqun.feng@gmail.com
2020-08-26 12:42:06 +02:00
Boqun Feng
621c9dac0e lockdep: Add recursive read locks into dependency graph
Since we have all the fundamental to handle recursive read locks, we now
add them into the dependency graph.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200807074238.1632519-13-boqun.feng@gmail.com
2020-08-26 12:42:06 +02:00
Boqun Feng
f08e388857 lockdep: Fix recursive read lock related safe->unsafe detection
Currently, in safe->unsafe detection, lockdep misses the fact that a
LOCK_ENABLED_IRQ_*_READ usage and a LOCK_USED_IN_IRQ_*_READ usage may
cause deadlock too, for example:

	P1                          P2
	<irq disabled>
	write_lock(l1);             <irq enabled>
				    read_lock(l2);
	write_lock(l2);
				    <in irq>
				    read_lock(l1);

Actually, all of the following cases may cause deadlocks:

	LOCK_USED_IN_IRQ_* -> LOCK_ENABLED_IRQ_*
	LOCK_USED_IN_IRQ_*_READ -> LOCK_ENABLED_IRQ_*
	LOCK_USED_IN_IRQ_* -> LOCK_ENABLED_IRQ_*_READ
	LOCK_USED_IN_IRQ_*_READ -> LOCK_ENABLED_IRQ_*_READ

To fix this, we need to 1) change the calculation of exclusive_mask() so
that READ bits are not dropped and 2) always call usage() in
mark_lock_irq() to check usage deadlocks, even when the new usage of the
lock is READ.

Besides, adjust usage_match() and usage_acculumate() to recursive read
lock changes.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200807074238.1632519-12-boqun.feng@gmail.com
2020-08-26 12:42:05 +02:00
Boqun Feng
68e3056785 lockdep: Adjust check_redundant() for recursive read change
check_redundant() will report redundancy if it finds a path could
replace the about-to-add dependency in the BFS search. With recursive
read lock changes, we certainly need to change the match function for
the check_redundant(), because the path needs to match not only the lock
class but also the dependency kinds. For example, if the about-to-add
dependency @prev -> @next is A -(SN)-> B, and we find a path A -(S*)->
.. -(*R)->B in the dependency graph with __bfs() (for simplicity, we can
also say we find an -(SR)-> path from A to B), we can not replace the
dependency with that path in the BFS search. Because the -(SN)->
dependency can make a strong path with a following -(S*)-> dependency,
however an -(SR)-> path cannot.

Further, we can replace an -(SN)-> dependency with a -(EN)-> path, that
means if we find a path which is stronger than or equal to the
about-to-add dependency, we can report the redundancy. By "stronger", it
means both the start and the end of the path are not weaker than the
start and the end of the dependency (E is "stronger" than S and N is
"stronger" than R), so that we can replace the dependency with that
path.

To make sure we find a path whose start point is not weaker than the
about-to-add dependency, we use a trick: the ->only_xr of the root
(start point) of __bfs() is initialized as @prev-> == 0, therefore if
@prev is E, __bfs() will pick only -(E*)-> for the first dependency,
otherwise, __bfs() can pick -(E*)-> or -(S*)-> for the first dependency.

To make sure we find a path whose end point is not weaker than the
about-to-add dependency, we replace the match function for __bfs()
check_redundant(), we check for the case that either @next is R
(anything is not weaker than it) or the end point of the path is N
(which is not weaker than anything).

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200807074238.1632519-11-boqun.feng@gmail.com
2020-08-26 12:42:05 +02:00
Boqun Feng
9de0c9bbce lockdep: Support deadlock detection for recursive read locks in check_noncircular()
Currently, lockdep only has limit support for deadlock detection for
recursive read locks.

This patch support deadlock detection for recursive read locks. The
basic idea is:

We are about to add dependency B -> A in to the dependency graph, we use
check_noncircular() to find whether we have a strong dependency path
A -> .. -> B so that we have a strong dependency circle (a closed strong
dependency path):

	 A -> .. -> B -> A

, which doesn't have two adjacent dependencies as -(*R)-> L -(S*)->.

Since A -> .. -> B is already a strong dependency path, so if either
B -> A is -(E*)-> or A -> .. -> B is -(*N)->, the circle A -> .. -> B ->
A is strong, otherwise not. So we introduce a new match function
hlock_conflict() to replace the class_equal() for the deadlock check in
check_noncircular().

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200807074238.1632519-10-boqun.feng@gmail.com
2020-08-26 12:42:05 +02:00
Boqun Feng
61775ed243 lockdep: Make __bfs(.match) return bool
The "match" parameter of __bfs() is used for checking whether we hit a
match in the search, therefore it should return a boolean value rather
than an integer for better readability.

This patch then changes the return type of the function parameter and the
match functions to bool.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200807074238.1632519-9-boqun.feng@gmail.com
2020-08-26 12:42:05 +02:00
Boqun Feng
6971c0f345 lockdep: Extend __bfs() to work with multiple types of dependencies
Now we have four types of dependencies in the dependency graph, and not
all the pathes carry real dependencies (the dependencies that may cause
a deadlock), for example:

	Given lock A and B, if we have:

	CPU1			CPU2
	=============		==============
	write_lock(A);		read_lock(B);
	read_lock(B);		write_lock(A);

	(assuming read_lock(B) is a recursive reader)

	then we have dependencies A -(ER)-> B, and B -(SN)-> A, and a
	dependency path A -(ER)-> B -(SN)-> A.

	In lockdep w/o recursive locks, a dependency path from A to A
	means a deadlock. However, the above case is obviously not a
	deadlock, because no one holds B exclusively, therefore no one
	waits for the other to release B, so who get A first in CPU1 and
	CPU2 will run non-blockingly.

	As a result, dependency path A -(ER)-> B -(SN)-> A is not a
	real/strong dependency that could cause a deadlock.

From the observation above, we know that for a dependency path to be
real/strong, no two adjacent dependencies can be as -(*R)-> -(S*)->.

Now our mission is to make __bfs() traverse only the strong dependency
paths, which is simple: we record whether we only have -(*R)-> for the
previous lock_list of the path in lock_list::only_xr, and when we pick a
dependency in the traverse, we 1) filter out -(S*)-> dependency if the
previous lock_list only has -(*R)-> dependency (i.e. ->only_xr is true)
and 2) set the next lock_list::only_xr to true if we only have -(*R)->
left after we filter out dependencies based on 1), otherwise, set it to
false.

With this extension for __bfs(), we now need to initialize the root of
__bfs() properly (with a correct ->only_xr), to do so, we introduce some
helper functions, which also cleans up a little bit for the __bfs() root
initialization code.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200807074238.1632519-8-boqun.feng@gmail.com
2020-08-26 12:42:04 +02:00
Boqun Feng
3454a36d6a lockdep: Introduce lock_list::dep
To add recursive read locks into the dependency graph, we need to store
the types of dependencies for the BFS later. There are four types of
dependencies:

*	Exclusive -> Non-recursive dependencies: EN
	e.g. write_lock(prev) held and try to acquire write_lock(next)
	or non-recursive read_lock(next), which can be represented as
	"prev -(EN)-> next"

*	Shared -> Non-recursive dependencies: SN
	e.g. read_lock(prev) held and try to acquire write_lock(next) or
	non-recursive read_lock(next), which can be represented as
	"prev -(SN)-> next"

*	Exclusive -> Recursive dependencies: ER
	e.g. write_lock(prev) held and try to acquire recursive
	read_lock(next), which can be represented as "prev -(ER)-> next"

*	Shared -> Recursive dependencies: SR
	e.g. read_lock(prev) held and try to acquire recursive
	read_lock(next), which can be represented as "prev -(SR)-> next"

So we use 4 bits for the presence of each type in lock_list::dep. Helper
functions and macros are also introduced to convert a pair of locks into
lock_list::dep bit and maintain the addition of different types of
dependencies.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200807074238.1632519-7-boqun.feng@gmail.com
2020-08-26 12:42:04 +02:00
Boqun Feng
bd76eca10d lockdep: Reduce the size of lock_list::distance
lock_list::distance is always not greater than MAX_LOCK_DEPTH (which
is 48 right now), so a u16 will fit. This patch reduces the size of
lock_list::distance to save space, so that we can introduce other fields
to help detect recursive read lock deadlocks without increasing the size
of lock_list structure.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200807074238.1632519-6-boqun.feng@gmail.com
2020-08-26 12:42:04 +02:00
Boqun Feng
d563bc6ead lockdep: Make __bfs() visit every dependency until a match
Currently, __bfs() will do a breadth-first search in the dependency
graph and visit each lock class in the graph exactly once, so for
example, in the following graph:

	A ---------> B
	|            ^
	|            |
	+----------> C

a __bfs() call starts at A, will visit B through dependency A -> B and
visit C through dependency A -> C and that's it, IOW, __bfs() will not
visit dependency C -> B.

This is OK for now, as we only have strong dependencies in the
dependency graph, so whenever there is a traverse path from A to B in
__bfs(), it means A has strong dependencies to B (IOW, B depends on A
strongly). So no need to visit all dependencies in the graph.

However, as we are going to add recursive-read lock into the dependency
graph, as a result, not all the paths mean strong dependencies, in the
same example above, dependency A -> B may be a weak dependency and
traverse A -> C -> B may be a strong dependency path. And with the old
way of __bfs() (i.e. visiting every lock class exactly once), we will
miss the strong dependency path, which will result into failing to find
a deadlock. To cure this for the future, we need to find a way for
__bfs() to visit each dependency, rather than each class, exactly once
in the search until we find a match.

The solution is simple:

We used to mark lock_class::lockdep_dependency_gen_id to indicate a
class has been visited in __bfs(), now we change the semantics a little
bit: we now mark lock_class::lockdep_dependency_gen_id to indicate _all
the dependencies_ in its lock_{after,before} have been visited in the
__bfs() (note we only take one direction in a __bfs() search). In this
way, every dependency is guaranteed to be visited until we find a match.

Note: the checks in mark_lock_accessed() and lock_accessed() are
removed, because after this modification, we may call these two
functions on @source_entry of __bfs(), which may not be the entry in
"list_entries"

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200807074238.1632519-5-boqun.feng@gmail.com
2020-08-26 12:42:03 +02:00
Boqun Feng
b11be024de lockdep: Demagic the return value of BFS
__bfs() could return four magic numbers:

	1: search succeeds, but none match.
	0: search succeeds, find one match.
	-1: search fails because of the cq is full.
	-2: search fails because a invalid node is found.

This patch cleans things up by using a enum type for the return value
of __bfs() and its friends, this improves the code readability of the
code, and further, could help if we want to extend the BFS.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200807074238.1632519-4-boqun.feng@gmail.com
2020-08-26 12:42:03 +02:00
Boqun Feng
e918188611 locking: More accurate annotations for read_lock()
On the archs using QUEUED_RWLOCKS, read_lock() is not always a recursive
read lock, actually it's only recursive if in_interrupt() is true. So
change the annotation accordingly to catch more deadlocks.

Note we used to treat read_lock() as pure recursive read locks in
lib/locking-seftest.c, and this is useful, especially for the lockdep
development selftest, so we keep this via a variable to force switching
lock annotation for read_lock().

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200807074238.1632519-2-boqun.feng@gmail.com
2020-08-26 12:42:02 +02:00
Valentin Schneider
4fc472f121 sched/topology: Move SD_DEGENERATE_GROUPS_MASK out of linux/sched/topology.h
SD_DEGENERATE_GROUPS_MASK is only useful for sched/topology.c, but still
gets defined for anyone who imports topology.h, leading to a flurry of
unused variable warnings.

Move it out of the header and place it next to the SD degeneration
functions in sched/topology.c.

Fixes: 4ee4ea443a5d ("sched/topology: Introduce SD metaflag for flags needing > 1 groups")
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200825133216.9163-2-valentin.schneider@arm.com
2020-08-26 12:41:59 +02:00
Valentin Schneider
8fca9494d4 sched/topology: Move sd_flag_debug out of linux/sched/topology.h
Defining an array in a header imported all over the place clearly is a daft
idea, that still didn't stop me from doing it.

Leave a declaration of sd_flag_debug in topology.h and move its definition
to sched/debug.c.

Fixes: b6e862f38672 ("sched/topology: Define and assign sched_domain flag metadata")
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200825133216.9163-1-valentin.schneider@arm.com
2020-08-26 12:41:59 +02:00
Sebastian Andrzej Siewior
c1cecf884a sched: Cache task_struct::flags in sched_submit_work()
sched_submit_work() is considered to be a hot path. The preempt_disable()
instruction is a compiler barrier and forces the compiler to load
task_struct::flags for the second comparison.
By using a local variable, the compiler can load the value once and keep it in
a register for the second comparison.

Verified on x86-64 with gcc-10.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200819200025.lqvmyefqnbok5i4f@linutronix.de
2020-08-26 12:41:58 +02:00
Jiang Biao
1724b95b92 sched/fair: Simplify the work when reweighting entity
The code in reweight_entity() can be simplified.

For a sched entity on the rq, the entity accounting can be replaced by
cfs_rq instantaneous load updates currently called from within the
entity accounting.

Even though an entity on the rq can't represent a task in
reweight_entity() (a task is always dequeued before calling this
function) and so the numa task accounting and the rq->cfs_tasks list
management of the entity accounting are never called, the redundant
cfs_rq->nr_running decrement/increment will be avoided.

Signed-off-by: Jiang Biao <benbjiang@tencent.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20200811113209.34057-1-benbjiang@tencent.com
2020-08-26 12:41:58 +02:00