32839 Commits

Author SHA1 Message Date
Andrii Nakryiko
62039c30c1 bpf: Initialize storage pointers to NULL to prevent freeing garbage pointer
Local storage array isn't initialized, so if cgroup storage allocation fails
for BPF_CGROUP_STORAGE_SHARED, error handling code will attempt to free
uninitialized pointer for BPF_CGROUP_STORAGE_PERCPU storage type. Avoid this
by always initializing storage pointers to NULLs.

Fixes: 8bad74f9840f ("bpf: extend cgroup bpf core to allow multiple cgroup storage types")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200309222756.1018737-1-andriin@fb.com
2020-03-09 19:56:48 -07:00
Christian Brauner
10dab84caf
pid: make ENOMEM return value more obvious
The alloc_pid() codepath used to be simpler. With the introducation of the
ability to choose specific pids in 49cb2fc42ce4 ("fork: extend clone3() to
support setting a PID") it got more complex. It hasn't been super obvious
that ENOMEM is returned when the pid namespace init process/child subreaper
of the pid namespace has died. As can be seen from multiple attempts to
improve this see e.g. [1] and most recently [2].
We regressed returning ENOMEM in [3] and [2] restored it. Let's add a
comment on top explaining that this is historic and documented behavior and
cannot easily be changed.

[1]: 35f71bc0a09a ("fork: report pid reservation failure properly")
[2]: b26ebfe12f34 ("pid: Fix error return value in some cases")
[3]: 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID")
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-03-09 23:40:05 +01:00
Thomas Gleixner
8d67743653 futex: Unbreak futex hashing
The recent futex inode life time fix changed the ordering of the futex key
union struct members, but forgot to adjust the hash function accordingly,

As a result the hashing omits the leading 64bit and even hashes beyond the
futex key causing a bad hash distribution which led to a ~100% performance
regression.

Hand in the futex key pointer instead of a random struct member and make
the size calculation based of the struct offset.

Fixes: 8019ad13ef7f ("futex: Fix inode life-time issue")
Reported-by: Rong Chen <rong.a.chen@intel.com>
Decoded-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Rong Chen <rong.a.chen@intel.com>
Link: https://lkml.kernel.org/r/87h7yy90ve.fsf@nanos.tec.linutronix.de
2020-03-09 22:33:09 +01:00
Corey Minyard
b26ebfe12f
pid: Fix error return value in some cases
Recent changes to alloc_pid() allow the pid number to be specified on
the command line.  If set_tid_size is set, then the code scanning the
levels will hard-set retval to -EPERM, overriding it's previous -ENOMEM
value.

After the code scanning the levels, there are error returns that do not
set retval, assuming it is still set to -ENOMEM.

So set retval back to -ENOMEM after scanning the levels.

Fixes: 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID")
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <areber@redhat.com>
Cc: <stable@vger.kernel.org> # 5.5
Link: https://lore.kernel.org/r/20200306172314.12232-1-minyard@acm.org
[christian.brauner@ubuntu.com: fixup commit message]
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-03-08 14:22:58 +01:00
Alexander Sverdlin
87f2d1c662 genirq/irqdomain: Check pointer in irq_domain_alloc_irqs_hierarchy()
irq_domain_alloc_irqs_hierarchy() has 3 call sites in the compilation unit
but only one of them checks for the pointer which is being dereferenced
inside the called function. Move the check into the function. This allows
for catching the error instead of the following crash:

Unable to handle kernel NULL pointer dereference at virtual address 00000000
PC is at 0x0
LR is at gpiochip_hierarchy_irq_domain_alloc+0x11f/0x140
...
[<c06c23ff>] (gpiochip_hierarchy_irq_domain_alloc)
[<c0462a89>] (__irq_domain_alloc_irqs)
[<c0462dad>] (irq_create_fwspec_mapping)
[<c06c2251>] (gpiochip_to_irq)
[<c06c1c9b>] (gpiod_to_irq)
[<bf973073>] (gpio_irqs_init [gpio_irqs])
[<bf974048>] (gpio_irqs_exit+0xecc/0xe84 [gpio_irqs])
Code: bad PC value

Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200306174720.82604-1-alexander.sverdlin@nokia.com
2020-03-08 13:03:41 +01:00
Thomas Gleixner
acd26bcf36 genirq: Provide interrupt injection mechanism
Error injection mechanisms need a half ways safe way to inject interrupts as
invoking generic_handle_irq() or the actual device interrupt handler
directly from e.g. a debugfs write is not guaranteed to be safe.

On x86 generic_handle_irq() is unsafe due to the hardware trainwreck which
is the base of x86 interrupt delivery and affinity management.

Move the irq debugfs injection code into a separate function which can be
used by error injection code as well.

The implementation prevents at least that state is corrupted, but it cannot
close a very tiny race window on x86 which might result in a stale and not
serviced device interrupt under very unlikely circumstances.

This is explicitly for debugging and testing and not for production use or
abuse in random driver code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lkml.kernel.org/r/20200306130623.990928309@linutronix.de
2020-03-08 11:06:42 +01:00
Thomas Gleixner
da90921acc genirq: Sanitize state handling in check_irq_resend()
The code sets IRQS_REPLAY unconditionally whether the resend happens or
not. That doesn't have bad side effects right now, but inconsistent state
is always a latent source of problems.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lkml.kernel.org/r/20200306130623.882129117@linutronix.de
2020-03-08 11:06:41 +01:00
Thomas Gleixner
1f85b1f5e1 genirq: Add return value to check_irq_resend()
In preparation for an interrupt injection interface which can be used
safely by error injection mechanisms. e.g. PCI-E/ AER, add a return value
to check_irq_resend() so errors can be propagated to the caller.

Split out the software resend code so the ugly #ifdef in check_irq_resend()
goes away and the whole thing becomes readable.

Fix up the caller in debugfs. The caller in irq_startup() does not care
about the return value as this is unconditionally invoked for all
interrupts and the resend is best effort anyway.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lkml.kernel.org/r/20200306130623.775200917@linutronix.de
2020-03-08 11:06:41 +01:00
Thomas Gleixner
c16816acd0 genirq: Add protection against unsafe usage of generic_handle_irq()
In general calling generic_handle_irq() with interrupts disabled from non
interrupt context is harmless. For some interrupt controllers like the x86
trainwrecks this is outright dangerous as it might corrupt state if an
interrupt affinity change is pending.

Add infrastructure which allows to mark interrupts as unsafe and catch such
usage in generic_handle_irq().

Reported-by: sathyanarayanan.kuppuswamy@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lkml.kernel.org/r/20200306130623.590923677@linutronix.de
2020-03-08 11:06:40 +01:00
Thomas Gleixner
a740a423c3 genirq/debugfs: Add missing sanity checks to interrupt injection
Interrupts cannot be injected when the interrupt is not activated and when
a replay is already in progress.

Fixes: 536e2e34bd00 ("genirq/debugfs: Triggering of interrupts from userspace")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200306130623.500019114@linutronix.de
2020-03-08 11:06:40 +01:00
luanshi
b513df6780 irqdomain: Fix function documentation of __irq_domain_alloc_fwnode()
The function got renamed at some point, but the kernel-doc was not updated.

Signed-off-by: luanshi <zhangliguang@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1583200125-58806-1-git-send-email-zhangliguang@linux.alibaba.com
2020-03-08 11:02:24 +01:00
Linus Torvalds
5dfcc13902 block-5.6-2020-03-07
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5j8hwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnjID/4/XVrqtVNUzVoVOtkOyxyesBrJVMHEQEpJ
 PZssv835IStw0ENhxQJfGjPaIFc9Ff6PMkeN5KRAlMoEc+NkrJShF3owGf+6Bps7
 rxpblPxaw+CJFa31YBDZVjMCvbVkDm40G5SsJh+xzdIjlWz7MppkkMPdrErPwY8V
 0vnrIc+mKBKfBMZTwVkycYtp17LVgfXguledoWzxM1y47IW5UasKh8jdzhbu8Hvt
 zztdQrigUdb+9XnLGCZIY0JQOyrhJ5zQpZ40FzbvxdYrQZXOoYT8L7iFu/z0Wi7K
 p3a+G+B4WowtLYW78me4Uut5RrHq2XOehSypfujanQlpgXPGjS3TdHT3an2T8XPQ
 NyGsZsn/eLm3btNbhGUd8vqpQy5EmWhqmwvYk9tFAoSFLiLcvCC624b/TCYPL+gk
 3ZiI7mXBMjHnUZ0J/RF6kZWTAZDvr/tE7UZt1f8r1eEr8VDzCNp5Pst+HCVIguYD
 g9eWF8oH6wYoj39UKf1k+vW2GjXGFsnfivObaxhyz03sAPXK2wQlzAe/4jZ24XNr
 TRtOXh97c3CbLAwdUHehlzzdR3U7h0n2KsmrTC5AGmLABmR79s7BJ0+pexuZituO
 LwU8+gpf7AugHTrLg1eNXAmBHW44I1ticXYiWcT4iSPn99kNIhlW+Jb1iTGoiu7n
 nXyS3b5SCw==
 =xwKl
 -----END PGP SIGNATURE-----

Merge tag 'block-5.6-2020-03-07' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Here are a few fixes that should go into this release. This contains:

   - Revert of a bad bcache patch from this merge window

   - Removed unused function (Daniel)

   - Fixup for the blktrace fix from Jan from this release (Cengiz)

   - Fix of deeper level bfqq overwrite in BFQ (Carlo)"

* tag 'block-5.6-2020-03-07' of git://git.kernel.dk/linux-block:
  block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group()
  blktrace: fix dereference after null check
  Revert "bcache: ignore pending signals when creating gc and allocator thread"
  block: Remove used kblockd_schedule_work_on()
2020-03-07 14:14:38 -06:00
Linus Torvalds
fa883d6afb for-linus-2020-03-07
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXmNvpgAKCRCRxhvAZXjc
 ouFvAQDCzfOx1vcEP/nNhYBP2MPuafKclJcoJggC9rSmIvcLiQD/TI+LyHzplD+m
 MWSu9NZJ6h6qyjKJivja3/bs8DVEewU=
 =4gyS
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-2020-03-07' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux

Pull thread fixes from Christian Brauner:
 "Here are a few hopefully uncontroversial fixes:

   - Use RCU_INIT_POINTER() when initializing rcu protected members in
     task_struct to fix sparse warnings.

   - Add pidfd_fdinfo_test binary to .gitignore file"

* tag 'for-linus-2020-03-07' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux:
  selftests: pidfd: Add pidfd_fdinfo_test in .gitignore
  exit: Fix Sparse errors and warnings
  fork: Use RCU_INIT_POINTER() instead of rcu_access_pointer()
2020-03-07 08:01:43 -06:00
Dmitry Safonov
eaee41727e sysctl/sysrq: Remove __sysrq_enabled copy
Many embedded boards have a disconnected TTL level serial which can
generate some garbage that can lead to spurious false sysrq detects.

Currently, sysrq can be either completely disabled for serial console
or always disabled (with CONFIG_MAGIC_SYSRQ_SERIAL), since
commit 732dbf3a6104 ("serial: do not accept sysrq characters via serial port")

At Arista, we have such boards that can generate BREAK and random
garbage. While disabling sysrq for serial console would solve
the problem with spurious false sysrq triggers, it's also desirable
to have a way to enable sysrq back.

Having the way to enable sysrq was beneficial to debug lockups with
a manual investigation in field and on the other side preventing false
sysrq detections.

As a preparation to add sysrq_toggle_support() call into uart,
remove a private copy of sysrq_enabled from sysctl - it should reflect
the actual status of sysrq.

Furthermore, the private copy isn't correct already in case
sysrq_always_enabled is true. So, remove __sysrq_enabled and use a
getter-helper sysrq_mask() to check sysrq_key_op enabled status.

Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Dmitry Safonov <dima@arista.com>
Link: https://lore.kernel.org/r/20200302175135.269397-2-dima@arista.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-07 09:52:02 +01:00
Ingo Molnar
14533a16c4 thermal/cpu-cooling, sched/core: Move the arch_set_thermal_pressure() API to generic scheduler code
drivers/base/arch_topology.c is only built if CONFIG_GENERIC_ARCH_TOPOLOGY=y,
resulting in such build failures:

  cpufreq_cooling.c:(.text+0x1e7): undefined reference to `arch_set_thermal_pressure'

Move it to sched/core.c instead, and keep it enabled on x86 despite
us not having a arch_scale_thermal_pressure() facility there, to
build-test this thing.

Cc: Thara Gopinath <thara.gopinath@linaro.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-06 14:26:31 +01:00
Peter Xu
fd3eafda8f sched/core: Remove rq.hrtick_csd_pending
Now smp_call_function_single_async() provides the protection that
we'll return with -EBUSY if the csd object is still pending, then we
don't need the rq.hrtick_csd_pending any more.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20191216213125.9536-4-peterx@redhat.com
2020-03-06 13:42:28 +01:00
Peter Xu
5a18ceca63 smp: Allow smp_call_function_single_async() to insert locked csd
Previously we will raise an warning if we want to insert a csd object
which is with the LOCK flag set, and if it happens we'll also wait for
the lock to be released.  However, this operation does not match
perfectly with how the function is named - the name with "_async"
suffix hints that this function should not block, while we will.

This patch changed this behavior by simply return -EBUSY instead of
waiting, at the meantime we allow this operation to happen without
warning the user to change this into a feature when the caller wants
to "insert a csd object, if it's there, just wait for that one".

This is pretty safe because in flush_smp_call_function_queue() for
async csd objects (where csd->flags&SYNC is zero) we'll first do the
unlock then we call the csd->func().  So if we see the csd->flags&LOCK
is true in smp_call_function_single_async(), then it's guaranteed that
csd->func() will be called after this smp_call_function_single_async()
returns -EBUSY.

Update the comment of the function too to refect this.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20191216213125.9536-2-peterx@redhat.com
2020-03-06 13:42:28 +01:00
Qais Yousef
d94a9df490 sched/rt: Remove unnecessary push for unfit tasks
In task_woken_rt() and switched_to_rto() we try trigger push-pull if the
task is unfit.

But the logic is found lacking because if the task was the only one
running on the CPU, then rt_rq is not in overloaded state and won't
trigger a push.

The necessity of this logic was under a debate as well, a summary of
the discussion can be found in the following thread:

  https://lore.kernel.org/lkml/20200226160247.iqvdakiqbakk2llz@e107158-lin.cambridge.arm.com/

Remove the logic for now until a better approach is agreed upon.

Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: 804d402fb6f6 ("sched/rt: Make RT capacity-aware")
Link: https://lkml.kernel.org/r/20200302132721.8353-6-qais.yousef@arm.com
2020-03-06 12:57:29 +01:00
Qais Yousef
98ca645f82 sched/rt: Allow pulling unfitting task
When implemented RT Capacity Awareness; the logic was done such that if
a task was running on a fitting CPU, then it was sticky and we would try
our best to keep it there.

But as Steve suggested, to adhere to the strict priority rules of RT
class; allow pulling an RT task to unfitting CPU to ensure it gets a
chance to run ASAP.

LINK: https://lore.kernel.org/lkml/20200203111451.0d1da58f@oasis.local.home/
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: 804d402fb6f6 ("sched/rt: Make RT capacity-aware")
Link: https://lkml.kernel.org/r/20200302132721.8353-5-qais.yousef@arm.com
2020-03-06 12:57:28 +01:00
Qais Yousef
a1bd02e1f2 sched/rt: Optimize cpupri_find() on non-heterogenous systems
By introducing a new cpupri_find_fitness() function that takes the
fitness_fn as an argument and only called when asym_system static key is
enabled.

cpupri_find() is now a wrapper function that calls cpupri_find_fitness()
passing NULL as a fitness_fn, hence disabling the logic that handles
fitness by default.

LINK: https://lore.kernel.org/lkml/c0772fca-0a4b-c88d-fdf2-5715fcf8447b@arm.com/
Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: 804d402fb6f6 ("sched/rt: Make RT capacity-aware")
Link: https://lkml.kernel.org/r/20200302132721.8353-4-qais.yousef@arm.com
2020-03-06 12:57:27 +01:00
Qais Yousef
b28bc1e002 sched/rt: Re-instate old behavior in select_task_rq_rt()
When RT Capacity Aware support was added, the logic in select_task_rq_rt
was modified to force a search for a fitting CPU if the task currently
doesn't run on one.

But if the search failed, and the search was only triggered to fulfill
the fitness request; we could end up selecting a new CPU unnecessarily.

Fix this and re-instate the original behavior by ensuring we bail out
in that case.

This behavior change only affected asymmetric systems that are using
util_clamp to implement capacity aware. None asymmetric systems weren't
affected.

LINK: https://lore.kernel.org/lkml/20200218041620.GD28029@codeaurora.org/
Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: 804d402fb6f6 ("sched/rt: Make RT capacity-aware")
Link: https://lkml.kernel.org/r/20200302132721.8353-3-qais.yousef@arm.com
2020-03-06 12:57:27 +01:00
Qais Yousef
d9cb236b94 sched/rt: cpupri_find: Implement fallback mechanism for !fit case
When searching for the best lowest_mask with a fitness_fn passed, make
sure we record the lowest_level that returns a valid lowest_mask so that
we can use that as a fallback in case we fail to find a fitting CPU at
all levels.

The intention in the original patch was not to allow a down migration to
unfitting CPU. But this missed the case where we are already running on
unfitting one.

With this change now RT tasks can still move between unfitting CPUs when
they're already running on such CPU.

And as Steve suggested; to adhere to the strict priority rules of RT, if
a task is already running on a fitting CPU but due to priority it can't
run on it, allow it to downmigrate to unfitting CPU so it can run.

Reported-by: Pavan Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: 804d402fb6f6 ("sched/rt: Make RT capacity-aware")
Link: https://lkml.kernel.org/r/20200302132721.8353-2-qais.yousef@arm.com
Link: https://lore.kernel.org/lkml/20200203142712.a7yvlyo2y3le5cpn@e107158-lin/
2020-03-06 12:57:26 +01:00
Vincent Guittot
5ab297bab9 sched/fair: Fix reordering of enqueue/dequeue_task_fair()
Even when a cgroup is throttled, the group se of a child cgroup can still
be enqueued and its gse->on_rq stays true. When a task is enqueued on such
child, we still have to update the load_avg and increase
h_nr_running of the throttled cfs. Nevertheless, the 1st
for_each_sched_entity() loop is skipped because of gse->on_rq == true and the
2nd loop because the cfs is throttled whereas we have to update both
load_avg with the old h_nr_running and increase h_nr_running in such case.

The same sequence can happen during dequeue when se moves to parent before
breaking in the 1st loop.

Note that the update of load_avg will effectively happen only once in order
to sync up to the throttled time. Next call for updating load_avg will stop
early because the clock stays unchanged.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: 6d4d22468dae ("sched/fair: Reorder enqueue/dequeue_task_fair path")
Link: https://lkml.kernel.org/r/20200306084208.12583-1-vincent.guittot@linaro.org
2020-03-06 12:57:25 +01:00
Vincent Guittot
6212437f0f sched/fair: Fix runnable_avg for throttled cfs
When a cfs_rq is throttled, its group entity is dequeued and its running
tasks are removed. We must update runnable_avg with the old h_nr_running
and update group_se->runnable_weight with the new h_nr_running at each
level of the hierarchy.

Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: 9f68395333ad ("sched/pelt: Add a new runnable average signal")
Link: https://lkml.kernel.org/r/20200227154115.8332-1-vincent.guittot@linaro.org
2020-03-06 12:57:25 +01:00
Yu Chen
ba4f7bc1de sched/deadline: Make two functions static
Since commit 06a76fe08d4 ("sched/deadline: Move DL related code
from sched/core.c to sched/deadline.c"), DL related code moved to
deadline.c.

Make the following two functions static since they're only used in
deadline.c:

	dl_change_utilization()
	init_dl_rq_bw_ratio()

Signed-off-by: Yu Chen <chen.yu@easystack.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200228100329.16927-1-chen.yu@easystack.cn
2020-03-06 12:57:24 +01:00
Valentin Schneider
38502ab4bf sched/topology: Don't enable EAS on SMT systems
EAS already requires asymmetric CPU capacities to be enabled, and mixing
this with SMT is an aberration, but better be safe than sorry.

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Quentin Perret <qperret@google.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200227191433.31994-2-valentin.schneider@arm.com
2020-03-06 12:57:23 +01:00
Mel Gorman
0621df3154 sched/numa: Acquire RCU lock for checking idle cores during NUMA balancing
Qian Cai reported the following bug:

  The linux-next commit ff7db0bf24db ("sched/numa: Prefer using an idle CPU as a
  migration target instead of comparing tasks") introduced a boot warning,

  [   86.520534][    T1] WARNING: suspicious RCU usage
  [   86.520540][    T1] 5.6.0-rc3-next-20200227 #7 Not tainted
  [   86.520545][    T1] -----------------------------
  [   86.520551][    T1] kernel/sched/fair.c:5914 suspicious rcu_dereference_check() usage!
  [   86.520555][    T1]
  [   86.520555][    T1] other info that might help us debug this:
  [   86.520555][    T1]
  [   86.520561][    T1]
  [   86.520561][    T1] rcu_scheduler_active = 2, debug_locks = 1
  [   86.520567][    T1] 1 lock held by systemd/1:
  [   86.520571][    T1]  #0: ffff8887f4b14848 (&mm->mmap_sem#2){++++}, at: do_page_fault+0x1d2/0x998
  [   86.520594][    T1]
  [   86.520594][    T1] stack backtrace:
  [   86.520602][    T1] CPU: 1 PID: 1 Comm: systemd Not tainted 5.6.0-rc3-next-20200227 #7

task_numa_migrate() checks for idle cores when updating NUMA-related statistics.
This relies on reading a RCU-protected structure in test_idle_cores() via this
call chain

task_numa_migrate
  -> update_numa_stats
    -> numa_idle_core
      -> test_idle_cores

While the locking could be fine-grained, it is more appropriate to acquire
the RCU lock for the entire scan of the domain. This patch removes the
warning triggered at boot time.

Reported-by: Qian Cai <cai@lca.pw>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: ff7db0bf24db ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks")
Link: https://lkml.kernel.org/r/20200227191804.GJ3818@techsingularity.net
2020-03-06 12:57:22 +01:00
Valentin Schneider
76c389ab2b sched/fair: Fix kernel build warning in test_idle_cores() for !SMT NUMA
Building against the tip/sched/core as ff7db0bf24db ("sched/numa: Prefer
using an idle CPU as a migration target instead of comparing tasks") with
the arm64 defconfig (which doesn't have CONFIG_SCHED_SMT set) leads to:

  kernel/sched/fair.c:1525:20: warning: 'test_idle_cores' declared 'static' but never defined [-Wunused-function]
   static inline bool test_idle_cores(int cpu, bool def);
		      ^~~~~~~~~~~~~~~

Rather than define it in its own CONFIG_SCHED_SMT #define island, bunch it
up with test_idle_cores().

Reported-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
[mgorman@techsingularity.net: Edit changelog, minor style change]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: ff7db0bf24db ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks")
Link: https://lkml.kernel.org/r/20200303110258.1092-3-mgorman@techsingularity.net
2020-03-06 12:57:22 +01:00
Thara Gopinath
05289b90c2 sched/fair: Enable tuning of decay period
Thermal pressure follows pelt signals which means the decay period for
thermal pressure is the default pelt decay period. Depending on SoC
characteristics and thermal activity, it might be beneficial to decay
thermal pressure slower, but still in-tune with the pelt signals.  One way
to achieve this is to provide a command line parameter to set a decay
shift parameter to an integer between 0 and 10.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200222005213.3873-10-thara.gopinath@linaro.org
2020-03-06 12:57:21 +01:00
Thara Gopinath
467b7d01c4 sched/fair: Update cpu_capacity to reflect thermal pressure
cpu_capacity initially reflects the maximum possible capacity of a CPU.
Thermal pressure on a CPU means this maximum possible capacity is
unavailable due to thermal events. This patch subtracts the average
thermal pressure for a CPU from its maximum possible capacity so that
cpu_capacity reflects the remaining maximum capacity.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200222005213.3873-8-thara.gopinath@linaro.org
2020-03-06 12:57:20 +01:00
Thara Gopinath
b4eccf5f8e sched/fair: Enable periodic update of average thermal pressure
Introduce support in scheduler periodic tick and other CFS bookkeeping
APIs to trigger the process of computing average thermal pressure for a
CPU. Also consider avg_thermal.load_avg in others_have_blocked which
allows for decay of pelt signals.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200222005213.3873-7-thara.gopinath@linaro.org
2020-03-06 12:57:20 +01:00
Thara Gopinath
765047932f sched/pelt: Add support to track thermal pressure
Extrapolating on the existing framework to track rt/dl utilization using
pelt signals, add a similar mechanism to track thermal pressure. The
difference here from rt/dl utilization tracking is that, instead of
tracking time spent by a CPU running a RT/DL task through util_avg, the
average thermal pressure is tracked through load_avg. This is because
thermal pressure signal is weighted time "delta" capacity unlike util_avg
which is binary. "delta capacity" here means delta between the actual
capacity of a CPU and the decreased capacity a CPU due to a thermal event.

In order to track average thermal pressure, a new sched_avg variable
avg_thermal is introduced. Function update_thermal_load_avg can be called
to do the periodic bookkeeping (accumulate, decay and average) of the
thermal pressure.

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200222005213.3873-2-thara.gopinath@linaro.org
2020-03-06 12:57:17 +01:00
Chris Wilson
f1dfdab694 sched/vtime: Prevent unstable evaluation of WARN(vtime->state)
As the vtime is sampled under loose seqcount protection by kcpustat, the
vtime fields may change as the code flows. Where logic dictates a field
has a static value, use a READ_ONCE.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes: 74722bb223d0 ("sched/vtime: Bring up complete kcpustat accessor")
Link: https://lkml.kernel.org/r/20200123180849.28486-1-frederic@kernel.org
2020-03-06 12:57:16 +01:00
Ingo Molnar
1b10d388d0 Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-06 12:49:56 +01:00
Ian Rogers
95ed6c707f perf/cgroup: Order events in RB tree by cgroup id
If one is monitoring 6 events on 20 cgroups the per-CPU RB tree will
hold 120 events. The scheduling in of the events currently iterates
over all events looking to see which events match the task's cgroup or
its cgroup hierarchy. If a task is in 1 cgroup with 6 events, then 114
events are considered unnecessarily.

This change orders events in the RB tree by cgroup id if it is present.
This means scheduling in may go directly to events associated with the
task's cgroup if one is present. The per-CPU iterator storage in
visit_groups_merge is sized sufficent for an iterator per cgroup depth,
where different iterators are needed for the task's cgroup and parent
cgroups. By considering the set of iterators when visiting, the lowest
group_index event may be selected and the insertion order group_index
property is maintained. This also allows event rotation to function
correctly, as although events are grouped into a cgroup, rotation always
selects the lowest group_index event to rotate (delete/insert into the
tree) and the min heap of iterators make it so that the group_index order
is maintained.

Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20190724223746.153620-3-irogers@google.com
2020-03-06 11:57:01 +01:00
Ian Rogers
c2283c9368 perf/cgroup: Grow per perf_cpu_context heap storage
Allow the per-CPU min heap storage to have sufficient space for per-cgroup
iterators.

Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200214075133.181299-6-irogers@google.com
2020-03-06 11:57:00 +01:00
Ian Rogers
836196beb3 perf/core: Add per perf_cpu_context min_heap storage
The storage required for visit_groups_merge's min heap needs to vary in
order to support more iterators, such as when multiple nested cgroups'
events are being visited. This change allows for 2 iterators and doesn't
support growth.

Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200214075133.181299-5-irogers@google.com
2020-03-06 11:57:00 +01:00
Ian Rogers
6eef8a7116 perf/core: Use min_heap in visit_groups_merge()
visit_groups_merge will pick the next event based on when it was
inserted in to the context (perf_event group_index). Events may be per CPU
or for any CPU, but in the future we'd also like to have per cgroup events
to avoid searching all events for the events to schedule for a cgroup.
Introduce a min heap for the events that maintains a property that the
earliest inserted event is always at the 0th element. Initialize the heap
with per-CPU and any-CPU events for the context.

Based-on-work-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200214075133.181299-4-irogers@google.com
2020-03-06 11:56:59 +01:00
Peter Zijlstra
98add2af89 perf/cgroup: Reorder perf_cgroup_connect()
Move perf_cgroup_connect() after perf_event_alloc(), such that we can
find/use the PMU's cpu context.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20200214075133.181299-2-irogers@google.com
2020-03-06 11:56:58 +01:00
Peter Zijlstra
2c2366c754 perf/core: Remove 'struct sched_in_data'
We can deduce the ctx and cpuctx from the event, no need to pass them
along. Remove the structure and pass in can_add_hw directly.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-06 11:56:58 +01:00
Peter Zijlstra
ab6f824cfd perf/core: Unify {pinned,flexible}_sched_in()
Less is more; unify the two very nearly identical function.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-06 11:56:55 +01:00
Ingo Molnar
1941011a8b Merge branch 'perf/urgent' into perf/core, to pick up the latest fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-06 11:56:40 +01:00
Peter Zijlstra
4b39f99c22 futex: Remove {get,drop}_futex_key_refs()
Now that {get,drop}_futex_key_refs() have become a glorified NOP,
remove them entirely.

The only thing get_futex_key_refs() is still doing is an smp_mb(), and
now that we don't need to (ab)use existing atomic ops to obtain them,
we can place it explicitly where we need it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-03-06 11:06:19 +01:00
Peter Zijlstra
222993395e futex: Remove pointless mmgrap() + mmdrop()
We always set 'key->private.mm' to 'current->mm', getting an extra
reference on 'current->mm' is quite pointless, because as long as the
task is blocked it isn't going to go away.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-03-06 11:06:18 +01:00
Peter Zijlstra
3867913c45 Merge branch 'locking/urgent' 2020-03-06 11:06:17 +01:00
Peter Zijlstra
8019ad13ef futex: Fix inode life-time issue
As reported by Jann, ihold() does not in fact guarantee inode
persistence. And instead of making it so, replace the usage of inode
pointers with a per boot, machine wide, unique inode identifier.

This sequence number is global, but shared (file backed) futexes are
rare enough that this should not become a performance issue.

Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-03-06 11:06:15 +01:00
Eric Biggers
07b24c7c08 crypto: pcrypt - simplify error handling in pcrypt_create_aead()
Simplify the error handling in pcrypt_create_aead() by taking advantage
of crypto_grab_aead() now handling an ERR_PTR() name and by taking
advantage of crypto_drop_aead() now accepting (as a no-op) a spawn that
hasn't been grabbed yet.

This required also making padata_free_shell() accept a NULL argument.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-03-06 12:28:24 +11:00
KP Singh
3e7c67d90e bpf: Fix bpf_prog_test_run_tracing for !CONFIG_NET
test_run.o is not built when CONFIG_NET is not set and
bpf_prog_test_run_tracing being referenced in bpf_trace.o causes the
linker error:

ld: kernel/trace/bpf_trace.o:(.rodata+0x38): undefined reference to
 `bpf_prog_test_run_tracing'

Add a __weak function in bpf_trace.c to handle this.

Fixes: da00d2f117a0 ("bpf: Add test ops for BPF_PROG_TYPE_TRACING")
Signed-off-by: KP Singh <kpsingh@google.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200305220127.29109-1-kpsingh@chromium.org
2020-03-05 15:14:58 -08:00
KP Singh
69191754ff bpf: Remove unnecessary CAP_MAC_ADMIN check
While well intentioned, checking CAP_MAC_ADMIN for attaching
BPF_MODIFY_RETURN tracing programs to "security_" functions is not
necessary as tracing BPF programs already require CAP_SYS_ADMIN.

Fixes: 6ba43b761c41 ("bpf: Attachment verification for BPF_MODIFY_RETURN")
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200305204955.31123-1-kpsingh@chromium.org
2020-03-05 14:27:22 -08:00
Martin KaFai Lau
849b4d9458 bpf: Do not allow map_freeze in struct_ops map
struct_ops map cannot support map_freeze.  Otherwise, a struct_ops
cannot be unregistered from the subsystem.

Fixes: 85d33df357b6 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200305013454.535397-1-kafai@fb.com
2020-03-05 14:15:49 -08:00