34700 Commits

Author SHA1 Message Date
Peter Zijlstra
19a1f5ec69 sched: Fix smp_call_function_single_async() usage for ILB
The recent commit: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
got smp_call_function_single_async() subtly wrong. Even though it will
return -EBUSY when trying to re-use a csd, that condition is not
atomic and still requires external serialization.

The change in kick_ilb() got this wrong.

While on first reading kick_ilb() has an atomic test-and-set that
appears to serialize the use, the matching 'release' is not in the
right place to actually guarantee this serialization.

Rework the nohz_idle_balance() trigger so that the release is in the
IPI callback and thus guarantees the required serialization for the
CSD.

Fixes: 90b5363acd47 ("sched: Clean up scheduler_ipi()")
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: mgorman@techsingularity.net
Link: https://lore.kernel.org/r/20200526161907.778543557@infradead.org
2020-05-28 10:54:15 +02:00
Ingo Molnar
58ef57b16d Merge branch 'core/rcu' into sched/core, to pick up dependency
We are going to rely on the loosening of RCU callback semantics,
introduced by this commit:

  806f04e9fd2c: ("rcu: Allow for smp_call_function() running callbacks from idle")

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-05-28 10:52:53 +02:00
Ingo Molnar
498bdcdb94 Merge branch 'sched/urgent' into sched/core, to pick up fix
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-05-28 10:52:37 +02:00
Peter Zijlstra
806f04e9fd rcu: Allow for smp_call_function() running callbacks from idle
Current RCU hard relies on smp_call_function() callbacks running from
interrupt context. A pending optimization is going to break that, it
will allow idle CPUs to run the callbacks from the idle loop. This
avoids raising the IPI on the requesting CPU and avoids handling an
exception on the receiving CPU.

Change rcu_is_cpu_rrupt_from_idle() to also accept task context,
provided it is the idle task.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/20200527171236.GC706495@hirez.programming.kicks-ass.net
2020-05-28 10:50:12 +02:00
Ingo Molnar
4f470fff67 Linux 5.7-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl7K9iEeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGzTAH/0ifZEG4BQ8x/WlB
 8YLSLE6QQTSXYi25nyExuJbFkkKY5Tik8M2HD/36xwY/HnZOlH9jH6m0ntqZxpaA
 3EU9lr1ct79nCBMYhiJssvz8d9AOZXlyogFW9y2y9pmPjlmUtseZ7yGh1xD465cj
 B5Ty2w2W34cs7zF3og2xn5agOJMtWWXLXZ5mRa9EOquKC5zeYyRicmd0T+plYQD6
 hbRYmxFfDfppVnBCBARPNN0+NU5JJD94H+8bOuf1tl48XNrLiZMOicmtohKNQ6+W
 rZNpJNEGEp7KMtqWH0Nl3hmy3yfZHMwe1DXM/AZDqR7jTHZY4mZ0GEpLyfI9AU4n
 34jVHwU=
 =SmJ9
 -----END PGP SIGNATURE-----

Merge tag 'v5.7-rc7' into WIP.locking/core, to refresh the tree

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-05-28 10:30:40 +02:00
Ingo Molnar
0bffedbce9 Linux 5.7-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl7K9iEeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGzTAH/0ifZEG4BQ8x/WlB
 8YLSLE6QQTSXYi25nyExuJbFkkKY5Tik8M2HD/36xwY/HnZOlH9jH6m0ntqZxpaA
 3EU9lr1ct79nCBMYhiJssvz8d9AOZXlyogFW9y2y9pmPjlmUtseZ7yGh1xD465cj
 B5Ty2w2W34cs7zF3og2xn5agOJMtWWXLXZ5mRa9EOquKC5zeYyRicmd0T+plYQD6
 hbRYmxFfDfppVnBCBARPNN0+NU5JJD94H+8bOuf1tl48XNrLiZMOicmtohKNQ6+W
 rZNpJNEGEp7KMtqWH0Nl3hmy3yfZHMwe1DXM/AZDqR7jTHZY4mZ0GEpLyfI9AU4n
 34jVHwU=
 =SmJ9
 -----END PGP SIGNATURE-----

Merge tag 'v5.7-rc7' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-05-28 07:58:12 +02:00
Linus Torvalds
3301f6ae2d Merge branch 'for-5.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:

 - Reverted stricter synchronization for cgroup recursive stats which
   was prepping it for event counter usage which never got merged. The
   change was causing performation regressions in some cases.

 - Restore bpf-based device-cgroup operation even when cgroup1 device
   cgroup is disabled.

 - An out-param init fix.

* 'for-5.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  device_cgroup: Cleanup cgroup eBPF device filter code
  xattr: fix uninitialized out-param
  Revert "cgroup: Add memory barriers to plug cgroup_rstat_updated() race window"
2020-05-27 10:58:19 -07:00
Domenico Andreoli
ad1e4f74c0 PM: hibernate: Restrict writes to the resume device
Hibernation via snapshot device requires write permission to the swap
block device, the one that more often (but not necessarily) is used to
store the hibernation image.

With this patch, such permissions are granted iff:

 1) snapshot device config option is enabled
 2) swap partition is used as resume device

In other circumstances the swap device is not writable from userspace.

In order to achieve this, every write attempt to a swap device is
checked against the device configured as part of the uswsusp API [0]
using a pointer to the inode struct in memory. If the swap device being
written was not configured for resuming, the write request is denied.

NOTE: this implementation works only for swap block devices, where the
inode configured by swapon (which sets S_SWAPFILE) is the same used
by SNAPSHOT_SET_SWAP_AREA.

In case of swap file, SNAPSHOT_SET_SWAP_AREA indeed receives the inode
of the block device containing the filesystem where the swap file is
located (+ offset in it) which is never passed to swapon and then has
not set S_SWAPFILE.

As result, the swap file itself (as a file) has never an option to be
written from userspace. Instead it remains writable if accessed directly
from the containing block device, which is always writeable from root.

[0] Documentation/power/userland-swsusp.rst

v2:
 - rename is_hibernate_snapshot_dev() to is_hibernate_resume_dev()
 - fix description so to correctly refer to the resume device

Signed-off-by: Domenico Andreoli <domenico.andreoli@linux.com>
Acked-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-05-27 17:55:59 +02:00
Zhang Qiang
342ed2400b workqueue: Remove unnecessary kfree() call in rcu_free_wq()
The data structure member "wq->rescuer" was reset to a null pointer
in one if branch. It was passed to a call of the function "kfree"
in the callback function "rcu_free_wq" (which was eventually executed).
The function "kfree" does not perform more meaningful data processing
for a passed null pointer (besides immediately returning from such a call).
Thus delete this function call which became unnecessary with the referenced
software update.

Fixes: def98c84b6cd ("workqueue: Fix spurious sanity check failures in destroy_workqueue()")

Suggested-by: Markus Elfring <Markus.Elfring@web.de>
Signed-off-by: Zhang Qiang <qiang.zhang@windriver.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-05-27 09:52:41 -04:00
Dan Williams
3234ac664a /dev/mem: Revoke mappings when a driver claims the region
Close the hole of holding a mapping over kernel driver takeover event of
a given address range.

Commit 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the
kernel against scenarios where a /dev/mem user tramples memory that a
kernel driver owns. However, this protection only prevents *new* read(),
write() and mmap() requests. Established mappings prior to the driver
calling request_mem_region() are left alone.

Especially with persistent memory, and the core kernel metadata that is
stored there, there are plentiful scenarios for a /dev/mem user to
violate the expectations of the driver and cause amplified damage.

Teach request_mem_region() to find and shoot down active /dev/mem
mappings that it believes it has successfully claimed for the exclusive
use of the driver. Effectively a driver call to request_mem_region()
becomes a hole-punch on the /dev/mem device.

The typical usage of unmap_mapping_range() is part of
truncate_pagecache() to punch a hole in a file, but in this case the
implementation is only doing the "first half" of a hole punch. Namely it
is just evacuating current established mappings of the "hole", and it
relies on the fact that /dev/mem establishes mappings in terms of
absolute physical address offsets. Once existing mmap users are
invalidated they can attempt to re-establish the mapping, or attempt to
continue issuing read(2) / write(2) to the invalidated extent, but they
will then be subject to the CONFIG_IO_STRICT_DEVMEM checking that can
block those subsequent accesses.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/159009507306.847224.8502634072429766747.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-27 11:10:05 +02:00
Zefan Li
6b6ebb3474 cgroup: Remove stale comments
- The default root is where we can create v2 cgroups.
- The __DEVEL__sane_behavior mount option has been removed long long ago.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-05-26 13:20:24 -04:00
Thomas Gleixner
07325d4a90 rcu: Provide rcu_irq_exit_check_preempt()
Provide a debug check which can be invoked from exception return to kernel
mode before an attempt is made to schedule. Warn if RCU is not ready for
this.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20200521202117.089709607@linutronix.de
2020-05-26 19:05:11 +02:00
Paul E. McKenney
aaf2bc50df rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()
There will likely be exception handlers that can sleep, which rules
out the usual approach of invoking rcu_nmi_enter() on entry and also
rcu_nmi_exit() on all exit paths.  However, the alternative approach of
just not calling anything can prevent RCU from coaxing quiescent states
from nohz_full CPUs that are looping in the kernel:  RCU must instead
IPI them explicitly.  It would be better to enable the scheduler tick
on such CPUs to interact with RCU in a lighter-weight manner, and this
enabling is one of the things that rcu_nmi_enter() currently does.

What is needed is something that helps RCU coax quiescent states while
not preventing subsequent sleeps.  This commit therefore splits out the
nohz_full scheduler-tick enabling from the rest of the rcu_nmi_enter()
logic into a new function named rcu_irq_enter_check_tick().

[ tglx: Renamed the function and made it a nop when context tracking is off ]
[ mingo: Fixed a CONFIG_NO_HZ_FULL assumption, harmonized and fixed all the
         comment blocks and cleaned up rcu_nmi_enter()/exit() definitions. ]

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20200521202116.996113173@linutronix.de
2020-05-26 19:04:18 +02:00
Jens Axboe
18f855e574 sched/fair: Don't NUMA balance for kthreads
Stefano reported a crash with using SQPOLL with io_uring:

  BUG: kernel NULL pointer dereference, address: 00000000000003b0
  CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11
  RIP: 0010:task_numa_work+0x4f/0x2c0
  Call Trace:
   task_work_run+0x68/0xa0
   io_sq_thread+0x252/0x3d0
   kthread+0xf9/0x130
   ret_from_fork+0x35/0x40

which is task_numa_work() oopsing on current->mm being NULL.

The task work is queued by task_tick_numa(), which checks if current->mm is
NULL at the time of the call. But this state isn't necessarily persistent,
if the kthread is using use_mm() to temporarily adopt the mm of a task.

Change the task_tick_numa() check to exclude kernel threads in general,
as it doesn't make sense to attempt ot balance for kthreads anyway.

Reported-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/865de121-8190-5d30-ece5-3b097dc74431@kernel.dk
2020-05-26 18:34:58 +02:00
Arnd Bergmann
502afe7f04 Qualcomm driver updates for v5.8
This contains a large set of cleanups, bug fixes, general improvements
 and documentation fixes for the RPMH driver. It adds a debugfs mechanism
 for inspecting Command DB. Socinfo got the "soc_id" attribute defines
 and definitions for a various variants of MSM8939.
 
 RPMH, RPMPD and RPMHPD where made possible to build as modules, but RPMH
 had to be reverted due to a compilation issue when tracing is enabled.
 
 RPMHPD gained power-domains for the SM8250 voltage corners.
 
 The SCM driver gained fixes for two build warnings and the SMP2P had an
 unnecessary error print removed.
 -----BEGIN PGP SIGNATURE-----
 
 iQJPBAABCAA5FiEEBd4DzF816k8JZtUlCx85Pw2ZrcUFAl7DZcEbHGJqb3JuLmFu
 ZGVyc3NvbkBsaW5hcm8ub3JnAAoJEAsfOT8Nma3FEwIQAK1TrENzsRjB23fY4pEW
 +hN/SkfMjNPsinmyNOHCo03MQzdFIKUl40aNvaHh3foQXaSG4TW12iot9Ul5nsxn
 /u6dCSzl15FK7pHYj/VQPWTz2WvpANVqsm6G5tf43hBg2TnStbK1AsxgJ6aq47fp
 QHehMbfeKpF/gltEowv1b+H7xwNFY7eqlQ9O9umYm3hUQh3Bl5mI6PdbkazDdO1j
 l/vHuQKkZXRdHtD1BxGBfvhPtM4NWDbOPeWWrw8HRFM5muDPgK9mXMRGDcv+fpHq
 4I3670xiTWK1Mfz9+FRBMoxLkIWT6zXILg9aNzuZMTOLoNRt3s6VNdne6uf3naND
 2Nrcu7t0b+4xWGbdwwpiAFZm8C6l+R1WTCiY4iOJXDychhVvyVwOLzPdZ4i8Pzx9
 a4UJDElu8xj2g++oCcjK816IdNwkA46bM7qVz4mkHRUBMkmK9AU7okBA9HQfbP64
 MrKPaUljyZuy7zonpJJBBDKLDEFa9a1pI8sehU8p+MUldxYB/4f/iGC3KzyDvv4Q
 Uzj6AqdP5pjgIxz98YgpOPPl8pwg1c5OgblNLWoHj4yTaqUzT0ZHIRyQO/O/+3Lg
 HwCrSy+f3+tgzts3DhyVRmOkOXFHfflSdf4rfKtbiq435yB2dz60dt77dJ0jQuBV
 M9MiF+SE9oASUupPhs47kNte
 =tkhb
 -----END PGP SIGNATURE-----

Merge tag 'qcom-drivers-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into arm/drivers

Qualcomm driver updates for v5.8

This contains a large set of cleanups, bug fixes, general improvements
and documentation fixes for the RPMH driver. It adds a debugfs mechanism
for inspecting Command DB. Socinfo got the "soc_id" attribute defines
and definitions for a various variants of MSM8939.

RPMH, RPMPD and RPMHPD where made possible to build as modules, but RPMH
had to be reverted due to a compilation issue when tracing is enabled.

RPMHPD gained power-domains for the SM8250 voltage corners.

The SCM driver gained fixes for two build warnings and the SMP2P had an
unnecessary error print removed.

* tag 'qcom-drivers-for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux: (42 commits)
  Revert "soc: qcom: rpmh: Allow RPMH driver to be loaded as a module"
  soc: qcom: rpmh-rsc: Remove the pm_lock
  soc: qcom: rpmh-rsc: Simplify locking by eliminating the per-TCS lock
  kernel/cpu_pm: Fix uninitted local in cpu_pm
  soc: qcom: rpmh-rsc: We aren't notified of our own failure w/ NOTIFY_BAD
  soc: qcom: rpmh-rsc: Correctly ignore CPU_CLUSTER_PM notifications
  firmware: qcom_scm-legacy: Replace zero-length array with flexible-array
  soc: qcom: rpmh-rsc: Timeout after 1 second in write_tcs_reg_sync()
  soc: qcom: rpmh-rsc: Factor "tcs_reg_addr" and "tcs_cmd_addr" calculation
  soc: qcom: socinfo: add msm8936/39 and apq8036/39 soc ids
  soc: qcom: aoss: Add SM8250 compatible
  soc: qcom: pdr: Remove impossible error condition
  soc: qcom: rpmh: Dirt can only make you dirtier, not cleaner
  soc: qcom: rpmhpd: Add SM8250 power domains
  firmware: qcom_scm: fix bogous abuse of dma-direct internals
  dt-bindings: soc: qcom: apr: Use generic node names for APR services
  firmware: qcom_scm: Remove unneeded conversion to bool
  soc: qcom: cmd-db: Properly endian swap the slv_id for debugfs
  soc: qcom: cmd-db: Use 5 digits for printing address
  soc: qcom: cmd-db: Cast sizeof() to int to silence field width warning
  ...

Link: https://lore.kernel.org/r/20200519052533.1250024-1-bjorn.andersson@linaro.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2020-05-25 23:19:06 +02:00
Greg Kroah-Hartman
344235f557 Merge 5.7-rc7 into tty-next
We need the tty/serial fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-25 13:22:05 +02:00
Mel Gorman
2ebb177175 sched/core: Offload wakee task activation if it the wakee is descheduling
The previous commit:

  c6e7bd7afaeb: ("sched/core: Optimize ttwu() spinning on p->on_cpu")

avoids spinning on p->on_rq when the task is descheduling, but only if the
wakee is on a CPU that does not share cache with the waker.

This patch offloads the activation of the wakee to the CPU that is about to
go idle if the task is the only one on the runqueue. This potentially allows
the waker task to continue making progress when the wakeup is not strictly
synchronous.

This is very obvious with netperf UDP_STREAM running on localhost. The
waker is sending packets as quickly as possible without waiting for any
reply. It frequently wakes the server for the processing of packets and
when netserver is using local memory, it quickly completes the processing
and goes back to idle. The waker often observes that netserver is on_rq
and spins excessively leading to a drop in throughput.

This is a comparison of 5.7-rc6 against "sched: Optimize ttwu() spinning
on p->on_cpu" and against this patch labeled vanilla, optttwu-v1r1 and
localwakelist-v1r2 respectively.

                                  5.7.0-rc6              5.7.0-rc6              5.7.0-rc6
                                    vanilla           optttwu-v1r1     localwakelist-v1r2
Hmean     send-64         251.49 (   0.00%)      258.05 *   2.61%*      305.59 *  21.51%*
Hmean     send-128        497.86 (   0.00%)      519.89 *   4.43%*      600.25 *  20.57%*
Hmean     send-256        944.90 (   0.00%)      997.45 *   5.56%*     1140.19 *  20.67%*
Hmean     send-1024      3779.03 (   0.00%)     3859.18 *   2.12%*     4518.19 *  19.56%*
Hmean     send-2048      7030.81 (   0.00%)     7315.99 *   4.06%*     8683.01 *  23.50%*
Hmean     send-3312     10847.44 (   0.00%)    11149.43 *   2.78%*    12896.71 *  18.89%*
Hmean     send-4096     13436.19 (   0.00%)    13614.09 (   1.32%)    15041.09 *  11.94%*
Hmean     send-8192     22624.49 (   0.00%)    23265.32 *   2.83%*    24534.96 *   8.44%*
Hmean     send-16384    34441.87 (   0.00%)    36457.15 *   5.85%*    35986.21 *   4.48%*

Note that this benefit is not universal to all wakeups, it only applies
to the case where the waker often spins on p->on_rq.

The impact can be seen from a "perf sched latency" report generated from
a single iteration of one packet size:

   -----------------------------------------------------------------------------------------------------------------
    Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |
   -----------------------------------------------------------------------------------------------------------------

  vanilla
    netperf:4337          |  21709.193 ms |     2932 | avg:    0.002 ms | max:    0.041 ms | max at:    112.154512 s
    netserver:4338        |  14629.459 ms |  5146990 | avg:    0.001 ms | max: 1615.864 ms | max at:    140.134496 s

  localwakelist-v1r2
    netperf:4339          |  29789.717 ms |     2460 | avg:    0.002 ms | max:    0.059 ms | max at:    138.205389 s
    netserver:4340        |  18858.767 ms |  7279005 | avg:    0.001 ms | max:    0.362 ms | max at:    135.709683 s
   -----------------------------------------------------------------------------------------------------------------

Note that the average wakeup delay is quite small on both the vanilla
kernel and with the two patches applied. However, there are significant
outliers with the vanilla kernel with the maximum one measured as 1615
milliseconds with a vanilla kernel but never worse than 0.362 ms with
both patches applied and a much higher rate of context switching.

Similarly a separate profile of cycles showed that 2.83% of all cycles
were spent in try_to_wake_up() with almost half of the cycles spent
on spinning on p->on_rq. With the two patches, the percentage of cycles
spent in try_to_wake_up() drops to 1.13%

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: valentin.schneider@arm.com
Cc: Hillf Danton <hdanton@sina.com>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lore.kernel.org/r/20200524202956.27665-3-mgorman@techsingularity.net
2020-05-25 07:04:10 +02:00
Peter Zijlstra
c6e7bd7afa sched/core: Optimize ttwu() spinning on p->on_cpu
Both Rik and Mel reported seeing ttwu() spend significant time on:

  smp_cond_load_acquire(&p->on_cpu, !VAL);

Attempt to avoid this by queueing the wakeup on the CPU that owns the
p->on_cpu value. This will then allow the ttwu() to complete without
further waiting.

Since we run schedule() with interrupts disabled, the IPI is
guaranteed to happen after p->on_cpu is cleared, this is what makes it
safe to queue early.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: valentin.schneider@arm.com
Cc: Hillf Danton <hdanton@sina.com>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lore.kernel.org/r/20200524202956.27665-2-mgorman@techsingularity.net
2020-05-25 07:01:44 +02:00
David S. Miller
13209a8f73 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The MSCC bug fix in 'net' had to be slightly adjusted because the
register accesses are done slightly differently in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-24 13:47:27 -07:00
Linus Torvalds
9e61d12bac A set of fixes for the scheduler:
- Fix handling of throttled parents in enqueue_task_fair() completely. The
    recent fix overlooked a corner case where the first iteration terminates
    do a entiry being on rq which makes the list management incomplete and
    later triggers the assertion which checks for completeness.
 
  - Fix a similar problem in unthrottle_cfs_rq().
 
  - Show the correct uclamp values in procfs which prints the effective
    value twice instead of requested and effective.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl7Ki3QTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVA5D/9d3ajxbUD7Qzyr9LW4dQ/GLdVrLC+h
 rav3KskFzF+v9qyEz5j6ZUEe6ZtGyqk3n+xlVCoBjEZonygaZ3A9rjn5p6p+junR
 ueUHQM9BYOQ2qX/rnQubpgzphfT2BdNXtrT0aWCQdXjbDwTJcDhS8AMjDAOnPntc
 Pj7brhP/lPtFF2JubBshlJGWCHdriALOJGyFw+FftBq6yotsFel/EH9i/5++JSFX
 3DmcFZXJMIf7cRk9xWzmP2QoOkyV6KU9FD9zlRxpvLOskT7qJ7HXaCgFLHl/HZPD
 Ukqmy8Ua8hGbKseqfuyz9vV/xJazdiEyqPOSnd8wJzk5umw2rplknOk2qpJ+Infv
 MLjnfgrjNtBvgCq4lHnmGeTvdjktLPuPusQIVjZzqis6RryZiNsNspuE5QPZP4Fh
 87/rfYQ7VoAGTRmeC9+t4X0BSnpkT4KkQvWCHtocmzl4nuym0cgfZxOCrnRW+NEN
 LeDzgujT8uRaDyeZcTwg6pysfxR2Kod6mWomnT9t17ZD93EaY/FEPUL+g8+VbDyD
 mAu1BiENX3DL5erZKBHcvHbniYVkcZ/l/cgX6o2wcFLuREaEEbHgsXYHdLqZd9QY
 eNwbldtPlagy+f2jla84O1/fFcM1za/R6V9Mbb+/86xJeNu2R/+Vv0FiOqpmHW24
 G9UW+hQcQ+ZxgA==
 =akes
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2020-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Thomas Gleixner:
 "A set of fixes for the scheduler:

   - Fix handling of throttled parents in enqueue_task_fair() completely.

     The recent fix overlooked a corner case where the first iteration
     terminates due to an entity already being on the runqueue which
     makes the list management incomplete and later triggers the
     assertion which checks for completeness.

   - Fix a similar problem in unthrottle_cfs_rq().

   - Show the correct uclamp values in procfs which prints the effective
     value twice instead of requested and effective"

* tag 'sched-urgent-2020-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list
  sched/debug: Fix requested task uclamp values shown in procfs
  sched/fair: Fix enqueue_task_fair() warning some more
2020-05-24 10:14:58 -07:00
Shreyas Joshi
48021f9813 printk: handle blank console arguments passed in.
If uboot passes a blank string to console_setup then it results in
a trashed memory. Ultimately, the kernel crashes during freeing up
the memory.

This fix checks if there is a blank parameter being
passed to console_setup from uboot. In case it detects that
the console parameter is blank then it doesn't setup the serial
device and it gracefully exits.

Link: https://lore.kernel.org/r/20200522065306.83-1-shreyas.joshi@biamp.com
Signed-off-by: Shreyas Joshi <shreyas.joshi@biamp.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
[pmladek@suse.com: Better format the commit message and code, remove unnecessary brackets.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
2020-05-22 10:34:34 +02:00
John Fastabend
cac616db39 bpf: Verifier track null pointer branch_taken with JNE and JEQ
Currently, when considering the branches that may be taken for a jump
instruction if the register being compared is a pointer the verifier
assumes both branches may be taken. But, if the jump instruction
is comparing if a pointer is NULL we have this information in the
verifier encoded in the reg->type so we can do better in these cases.
Specifically, these two common cases can be handled.

 * If the instruction is BPF_JEQ and we are comparing against a
   zero value. This test is 'if ptr == 0 goto +X' then using the
   type information in reg->type we can decide if the ptr is not
   null. This allows us to avoid pushing both branches onto the
   stack and instead only use the != 0 case. For example
   PTR_TO_SOCK and PTR_TO_SOCK_OR_NULL encode the null pointer.
   Note if the type is PTR_TO_SOCK_OR_NULL we can not learn anything.
   And also if the value is non-zero we learn nothing because it
   could be any arbitrary value a different pointer for example

 * If the instruction is BPF_JNE and ware comparing against a zero
   value then a similar analysis as above can be done. The test in
   asm looks like 'if ptr != 0 goto +X'. Again using the type
   information if the non null type is set (from above PTR_TO_SOCK)
   we know the jump is taken.

In this patch we extend is_branch_taken() to consider this extra
information and to return only the branch that will be taken. This
resolves a verifier issue reported with C code like the following.
See progs/test_sk_lookup_kern.c in selftests.

 sk = bpf_sk_lookup_tcp(skb, tuple, tuple_len, BPF_F_CURRENT_NETNS, 0);
 bpf_printk("sk=%d\n", sk ? 1 : 0);
 if (sk)
   bpf_sk_release(sk);
 return sk ? TC_ACT_OK : TC_ACT_UNSPEC;

In the above the bpf_printk() will resolve the pointer from
PTR_TO_SOCK_OR_NULL to PTR_TO_SOCK. Then the second test guarding
the release will cause the verifier to walk both paths resulting
in the an unreleased sock reference. See verifier/ref_tracking.c
in selftests for an assembly version of the above.

After the above additional logic is added the C code above passes
as expected.

Reported-by: Andrey Ignatov <rdna@fb.com>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/159009164651.6313.380418298578070501.stgit@john-Precision-5820-Tower
2020-05-21 17:44:25 -07:00
Björn Töpel
d20a1676df xsk: Move xskmap.c to net/xdp/
The XSKMAP is partly implemented by net/xdp/xsk.c. Move xskmap.c from
kernel/bpf/ to net/xdp/, which is the logical place for AF_XDP related
code. Also, move AF_XDP struct definitions, and function declarations
only used by AF_XDP internals into net/xdp/xsk.h.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200520192103.355233-3-bjorn.topel@gmail.com
2020-05-21 17:31:26 -07:00
J. Bruce Fields
6670ee2ef2 Merge branch 'nfsd-5.8' of git://linux-nfs.org/~cel/cel-2.6 into for-5.8-incoming
Highlights of this series:
* Remove serialization of sending RPC/RDMA Replies
* Convert the TCP socket send path to use xdr_buf::bvecs (pre-requisite for
RPC-on-TLS)
* Fix svcrdma backchannel sendto return code
* Convert a number of dprintk call sites to use tracepoints
* Fix the "suggest braces around empty body in an 'else' statement" warning
2020-05-21 10:58:15 -04:00
Bruno Meneguele
8ece3b3eb5 kernel/printk: add kmsg SEEK_CUR handling
Userspace libraries, e.g. glibc's dprintf(), perform a SEEK_CUR operation
over any file descriptor requested to make sure the current position isn't
pointing to junk due to previous manipulation of that same fd. And whenever
that fd doesn't have support for such operation, the userspace code expects
-ESPIPE to be returned.

However, when the fd in question references the /dev/kmsg interface, the
current kernel code state returns -EINVAL instead, causing an unexpected
behavior in userspace: in the case of glibc, when -ESPIPE is returned it
gets ignored and the call completes successfully, while returning -EINVAL
forces dprintf to fail without performing any action over that fd:

  if (_IO_SEEKOFF (fp, (off64_t)0, _IO_seek_cur, _IOS_INPUT|_IOS_OUTPUT) ==
  _IO_pos_BAD && errno != ESPIPE)
    return NULL;

With this patch we make sure to return the correct value when SEEK_CUR is
requested over kmsg and also add some kernel doc information to formalize
this behavior.

Link: https://lore.kernel.org/r/20200317103344.574277-1-bmeneg@redhat.com
Cc: linux-kernel@vger.kernel.org
Cc: rostedt@goodmis.org,
Cc: David.Laight@ACULAB.COM
Signed-off-by: Bruno Meneguele <bmeneg@redhat.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2020-05-21 13:32:25 +02:00
Ethon Paul
325606af57 printk: Fix a typo in comment "interator"->"iterator"
There is a typo in comment, fix it.

Signed-off-by: Ethon Paul <ethp@qq.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2020-05-21 13:31:33 +02:00
Andy Shevchenko
9ed78b05f9 irqdomain: Allow software nodes for IRQ domain creation
In some cases we need to have an IRQ domain created out of software node.

One of such cases is DesignWare GPIO driver when it's instantiated from
half-baked ACPI table (alas, we can't fix it for devices which are few years
on market) and thus using software nodes to quirk this. But the driver
is using IRQ domains based on per GPIO port firmware nodes, which are in
the above case software ones. This brings a warning message to be printed

  [   73.957183] irq: Invalid fwnode type for irqdomain

and creates an anonymous IRQ domain without a debugfs entry.

Allowing software nodes to be valid for IRQ domains rids us of the warning
and debugs gets correctly populated.

  % ls -1 /sys/kernel/debug/irq/domains/
  ...
  intel-quark-dw-apb-gpio:portA

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
[maz: refactored commit message]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200520164927.39090-3-andriy.shevchenko@linux.intel.com
2020-05-21 10:53:17 +01:00
Andy Shevchenko
87526603c8 irqdomain: Get rid of special treatment for ACPI in __irq_domain_add()
Now that __irq_domain_add() is able to better deals with generic
fwnodes, there is no need to special-case ACPI anymore.

Get rid of the special treatment for ACPI.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200520164927.39090-2-andriy.shevchenko@linux.intel.com
2020-05-21 10:51:50 +01:00
Andy Shevchenko
181e9d4efa irqdomain: Make __irq_domain_add() less OF-dependent
__irq_domain_add() relies in some places on the fact that the fwnode
can be only of type OF. This prevents refactoring of the code to support
other types of fwnode. Make it less OF-dependent by switching it
to use the fwnode directly where it makes sense.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20200520164927.39090-1-andriy.shevchenko@linux.intel.com
2020-05-21 10:50:30 +01:00
Andrii Nakryiko
dfeb376dd4 bpf: Prevent mmap()'ing read-only maps as writable
As discussed in [0], it's dangerous to allow mapping BPF map, that's meant to
be frozen and is read-only on BPF program side, because that allows user-space
to actually store a writable view to the page even after it is frozen. This is
exacerbated by BPF verifier making a strong assumption that contents of such
frozen map will remain unchanged. To prevent this, disallow mapping
BPF_F_RDONLY_PROG mmap()'able BPF maps as writable, ever.

  [0] https://lore.kernel.org/bpf/CAEf4BzYGWYhXdp6BJ7_=9OQPJxQpgug080MMjdSB72i9R+5c6g@mail.gmail.com/

Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/bpf/20200519053824.1089415-1-andriin@fb.com
2020-05-20 20:21:53 -07:00
Richard Guy Briggs
9d44a121c5 audit: add subj creds to NETFILTER_CFG record to
Some table unregister actions seem to be initiated by the kernel to
garbage collect unused tables that are not initiated by any userspace
actions.  It was found to be necessary to add the subject credentials to
cover this case to reveal the source of these actions.  A sample record:

The uid, auid, tty, ses and exe fields have not been included since they
are in the SYSCALL record and contain nothing useful in the non-user
context.

Here are two sample orphaned records:

  type=NETFILTER_CFG msg=audit(2020-05-20 12:14:36.505:5) : table=filter family=ipv4 entries=0 op=register pid=1 subj=kernel comm=swapper/0

  type=NETFILTER_CFG msg=audit(2020-05-20 12:15:27.701:301) : table=nat family=bridge entries=0 op=unregister pid=30 subj=system_u:system_r:kernel_t:s0 comm=kworker/u4:1

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-05-20 18:09:19 -04:00
Eric W. Biederman
87b047d2be exec: Teach prepare_exec_creds how exec treats uids & gids
It is almost possible to use the result of prepare_exec_creds with no
modifications during exec.  Update prepare_exec_creds to initialize
the suid and the fsuid to the euid, and the sgid and the fsgid to the
egid.  This is all that is needed to handle the common case of exec
when nothing special like a setuid exec is happening.

That this preserves the existing behavior of exec can be verified
by examing bprm_fill_uid and cap_bprm_set_creds.

This change makes it clear that the later parts of exec that
update bprm->cred are just need to handle special cases such
as setuid exec and change of domains.

Link: https://lkml.kernel.org/r/871rng22dm.fsf_-_@x220.int.ebiederm.org
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-05-20 14:44:21 -05:00
Christoph Hellwig
c928f642c2 fs: rename pipe_buf ->steal to ->try_steal
And replace the arcane return value convention with a simple bool
where true means success and false means failure.

[AV: braino fix folded in]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 12:14:10 -04:00
Christoph Hellwig
b8d9e7f241 fs: make the pipe_buf_operations ->confirm operation optional
Just return 0 for success if it is not present.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 12:11:26 -04:00
Christoph Hellwig
76887c2567 fs: make the pipe_buf_operations ->steal operation optional
Just return 1 for failure if it is not present.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 12:11:26 -04:00
Christoph Hellwig
6797d97ab9 trace: remove tracing_pipe_buf_ops
tracing_pipe_buf_ops has identical ops to default_pipe_buf_ops, so use
that instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 12:11:26 -04:00
Michael Ellerman
217ba7dcce Merge branch 'topic/uaccess-ppc' into next
Merge our uaccess-ppc topic branch. It is based on the uaccess topic
branch that we're sharing with Viro.

This includes the addition of user_[read|write]_access_begin(), as
well as some powerpc specific changes to our uaccess routines that
would conflict badly if merged separately.
2020-05-20 23:37:33 +10:00
Paolo Bonzini
9d5272f5e3 Merge tag 'noinstr-x86-kvm-2020-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into HEAD 2020-05-20 03:40:09 -04:00
Julia Lawall
fc9d276f22 tracing/probe: reverse arguments to list_add
Elsewhere in the file, the function trace_kprobe_has_same_kprobe uses
a trace_probe_event.probes object as the second argument of
list_for_each_entry, ie as a list head, while the list_for_each_entry
iterates over the list fields of the trace_probe structures, making
them the list elements.  So, exchange the arguments on the list_add
call to put the list head in the second argument.

Since both list_head structures were just initialized, this problem
did not cause any loss of information.

Link: https://lkml.kernel.org/r/1588879808-24488-1-git-send-email-Julia.Lawall@inria.fr

Fixes: 60d53e2c3b75 ("tracing/probe: Split trace_event related data from trace_probe")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-19 21:10:50 -04:00
Cheng Jian
c143b7753b ftrace: show debugging information when panic_on_warn set
When an anomaly is detected in the function call modification
code, ftrace_bug() is called to disable function tracing as well
as give some warn and information that may help debug the problem.

But currently, we call FTRACE_WARN_ON_ONCE() first in ftrace_bug(),
so when panic_on_warn is set, we can't see the debugging information
here. Call FTRACE_WARN_ON_ONCE() at the end of ftrace_bug() to ensure
that the debugging information is displayed first.

after this patch, the dmesg looks like:

	------------[ ftrace bug ]------------
	ftrace failed to modify
	[<ffff800010081004>] bcm2835_handle_irq+0x4/0x58
	 actual:   1f:20:03:d5
	Setting ftrace call site to call ftrace function
	ftrace record flags: 80000001
	 (1)
	 expected tramp: ffff80001009d6f0
	------------[ cut here ]------------
	WARNING: CPU: 2 PID: 1635 at kernel/trace/ftrace.c:2078 ftrace_bug+0x204/0x238
	Kernel panic - not syncing: panic_on_warn set ...
	CPU: 2 PID: 1635 Comm: sh Not tainted 5.7.0-rc5-00033-gb922183867f5 #14
	Hardware name: linux,dummy-virt (DT)
	Call trace:
	 dump_backtrace+0x0/0x1b0
	 show_stack+0x20/0x30
	 dump_stack+0xc0/0x10c
	 panic+0x16c/0x368
	 __warn+0x120/0x160
	 report_bug+0xc8/0x160
	 bug_handler+0x28/0x98
	 brk_handler+0x70/0xd0
	 do_debug_exception+0xcc/0x1ac
	 el1_sync_handler+0xe4/0x120
	 el1_sync+0x7c/0x100
	 ftrace_bug+0x204/0x238

Link: https://lkml.kernel.org/r/20200515100828.7091-1-cj.chengjian@huawei.com

Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-05-19 21:08:01 -04:00
Gustavo A. R. Silva
db78538c75 locking/lockdep: Replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200507185804.GA15036@embeddedor
2020-05-19 20:34:18 +02:00
Gustavo A. R. Silva
c50c75e9b8 perf/core: Replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200511201227.GA14041@embeddedor
2020-05-19 20:34:16 +02:00
Huaixin Chang
d505b8af58 sched: Defend cfs and rt bandwidth quota against overflow
When users write some huge number into cpu.cfs_quota_us or
cpu.rt_runtime_us, overflow might happen during to_ratio() shifts of
schedulable checks.

to_ratio() could be altered to avoid unnecessary internal overflow, but
min_cfs_quota_period is less than 1 << BW_SHIFT, so a cutoff would still
be needed. Set a cap MAX_BW for cfs_quota_us and rt_runtime_us to
prevent overflow.

Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Link: https://lkml.kernel.org/r/20200425105248.60093-1-changhuaixin@linux.alibaba.com
2020-05-19 20:34:14 +02:00
Muchun Song
dbe9337109 sched/cpuacct: Fix charge cpuacct.usage_sys
The user_mode(task_pt_regs(tsk)) always return true for
user thread, and false for kernel thread. So it means that
the cpuacct.usage_sys is the time that kernel thread uses
not the time that thread uses in the kernel mode. We can
try get_irq_regs() first, if it is NULL, then we can fall
back to task_pt_regs().

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200420070453.76815-1-songmuchun@bytedance.com
2020-05-19 20:34:14 +02:00
Gustavo A. R. Silva
04f5c362ec sched/fair: Replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200507192141.GA16183@embeddedor
2020-05-19 20:34:14 +02:00
Vincent Guittot
95d685935a sched/pelt: Sync util/runnable_sum with PELT window when propagating
update_tg_cfs_*() propagate the impact of the attach/detach of an entity
down into the cfs_rq hierarchy and must keep the sync with the current pelt
window.

Even if we can't sync child cfs_rq and its group se, we can sync the group
se and its parent cfs_rq with current position in the PELT window. In fact,
we must keep them sync in order to stay also synced with others entities
and group entities that are already attached to the cfs_rq.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200506155301.14288-1-vincent.guittot@linaro.org
2020-05-19 20:34:14 +02:00
Muchun Song
12aa258738 sched/cpuacct: Use __this_cpu_add() instead of this_cpu_ptr()
The cpuacct_charge() and cpuacct_account_field() are called with
rq->lock held, and this means preemption(and IRQs) are indeed
disabled, so it is safe to use __this_cpu_*() to allow for better
code-generation.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200507031039.32615-1-songmuchun@bytedance.com
2020-05-19 20:34:13 +02:00
Vincent Guittot
7d148be69e sched/fair: Optimize enqueue_task_fair()
enqueue_task_fair jumps to enqueue_throttle label when cfs_rq_of(se) is
throttled which means that se can't be NULL in such case and we can move
the label after the if (!se) statement. Futhermore, the latter can be
removed because se is always NULL when reaching this point.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200513135502.4672-1-vincent.guittot@linaro.org
2020-05-19 20:34:13 +02:00
Peter Zijlstra
9013196a46 Merge branch 'sched/urgent' 2020-05-19 20:34:12 +02:00
Vincent Guittot
39f23ce07b sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list
Although not exactly identical, unthrottle_cfs_rq() and enqueue_task_fair()
are quite close and follow the same sequence for enqueuing an entity in the
cfs hierarchy. Modify unthrottle_cfs_rq() to use the same pattern as
enqueue_task_fair(). This fixes a problem already faced with the latter and
add an optimization in the last for_each_sched_entity loop.

Fixes: fe61468b2cb (sched/fair: Fix enqueue_task_fair warning)
Reported-by Tao Zhou <zohooouoto@zoho.com.cn>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Link: https://lkml.kernel.org/r/20200513135528.4742-1-vincent.guittot@linaro.org
2020-05-19 20:34:10 +02:00