29420 Commits

Author SHA1 Message Date
Quentin Perret
1f74de8798 sched/toplogy: Introduce the 'sched_energy_present' static key
In order to make sure Energy Aware Scheduling (EAS) will not impact
systems where no Energy Model is available, introduce a static key
guarding the access to EAS code. Since EAS is enabled on a
per-root-domain basis, the static key is enabled when at least one root
domain meets all conditions for EAS.

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adharmap@codeaurora.org
Cc: chris.redpath@arm.com
Cc: currojerez@riseup.net
Cc: dietmar.eggemann@arm.com
Cc: edubezval@gmail.com
Cc: gregkh@linuxfoundation.org
Cc: javi.merino@kernel.org
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: pkondeti@codeaurora.org
Cc: rjw@rjwysocki.net
Cc: skannan@codeaurora.org
Cc: smuckle@google.com
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Cc: tkjos@google.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Cc: viresh.kumar@linaro.org
Link: https://lkml.kernel.org/r/20181203095628.11858-10-quentin.perret@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11 15:17:01 +01:00
Quentin Perret
531b5c9f5c sched/topology: Make Energy Aware Scheduling depend on schedutil
Energy Aware Scheduling (EAS) is designed with the assumption that
frequencies of CPUs follow their utilization value. When using a CPUFreq
governor other than schedutil, the chances of this assumption being true
are small, if any. When schedutil is being used, EAS' predictions are at
least consistent with the frequency requests. Although those requests
have no guarantees to be honored by the hardware, they should at least
guide DVFS in the right direction and provide some hope in regards to the
EAS model being accurate.

To make sure EAS is only used in a sane configuration, create a strong
dependency on schedutil being used. Since having sugov compiled-in does
not provide that guarantee, make CPUFreq call a scheduler function on
governor changes hence letting it rebuild the scheduling domains, check
the governors of the online CPUs, and enable/disable EAS accordingly.

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adharmap@codeaurora.org
Cc: chris.redpath@arm.com
Cc: currojerez@riseup.net
Cc: dietmar.eggemann@arm.com
Cc: edubezval@gmail.com
Cc: gregkh@linuxfoundation.org
Cc: javi.merino@kernel.org
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: pkondeti@codeaurora.org
Cc: skannan@codeaurora.org
Cc: smuckle@google.com
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Cc: tkjos@google.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Cc: viresh.kumar@linaro.org
Link: https://lkml.kernel.org/r/20181203095628.11858-9-quentin.perret@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11 15:17:00 +01:00
Quentin Perret
b68a4c0dba sched/topology: Disable EAS on inappropriate platforms
Energy Aware Scheduling (EAS) in its current form is most relevant on
platforms with asymmetric CPU topologies (e.g. Arm big.LITTLE) since
this is where there is a lot of potential for saving energy through
scheduling. This is particularly true since the Energy Model only
includes the active power costs of CPUs, hence not providing enough data
to compare packing-vs-spreading strategies.

As such, disable EAS on root domains where the SD_ASYM_CPUCAPACITY flag
is not set. While at it, disable EAS on systems where the complexity of
the Energy Model is too high since that could lead to unacceptable
scheduling overhead.

All in all, EAS can be used on a root domain if and only if:
  1. an Energy Model is available;
  2. the root domain has an asymmetric CPU capacity topology;
  3. the complexity of the root domain's EM is low enough to keep
     scheduling overheads low.

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adharmap@codeaurora.org
Cc: chris.redpath@arm.com
Cc: currojerez@riseup.net
Cc: dietmar.eggemann@arm.com
Cc: edubezval@gmail.com
Cc: gregkh@linuxfoundation.org
Cc: javi.merino@kernel.org
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: pkondeti@codeaurora.org
Cc: rjw@rjwysocki.net
Cc: skannan@codeaurora.org
Cc: smuckle@google.com
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Cc: tkjos@google.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Cc: viresh.kumar@linaro.org
Link: https://lkml.kernel.org/r/20181203095628.11858-8-quentin.perret@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11 15:17:00 +01:00
Quentin Perret
011b27bb5d sched/topology: Add lowest CPU asymmetry sched_domain level pointer
Add another member to the family of per-cpu sched_domain shortcut
pointers. This one, sd_asym_cpucapacity, points to the lowest level
at which the SD_ASYM_CPUCAPACITY flag is set. While at it, rename the
sd_asym shortcut to sd_asym_packing to avoid confusions.

Generally speaking, the largest opportunity to save energy via
scheduling comes from a smarter exploitation of heterogeneous platforms
(i.e. big.LITTLE). Consequently, the sd_asym_cpucapacity shortcut will
be used at first as the lowest domain where Energy-Aware Scheduling
(EAS) should be applied. For example, it is possible to apply EAS within
a socket on a multi-socket system, as long as each socket has an
asymmetric topology. Energy-aware cross-sockets wake-up balancing will
only happen when the system is over-utilized, or this_cpu and prev_cpu
are in different sockets.

Suggested-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adharmap@codeaurora.org
Cc: chris.redpath@arm.com
Cc: currojerez@riseup.net
Cc: dietmar.eggemann@arm.com
Cc: edubezval@gmail.com
Cc: gregkh@linuxfoundation.org
Cc: javi.merino@kernel.org
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: patrick.bellasi@arm.com
Cc: pkondeti@codeaurora.org
Cc: rjw@rjwysocki.net
Cc: skannan@codeaurora.org
Cc: smuckle@google.com
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Cc: tkjos@google.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Cc: viresh.kumar@linaro.org
Link: https://lkml.kernel.org/r/20181203095628.11858-7-quentin.perret@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11 15:16:59 +01:00
Quentin Perret
6aa140fa45 sched/topology: Reference the Energy Model of CPUs when available
The existing scheduling domain hierarchy is defined to map to the cache
topology of the system. However, Energy Aware Scheduling (EAS) requires
more knowledge about the platform, and specifically needs to know about
the span of Performance Domains (PD), which do not always align with
caches.

To address this issue, use the Energy Model (EM) of the system to extend
the scheduler topology code with a representation of the PDs, alongside
the scheduling domains. More specifically, a linked list of PDs is
attached to each root domain. When multiple root domains are in use,
each list contains only the PDs covering the CPUs of its root domain. If
a PD spans over CPUs of multiple different root domains, it will be
duplicated in all lists.

The lists are fully maintained by the scheduler from
partition_sched_domains() in order to cope with hotplug and cpuset
changes. As for scheduling domains, the list are protected by RCU to
ensure safe concurrent updates.

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adharmap@codeaurora.org
Cc: chris.redpath@arm.com
Cc: currojerez@riseup.net
Cc: dietmar.eggemann@arm.com
Cc: edubezval@gmail.com
Cc: gregkh@linuxfoundation.org
Cc: javi.merino@kernel.org
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: pkondeti@codeaurora.org
Cc: rjw@rjwysocki.net
Cc: skannan@codeaurora.org
Cc: smuckle@google.com
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Cc: tkjos@google.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Cc: viresh.kumar@linaro.org
Link: https://lkml.kernel.org/r/20181203095628.11858-6-quentin.perret@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11 15:16:59 +01:00
Quentin Perret
27871f7a8a PM: Introduce an Energy Model management framework
Several subsystems in the kernel (task scheduler and/or thermal at the
time of writing) can benefit from knowing about the energy consumed by
CPUs. Yet, this information can come from different sources (DT or
firmware for example), in different formats, hence making it hard to
exploit without a standard API.

As an attempt to address this, introduce a centralized Energy Model
(EM) management framework which aggregates the power values provided
by drivers into a table for each performance domain in the system. The
power cost tables are made available to interested clients (e.g. task
scheduler or thermal) via platform-agnostic APIs. The overall design
is represented by the diagram below (focused on Arm-related drivers as
an example, but applicable to any architecture):

     +---------------+  +-----------------+  +-------------+
     | Thermal (IPA) |  | Scheduler (EAS) |  |    Other    |
     +---------------+  +-----------------+  +-------------+
             |                   | em_pd_energy()   |
             |                   | em_cpu_get()     |
             +-----------+       |         +--------+
                         |       |         |
                         v       v         v
                      +---------------------+
                      |                     |
                      |    Energy Model     |
                      |                     |
                      |     Framework       |
                      |                     |
                      +---------------------+
                         ^       ^       ^
                         |       |       | em_register_perf_domain()
              +----------+       |       +---------+
              |                  |                 |
      +---------------+  +---------------+  +--------------+
      |  cpufreq-dt   |  |   arm_scmi    |  |    Other     |
      +---------------+  +---------------+  +--------------+
              ^                  ^                 ^
              |                  |                 |
      +--------------+   +---------------+  +--------------+
      | Device Tree  |   |   Firmware    |  |      ?       |
      +--------------+   +---------------+  +--------------+

Drivers (typically, but not limited to, CPUFreq drivers) can register
data in the EM framework using the em_register_perf_domain() API. The
calling driver must provide a callback function with a standardized
signature that will be used by the EM framework to build the power
cost tables of the performance domain. This design should offer a lot of
flexibility to calling drivers which are free of reading information
from any location and to use any technique to compute power costs.
Moreover, the capacity states registered by drivers in the EM framework
are not required to match real performance states of the target. This
is particularly important on targets where the performance states are
not known by the OS.

The power cost coefficients managed by the EM framework are specified in
milli-watts. Although the two potential users of those coefficients (IPA
and EAS) only need relative correctness, IPA specifically needs to
compare the power of CPUs with the power of other components (GPUs, for
example), which are still expressed in absolute terms in their
respective subsystems. Hence, specifying the power of CPUs in
milli-watts should help transitioning IPA to using the EM framework
without introducing new problems by keeping units comparable across
sub-systems.
On the longer term, the EM of other devices than CPUs could also be
managed by the EM framework, which would enable to remove the absolute
unit. However, this is not absolutely required as a first step, so this
extension of the EM framework is left for later.

On the client side, the EM framework offers APIs to access the power
cost tables of a CPU (em_cpu_get()), and to estimate the energy
consumed by the CPUs of a performance domain (em_pd_energy()). Clients
such as the task scheduler can then use these APIs to access the shared
data structures holding the Energy Model of CPUs.

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adharmap@codeaurora.org
Cc: chris.redpath@arm.com
Cc: currojerez@riseup.net
Cc: dietmar.eggemann@arm.com
Cc: edubezval@gmail.com
Cc: gregkh@linuxfoundation.org
Cc: javi.merino@kernel.org
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: pkondeti@codeaurora.org
Cc: skannan@codeaurora.org
Cc: smuckle@google.com
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Cc: tkjos@google.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Cc: viresh.kumar@linaro.org
Link: https://lkml.kernel.org/r/20181203095628.11858-4-quentin.perret@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11 15:16:58 +01:00
Quentin Perret
938e5e4b0d sched/cpufreq: Prepare schedutil for Energy Aware Scheduling
Schedutil requests frequency by aggregating utilization signals from
the scheduler (CFS, RT, DL, IRQ) and applying a 25% margin on top of
them. Since Energy Aware Scheduling (EAS) needs to be able to predict
the frequency requests, it needs to forecast the decisions made by the
governor.

In order to prepare the introduction of EAS, introduce
schedutil_freq_util() to centralize the aforementioned signal
aggregation and make it available to both schedutil and EAS. Since
frequency selection and energy estimation still need to deal with RT and
DL signals slightly differently, schedutil_freq_util() is called with a
different 'type' parameter in those two contexts, and returns an
aggregated utilization signal accordingly. While at it, introduce the
map_util_freq() function which is designed to make schedutil's 25%
margin usable easily for both sugov and EAS.

As EAS will be able to predict schedutil's frequency requests more
accurately than any other governor by design, it'd be sensible to make
sure EAS cannot be used without schedutil. This will be done later, once
EAS has actually been introduced.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adharmap@codeaurora.org
Cc: chris.redpath@arm.com
Cc: currojerez@riseup.net
Cc: dietmar.eggemann@arm.com
Cc: edubezval@gmail.com
Cc: gregkh@linuxfoundation.org
Cc: javi.merino@kernel.org
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: pkondeti@codeaurora.org
Cc: rjw@rjwysocki.net
Cc: skannan@codeaurora.org
Cc: smuckle@google.com
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Cc: tkjos@google.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Cc: viresh.kumar@linaro.org
Link: https://lkml.kernel.org/r/20181203095628.11858-3-quentin.perret@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11 15:16:58 +01:00
Quentin Perret
5bd0988be1 sched/topology: Relocate arch_scale_cpu_capacity() to the internal header
By default, arch_scale_cpu_capacity() is only visible from within the
kernel/sched folder. Relocate it to include/linux/sched/topology.h to
make it visible to other clients needing to know about the capacity of
CPUs, such as the Energy Model framework.

This also shrinks the <linux/sched/topology.h> public header.

Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: adharmap@codeaurora.org
Cc: chris.redpath@arm.com
Cc: currojerez@riseup.net
Cc: dietmar.eggemann@arm.com
Cc: edubezval@gmail.com
Cc: gregkh@linuxfoundation.org
Cc: javi.merino@kernel.org
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Cc: pkondeti@codeaurora.org
Cc: rjw@rjwysocki.net
Cc: skannan@codeaurora.org
Cc: smuckle@google.com
Cc: srinivas.pandruvada@linux.intel.com
Cc: thara.gopinath@linaro.org
Cc: tkjos@google.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Cc: viresh.kumar@linaro.org
Link: https://lkml.kernel.org/r/20181203095628.11858-2-quentin.perret@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11 15:16:58 +01:00
Yangtao Li
9ebc605381 sched/core: Remove unnecessary unlikely() in push_*_task()
WARN_ON() already contains an unlikely(), so it's not necessary to
use WARN_ON(1).

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20181103172602.1917-1-tiny.windzz@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11 15:16:57 +01:00
Vincent Guittot
765d0af19f sched/topology: Remove the ::smt_gain field from 'struct sched_domain'
::smt_gain is used to compute the capacity of CPUs of a SMT core with the
constraint 1 < ::smt_gain < 2 in order to be able to compute number of CPUs
per core. The field has_free_capacity of struct numa_stat, which was the
last user of this computation of number of CPUs per core, has been removed
by:

  2d4056fafa19 ("sched/numa: Remove numa_has_capacity()")

We can now remove this constraint on core capacity and use the defautl value
SCHED_CAPACITY_SCALE for SMT CPUs. With this remove, SCHED_CAPACITY_SCALE
becomes the maximum compute capacity of CPUs on every systems. This should
help to simplify some code and remove fields like rd->max_cpu_capacity

Furthermore, arch_scale_cpu_capacity() is used with a NULL sd in several other
places in the code when it wants the capacity of a CPUs to scale
some metrics like in pelt, deadline or schedutil. In case on SMT, the value
returned is not the capacity of SMT CPUs but default SCHED_CAPACITY_SCALE.

So remove it.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1535548752-4434-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11 15:16:57 +01:00
Andrea Parri
80eb865768 sched/fair: Clean up comment in nohz_idle_balance()
Concerning the comment associated to the atomic_fetch_andnot() in
nohz_idle_balance(), Vincent explains [1]:

  "[...] the comment is useless and can be removed [...]  it was
   referring to a line code above the comment that was present in
   a previous iteration of the patchset. This line disappeared in
   final version but the comment has stayed."

So remove the comment.

Vincent also points out that the full ordering associated to the
atomic_fetch_andnot() primitive could be relaxed, but this patch
insists on the current more conservative/fully ordered solution:

"Performance" isn't a concern, stay away from "correctness"/subtle
relaxed (re)ordering if possible..., just make sure not to confuse
the next reader with misleading/out-of-date comments.

[1] http://lkml.kernel.org/r/CAKfTPtBjA-oCBRkO6__npQwL3+HLjzk7riCcPU1R7YdO-EpuZg@mail.gmail.com

Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20181127110110.5533-1-andrea.parri@amarulasolutions.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11 14:54:57 +01:00
Bart Van Assche
fe27b0de8d locking/lockdep: Stop using RCU primitives to access 'all_lock_classes'
Due to the previous patch all code that accesses the 'all_lock_classes'
list holds the graph lock. Hence use regular list primitives instead of
their RCU variants to access this list.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: johannes.berg@intel.com
Cc: tj@kernel.org
Link: https://lkml.kernel.org/r/20181207011148.251812-14-bvanassche@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11 14:54:56 +01:00
Bart Van Assche
786fa29e9c locking/lockdep: Make concurrent lockdep_reset_lock() calls safe
Since zap_class() removes items from the all_lock_classes list and the
classhash_table, protect all zap_class() calls against concurrent
data structure modifications with the graph lock.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: johannes.berg@intel.com
Cc: tj@kernel.org
Link: https://lkml.kernel.org/r/20181207011148.251812-13-bvanassche@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11 14:54:55 +01:00
Bart Van Assche
a66b6922dc locking/lockdep: Remove a superfluous INIT_LIST_HEAD() statement
Initializing a list entry just before it is passed to list_add_tail_rcu()
is not necessary because list_add_tail_rcu() overwrites the next and prev
pointers anyway. Hence remove the INIT_LIST_HEAD() statement.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: johannes.berg@intel.com
Cc: tj@kernel.org
Link: https://lkml.kernel.org/r/20181207011148.251812-12-bvanassche@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11 14:54:54 +01:00
Bart Van Assche
2904d9fa45 locking/lockdep: Introduce lock_class_cache_is_registered()
This patch does not change any functionality but makes the
lockdep_reset_lock() function easier to read.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: johannes.berg@intel.com
Cc: tj@kernel.org
Link: https://lkml.kernel.org/r/20181207011148.251812-11-bvanassche@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11 14:54:53 +01:00
Bart Van Assche
d35568bdb6 locking/lockdep: Inline __lockdep_init_map()
Since the function __lockdep_init_map() only has one caller, inline it
into its caller. This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: johannes.berg@intel.com
Cc: tj@kernel.org
Link: https://lkml.kernel.org/r/20181207011148.251812-10-bvanassche@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11 14:54:51 +01:00
Bart Van Assche
1431a5d2cf locking/lockdep: Declare local symbols static
This patch avoids that sparse complains about a missing declaration for
the lock_classes array when building with CONFIG_DEBUG_LOCKDEP=n.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: johannes.berg@intel.com
Cc: tj@kernel.org
Link: https://lkml.kernel.org/r/20181207011148.251812-9-bvanassche@acm.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11 14:54:51 +01:00
Robin Murphy
ad78dee0b6 dma-debug: Batch dma_debug_entry allocation
DMA debug entries are one of those things which aren't that useful
individually - we will always want some larger quantity of them - and
which we don't really need to manage the exact number of - we only care
about having 'enough'. In that regard, the current behaviour of creating
them one-by-one leads to a lot of unwarranted function call overhead and
memory wasted on alignment padding.

Now that we don't have to worry about freeing anything via
dma_debug_resize_entries(), we can optimise the allocation behaviour by
grabbing whole pages at once, which will save considerably on the
aforementioned overheads, and probably offer a little more cache/TLB
locality benefit for traversing the lists under normal operation. This
should also give even less reason for an architecture-level override of
the preallocation size, so make the definition unconditional - if there
is still any desire to change the compile-time value for some platforms
it would be better off as a Kconfig option anyway.

Since freeing a whole page of entries at once becomes enough of a
challenge that it's not really worth complicating dma_debug_init(), we
may as well tweak the preallocation behaviour such that as long as we
manage to allocate *some* pages, we can leave debugging enabled on a
best-effort basis rather than otherwise wasting them.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Qian Cai <cai@lca.pw>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-11 14:32:13 +01:00
Robin Murphy
0cb0e25e42 dma/debug: Remove dma_debug_resize_entries()
With the only caller now gone, we can clean up this part of dma-debug's
exposed internals and make way to tweak the allocation behaviour.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Qian Cai <cai@lca.pw>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-11 14:32:13 +01:00
Robin Murphy
ceb51173b2 dma-debug: Make leak-like behaviour apparent
Now that we can dynamically allocate DMA debug entries to cope with
drivers maintaining excessively large numbers of live mappings, a driver
which *does* actually have a bug leaking mappings (and is not unloaded)
will no longer trigger the "DMA-API: debugging out of memory - disabling"
message until it gets to actual kernel OOM conditions, which means it
could go unnoticed for a while. To that end, let's inform the user each
time the pool has grown to a multiple of its initial size, which should
make it apparent that they either have a leak or might want to increase
the preallocation size.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Qian Cai <cai@lca.pw>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-11 14:31:18 +01:00
Robin Murphy
2b9d9ac02b dma-debug: Dynamically expand the dma_debug_entry pool
Certain drivers such as large multi-queue network adapters can use pools
of mapped DMA buffers larger than the default dma_debug_entry pool of
65536 entries, with the result that merely probing such a device can
cause DMA debug to disable itself during boot unless explicitly given an
appropriate "dma_debug_entries=..." option.

Developers trying to debug some other driver on such a system may not be
immediately aware of this, and at worst it can hide bugs if they fail to
realise that dma-debug has already disabled itself unexpectedly by the
time their code of interest gets to run. Even once they do realise, it
can be a bit of a pain to emprirically determine a suitable number of
preallocated entries to configure, short of massively over-allocating.

There's really no need for such a static limit, though, since we can
quite easily expand the pool at runtime in those rare cases that the
preallocated entries are insufficient, which is arguably the least
surprising and most useful behaviour. To that end, refactor the
prealloc_memory() logic a little bit to generalise it for runtime
reallocations as well.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Qian Cai <cai@lca.pw>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-11 14:31:17 +01:00
Robin Murphy
f737b095c6 dma-debug: Use pr_fmt()
Use pr_fmt() to generate the "DMA-API: " prefix consistently. This
results in it being added to a couple of pr_*() messages which were
missing it before, and for the err_printk() calls moves it to the actual
start of the message instead of somewhere in the middle.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Qian Cai <cai@lca.pw>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-11 14:31:17 +01:00
Robin Murphy
9f191555ba dma-debug: Expose nr_total_entries in debugfs
Expose nr_total_entries in debugfs, so that {num,min}_free_entries
become even more meaningful to users interested in current/maximum
utilisation. This becomes even more relevant once nr_total_entries
may change at runtime beyond just the existing AMD GART debug code.

Suggested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Qian Cai <cai@lca.pw>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-11 14:31:17 +01:00
Daniel Lezcano
108c35a908 sched/cpufreq: Add the SPDX tags
The SPDX tags are not present in cpufreq.c and cpufreq_schedutil.c.

Add them and remove the license descriptions

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-12-11 11:35:25 +01:00
David S. Miller
addb067983 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-12-11

The following pull-request contains BPF updates for your *net-next* tree.

It has three minor merge conflicts, resolutions:

1) tools/testing/selftests/bpf/test_verifier.c

 Take first chunk with alignment_prevented_execution.

2) net/core/filter.c

  [...]
  case bpf_ctx_range_ptr(struct __sk_buff, flow_keys):
  case bpf_ctx_range(struct __sk_buff, wire_len):
        return false;
  [...]

3) include/uapi/linux/bpf.h

  Take the second chunk for the two cases each.

The main changes are:

1) Add support for BPF line info via BTF and extend libbpf as well
   as bpftool's program dump to annotate output with BPF C code to
   facilitate debugging and introspection, from Martin.

2) Add support for BPF_ALU | BPF_ARSH | BPF_{K,X} in interpreter
   and all JIT backends, from Jiong.

3) Improve BPF test coverage on archs with no efficient unaligned
   access by adding an "any alignment" flag to the BPF program load
   to forcefully disable verifier alignment checks, from David.

4) Add a new bpf_prog_test_run_xattr() API to libbpf which allows for
   proper use of BPF_PROG_TEST_RUN with data_out, from Lorenz.

5) Extend tc BPF programs to use a new __sk_buff field called wire_len
   for more accurate accounting of packets going to wire, from Petar.

6) Improve bpftool to allow dumping the trace pipe from it and add
   several improvements in bash completion and map/prog dump,
   from Quentin.

7) Optimize arm64 BPF JIT to always emit movn/movk/movk sequence for
   kernel addresses and add a dedicated BPF JIT backend allocator,
   from Ard.

8) Add a BPF helper function for IR remotes to report mouse movements,
   from Sean.

9) Various cleanups in BPF prog dump e.g. to make UAPI bpf_prog_info
   member naming consistent with existing conventions, from Yonghong
   and Song.

10) Misc cleanups and improvements in allowing to pass interface name
    via cmdline for xdp1 BPF example, from Matteo.

11) Fix a potential segfault in BPF sample loader's kprobes handling,
    from Daniel T.

12) Fix SPDX license in libbpf's README.rst, from Andrey.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-10 18:00:43 -08:00
Yonghong Song
11d8b82d22 bpf: rename *_info_cnt to nr_*_info in bpf_prog_info
In uapi bpf.h, currently we have the following fields in
the struct bpf_prog_info:
	__u32 func_info_cnt;
	__u32 line_info_cnt;
	__u32 jited_line_info_cnt;
The above field names "func_info_cnt" and "line_info_cnt"
also appear in union bpf_attr for program loading.

The original intention is to keep the names the same
between bpf_prog_info and bpf_attr
so it will imply what we returned to user space will be
the same as what the user space passed to the kernel.

Such a naming convention in bpf_prog_info is not consistent
with other fields like:
        __u32 nr_jited_ksyms;
        __u32 nr_jited_func_lens;

This patch made this adjustment so in bpf_prog_info
newly introduced *_info_cnt becomes nr_*_info.

Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-12-10 14:51:45 -08:00
Song Liu
7a5725ddc6 bpf: clean up bpf_prog_get_info_by_fd()
info.nr_jited_ksyms and info.nr_jited_func_lens cannot be 0 in these two
statements, so we don't need to check them.

Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-12-10 14:48:42 -08:00
Jiong Wang
e434b8cdf7 bpf: relax verifier restriction on BPF_MOV | BPF_ALU
Currently, the destination register is marked as unknown for 32-bit
sub-register move (BPF_MOV | BPF_ALU) whenever the source register type is
SCALAR_VALUE.

This is too conservative that some valid cases will be rejected.
Especially, this may turn a constant scalar value into unknown value that
could break some assumptions of verifier.

For example, test_l4lb_noinline.c has the following C code:

    struct real_definition *dst

1:  if (!get_packet_dst(&dst, &pckt, vip_info, is_ipv6))
2:    return TC_ACT_SHOT;
3:
4:  if (dst->flags & F_IPV6) {

get_packet_dst is responsible for initializing "dst" into valid pointer and
return true (1), otherwise return false (0). The compiled instruction
sequence using alu32 will be:

  412: (54) (u32) r7 &= (u32) 1
  413: (bc) (u32) r0 = (u32) r7
  414: (95) exit

insn 413, a BPF_MOV | BPF_ALU, however will turn r0 into unknown value even
r7 contains SCALAR_VALUE 1.

This causes trouble when verifier is walking the code path that hasn't
initialized "dst" inside get_packet_dst, for which case 0 is returned and
we would then expect verifier concluding line 1 in the above C code pass
the "if" check, therefore would skip fall through path starting at line 4.
Now, because r0 returned from callee has became unknown value, so verifier
won't skip analyzing path starting at line 4 and "dst->flags" requires
dereferencing the pointer "dst" which actually hasn't be initialized for
this path.

This patch relaxed the code marking sub-register move destination. For a
SCALAR_VALUE, it is safe to just copy the value from source then truncate
it into 32-bit.

A unit test also included to demonstrate this issue. This test will fail
before this patch.

This relaxation could let verifier skipping more paths for conditional
comparison against immediate. It also let verifier recording a more
accurate/strict value for one register at one state, if this state end up
with going through exit without rejection and it is used for state
comparison later, then it is possible an inaccurate/permissive value is
better. So the real impact on verifier processed insn number is complex.
But in all, without this fix, valid program could be rejected.

>From real benchmarking on kernel selftests and Cilium bpf tests, there is
no impact on processed instruction number when tests ares compiled with
default compilation options. There is slightly improvements when they are
compiled with -mattr=+alu32 after this patch.

Also, test_xdp_noinline/-mattr=+alu32 now passed verification. It is
rejected before this fix.

Insn processed before/after this patch:

                        default     -mattr=+alu32

Kernel selftest

===
test_xdp.o              371/371      369/369
test_l4lb.o             6345/6345    5623/5623
test_xdp_noinline.o     2971/2971    rejected/2727
test_tcp_estates.o      429/429      430/430

Cilium bpf
===
bpf_lb-DLB_L3.o:        2085/2085     1685/1687
bpf_lb-DLB_L4.o:        2287/2287     1986/1982
bpf_lb-DUNKNOWN.o:      690/690       622/622
bpf_lxc.o:              95033/95033   N/A
bpf_netdev.o:           7245/7245     N/A
bpf_overlay.o:          2898/2898     3085/2947

NOTE:
  - bpf_lxc.o and bpf_netdev.o compiled by -mattr=+alu32 are rejected by
    verifier due to another issue inside verifier on supporting alu32
    binary.
  - Each cilium bpf program could generate several processed insn number,
    above number is sum of them.

v1->v2:
 - Restrict the change on SCALAR_VALUE.
 - Update benchmark numbers on Cilium bpf tests.

Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-12-10 09:23:33 -08:00
Steven Rostedt (VMware)
45fe439bc3 fgraph: Add comment to describe ftrace_graph_get_ret_stack
The ret_stack should not be accessed directly via the curr_ret_stack
variable on the task_struct. This is because the ret_stack is going to be
converted into a series of longs and not an array of ret_stack structures.
The way that a ret_stack should be retrieved is via the
ftrace_graph_get_ret_stack structure, but it needs to be documented on how
to use it.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-12-10 12:22:45 -05:00
Steven Rostedt (VMware)
a0572f687f ftrace: Allow ftrace_replace_code() to be schedulable
The function ftrace_replace_code() is the ftrace engine that does the
work to modify all the nops into the calls to the function callback in
all the functions being traced.

The generic version which is normally called from stop machine, but an
architecture can implement a non stop machine version and still use the
generic ftrace_replace_code(). When an architecture does this,
ftrace_replace_code() may be called from a schedulable context, where
it can allow the code to be preemptible, and schedule out.

In order to allow an architecture to make ftrace_replace_code()
schedulable, a new command flag is added called:

 FTRACE_MAY_SLEEP

Which can be or'd to the command that is passed to
ftrace_modify_all_code() that calls ftrace_replace_code() and will have
it call cond_resched() in the loop that modifies the nops into the
calls to the ftrace trampolines.

Link: http://lkml.kernel.org/r/20181204192903.8193-1-anders.roxell@linaro.org
Link: http://lkml.kernel.org/r/20181205183303.828422192@goodmis.org

Reported-by: Anders Roxell <anders.roxell@linaro.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-12-10 12:22:45 -05:00
Masami Hiramatsu
1ce25e9f6f tracing: Add generic event-name based remove event method
Add a generic method to remove event from dynamic event
list. This is same as other system under ftrace. You
just need to pass the event name with '!', e.g.

  # echo p:new_grp/new_event _do_fork > dynamic_events

This creates an event, and

  # echo '!p:new_grp/new_event _do_fork' > dynamic_events

Or,

  # echo '!p:new_grp/new_event' > dynamic_events

will remove new_grp/new_event event.

Note that this doesn't check the event prefix (e.g. "p:")
strictly, because the "group/event" name must be unique.

Link: http://lkml.kernel.org/r/154140869774.17322.8887303560398645347.stgit@devbox

Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-12-10 12:22:44 -05:00
Steven Rostedt (VMware)
7e1413edd6 tracing: Consolidate trace_add/remove_event_call back to the nolock functions
The trace_add/remove_event_call_nolock() functions were added to allow
the tace_add/remove_event_call() code be called when the event_mutex
lock was already taken. Now that all callers are done within the
event_mutex, there's no reason to have two different interfaces.

Remove the current wrapper trace_add/remove_event_call()s and rename the
_nolock versions back to the original names.

Link: http://lkml.kernel.org/r/154140866955.17322.2081425494660638846.stgit@devbox

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-12-10 12:22:10 -05:00
Tetsuo Handa
e80c1a9d5f printk: fix printk_time race.
Since printk_time can be toggled via /sys/module/printk/parameters/time ,
it is not safe to assume that output length does not change across
multiple msg_print_text() calls. If we hit this race, we can observe
failures such as SYSLOG_ACTION_READ_ALL writes more bytes than userspace
has supplied, SYSLOG_ACTION_SIZE_UNREAD returns -EFAULT when succeeded,
SYSLOG_ACTION_READ reads garbage memory or even triggers an kernel oops
at _copy_to_user() due to integer overflow.

To close this race, get a snapshot value of printk_time and pass it to
SYSLOG_ACTION_READ, SYSLOG_ACTION_READ_ALL, SYSLOG_ACTION_SIZE_UNREAD and
kmsg_dump_get_buffer().

Link: http://lkml.kernel.org/r/555af37c-b9e0-f940-cb73-a78eba2d4944@i-love.sakura.ne.jp
To: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-12-10 10:45:59 +01:00
David S. Miller
4cc1feeb6f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several conflicts, seemingly all over the place.

I used Stephen Rothwell's sample resolutions for many of these, if not
just to double check my own work, so definitely the credit largely
goes to him.

The NFP conflict consisted of a bug fix (moving operations
past the rhashtable operation) while chaning the initial
argument in the function call in the moved code.

The net/dsa/master.c conflict had to do with a bug fix intermixing of
making dsa_master_set_mtu() static with the fixing of the tagging
attribute location.

cls_flower had a conflict because the dup reject fix from Or
overlapped with the addition of port range classifiction.

__set_phy_supported()'s conflict was relatively easy to resolve
because Andrew fixed it in both trees, so it was just a matter
of taking the net-next copy.  Or at least I think it was :-)

Joe Stringer's fix to the handling of netns id 0 in bpf_sk_lookup()
intermixed with changes on how the sdif and caller_net are calculated
in these code paths in net-next.

The remaining BPF conflicts were largely about the addition of the
__bpf_md_ptr stuff in 'net' overlapping with adjustments and additions
to the relevant data structure where the MD pointer macros are used.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-09 21:43:31 -08:00
Jens Axboe
96f774106e Linux 4.20-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlwNpb0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwGwH/00UHnXfxww3ixxz
 zwTVDzptA6SPm6s84yJOWatM5fXhPiAltZaHSYF9lzRzNU71NCq7Frhq3fQUIXKM
 OxqDn9nfSTWcjWTk2q5n2keyRV/KIn67YX7UgqFc1bO/mqtVjEgNWaMyblhI+e9E
 giu1ZXayHr43jK1cDOmGExZubXUq7Vsc9TOlrd+d2SwIqeEP7TCMrPhnHDwCNvX2
 UU5dtANpVzGtHaBcr37wJj+L8kODCc0f+PQ3g2ar5jTHst5SLlHp2u0AMRnUmgdi
 VkGx+mu/uk8mtwUqMIMqhplklVoqK6LTeLqsY5Xt32SKruw9UqyJGdphLjW2QP/g
 MkmA1lI=
 =7kaD
 -----END PGP SIGNATURE-----

Merge tag 'v4.20-rc6' into for-4.21/block

Pull in v4.20-rc6 to resolve the conflict in NVMe, but also to get the
two corruption fixes. We're going to be overhauling the direct dispatch
path, and we need to do that on top of the changes we made for that
in mainline.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-09 17:45:40 -07:00
Linus Torvalds
d48f782e4f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "A decent batch of fixes here. I'd say about half are for problems that
  have existed for a while, and half are for new regressions added in
  the 4.20 merge window.

   1) Fix 10G SFP phy module detection in mvpp2, from Baruch Siach.

   2) Revert bogus emac driver change, from Benjamin Herrenschmidt.

   3) Handle BPF exported data structure with pointers when building
      32-bit userland, from Daniel Borkmann.

   4) Memory leak fix in act_police, from Davide Caratti.

   5) Check RX checksum offload in RX descriptors properly in aquantia
      driver, from Dmitry Bogdanov.

   6) SKB unlink fix in various spots, from Edward Cree.

   7) ndo_dflt_fdb_dump() only works with ethernet, enforce this, from
      Eric Dumazet.

   8) Fix FID leak in mlxsw driver, from Ido Schimmel.

   9) IOTLB locking fix in vhost, from Jean-Philippe Brucker.

  10) Fix SKB truesize accounting in ipv4/ipv6/netfilter frag memory
      limits otherwise namespace exit can hang. From Jiri Wiesner.

  11) Address block parsing length fixes in x25 from Martin Schiller.

  12) IRQ and ring accounting fixes in bnxt_en, from Michael Chan.

  13) For tun interfaces, only iface delete works with rtnl ops, enforce
      this by disallowing add. From Nicolas Dichtel.

  14) Use after free in liquidio, from Pan Bian.

  15) Fix SKB use after passing to netif_receive_skb(), from Prashant
      Bhole.

  16) Static key accounting and other fixes in XPS from Sabrina Dubroca.

  17) Partially initialized flow key passed to ip6_route_output(), from
      Shmulik Ladkani.

  18) Fix RTNL deadlock during reset in ibmvnic driver, from Thomas
      Falcon.

  19) Several small TCP fixes (off-by-one on window probe abort, NULL
      deref in tail loss probe, SNMP mis-estimations) from Yuchung
      Cheng"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (93 commits)
  net/sched: cls_flower: Reject duplicated rules also under skip_sw
  bnxt_en: Fix _bnxt_get_max_rings() for 57500 chips.
  bnxt_en: Fix NQ/CP rings accounting on the new 57500 chips.
  bnxt_en: Keep track of reserved IRQs.
  bnxt_en: Fix CNP CoS queue regression.
  net/mlx4_core: Correctly set PFC param if global pause is turned off.
  Revert "net/ibm/emac: wrong bit is used for STA control"
  neighbour: Avoid writing before skb->head in neigh_hh_output()
  ipv6: Check available headroom in ip6_xmit() even without options
  tcp: lack of available data can also cause TSO defer
  ipv6: sr: properly initialize flowi6 prior passing to ip6_route_output
  mlxsw: spectrum_switchdev: Fix VLAN device deletion via ioctl
  mlxsw: spectrum_router: Relax GRE decap matching check
  mlxsw: spectrum_switchdev: Avoid leaking FID's reference count
  mlxsw: spectrum_nve: Remove easily triggerable warnings
  ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes
  sctp: frag_point sanity check
  tcp: fix NULL ref in tail loss probe
  tcp: Do not underestimate rwnd_limited
  net: use skb_list_del_init() to remove from RX sublists
  ...
2018-12-09 15:12:33 -08:00
Martin KaFai Lau
c454a46b5e bpf: Add bpf_line_info support
This patch adds bpf_line_info support.

It accepts an array of bpf_line_info objects during BPF_PROG_LOAD.
The "line_info", "line_info_cnt" and "line_info_rec_size" are added
to the "union bpf_attr".  The "line_info_rec_size" makes
bpf_line_info extensible in the future.

The new "check_btf_line()" ensures the userspace line_info is valid
for the kernel to use.

When the verifier is translating/patching the bpf_prog (through
"bpf_patch_insn_single()"), the line_infos' insn_off is also
adjusted by the newly added "bpf_adj_linfo()".

If the bpf_prog is jited, this patch also provides the jited addrs (in
aux->jited_linfo) for the corresponding line_info.insn_off.
"bpf_prog_fill_jited_linfo()" is added to fill the aux->jited_linfo.
It is currently called by the x86 jit.  Other jits can also use
"bpf_prog_fill_jited_linfo()" and it will be done in the followup patches.
In the future, if it deemed necessary, a particular jit could also provide
its own "bpf_prog_fill_jited_linfo()" implementation.

A few "*line_info*" fields are added to the bpf_prog_info such
that the user can get the xlated line_info back (i.e. the line_info
with its insn_off reflecting the translated prog).  The jited_line_info
is available if the prog is jited.  It is an array of __u64.
If the prog is not jited, jited_line_info_cnt is 0.

The verifier's verbose log with line_info will be done in
a follow up patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-12-09 13:54:38 -08:00
Masami Hiramatsu
0e2b81f7b5 tracing: Remove unneeded synth_event_mutex
Rmove unneeded synth_event_mutex. This mutex protects the reference
count in synth_event, however, those operational points are already
protected by event_mutex.

1. In __create_synth_event() and create_or_delete_synth_event(),
 those synth_event_mutex clearly obtained right after event_mutex.

2. event_hist_trigger_func() is trigger_hist_cmd.func() which is
 called by trigger_process_regex(), which is a part of
 event_trigger_regex_write() and this function takes event_mutex.

3. hist_unreg_all() is trigger_hist_cmd.unreg_all() which is called
 by event_trigger_regex_open() and it takes event_mutex.

4. onmatch_destroy() and onmatch_create() have long call tree,
 but both are finally invoked from event_trigger_regex_write()
 and event_trace_del_tracer(), former takes event_mutex, and latter
 ensures called under event_mutex locked.

Finally, I ensured there is no resource conflict. For safety,
I added lockdep_assert_held(&event_mutex) for each function.

Link: http://lkml.kernel.org/r/154140864134.17322.4796059721306031894.stgit@devbox

Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-12-08 20:54:10 -05:00
Masami Hiramatsu
7bbab38d07 tracing: Use dyn_event framework for synthetic events
Use dyn_event framework for synthetic events. This shows
synthetic events on "tracing/dynamic_events" file in addition
to tracing/synthetic_events interface.

User can also define new events via tracing/dynamic_events
with "s:" prefix. So, the new syntax is below;

  s:[synthetic/]EVENT_NAME TYPE ARG; [TYPE ARG;]...

To remove events via tracing/dynamic_events, you can use
"-:" prefix as same as other events.

Link: http://lkml.kernel.org/r/154140861301.17322.15454611233735614508.stgit@devbox

Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-12-08 20:54:10 -05:00
Masami Hiramatsu
0597c49c69 tracing/uprobes: Use dyn_event framework for uprobe events
Use dyn_event framework for uprobe events. This shows
uprobe events on "dynamic_events" file.
User can also define new uprobe events via dynamic_events.

Link: http://lkml.kernel.org/r/154140858481.17322.9091293846515154065.stgit@devbox

Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-12-08 20:54:10 -05:00
Masami Hiramatsu
6212dd2968 tracing/kprobes: Use dyn_event framework for kprobe events
Use dyn_event framework for kprobe events. This shows
kprobe events on "tracing/dynamic_events" file.

User can also define new events via tracing/dynamic_events.

Link: http://lkml.kernel.org/r/154140855646.17322.6619219995865980392.stgit@devbox

Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-12-08 20:54:09 -05:00
Masami Hiramatsu
5448d44c38 tracing: Add unified dynamic event framework
Add unified dynamic event framework for ftrace kprobes, uprobes
and synthetic events. Those dynamic events can be co-exist on
same file because those syntax doesn't overlap.

This introduces a framework part which provides a unified tracefs
interface and operations.

Link: http://lkml.kernel.org/r/154140852824.17322.12250362185969352095.stgit@devbox

Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-12-08 20:54:09 -05:00
Masami Hiramatsu
d00bbea945 tracing: Integrate similar probe argument parsers
Integrate similar argument parsers for kprobes and uprobes events
into traceprobe_parse_probe_arg().

Link: http://lkml.kernel.org/r/154140850016.17322.9836787731210512176.stgit@devbox

Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-12-08 20:54:09 -05:00
Masami Hiramatsu
faacb361f2 tracing: Simplify creation and deletion of synthetic events
Since the event_mutex and synth_event_mutex ordering issue
is gone, we can skip existing event check when adding or
deleting events, and some redundant code in error path.

This changes release_all_synth_events() to abort the process
when it hits any error and returns the error code. It succeeds
only if it has no error.

Link: http://lkml.kernel.org/r/154140847194.17322.17960275728005067803.stgit@devbox

Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-12-08 20:54:09 -05:00
Masami Hiramatsu
fc800a10be tracing: Lock event_mutex before synth_event_mutex
synthetic event is using synth_event_mutex for protecting
synth_event_list, and event_trigger_write() path acquires
locks as below order.

event_trigger_write(event_mutex)
  ->trigger_process_regex(trigger_cmd_mutex)
    ->event_hist_trigger_func(synth_event_mutex)

On the other hand, synthetic event creation and deletion paths
call trace_add_event_call() and trace_remove_event_call()
which acquires event_mutex. In that case, if we keep the
synth_event_mutex locked while registering/unregistering synthetic
events, its dependency will be inversed.

To avoid this issue, current synthetic event is using a 2 phase
process to create/delete events. For example, it searches existing
events under synth_event_mutex to check for event-name conflicts, and
unlocks synth_event_mutex, then registers a new event under event_mutex
locked. Finally, it locks synth_event_mutex and tries to add the
new event to the list. But it can introduce complexity and a chance
for name conflicts.

To solve this simpler, this introduces trace_add_event_call_nolock()
and trace_remove_event_call_nolock() which don't acquire
event_mutex inside. synthetic event can lock event_mutex before
synth_event_mutex to solve the lock dependency issue simpler.

Link: http://lkml.kernel.org/r/154140844377.17322.13781091165954002713.stgit@devbox

Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-12-08 20:54:09 -05:00
Masami Hiramatsu
547cd9eacd tracing/uprobes: Add busy check when cleanup all uprobes
Add a busy check loop in cleanup_all_probes() before
trying to remove all events in uprobe_events, the same way
that kprobe_events does.

Without this change, writing null to uprobe_events will
try to remove events but if one of them is enabled, it will
stop there leaving some events cleared and others not clceared.

With this change, writing null to uprobe_events makes
sure all events are not enabled before removing events.
So, it clears all events, or returns an error (-EBUSY)
with keeping all events.

Link: http://lkml.kernel.org/r/154140841557.17322.12653952888762532401.stgit@devbox

Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Tested-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-12-08 20:54:08 -05:00
Steven Rostedt (VMware)
a7b1d74e87 tracing: Change default buffer_percent to 50
After running several tests, it appears that having the reader wait till
half the buffer is full before starting to read (and causing its own events
to fill up the ring buffer constantly), works well. It keeps trace-cmd (the
main user of this interface) from dominating the traces it records.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-12-08 20:54:08 -05:00
Steven Rostedt (VMware)
03329f9939 tracing: Add tracefs file buffer_percentage
Add a "buffer_percentage" file, that allows users to specify how much of the
buffer (percentage of pages) need to be filled before waking up a task
blocked on a per cpu trace_pipe_raw file.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-12-08 20:54:08 -05:00
Steven Rostedt (VMware)
2c2b0a78b3 ring-buffer: Add percentage of ring buffer full to wake up reader
Instead of just waiting for a page to be full before waking up a pending
reader, allow the reader to pass in a "percentage" of pages that have
content before waking up a reader. This should help keep the process of
reading the events not cause wake ups that constantly cause reading of the
buffer.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-12-08 20:54:08 -05:00
Dan Carpenter
ca16b0fbb0 tracing: Have trace_stack nr_entries compare not be so subtle
Dan Carpenter reviewed the trace_stack.c code and figured he found an off by
one bug.

 "From reviewing the code, it seems possible for
  stack_trace_max.nr_entries to be set to .max_entries and in that case we
  would be reading one element beyond the end of the stack_dump_trace[]
  array.  If it's not set to .max_entries then the bug doesn't affect
  runtime."

Although it looks to be the case, it is not. Because we have:

 static unsigned long stack_dump_trace[STACK_TRACE_ENTRIES+1] =
	 { [0 ... (STACK_TRACE_ENTRIES)] = ULONG_MAX };

 struct stack_trace stack_trace_max = {
	.max_entries		= STACK_TRACE_ENTRIES - 1,
	.entries		= &stack_dump_trace[0],
 };

And:

	stack_trace_max.nr_entries = x;
	for (; x < i; x++)
		stack_dump_trace[x] = ULONG_MAX;

Even if nr_entries equals max_entries, indexing with it into the
stack_dump_trace[] array will not overflow the array. But if it is the case,
the second part of the conditional that tests stack_dump_trace[nr_entries]
to ULONG_MAX will always be true.

By applying Dan's patch, it removes the subtle aspect of it and makes the if
conditional slightly more efficient.

Link: http://lkml.kernel.org/r/20180620110758.crunhd5bfep7zuiz@kili.mountain

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-12-08 20:54:07 -05:00