IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
To make the code a bit more readable, let's move the OSI specific
initialization out of the psci_dt_cpu_init_idle() and into a separate
function.
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Before commit 1328edca4a ("cpuidle-haltpoll: Enable kvm guest polling
when dedicated physical CPUs are available") the cpuidle-haltpoll driver
could also be used in scenarios when the host does not advertise the
KVM_HINTS_REALTIME hint.
While the behavior introduced by the aforementioned commit makes sense as
the default there are cases where the old behavior is desired, for example,
when other kernel changes triggered by presence by this hint are unwanted,
for some workloads where the latency benefit from polling overweights the
loss from idle CPU capacity that otherwise would be available, or just when
running under older Qemu versions that lack this hint.
Let's provide a typical "force" module parameter that allows restoring the
old behavior.
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
LP2 suspending could be unavailable, for example if it is disabled in a
device-tree. CC6 cpuidle state won't work in that case.
Acked-by: Peter De Schrijver <pdeschrijver@nvidia.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Tegra20/30/114/124 SoCs have common idling states, thus there is no much
point in having separate drivers for a similar hardware. This patch moves
Tegra114/124 arch/ drivers into the common driver without any functional
changes. The CC6 state is kept disabled on Tegra114/124 because the core
Tegra PM code needs some more work in order to support that state.
Acked-by: Peter De Schrijver <pdeschrijver@nvidia.com>
Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Tegra20 and Terga30 SoCs have common C1 and CC6 idling states and thus
share the same code paths, there is no point in having separate drivers
for a similar hardware. This patch merely moves functionality of the old
driver into the new, although the CC6 state is kept disabled for now since
old driver had a rudimentary support for this state (allowing to enter
into CC6 only when secondary CPUs are put offline), while new driver can
provide a full-featured support. The new feature will be enabled by
another patch.
Acked-by: Peter De Schrijver <pdeschrijver@nvidia.com>
Tested-by: Peter Geis <pgwipeout@gmail.com>
Tested-by: Jasper Korten <jja2000@gmail.com>
Tested-by: David Heidelberg <david@ixit.cz>
Tested-by: Nicolas Chauvet <kwizart@gmail.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
The driver's code is refactored in a way that will make it easy to
support Tegra30/114/124 SoCs by this unified driver later on. The
current functionality is equal to the old Tegra20 driver, only the
code's structure changed a tad. This is also a proper platform driver
now.
Acked-by: Peter De Schrijver <pdeschrijver@nvidia.com>
Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Call cpu_latency_qos_limit() instead of pm_qos_request(), because the
latter is going to be dropped.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org>
Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
Notice that pm_qos_remove_notifier() is not used at all and the only
caller of pm_qos_add_notifier() is the cpuidle core, which only needs
the PM_QOS_CPU_DMA_LATENCY notifier to invoke wake_up_all_idle_cpus()
upon changes of the PM_QOS_CPU_DMA_LATENCY target value.
First, to ensure that wake_up_all_idle_cpus() will be called
whenever the PM_QOS_CPU_DMA_LATENCY target value changes, modify the
pm_qos_add/update/remove_request() family of functions to check if
the effective constraint for the PM_QOS_CPU_DMA_LATENCY has changed
and call wake_up_all_idle_cpus() directly in that case.
Next, drop the PM_QOS_CPU_DMA_LATENCY notifier from cpuidle as it is
not necessary any more.
Finally, drop both pm_qos_add_notifier() and pm_qos_remove_notifier(),
as they have no callers now, along with cpu_dma_lat_notifier which is
only used by them.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Amit Kucheria <amit.kucheria@linaro.org>
Tested-by: Amit Kucheria <amit.kucheria@linaro.org>
Various driver updates for platforms:
- Nvidia: Fuse support for Tegra194, continued memory controller pieces
for Tegra30
- NXP/FSL: Refactorings of QuickEngine drivers to support ARM/ARM64/PPC
- NXP/FSL: i.MX8MP SoC driver pieces
- TI Keystone: ring accelerator driver
- Qualcomm: SCM driver cleanup/refactoring + support for new SoCs.
- Xilinx ZynqMP: feature checking interface for firmware. Mailbox
communication for power management
- Overall support patch set for cpuidle on more complex hierarchies
(PSCI-based)
+ Misc cleanups, refactorings of Marvell, TI, other platforms.
-----BEGIN PGP SIGNATURE-----
iQJDBAABCAAtFiEElf+HevZ4QCAJmMQ+jBrnPN6EHHcFAl4+lTYPHG9sb2ZAbGl4
b20ubmV0AAoJEIwa5zzehBx3nQcQAJm91+6hZbmMjlBySGS7ISjYvOcrI/hMgiOl
uhhEP0Dcylvf9A9x3wcIbLwixe+2pvie9DQh2u5F80ShYimidtFi/2xCfuTb9fKu
sxxKjrXWyVKhkpW0z+tedY08ftVhkwwcyD4m2C7uVl6AwTP7c367vFeU7XjF2APn
drfgmgbjm8U3XbSyAqv+k6z6tyqaCnFM7vbPupSKHgHJ3mfByxOa+XyBN2RdgBbs
0KrVfbXGv80zFIFrMPwaWG7G52bu7K68nVdgy44MpKdRZ6QTjhnR+kerFxHsYgV4
bM55Fya52nTCSTGdKaQakDtKwbAUdCDTSkxgOHGcQoyFi0R/VaEUJtcysnvLbI6c
+n/yFIzGyEdXcvIzfv2SoDYhogw19I6RR/M9K5Ni29eazkDVYx2z3rI+2QYeqCiF
u7cq52gW6JLP0SI/9kuUrRFiR8v19Ixap7qokAxgqQwYB3NzT8a7WsYPkzdpDZGQ
ETSDFMyBWT6UvBe/HWkQluBabbet53rG8BF0OHFrQuMK0u/ieKgSGuTB9XN2djEW
PHMOMz2vhi+8XTfpkskhF2tTxlA/k4R6QwCdIMpIkMRVnVQCh1XdPr3Fi2NrgB+S
kIXHD4vV6zLYh04zHyKewSPHAXWgraFpg2qKnvL5+KWMTnW6QH+RNjOt9xKDNXOd
+iDXpOad
=ONtb
-----END PGP SIGNATURE-----
Merge tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC-related driver updates from Olof Johansson:
"Various driver updates for platforms:
- Nvidia: Fuse support for Tegra194, continued memory controller
pieces for Tegra30
- NXP/FSL: Refactorings of QuickEngine drivers to support
ARM/ARM64/PPC
- NXP/FSL: i.MX8MP SoC driver pieces
- TI Keystone: ring accelerator driver
- Qualcomm: SCM driver cleanup/refactoring + support for new SoCs.
- Xilinx ZynqMP: feature checking interface for firmware. Mailbox
communication for power management
- Overall support patch set for cpuidle on more complex hierarchies
(PSCI-based)
and misc cleanups, refactorings of Marvell, TI, other platforms"
* tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (166 commits)
drivers: soc: xilinx: Use mailbox IPI callback
dt-bindings: power: reset: xilinx: Add bindings for ipi mailbox
drivers: soc: ti: knav_qmss_queue: Pass lockdep expression to RCU lists
MAINTAINERS: Add brcmstb PCIe controller entry
soc/tegra: fuse: Unmap registers once they are not needed anymore
soc/tegra: fuse: Correct straps' address for older Tegra124 device trees
soc/tegra: fuse: Warn if straps are not ready
soc/tegra: fuse: Cache values of straps and Chip ID registers
memory: tegra30-emc: Correct error message for timed out auto calibration
memory: tegra30-emc: Firm up hardware programming sequence
memory: tegra30-emc: Firm up suspend/resume sequence
soc/tegra: regulators: Do nothing if voltage is unchanged
memory: tegra: Correct reset value of xusb_hostr
soc/tegra: fuse: Add APB DMA dependency for Tegra20
bus: tegra-aconnect: Remove PM_CLK dependency
dt-bindings: mediatek: add MT6765 power dt-bindings
soc: mediatek: cmdq: delete not used define
memory: tegra: Add support for the Tegra194 memory controller
memory: tegra: Only include support for enabled SoCs
memory: tegra: Support DVFS on Tegra186 and later
...
Merge changes updating the ACPI processor driver in order to export
acpi_processor_evaluate_cst() to the code outside of it and adding
ACPI support to the intel_idle driver based on that.
* intel_idle+acpi:
Documentation: admin-guide: PM: Add intel_idle document
intel_idle: Use ACPI _CST on server systems
intel_idle: Add module parameter to prevent ACPI _CST from being used
intel_idle: Allow ACPI _CST to be used for selected known processors
cpuidle: Allow idle states to be disabled by default
intel_idle: Use ACPI _CST for processor models without C-state tables
intel_idle: Refactor intel_idle_cpuidle_driver_init()
ACPI: processor: Export acpi_processor_evaluate_cst()
ACPI: processor: Make ACPI_PROCESSOR_CSTATE depend on ACPI_PROCESSOR
ACPI: processor: Clean up acpi_processor_evaluate_cst()
ACPI: processor: Introduce acpi_processor_evaluate_cst()
ACPI: processor: Export function to claim _CST control
Fix cpuidle_find_deepest_state() kernel documentation to avoid
warnings when compiling with W=1.
Signed-off-by: Benjamin Gaignard <benjamin.gaignard@st.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Fix kernel documentation comments to remove warnings when
compiling with W=1.
Signed-off-by: Benjamin Gaignard <benjamin.gaignard@st.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Fix warnings that show up when compiling with W=1
Signed-off-by: Benjamin Gaignard <benjamin.gaignard@st.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Some of cpuidle drivers for ARMv7 can be compile tested on this
architecture because they do not depend on mach-specific bits. Enable
compile testing for big.LITTLE, Kirkwood, Zynq, AT91, Exynos and mvebu
cpuidle drivers.
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Fix a simple bug in rotating array index.
Fixes: b26bf6ab71 ("cpuidle: New timer events oriented governor for tickless systems")
Signed-off-by: Ikjoon Jang <ikjn@chromium.org>
Cc: 5.1+ <stable@vger.kernel.org> # 5.1+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpuidle_driver_ref() and cpuidle_driver_unref() functions are not
used and the refcnt field in struct cpuidle_driver operated by them
is not updated anywhere else (so it is permanently equal to 0), so
drop both of them along with refcnt.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
When the hierarchical CPU topology layout is used in DT and the PSCI OSI
mode is supported by the PSCI FW, let's initialize a corresponding PM
domain topology by using genpd. This enables a CPU and a group of CPUs,
when attached to the topology, to be power-managed accordingly.
To trigger the attempt to initialize the genpd data structures let's use a
subsys_initcall, which should be early enough to allow CPUs, but also other
devices to be attached.
The initialization consists of parsing the PSCI OF node for the topology
and the "domain idle states" DT bindings. In case the idle states are
compatible with "domain-idle-state", the initialized genpd becomes
responsible of selecting an idle state for the PM domain, via assigning it
a genpd governor.
Note that, a successful initialization of the genpd data structures, is
followed by a call to psci_set_osi_mode(), as to try to enable the OSI mode
in the PSCI FW. In case this fails, we fall back into a degraded mode
rather than bailing out and returning error codes.
Co-developed-by: Lina Iyer <lina.iyer@linaro.org>
Signed-off-by: Lina Iyer <lina.iyer@linaro.org>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
When the hierarchical CPU topology is used and when a CPU is put offline,
that CPU prevents its PM domain from being powered off, which is because
genpd observes the corresponding attached device as being active from a
runtime PM point of view. Furthermore, any potential master PM domains are
also prevented from being powered off.
To address this limitation, let's add add a new CPU hotplug state
(CPUHP_AP_CPU_PM_STARTING) and register up/down callbacks for it, which
allows us to deal with runtime PM accordingly.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
In case we have succeeded to attach a CPU to its PM domain, let's deploy
runtime PM support for the corresponding attached device, to allow the CPU
to be powered-managed accordingly.
The triggering point for when runtime PM reference counting should be done,
has been selected to the deepest idle state for the CPU. However, from the
hierarchical point view, there may be good reasons to do runtime PM
reference counting even on shallower idle states, but at this point this
isn't supported, mainly due to limitations set by the generic PM domain.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
The per CPU variable psci_power_state, contains an array of fixed values,
which reflects the corresponding arm,psci-suspend-param parsed from DT, for
each of the available CPU idle states.
This isn't sufficient when using the hierarchical CPU topology in DT, in
combination with having PSCI OS initiated (OSI) mode enabled. More
precisely, in OSI mode, Linux is responsible of telling the PSCI FW what
idle state the cluster (a group of CPUs) should enter, while in PSCI
Platform Coordinated (PC) mode, each CPU independently votes for an idle
state of the cluster.
For this reason, introduce a per CPU variable called domain_state and
implement two helper functions to read/write its value. Then let the
domain_state take precedence over the regular selected state, when entering
and idle state.
To avoid executing the above OSI specific code in the ->enter() callback,
while operating in the default PSCI Platform Coordinated mode, let's also
add a new enter-function and use it for OSI.
Co-developed-by: Lina Iyer <lina.iyer@linaro.org>
Signed-off-by: Lina Iyer <lina.iyer@linaro.org>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
In order to enable a CPU to be power managed through its PM domain, let's
try to attach it by calling psci_dt_attach_cpu() during the cpuidle
initialization.
psci_dt_attach_cpu() returns a pointer to the attached struct device, which
later should be used for runtime PM, hence we need to store it somewhere.
Rather than adding yet another per CPU variable, let's create a per CPU
struct to collect the relevant per CPU variables.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Introduce a PSCI DT helper function, psci_dt_attach_cpu(), which takes a
CPU number as an in-parameter and tries to attach the CPU's struct device
to its corresponding PM domain.
Let's makes use of dev_pm_domain_attach_by_name(), as it allows us to
specify "psci" as the "name" of the PM domain to attach to. Additionally,
let's also prepare the attached device to be power managed via runtime PM.
Note that, the implementation of the new helper function is in a new
separate c-file, which may seems a bit too much at this point. However,
subsequent changes that implements the remaining part of the PM domain
support for cpuidle-psci, helps to justify this split.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Currently CPU's idle states are represented using the flattened model.
Let's add support for the hierarchical layout, via converting to use
of_get_cpu_state_node().
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Iterating through the idle state nodes in DT, to find out the number of
states that needs to be allocated is unnecessary, as it has already been
done from dt_init_idle_driver(). Therefore, drop the iteration and use the
number we already have at hand.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Currently CPU's idle states are represented using the flattened model.
Let's add support for the hierarchical layout, via converting to use
of_get_cpu_state_node().
Suggested-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Lina Iyer <lina.iyer@linaro.org>
Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Co-developed-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Instead of allocating 'n-1' states in psci_power_state to manage 'n'
idle states which include "ARM WFI" state, it would be simpler to have
1:1 mapping between psci_power_state and cpuidle driver states.
ARM WFI state(i.e. idx == 0) is handled specially in the generic macro
CPU_PM_CPU_IDLE_ENTER_PARAM and hence state[-1] is not possible. However
for sake of code readability, it is better to have 1:1 mapping and not
use [idx - 1] to access psci_power_state corresponding to driver cpuidle
state for idx.
psci_power_state[0] is default initialised to 0 and is never accessed
while entering WFI state.
Reported-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
In certain situations it may be useful to prevent some idle states
from being used by default while allowing user space to enable them
later on.
For this purpose, introduce a new state flag, CPUIDLE_FLAG_OFF, to
mark idle states that should be disabled by default, make the core
set CPUIDLE_STATE_DISABLED_BY_USER for those states at the
initialization time and add a new state attribute in sysfs,
"default_status", to inform user space of the initial status of
the given idle state ("disabled" if CPUIDLE_FLAG_OFF is set for it,
"enabled" otherwise).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Use devm_platform_ioremap_resource() to simplify code.
Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Use devm_platform_ioremap_resource() to simplify code.
Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The data type of the target_residency_ns field in struct cpuidle_state
is u64, so it does not need to be cast into u64.
Get rid of the unnecessary type cast.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
It turns out that cpuidle_driver_state_disabled() can be called
before registering the cpufreq driver on some platforms, which
was not expected when it was introduced and which leads to a NULL
pointer dereference when trying to walk the CPUs associated with
the given cpuidle driver.
Fix the problem by making cpuidle_driver_state_disabled() check if
the driver's mask of CPUs associated with it is present and to set
CPUIDLE_FLAG_UNUSABLE for the given idle state in the driver's states
list if that is not the case to cause __cpuidle_register_device() to
set CPUIDLE_STATE_DISABLED_BY_DRIVER for that state for all cpuidle
devices registered by it later.
Fixes: cbda56d5fe ("cpuidle: Introduce cpuidle_driver_state_disabled() for driver quirks")
Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Tested-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 259231a045 ("cpuidle: add poll_limit_ns to cpuidle_device
structure") changed, by mistake, the target residency from the first
available sleep state to the last available sleep state (which should
be longer).
This might cause excessive polling.
Fixes: 259231a045 ("cpuidle: add poll_limit_ns to cpuidle_device structure")
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Cc: 5.4+ <stable@vger.kernel.org> # 5.4+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
End sentences in help text with a period (aka full stop).
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
After recent cpuidle updates the "disabled" field in struct
cpuidle_state is only used by two drivers (intel_idle and shmobile
cpuidle) for marking unusable idle states, but that may as well be
achieved with the help of a state flag, so define an "unusable" idle
state flag, CPUIDLE_FLAG_UNUSABLE, make the drivers in question use
it instead of the "disabled" field and make the core set
CPUIDLE_STATE_DISABLED_BY_DRIVER for the idle states with that flag
set.
After the above changes, the "disabled" field in struct cpuidle_state
is not used any more, so drop it.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
$ sed -e 's/^ /\t/' -i */Kconfig
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Modify cpuidle_use_deepest_state() to take an additional exit latency
limit argument to be passed to find_deepest_idle_state() and make
cpuidle_idle_call() pass dev->forced_idle_latency_limit_ns to it for
forced idle.
Suggested-by: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
[ rjw: Rebase and rearrange code, subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In some cases it may be useful to specify an exit latency limit for
the idle state to be used during CPU idle time injection.
Instead of duplicating the information in struct cpuidle_device
or propagating the latency limit in the call stack, replace the
use_deepest_state field with forced_latency_limit_ns to represent
that limit, so that the deepest idle state with exit latency within
that limit is forced (i.e. no governors) when it is set.
A zero exit latency limit for forced idle means to use governors in
the usual way (analogous to use_deepest_state equal to "false" before
this change).
Additionally, add play_idle_precise() taking two arguments, the
duration of forced idle and the idle state exit latency limit, both
in nanoseconds, and redefine play_idle() as a wrapper around that
new function.
This change is preparatory, no functional impact is expected.
Suggested-by: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
[ rjw: Subject, changelog, cpuidle_use_deepest_state() kerneldoc, whitespace ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 99e98d3fb1 ("cpuidle: Consolidate disabled state checks")
overlooked the fact that the imx6q and tegra20 cpuidle drivers use
the "disabled" field in struct cpuidle_state for quirks which trigger
after the initialization of cpuidle, so reading the initial value of
that field is not sufficient for those drivers.
In order to allow them to implement the quirks without using the
"disabled" field in struct cpuidle_state, introduce a new helper
function and modify them to use it.
Fixes: 99e98d3fb1 ("cpuidle: Consolidate disabled state checks")
Reported-by: Len Brown <lenb@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There are three places in teo_select() where a given amount of time
is compared with TICK_NSEC if tick_nohz_tick_stopped() returns true,
which is a bit of duplicated code.
Avoid that code duplication by defining a helper function to do the
check and using it in all of the places in question.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If the current state with the maximum "early hits" metric in
teo_select() is also the one "matching" the expected idle duration,
it will be used as the candidate one for selection even if its
"misses" metric is greater than its "hits" metric, which is not
correct.
In that case, the candidate state should be shallower than the
current one and its "early hits" metric should be the maximum
among the idle states shallower than the current one.
To make that happen, modify teo_select() to save the index of
the state whose "early hits" metric is the maximum for the
range of states below the current one and go back to that state
if it turns out that the current one should be rejected.
Fixes: 159e48560f ("cpuidle: teo: Fix "early hits" handling for disabled idle states")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
One purpose of the computations in teo_update() is to determine
whether or not the (saved) time till the next timer event and the
measured idle duration fall into the same "bin", so avoid using
values that include the cpuidle overhead to obtain the latter.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently, the cpuidle subsystem uses microseconds as the unit of
time which (among other things) causes the idle loop to incur some
integer division overhead for no clear benefit.
In order to allow cpuidle to measure time in nanoseconds, add two
new fields, exit_latency_ns and target_residency_ns, to represent the
exit latency and target residency of an idle state in nanoseconds,
respectively, to struct cpuidle_state and initialize them with the
help of the corresponding values in microseconds provided by drivers.
Additionally, change cpuidle_governor_latency_req() to return the
idle state exit latency constraint in nanoseconds.
Also meeasure idle state residency (last_residency_ns in struct
cpuidle_device and time_ns in struct cpuidle_driver) in nanoseconds
and update the cpuidle core and governors accordingly.
However, the menu governor still computes typical intervals in
microseconds to avoid integer overflows.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Doug Smythies <dsmythies@telus.net>
Tested-by: Doug Smythies <dsmythies@telus.net>
There are two reasons why CPU idle states may be disabled: either
because the driver has disabled them or because they have been
disabled by user space via sysfs.
In the former case, the state's "disabled" flag is set once during
the initialization of the driver and it is never cleared later (it
is read-only effectively). In the latter case, the "disable" field
of the given state's cpuidle_state_usage struct is set and it may be
changed via sysfs. Thus checking whether or not an idle state has
been disabled involves reading these two flags every time.
In order to avoid the additional check of the state's "disabled" flag
(which is effectively read-only anyway), use the value of it at the
init time to set a (new) flag in the "disable" field of that state's
cpuidle_state_usage structure and use the sysfs interface to
manipulate another (new) flag in it. This way the state is disabled
whenever the "disable" field of its cpuidle_state_usage structure is
nonzero, whatever the reason, and it is the only place to look into
to check whether or not the state has been disabled.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Fix __cpuidle_set_driver() to check if any of the CPUs in the mask has
a driver different from drv already and, if so, return -EBUSY before
updating any cpuidle_drivers per-CPU pointers.
Fixes: 82467a5a88 ("cpuidle: simplify multiple driver support")
Cc: 3.11+ <stable@vger.kernel.org> # 3.11+
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currenly haltpoll isn't aware of the 'idle=' override, the priority is
'idle=poll' > haltpoll > 'idle=halt'. When 'idle=poll' is used, cpuidle
driver is bypassed but current_driver in sys still shows 'haltpoll'.
When 'idle=halt' is used, haltpoll takes precedence and makes
'idle=halt' have no effect.
Add a check to prevent the haltpoll driver from loading if 'idle=' is
present.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Co-developed-by: Joao Martins <joao.m.martins@oracle.com>
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The TEO governor uses idle duration "bins" defined in accordance with
the CPU idle states table provided by the driver, so that each "bin"
covers the idle duration range between the target residency of the
idle state corresponding to it and the target residency of the closest
deeper idle state. The governor collects statistics for each bin
regardless of whether or not the idle state corresponding to it is
currently enabled.
In particular, the "early hits" metric measures the likelihood of a
situation in which the idle duration measured after wakeup falls into
to given bin, but the time till the next timer (sleep length) falls
into a bin corresponding to one of the deeper idle states. It is
used when the "hits" and "misses" metrics indicate that the state
"matching" the sleep length should not be selected, so that the state
with the maximum "early hits" value is selected instead of it.
If the idle state corresponding to the given bin is disabled, it
cannot be selected and if it turns out to be the one that should be
selected, a shallower idle state needs to be used instead of it.
Nevertheless, the metrics collected for the bin corresponding to it
are still valid and need to be taken into account as though that
state had not been disabled.
As far as the "early hits" metric is concerned, teo_select() tries to
take disabled states into account, but the state index corresponding
to the maximum "early hits" value computed by it may be incorrect.
Namely, it always uses the index of the previous maximum "early hits"
state then, but there may be enabled idle states closer to the
disabled one in question. In particular, if the current candidate
state (whose index is the idx value) is closer to the disabled one
and the "early hits" value of the disabled state is greater than the
current maximum, the index of the current candidate state (idx)
should replace the "maximum early hits state" index.
Modify the code to handle that case correctly.
Fixes: b26bf6ab71 ("cpuidle: New timer events oriented governor for tickless systems")
Reported-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: 5.1+ <stable@vger.kernel.org> # 5.1+
The TEO governor uses idle duration "bins" defined in accordance with
the CPU idle states table provided by the driver, so that each "bin"
covers the idle duration range between the target residency of the
idle state corresponding to it and the target residency of the closest
deeper idle state. The governor collects statistics for each bin
regardless of whether or not the idle state corresponding to it is
currently enabled.
In particular, the "hits" and "misses" metrics measure the likelihood
of a situation in which both the time till the next timer (sleep
length) and the idle duration measured after wakeup fall into the
given bin. Namely, if the "hits" value is greater than the "misses"
one, that situation is more likely than the one in which the sleep
length falls into the given bin, but the idle duration measured after
wakeup falls into a bin corresponding to one of the shallower idle
states.
If the idle state corresponding to the given bin is disabled, it
cannot be selected and if it turns out to be the one that should be
selected, a shallower idle state needs to be used instead of it.
Nevertheless, the metrics collected for the bin corresponding to it
are still valid and need to be taken into account as though that
state had not been disabled.
For this reason, make teo_select() always use the "hits" and "misses"
values of the idle duration range that the sleep length falls into
even if the specific idle state corresponding to it is disabled and
if the "hits" values is greater than the "misses" one, select the
closest enabled shallower idle state in that case.
Fixes: b26bf6ab71 ("cpuidle: New timer events oriented governor for tickless systems")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: 5.1+ <stable@vger.kernel.org> # 5.1+
Rename a local variable in teo_select() in preparation for subsequent
code modifications, no intentional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: 5.1+ <stable@vger.kernel.org> # 5.1+
Prevent disabled CPU idle state with target residencies beyond the
anticipated idle duration from being taken into account by the TEO
governor.
Fixes: b26bf6ab71 ("cpuidle: New timer events oriented governor for tickless systems")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: 5.1+ <stable@vger.kernel.org> # 5.1+
- Rework the main suspend-to-idle control flow to avoid repeating
"noirq" device resume and suspend operations in case of spurious
wakeups from the ACPI EC and decouple the ACPI EC wakeups support
from the LPS0 _DSM support (Rafael Wysocki).
- Extend the wakeup sources framework to expose wakeup sources as
device objects in sysfs (Tri Vo, Stephen Boyd).
- Expose system suspend statistics in sysfs (Kalesh Singh).
- Introduce a new haltpoll cpuidle driver and a new matching
governor for virtualized guests wanting to do guest-side polling
in the idle loop (Marcelo Tosatti, Joao Martins, Wanpeng Li,
Stephen Rothwell).
- Fix the menu and teo cpuidle governors to allow the scheduler tick
to be stopped if PM QoS is used to limit the CPU idle state exit
latency in some cases (Rafael Wysocki).
- Increase the resolution of the play_idle() argument to microseconds
for more fine-grained injection of CPU idle cycles (Daniel Lezcano).
- Switch over some users of cpuidle notifiers to the new QoS-based
frequency limits and drop the CPUFREQ_ADJUST and CPUFREQ_NOTIFY
policy notifier events (Viresh Kumar).
- Add new cpufreq driver based on nvmem for sun50i (Yangtao Li).
- Add support for MT8183 and MT8516 to the mediatek cpufreq driver
(Andrew-sh.Cheng, Fabien Parent).
- Add i.MX8MN support to the imx-cpufreq-dt cpufreq driver (Anson
Huang).
- Add qcs404 to cpufreq-dt-platdev blacklist (Jorge Ramirez-Ortiz).
- Update the qcom cpufreq driver (among other things, to make it
easier to extend and to use kryo cpufreq for other nvmem-based
SoCs) and add qcs404 support to it (Niklas Cassel, Douglas
RAILLARD, Sibi Sankar, Sricharan R).
- Fix assorted issues and make assorted minor improvements in the
cpufreq code (Colin Ian King, Douglas RAILLARD, Florian Fainelli,
Gustavo Silva, Hariprasad Kelam).
- Add new devfreq driver for NVidia Tegra20 (Dmitry Osipenko, Arnd
Bergmann).
- Add new Exynos PPMU events to devfreq events and extend that
mechanism (Lukasz Luba).
- Fix and clean up the exynos-bus devfreq driver (Kamil Konieczny).
- Improve devfreq documentation and governor code, fix spelling
typos in devfreq (Ezequiel Garcia, Krzysztof Kozlowski, Leonard
Crestez, MyungJoo Ham, Gaël PORTAY).
- Add regulators enable and disable to the OPP (operating performance
points) framework (Kamil Konieczny).
- Update the OPP framework to support multiple opp-suspend properties
(Anson Huang).
- Fix assorted issues and make assorted minor improvements in the OPP
code (Niklas Cassel, Viresh Kumar, Yue Hu).
- Clean up the generic power domains (genpd) framework (Ulf Hansson).
- Clean up assorted pieces of power management code and documentation
(Akinobu Mita, Amit Kucheria, Chuhong Yuan).
- Update the pm-graph tool to version 5.5 including multiple fixes
and improvements (Todd Brandt).
- Update the cpupower utility (Benjamin Weis, Geert Uytterhoeven,
Sébastien Szymanski).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl2ArZ4SHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxgfYQAK80hs43vWQDmp7XKrN4pQe8+qYULAGO
fBfrFl+NG9y/cnuqnt3NtA8MoyNsMMkMLkpkEDMfSbYqqH5ehEzX5+uGJWiWx8+Y
oH5KU8MH7Tj/utYaalGzDt0AHfHZDIGC0NCUNQJVtE/4mOANFabwsCwscp4MrD5Q
WjFN8U4BrsmWgJdZ/U9QIWcDZ0I+1etCF+rZG2yxSv31FMq2Zk/Qm4YyobqCvQFl
TR9rxl08wqUmIYIz5cDjt/3AKH7NLLDqOTstbCL7cmufM5XPFc1yox69xc89UrIa
4AMgmDp7SMwFG/gdUPof0WQNmx7qxmiRAPleAOYBOZW/8jPNZk2y+RhM5NeF72m7
AFqYiuxqatkSb4IsT8fLzH9IUZOdYr8uSmoMQECw+MHdApaKFjFV8Lb/qx5+AwkD
y7pwys8dZSamAjAf62eUzJDWcEwkNrujIisGrIXrVHb7ISbweskMOmdAYn9p4KgP
dfRzpJBJ45IaMIdbaVXNpg3rP7Apfs7X1X+/ZhG6f+zHH3zYwr8Y81WPqX8WaZJ4
qoVCyxiVWzMYjY2/1lzjaAdqWojPWHQ3or3eBaK52DouyG3jY6hCDTLwU7iuqcCX
jzAtrnqrNIKufvaObEmqcmYlIIOFT7QaJCtGUSRFQLfSon8fsVSR7LLeXoAMUJKT
JWQenuNaJngK
=TBDQ
-----END PGP SIGNATURE-----
Merge tag 'pm-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These include a rework of the main suspend-to-idle code flow (related
to the handling of spurious wakeups), a switch over of several users
of cpufreq notifiers to QoS-based limits, a new devfreq driver for
Tegra20, a new cpuidle driver and governor for virtualized guests, an
extension of the wakeup sources framework to expose wakeup sources as
device objects in sysfs, and more.
Specifics:
- Rework the main suspend-to-idle control flow to avoid repeating
"noirq" device resume and suspend operations in case of spurious
wakeups from the ACPI EC and decouple the ACPI EC wakeups support
from the LPS0 _DSM support (Rafael Wysocki).
- Extend the wakeup sources framework to expose wakeup sources as
device objects in sysfs (Tri Vo, Stephen Boyd).
- Expose system suspend statistics in sysfs (Kalesh Singh).
- Introduce a new haltpoll cpuidle driver and a new matching governor
for virtualized guests wanting to do guest-side polling in the idle
loop (Marcelo Tosatti, Joao Martins, Wanpeng Li, Stephen Rothwell).
- Fix the menu and teo cpuidle governors to allow the scheduler tick
to be stopped if PM QoS is used to limit the CPU idle state exit
latency in some cases (Rafael Wysocki).
- Increase the resolution of the play_idle() argument to microseconds
for more fine-grained injection of CPU idle cycles (Daniel
Lezcano).
- Switch over some users of cpuidle notifiers to the new QoS-based
frequency limits and drop the CPUFREQ_ADJUST and CPUFREQ_NOTIFY
policy notifier events (Viresh Kumar).
- Add new cpufreq driver based on nvmem for sun50i (Yangtao Li).
- Add support for MT8183 and MT8516 to the mediatek cpufreq driver
(Andrew-sh.Cheng, Fabien Parent).
- Add i.MX8MN support to the imx-cpufreq-dt cpufreq driver (Anson
Huang).
- Add qcs404 to cpufreq-dt-platdev blacklist (Jorge Ramirez-Ortiz).
- Update the qcom cpufreq driver (among other things, to make it
easier to extend and to use kryo cpufreq for other nvmem-based
SoCs) and add qcs404 support to it (Niklas Cassel, Douglas
RAILLARD, Sibi Sankar, Sricharan R).
- Fix assorted issues and make assorted minor improvements in the
cpufreq code (Colin Ian King, Douglas RAILLARD, Florian Fainelli,
Gustavo Silva, Hariprasad Kelam).
- Add new devfreq driver for NVidia Tegra20 (Dmitry Osipenko, Arnd
Bergmann).
- Add new Exynos PPMU events to devfreq events and extend that
mechanism (Lukasz Luba).
- Fix and clean up the exynos-bus devfreq driver (Kamil Konieczny).
- Improve devfreq documentation and governor code, fix spelling typos
in devfreq (Ezequiel Garcia, Krzysztof Kozlowski, Leonard Crestez,
MyungJoo Ham, Gaël PORTAY).
- Add regulators enable and disable to the OPP (operating performance
points) framework (Kamil Konieczny).
- Update the OPP framework to support multiple opp-suspend properties
(Anson Huang).
- Fix assorted issues and make assorted minor improvements in the OPP
code (Niklas Cassel, Viresh Kumar, Yue Hu).
- Clean up the generic power domains (genpd) framework (Ulf Hansson).
- Clean up assorted pieces of power management code and documentation
(Akinobu Mita, Amit Kucheria, Chuhong Yuan).
- Update the pm-graph tool to version 5.5 including multiple fixes
and improvements (Todd Brandt).
- Update the cpupower utility (Benjamin Weis, Geert Uytterhoeven,
Sébastien Szymanski)"
* tag 'pm-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (126 commits)
cpuidle-haltpoll: Enable kvm guest polling when dedicated physical CPUs are available
cpuidle-haltpoll: do not set an owner to allow modunload
cpuidle-haltpoll: return -ENODEV on modinit failure
cpuidle-haltpoll: set haltpoll as preferred governor
cpuidle: allow governor switch on cpuidle_register_driver()
PM: runtime: Documentation: add runtime_status ABI document
pm-graph: make setVal unbuffered again for python2 and python3
powercap: idle_inject: Use higher resolution for idle injection
cpuidle: play_idle: Increase the resolution to usec
cpuidle-haltpoll: vcpu hotplug support
cpufreq: Add qcs404 to cpufreq-dt-platdev blacklist
cpufreq: qcom: Add support for qcs404 on nvmem driver
cpufreq: qcom: Refactor the driver to make it easier to extend
cpufreq: qcom: Re-organise kryo cpufreq to use it for other nvmem based qcom socs
dt-bindings: opp: Add qcom-opp bindings with properties needed for CPR
dt-bindings: opp: qcom-nvmem: Support pstates provided by a power domain
Documentation: cpufreq: Update policy notifier documentation
cpufreq: Remove CPUFREQ_ADJUST and CPUFREQ_NOTIFY policy notifier events
PM / Domains: Verify PM domain type in dev_pm_genpd_set_performance_state()
PM / Domains: Simplify genpd_lookup_dev()
...
The downside of guest side polling is that polling is performed even
with other runnable tasks in the host. However, even if poll in kvm
can aware whether or not other runnable tasks in the same pCPU, it
can still incur extra overhead in over-subscribe scenario. Now we can
just enable guest polling when dedicated pCPUs are available.
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
cpuidle-haltpoll can be built as a module to allow optional late load.
Given we are setting @owner to THIS_MODULE, cpuidle will attempt to grab a
module reference every time a cpuidle_device is registered -- so
essentially all online cpus get a reference.
This prevents for the module to be unloaded later, which makes the
module_exit callback entirely unused. Thus remove the @owner and allow
module to be unloaded.
Fixes: fa86ee90eb ("add cpuidle-haltpoll driver")
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When a user loads cpuidle-haltpoll on a non KVM guest the module will
successfully load, even though idle driver registration didn't take
place.
We should instead return -ENODEV signaling the user that the driver can't
be loaded, like other error paths in haltpoll_init(). An example of such
error paths is when we return -EBUSY when attempting to register an idle
driver when it had one already (e.g. intel_idle loads at boot and then we
attempt to insert module cpuidle-haltpoll).
Fixes: fa86ee90eb ("add cpuidle-haltpoll driver")
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Right now, guest current governors have the following ratings:
* ladder -> 10
* teo -> 19
* menu -> 20
* haltpoll -> 21
* ladder + nohz=off -> 25
haltpoll governor got introduced and it is now the default governor given
its highest rating -- with ladder+nohz being the exception -- regardless of
idle driver in the guest. An example of an undesirable case is x86 KVM
guests with MWAIT which have intel_idle registered first, and consequently
will have haltpoll be used as governor which would get limited to a poll
state and state 1 and the other states wouldn't get used.
To keep the previous defaults we decrease rating of governor to 9 (below
current lowest rating) and thus rely on @governor switch on
cpuidle_register_driver() to tie in haltpoll idle driver and governor
together.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The recently introduced haltpoll driver is largely only useful with
haltpoll governor. To allow drivers to associate with a particular idle
behaviour, add a @governor property to 'struct cpuidle_driver' and thus
allow a cpuidle driver to switch to a *preferred* governor on idle driver
registration. We save the previous governor, and when an idle driver is
unregistered we switch back to that.
The @governor can be overridden by cpuidle.governor= boot param or
alternatively be ignored if the governor doesn't exist.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus
past the online ones and thus fail to register the idle driver.
This is because cpuidle_add_sysfs() will return with -ENODEV as a
consequence from get_cpu_device() return no device for a non-existing
CPU.
Instead switch to cpuidle_register_driver() and manually register each
of the present cpus through cpuhp_setup_state() callbacks and future
ones that get onlined or offlined. This mimmics similar logic that
intel_idle does.
Fixes: fa86ee90eb ("add cpuidle-haltpoll driver")
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Notice that setting measured_us to UINT_MAX in teo_update() earlier
doesn't change the behavior of the following code, so do that and
eliminate a redundant check used for setting measured_us to UINT_MAX.
This change is not expected to alter functionality.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Current PSCI code handles idle state entry through the
psci_cpu_suspend_enter() API, that takes an idle state index as a
parameter and convert the index into a previously initialized
power_state parameter before calling the PSCI.CPU_SUSPEND() with it.
This is unwieldly, since it forces the PSCI firmware layer to keep track
of power_state parameter for every idle state so that the
index->power_state conversion can be made in the PSCI firmware layer
instead of the CPUidle driver implementations.
Move the power_state handling out of drivers/firmware/psci
into the respective ACPI/DT PSCI CPUidle backends and convert
the psci_cpu_suspend_enter() API to get the power_state
parameter as input, which makes it closer to its firmware
interface PSCI.CPU_SUSPEND() API.
A notable side effect is that the PSCI ACPI/DT CPUidle backends
now can directly handle (and if needed update) power_state
parameters before handing them over to the PSCI firmware
interface to trigger PSCI.CPU_SUSPEND() calls.
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Will Deacon <will@kernel.org>
Allow selection of the PSCI CPUidle in the kernel by updating
the respective Kconfig entry.
Remove PSCI callbacks from ARM/ARM64 generic CPU ops
to prevent the PSCI idle driver from clashing with the generic
ARM CPUidle driver initialization, that relies on CPU ops
to initialize and enter idle states.
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Will Deacon <will@kernel.org>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Will Deacon <will@kernel.org>
PSCI firmware is the standard power management control for
all ARM64 based platforms and it is also deployed on some
ARM 32 bit platforms to date.
Idle state entry in PSCI is currently achieved by calling
arm_cpuidle_init() and arm_cpuidle_suspend() in a generic
idle driver, which in turn relies on ARM/ARM64 CPUidle back-end
to relay the call into PSCI firmware if PSCI is the boot method.
Given that PSCI is the standard idle entry method on ARM64 systems
(which means that no other CPUidle driver are expected on ARM64
platforms - so PSCI is already a generic idle driver), in order to
simplify idle entry and code maintenance, it makes sense to have a PSCI
specific idle driver so that idle code that it is currently living in
drivers/firmware directory can be hoisted out of it and moved
where it belongs, into a full-fledged PSCI driver, leaving PSCI code
in drivers/firmware as a pure firmware interface, as it should be.
Implement a PSCI CPUidle driver. By default it is a silent Kconfig entry
which is left unselected, since it selection would clash with the
generic ARM CPUidle driver that provides a PSCI based idle driver
through the arm/arm64 arches back-ends CPU operations.
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Will Deacon <will@kernel.org>
CPUidle back-end operations are not implemented in some platforms
but this should not be considered an error serious enough to be
logged. Check the arm_cpuidle_init() return value to detect whether
the failure must be reported or not in the kernel log and do
not log it if the platform does not support CPUidle operations.
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Will Deacon <will@kernel.org>
The TEO goveror prevents the scheduler tick from being stopped (unless
stopped already) if there is a PM QoS latency constraint for the given
CPU and the target residency of the deepest idle state matching that
constraint is below the tick boundary.
However, that is problematic if CPUs with PM QoS latency constraints
are idle for long times, because it effectively causes the tick to
run on them all the time which is wasteful. [It is also confusing
and questionable if they are full dynticks CPUs.]
To address that issue, modify the TEO governor to carry out the
entire search for the most suitable idle state (from the target
residency perspective) even if a latency constraint is present,
to allow it to determine the expected idle duration in all cases.
Also, when using the last several measured idle duration values
to refine the idle state selection, make it compare those values
with the current expected idle duration value (instead of
comparing them with the target residency of the idle state
selected so far) which should prevent the tick from being
retained when it makes sense to stop it sometimes (especially
in the presence of PM QoS latency constraints).
Fixes: b26bf6ab71 ("cpuidle: New timer events oriented governor for tickless systems")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
After commit 554c8aa8ec ("sched: idle: Select idle state before
stopping the tick") the menu governor prevents the scheduler tick from
being stopped (unless stopped already) if there is a PM QoS latency
constraint for the given CPU and the target residency of the deepest
idle state matching that constraint is below the tick boundary.
However, that is problematic if CPUs with PM QoS latency constraints
are idle for long times, because it effectively causes the tick to
run on them all the time which is wasteful. [It is also confusing
and questionable if they are full dynticks CPUs.]
To address that issue, make the menu governor allow the tick to be
stopped only if the idle duration predicted by it is beyond the tick
boundary, except when the shallowest idle state is selected upfront
and it is not a "polling" one.
Fixes: 554c8aa8ec ("sched: idle: Select idle state before stopping the tick")
Link: https://lore.kernel.org/lkml/79b247b3-e056-610e-9a07-e685dfdaa6c9@gmail.com/
Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com>
Tested-by: Thomas Lindroth <thomas.lindroth@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When performing guest side polling, it is not necessary to
also perform host side polling.
So disable host side polling, via the new MSR interface,
when loading cpuidle-haltpoll driver.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpuidle_haltpoll governor, in conjunction with the haltpoll cpuidle
driver, allows guest vcpus to poll for a specified amount of time before
halting.
This provides the following benefits to host side polling:
1) The POLL flag is set while polling is performed, which allows
a remote vCPU to avoid sending an IPI (and the associated
cost of handling the IPI) when performing a wakeup.
2) The VM-exit cost can be avoided.
The downside of guest side polling is that polling is performed
even with other runnable tasks in the host.
Results comparing halt_poll_ns and server/client application
where a small packet is ping-ponged:
host --> 31.33
halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%)
halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%)
For the SAP HANA benchmarks (where idle_spin is a parameter
of the previous version of the patch, results should be the
same):
hpns == halt_poll_ns
idle_spin=0/ idle_spin=800/ idle_spin=0/
hpns=200000 hpns=0 hpns=800000
DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%)
InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%)
DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%)
UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%)
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Since this field is shared by all governors, move it to
cpuidle device structure.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add a poll_limit_ns variable to cpuidle_device structure.
Calculate and configure it in the new cpuidle_poll_time
function, in case its zero.
Individual governors are allowed to override this value.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add a cpuidle driver that calls the architecture default_idle routine.
To be used in conjunction with the haltpoll governor.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* pm-cpufreq:
cpufreq: Make cpufreq_generic_init() return void
cpufreq: imx-cpufreq-dt: Add i.MX8MN support
cpufreq: Add QoS requests for userspace constraints
cpufreq: intel_pstate: Reuse refresh_frequency_limits()
cpufreq: Register notifiers with the PM QoS framework
PM / QoS: Add support for MIN/MAX frequency constraints
PM / QOS: Pass request type to dev_pm_qos_read_value()
PM / QOS: Rename __dev_pm_qos_read_value() and dev_pm_qos_raw_read_value()
PM / QOS: Pass request type to dev_pm_qos_{add|remove}_notifier()
dev_pm_qos_read_value() will soon need to support more constraint types
(min/max frequency) and will have another argument to it, i.e. type of
the constraint. While that is fine for the existing users of
dev_pm_qos_read_value(), but not that optimal for the callers of
__dev_pm_qos_read_value() and dev_pm_qos_raw_read_value() as all the
callers of these two routines are only looking for resume latency
constraint.
Lets make these two routines care only about the resume latency
constraint and rename them to __dev_pm_qos_resume_latency() and
dev_pm_qos_raw_resume_latency().
Suggested-by: Rafael J. Wysocki <rjw@rjwysocki.net>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Based on 2 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation #
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 4122 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this file is released under the gplv2
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 68 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190114.292346262@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this code is licenced under the gpl version 2 as described in the
copying file that acompanies the linux kernel
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 1 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190528171439.466585205@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms and conditions of the gnu general public license
version 2 as published by the free software foundation this program
is distributed in the hope it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details you should have received a copy of the gnu general
public license along with this program if not see http www gnu org
licenses
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 228 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190528171438.107155473@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 3 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version [author] [kishon] [vijay] [abraham]
[i] [kishon]@[ti] [com] this program is distributed in the hope that
it will be useful but without any warranty without even the implied
warranty of merchantability or fitness for a particular purpose see
the gnu general public license for more details
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version [author] [graeme] [gregory]
[gg]@[slimlogic] [co] [uk] [author] [kishon] [vijay] [abraham] [i]
[kishon]@[ti] [com] [based] [on] [twl6030]_[usb] [c] [author] [hema]
[hk] [hemahk]@[ti] [com] this program is distributed in the hope
that it will be useful but without any warranty without even the
implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 1105 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070033.202006027@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 3029 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add SPDX license identifiers to all Make/Kconfig files which:
- Have no license information of any form
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To be able to predict the sleep duration for a CPU entering idle, it
is essential to know the expiration time of the next timer. Both the
teo and the menu cpuidle governors already use this information for
CPU idle state selection.
Moving forward, a similar prediction needs to be made for a group of
idle CPUs rather than for a single one and the following changes
implement a new genpd governor for that purpose.
In order to support that feature, add a new function called
tick_nohz_get_next_hrtimer() that will return the next hrtimer
expiration time of a given CPU to be invoked after deciding
whether or not to stop the scheduler tick on that CPU.
Make the cpuidle core call tick_nohz_get_next_hrtimer() right
before invoking the ->enter() callback provided by the cpuidle
driver for the given state and store its return value in the
per-CPU struct cpuidle_device, so as to make it available to code
outside of cpuidle.
Note that at the point when cpuidle calls tick_nohz_get_next_hrtimer(),
the governor's ->select() callback has already returned and indicated
whether or not the tick should be stopped, so in fact the value
returned by tick_nohz_get_next_hrtimer() always is the next hrtimer
expiration time for the given CPU, possibly including the tick (if
it hasn't been stopped).
Co-developed-by: Lina Iyer <lina.iyer@linaro.org>
Co-developed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Since commit 45f1ff59e2 ("cpuidle: Return nohz hint from
cpuidle_select()") Exynos CPUidle driver stopped entering C1 (AFTR) mode
on Exynos4412-based Trats2 board.
Further analysis revealed that the CPUidle framework changed the way
it handles predicted timer ticks and reported target residency for the
given idle states. As a result, the C1 (AFTR) state was not chosen
anymore on completely idle device. The main issue was to high target
residency value. The similar C1 (AFTR) state for 'coupled' CPUidle
version used 10 times lower value for the target residency, despite
the fact that it is the same state from the hardware perspective.
The 100000us value for standard C1 (AFTR) mode is there from the begining
of the support for this idle state, added by the commit 67173ca492
("ARM: EXYNOS: Add support AFTR mode on EXYNOS4210"). That commit doesn't
give any reason for it, instead it looks like it was blindly copied from
the WFI/IDLE state of the same driver that time. That time, that value
was probably not really used by the framework for any critical decision,
so it didn't matter that much.
Now it turned out to be an issue, so unify the target residency with the
'coupled' version, as it seems to better match the real use case values
and restores the operation of the Exynos CPUidle driver on the idle
device.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Krzysztof Kozlowski <krzk@kernel.org>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
After commit 61cb5758d3 ("cpuidle: Add cpuidle.governor= command
line parameter") new cpuidle governors are not added to the list
of available governors, so governor selection via sysfs doesn't
work as expected (even though it is rarely used anyway).
Fix that by making cpuidle_register_governor() add new governors to
cpuidle_governors again.
Fixes: 61cb5758d3 ("cpuidle: Add cpuidle.governor= command line parameter")
Reported-by: Kees Cook <keescook@chromium.org>
Cc: 5.0+ <stable@vger.kernel.org> # 5.0+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The variance computation in get_typical_interval() may overflow if
the square of the value of diff exceeds the maximum for the int64_t
data type value which basically is the case when it is of the order
of UINT_MAX.
However, data points so far in the future don't matter for idle
state selection anyway, so change the initial threshold value in
get_typical_interval() to INT_MAX which will cause more "outlying"
data points to be discarded without affecting the selection result.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently, the DT of the idle states will be parsed first whether it's
compatible or not. This could cause a warning message that comes from if
the CPU doesn't support identical idle states. E.g. Tegra186 can run
with 2 Cortex-A57 and 2 Denver cores with different idle states on
different types of these cores.
So fix it by checking the match node earlier, then it can make sure it
only goes through the idle states that the CPU supported.
Signed-off-by: Joseph Lo <josephl@nvidia.com>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The default time is declared in units of microsecnds,
but is used as nanoseconds, resulting in significant
accounting errors for idle state 0 time when all idle
states deeper than 0 are disabled.
Under these unusual conditions, we don't really care
about the poll time limit anyhow.
Fixes: 800fb34a99 ("cpuidle: poll_state: Disregard disable idle states")
Signed-off-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The venerable menu governor does some things that are quite
questionable in my view.
First, it includes timer wakeups in the pattern detection data and
mixes them up with wakeups from other sources which in some cases
causes it to expect what essentially would be a timer wakeup in a
time frame in which no timer wakeups are possible (because it knows
the time until the next timer event and that is later than the
expected wakeup time).
Second, it uses the extra exit latency limit based on the predicted
idle duration and depending on the number of tasks waiting on I/O,
even though those tasks may run on a different CPU when they are
woken up. Moreover, the time ranges used by it for the sleep length
correction factors depend on whether or not there are tasks waiting
on I/O, which again doesn't imply anything in particular, and they
are not correlated to the list of available idle states in any way
whatever.
Also, the pattern detection code in menu may end up considering
values that are too large to matter at all, in which cases running
it is a waste of time.
A major rework of the menu governor would be required to address
these issues and the performance of at least some workloads (tuned
specifically to the current behavior of the menu governor) is likely
to suffer from that. It is thus better to introduce an entirely new
governor without them and let everybody use the governor that works
better with their actual workloads.
The new governor introduced here, the timer events oriented (TEO)
governor, uses the same basic strategy as menu: it always tries to
find the deepest idle state that can be used in the given conditions.
However, it applies a different approach to that problem.
First, it doesn't use "correction factors" for the time till the
closest timer, but instead it tries to correlate the measured idle
duration values with the available idle states and use that
information to pick up the idle state that is most likely to "match"
the upcoming CPU idle interval.
Second, it doesn't take the number of "I/O waiters" into account at
all and the pattern detection code in it avoids taking timer wakeups
into account. It also only uses idle duration values less than the
current time till the closest timer (with the tick excluded) for that
purpose.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Notable changes:
- Mitigations for Spectre v2 on some Freescale (NXP) CPUs.
- A large series adding support for pass-through of Nvidia V100 GPUs to guests
on Power9.
- Another large series to enable hardware assistance for TLB table walk on
MPC8xx CPUs.
- Some preparatory changes to our DMA code, to make way for further cleanups
from Christoph.
- Several fixes for our Transactional Memory handling discovered by fuzzing the
signal return path.
- Support for generating our system call table(s) from a text file like other
architectures.
- A fix to our page fault handler so that instead of generating a WARN_ON_ONCE,
user accesses of kernel addresses instead print a ratelimited and
appropriately scary warning.
- A cosmetic change to make our unhandled page fault messages more similar to
other arches and also more compact and informative.
- Freescale updates from Scott:
"Highlights include elimination of legacy clock bindings use from dts
files, an 83xx watchdog handler, fixes to old dts interrupt errors, and
some minor cleanup."
And many clean-ups, reworks and minor fixes etc.
Thanks to:
Alexandre Belloni, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Arnd Bergmann, Benjamin Herrenschmidt, Breno Leitao, Christian Lamparter,
Christophe Leroy, Christoph Hellwig, Daniel Axtens, Darren Stevens, David
Gibson, Diana Craciun, Dmitry V. Levin, Firoz Khan, Geert Uytterhoeven, Greg
Kurz, Gustavo Romero, Hari Bathini, Joel Stanley, Kees Cook, Madhavan
Srinivasan, Mahesh Salgaonkar, Markus Elfring, Mathieu Malaterre, Michal
Suchánek, Naveen N. Rao, Nick Desaulniers, Oliver O'Halloran, Paul Mackerras,
Ram Pai, Ravi Bangoria, Rob Herring, Russell Currey, Sabyasachi Gupta, Sam
Bobroff, Satheesh Rajendran, Scott Wood, Segher Boessenkool, Stephen Rothwell,
Tang Yuantian, Thiago Jung Bauermann, Yangtao Li, Yuantian Tang, Yue Haibing.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcJLwZAAoJEFHr6jzI4aWAAv4P/jMvP52lA90i2E8G72LOVSF1
33DbE/Okib3VfmmMcXZpgpEfwIcEmJcIj86WWcLWzBfXLunehkgwh+AOfBLwqWch
D08+RR9EZb7ppvGe91hvSgn4/28CWVKAxuDviSuoE1OK8lOTncu889r2+AxVFZiY
f6Al9UPlB3FTJonNx8iO4r/GwrPigukjbzp1vkmJJg59LvNUrMQ1Fgf9D3cdlslH
z4Ff9zS26RJy7cwZYQZI4sZXJZmeQ1DxOZ+6z6FL/nZp/O4WLgpw6C6o1+vxo1kE
9ZnO/3+zIRhoWiXd6OcOQXBv3NNCjJZlXh9HHAiL8m5ZqbmxrStQWGyKW/jjEZuK
wVHxfUT19x9Qy1p+BH3XcUNMlxchYgcCbEi5yPX2p9ZDXD6ogNG7sT1+NO+FBTww
ueCT5PCCB/xWOccQlBErFTMkFXFLtyPDNFK7BkV7uxbH0PQ+9guCvjWfBZti6wjD
/6NK4mk7FpmCiK13Y1xjwC5OqabxLUYwtVuHYOMr5TOPh8URUPS4+0pIOdoYDM6z
Ensrq1CC843h59MWADgFHSlZ78FRtZlG37JAXunjLbqGupLOvL7phC9lnwkylHga
2hWUWFeOV8HFQBP4gidZkLk64pkT9LzqHgdgIB4wUwrhc8r2mMZGdQTq5H7kOn3Q
n9I48PWANvEC0PBCJ/KL
=cr6s
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.21-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Notable changes:
- Mitigations for Spectre v2 on some Freescale (NXP) CPUs.
- A large series adding support for pass-through of Nvidia V100 GPUs
to guests on Power9.
- Another large series to enable hardware assistance for TLB table
walk on MPC8xx CPUs.
- Some preparatory changes to our DMA code, to make way for further
cleanups from Christoph.
- Several fixes for our Transactional Memory handling discovered by
fuzzing the signal return path.
- Support for generating our system call table(s) from a text file
like other architectures.
- A fix to our page fault handler so that instead of generating a
WARN_ON_ONCE, user accesses of kernel addresses instead print a
ratelimited and appropriately scary warning.
- A cosmetic change to make our unhandled page fault messages more
similar to other arches and also more compact and informative.
- Freescale updates from Scott:
"Highlights include elimination of legacy clock bindings use from
dts files, an 83xx watchdog handler, fixes to old dts interrupt
errors, and some minor cleanup."
And many clean-ups, reworks and minor fixes etc.
Thanks to: Alexandre Belloni, Alexey Kardashevskiy, Andrew Donnellan,
Aneesh Kumar K.V, Arnd Bergmann, Benjamin Herrenschmidt, Breno Leitao,
Christian Lamparter, Christophe Leroy, Christoph Hellwig, Daniel
Axtens, Darren Stevens, David Gibson, Diana Craciun, Dmitry V. Levin,
Firoz Khan, Geert Uytterhoeven, Greg Kurz, Gustavo Romero, Hari
Bathini, Joel Stanley, Kees Cook, Madhavan Srinivasan, Mahesh
Salgaonkar, Markus Elfring, Mathieu Malaterre, Michal Suchánek, Naveen
N. Rao, Nick Desaulniers, Oliver O'Halloran, Paul Mackerras, Ram Pai,
Ravi Bangoria, Rob Herring, Russell Currey, Sabyasachi Gupta, Sam
Bobroff, Satheesh Rajendran, Scott Wood, Segher Boessenkool, Stephen
Rothwell, Tang Yuantian, Thiago Jung Bauermann, Yangtao Li, Yuantian
Tang, Yue Haibing"
* tag 'powerpc-4.21-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (201 commits)
Revert "powerpc/fsl_pci: simplify fsl_pci_dma_set_mask"
powerpc/zImage: Also check for stdout-path
powerpc: Fix HMIs on big-endian with CONFIG_RELOCATABLE=y
macintosh: Use of_node_name_{eq, prefix} for node name comparisons
ide: Use of_node_name_eq for node name comparisons
powerpc: Use of_node_name_eq for node name comparisons
powerpc/pseries/pmem: Convert to %pOFn instead of device_node.name
powerpc/mm: Remove very old comment in hash-4k.h
powerpc/pseries: Fix node leak in update_lmb_associativity_index()
powerpc/configs/85xx: Enable CONFIG_DEBUG_KERNEL
powerpc/dts/fsl: Fix dtc-flagged interrupt errors
clk: qoriq: add more compatibles strings
powerpc/fsl: Use new clockgen binding
powerpc/83xx: handle machine check caused by watchdog timer
powerpc/fsl-rio: fix spelling mistake "reserverd" -> "reserved"
powerpc/fsl_pci: simplify fsl_pci_dma_set_mask
arch/powerpc/fsl_rmu: Use dma_zalloc_coherent
vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver
vfio_pci: Allow regions to add own capabilities
vfio_pci: Allow mapping extra regions
...
Add two new metrics for CPU idle states, "above" and "below", to count
the number of times the given state had been asked for (or entered
from the kernel's perspective), but the observed idle duration turned
out to be too short or too long for it (respectively).
These metrics help to estimate the quality of the CPU idle governor
in use.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
of_find_node_by_path() acquires a reference to the node
returned by it and that reference needs to be dropped by its caller.
bl_idle_init() doesn't do that, so fix it.
Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Add cpuidle.governor= command line parameter to allow the default
cpuidle governor to be replaced.
That is useful, for example, if someone running a tickful kernel
wants to use the menu governor on it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When computing the limit of time to spend in the loop in poll_idle(),
use the target residency of the first enabled idle state deeper than
state 0 instead of always using the target residency of state 1.
This helps when state 1 is disabled for diagnostics, for instance.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When booting a pseries kernel with PREEMPT enabled, it dumps the
following warning:
BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1
caller is pseries_processor_idle_init+0x5c/0x22c
CPU: 13 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc3-00090-g12201a0128bc-dirty #828
Call Trace:
[c000000429437ab0] [c0000000009c8878] dump_stack+0xec/0x164 (unreliable)
[c000000429437b00] [c0000000005f2f24] check_preemption_disabled+0x154/0x160
[c000000429437b90] [c000000000cab8e8] pseries_processor_idle_init+0x5c/0x22c
[c000000429437c10] [c000000000010ed4] do_one_initcall+0x64/0x300
[c000000429437ce0] [c000000000c54500] kernel_init_freeable+0x3f0/0x500
[c000000429437db0] [c0000000000112dc] kernel_init+0x2c/0x160
[c000000429437e20] [c00000000000c1d0] ret_from_kernel_thread+0x5c/0x6c
This happens because the code calls get_lppaca() which calls
get_paca() and it checks if preemption is disabled through
check_preemption_disabled().
Preemption should be disabled because the per CPU variable may make no
sense if there is a preemption (and a CPU switch) after it reads the
per CPU data and when it is used.
In this device driver specifically, it is not a problem, because this
code just needs to have access to one lppaca struct, and it does not
matter if it is the current per CPU lppaca struct or not (i.e. when
there is a preemption and a CPU migration).
That said, the most appropriate fix seems to be related to avoiding
the debug_smp_processor_id() call at get_paca(), instead of calling
preempt_disable() before get_paca().
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The only reason that remains, to why the ARM cpuidle driver calls
cpuidle_register_driver(), is to avoid printing an error message in case
another driver already have been registered for the CPU. This seems a bit
silly, but more importantly, if that is a common scenario, perhaps we
should change cpuidle_register() accordingly instead.
In either case, let's consolidate the code, by converting to use
cpuidle_register|unregister(), which also avoids the unnecessary allocation
of the struct cpuidle_device.
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There's no point to register the cpuidle driver for the current CPU, when
the initialization of the arch specific back-end data fails by returning
-ENXIO.
Instead, let's re-order the sequence to its original flow, by first trying
to initialize the back-end part and then act accordingly on the returned
error code. Additionally, let's print the error message, no matter of what
error code that was returned.
Fixes: a0d46a3dfd (ARM: cpuidle: Register per cpuidle device)
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: 4.19+ <stable@vger.kernel.org> # v4.19+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Fix build regression in the intel_pstate driver that doesn't
build without CONFIG_ACPI after recent changes (Dominik Brodowski).
- One of the heuristics in the menu cpuidle governor is based on a
function returning 0 most of the time, so drop it and clean up
the scheduler code related to it (Daniel Lezcano).
- Prevent the arm_big_little cpufreq driver from being used on ARM64
which is not suitable for it and drop the arm_big_little_dt driver
that is not used any more (Sudeep Holla).
- Prevent the hung task watchdog from triggering during resume from
system-wide sleep states by disabling it before freezing tasks and
enabling it again after they have been thawed (Vitaly Kuznetsov).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJb2BJ7AAoJEILEb/54YlRx/kwP/iD7tUUZ6mT84OI0FTbEj8A/
fM+uHrwy25PmqyWGGtbHpaWU9OxVxUReSicsBCt+2LZmX3sFYpbSb243mv3pmxqb
A0kLflG4lWCKJNIfa/a3OMDTUw26mxSTCidE3jJXkd8HkWrzeAWvMair+UCuzMf3
A4Omu0IkNL8C0MKtUOb3PlUk3dnLYMxuairNhozBPhi+P+0tLW9/9XvgPJBVhnbZ
CKn/aFsDoc08tAfxC8N32cgKwE7nbeIgTJTBFyu2lQmInsd4TTuoM50vSC5i+x88
AmBOoH9IX0fhXJ6hgm+VMW8+x9S+H7jAVy/3C2xoUBeCclzlxX6eUCtjV5YNZqqn
1nXQfGeAwgzX6Tyu6HjM7vjbfObk59ZwpmDRPJEUEhLDEBMS+iDStlp9zmKTedNm
G4iSTzS6qJCNPtx4y5wkLp/FvzTofIuWqVFJSJC4+EoVKkbbw9xwaY+JKXUt1Uwx
j+U6EtRhzL/kVX0nq+iQXXeANxCFNzI56Ov5O7mxjF1m/hDE/Gb2QEeIb6nRZC2A
H3I2so2J3h1yTgadpGFFvJWaqfHkgcBTsm06tSgHVb86quiTANJIQ9mqfFyOzDDJ
KaZ82MROt7UuCMI6X9n+oIBDZWLHmADge6RdHCD1wB+zrUmusCtNEHUZACXd0mPf
s8MUK4bWVhViVXGS5bMP
=/bnR
-----END PGP SIGNATURE-----
Merge tag 'pm-4.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These remove a questionable heuristic from the menu cpuidle governor,
fix a recent build regression in the intel_pstate driver, clean up ARM
big-Little support in cpufreq and fix up hung task watchdog's
interaction with system-wide power management transitions.
Specifics:
- Fix build regression in the intel_pstate driver that doesn't build
without CONFIG_ACPI after recent changes (Dominik Brodowski).
- One of the heuristics in the menu cpuidle governor is based on a
function returning 0 most of the time, so drop it and clean up the
scheduler code related to it (Daniel Lezcano).
- Prevent the arm_big_little cpufreq driver from being used on ARM64
which is not suitable for it and drop the arm_big_little_dt driver
that is not used any more (Sudeep Holla).
- Prevent the hung task watchdog from triggering during resume from
system-wide sleep states by disabling it before freezing tasks and
enabling it again after they have been thawed (Vitaly Kuznetsov)"
* tag 'pm-4.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
kernel: hung_task.c: disable on suspend
cpufreq: remove unused arm_big_little_dt driver
cpufreq: drop ARM_BIG_LITTLE_CPUFREQ support for ARM64
cpufreq: intel_pstate: Fix compilation for !CONFIG_ACPI
cpuidle: menu: Remove get_loadavg() from the performance multiplier
sched: Factor out nr_iowait and nr_iowait_cpu
There are several definitions of those functions/macros in places that
mess with fixed-point load averages. Provide an official version.
[akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The function get_loadavg() returns almost always zero. To be more
precise, statistically speaking for a total of 1023379 times passing
in the function, the load is equal to zero 1020728 times, greater than
100, 610 times, the remaining is between 0 and 5.
In 2011, the get_loadavg() was removed from the Android tree because
of the above [1]. At this time, the load was:
unsigned long this_cpu_load(void)
{
struct rq *this = this_rq();
return this->cpu_load[0];
}
In 2014, the code was changed by commit 372ba8cb46 (cpuidle: menu: Lookup CPU
runqueues less) and the load is:
void get_iowait_load(unsigned long *nr_waiters, unsigned long *load)
{
struct rq *rq = this_rq();
*nr_waiters = atomic_read(&rq->nr_iowait);
*load = rq->load.weight;
}
with the same result.
Both measurements show using the load in this code path does no matter
anymore. Removing it.
[1] 4dedd9f124
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If the minimum interval taken into account in the average computation
loop in get_typical_interval() is less than the expected idle
duration determined so far, the resultant average cannot be greater
than that value as well and the entire return result of the function
is going to be discarded anyway going forward.
In that case, it is a waste of time to carry out the remaining
computations in get_typical_interval(), so avoid that by returning
early if the minimum interval is not below the expected idle duration.
No intentional changes of behavior.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Since the correction factor cannot be greater than RESOLUTION * DECAY,
the result of the predicted_us computation in menu_select() cannot be
greater than data->next_timer_us, so it is not necessary to compare
the "typical interval" value coming from get_typical_interval() with
data->next_timer_us separately.
It is sufficient to copmare predicted_us with the return value of
get_typical_interval() directly, so do that and drop the now
redundant expected_interval variable.
No intentional changes of behavior.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
After some recent menu governor changes, the promotion of the
"polling" state to a physical one is mostly controlled by the
latency limit (resulting from the "interactivity" factor) and
not by the time to the closest timer event, so it should be
sufficient to check the exit latency of that state for this
purpose (of course, its target residency still needs to be
within the next timer event range for energy-efficiency).
Also, the physical state the "polling" one is promoted to need not
be the next one in principle (in case the next state is disabled,
for example).
For these reasons, simplify the checks made to decide whether or
not to promote the "polling" state to a physical one and update
the target idle duration when it is promoted in case the residency
of the new state turns out to be above the tick boundary (in which
case there is no reason to stop the tick).
Tested-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If need_resched() returns "false", breaking out of the loop in
poll_idle() will cause a new idle state to be selected, so in fact
it usually doesn't make sense to spin in it longer than the target
residency of the second state. [Note that the "polling" state is
used only if there is at least one "real" state defined in addition
to it, so the second state is always there.] On the other hand,
breaking out of it early (say in case the next state is disabled)
shouldn't hurt as it is polling anyway.
For this reason, make the loop in poll_idle() break if the CPU has
been spinning longer than the target residency of the second state
(the "polling" state can only be state[0]).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
It is better to always update data->bucket before returning from
menu_select() to avoid updating the correction factor for a stale
bucket, so combine the latency_req == 0 special check with the more
general check below.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
If the next timer event (not including the tick) is closer than the
target residency of the second state or the PM QoS latency constraint
is below its exit latency, state[0] will be used regardless of any
other factors, so skip the computations in menu_select() then and
return 0 straight away from it.
Still, do that after the bucket has been determined to avoid
updating the correction factor for a stale bucket.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
It is not necessary to update data->last_state_idx in menu_select()
as it only is used in menu_update() which only runs when
data->needs_update is set and that is set only when updating
data->last_state_idx in menu_reflect().
Accordingly, drop the update of data->last_state_idx from
menu_select() and get rid of the (now redundant) "out" label
from it.
No intentional behavior changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Rearrange the code in menu_select() so that the loop over idle states
always starts from 0 and get rid of the first_idx variable.
While at it, add two empty lines to separate conditional statements
from one another.
No intentional behavior changes.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Since menu_select() can only set first_idx to 1 if the exit latency
of the second state is not greater than the latency limit, it should
first determine that limit. Thus first_idx should be computed after
the "interactivity" factor has been taken into account.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewedy-by: Daniel Lezcano <daniel.lezcano@linaro.org>
If the CPU exits the "polling" state due to the time limit in the
loop in poll_idle(), this is not a real wakeup and it just means
that the "polling" state selection was not adequate. The governor
mispredicted short idle duration, but had a more suitable state been
selected, the CPU might have spent more time in it. In fact, there
is no reason to expect that there would have been a wakeup event
earlier than the next timer in that case.
Handling such cases as regular wakeups in menu_update() may cause the
menu governor to make suboptimal decisions going forward, but ignoring
them altogether would not be correct either, because every time
menu_select() is invoked, it makes a separate new attempt to predict
the idle duration taking distinct time to the closest timer event as
input and the outcomes of all those attempts should be recorded.
For this reason, make menu_update() always assume that if the
"polling" state was exited due to the time limit, the next proper
wakeup event for the CPU would be the next timer event (not
including the tick).
Fixes: a37b969a61 "cpuidle: poll_state: Add time limit to poll_idle()"
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Lezcano <daniel.lezcano@linaro.org>
The predicted_us field in struct menu_device is only accessed in
menu_select(), so replace it with a local variable in that function.
With that, stop using expected_interval instead of predicted_us to
store the new predicted idle duration value if it is set to the
selected state's target residency which is quite confusing.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Currently, ktime_us_delta() is invoked unconditionally to compute the
idle residency of the CPU, but it only makes sense to do that if a
valid idle state has been entered, so move the ktime_us_delta()
invocation after the entered_state >= 0 check.
While at it, merge two comment blocks in there into one and drop
a space between type casting of diff.
This patch has no functional changes.
Signed-off-by: Fieah Lim <kw@fieahl.im>
[ rjw: Changelog cleanup, comment format fix ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
cpuidle_get_last_residency() is just a wrapper for retrieving
the last_residency member of struct cpuidle_device. It's also
weirdly the only wrapper function for accessing cpuidle_* struct
member (by my best guess is it could be a leftover from v2.x).
Anyhow, since the only two users (the ladder and menu governors)
can access dev->last_residency directly, and it's more intuitive to
do it that way, let's just get rid of the wrapper.
This patch tidies up CPU idle code a bit without functional changes.
Signed-off-by: Fieah Lim <kw@fieahl.im>
[ rjw: Changelog cleanup ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The case addressed by commit 5ef499cd57 (cpuidle: menu: Handle
stopped tick more aggressively) in the stopped tick case is present
when the tick has not been stopped yet too. Namely, if only two CPU
idle states, shallow state A with target residency significantly
below the tick boundary and deep state B with target residency
significantly above it, are available and the predicted idle
duration is above the tick boundary, but below the target residency
of state B, state A will be selected and the CPU may spend indefinite
amount of time in it, which is not quite energy-efficient.
However, if the tick has not been stopped yet and the governor is
about to select a shallow idle state for the CPU even though the idle
duration predicted by it is above the tick boundary, it should be
fine to wake up the CPU early, so the tick can be retained then and
the governor will have a chance to select a deeper state when it runs
next time.
[Note that when this really happens, it will make the idle duration
predictor believe that the CPU might be idle longer than predicted,
which will make it more likely to predict longer idle durations going
forward, but that will also cause deeper idle states to be selected
going forward, on average, which is what's needed here.]
Fixes: 87c9fe6ee4 (cpuidle: menu: Avoid selecting shallow states with stopped tick)
Reported-by: Leo Yan <leo.yan@linaro.org>
Cc: 4.17+ <stable@vger.kernel.org> # 4.17+: 5ef499cd57 (cpuidle: menu: Handle ...)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Make the idle loop handle stopped scheduler tick correctly (Rafael
Wysocki).
- Prevent the menu cpuidle governor from letting CPUs spend too much
time in shallow idle states when it is invoked with scheduler tick
stopped and clean it up somewhat (Rafael Wysocki).
- Avoid invoking the platform firmware to make the platform enter
the ACPI S3 sleep state with suspended PCIe root ports which may
confuse the firmware and cause it to crash (Rafael Wysocki).
- Fix sysfs-related race in the ondemand and conservative cpufreq
governors which may cause the system to crash if the governor
module is removed during an update of CPU frequency limits (Henry
Willard).
- Select SRCU when building the system wakeup framework to avoid a
build issue in it (zhangyi).
- Make the descriptions of ACPI C-states vendor-neutral to avoid
confusion (Prarit Bhargava).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJbfSnVAAoJEILEb/54YlRxBn4QAKQ8PqkSYkBby+1hb90ET4dk
VaLkbCYXuzLK5rIDvnbYOALhVKo4B29Ex5GdCLN7cWkZMkrVKe7oX8QQTnp3/7lF
URjTKgTNec5uJG652PrE3ESAa3X/kYggj6aeQOxDR4iYKzcpJEQ92ekFW+SoJTNp
Jc2kZh3qkC2On64GB3ibsZaKnmHfPvLg0t4agwzuYq/Gff8NRJFk7kMwAPzqGzZo
b2UVRcYFWIRkJjgmU9iInoeHIY8mBdT3IiKwTemZP1dOhb5T1AHOXwGTk6/cS+RH
A9qx4eg7I3R00KmnYvO8WytYJeOu2qb83GIUx4fIJGOqfvevm5xkxB9F+nfE+ouj
ROBqO4+X4XfQGPw8slayg0rJjI9JSkXLnLdl0Qw2WRlbc4/fVWntra1C57EeKFBR
EG9UAF9+7nUUx0bOCLsfFF3+r9R3SDUjk7b4thyhYncyQRsYC+FL7ztlxnMzVtzW
M5SF2sPrpcQzqmcszdUdbESI10n5X8m/crJW4rsbTxBpAM+coO+uLcvHWOY4MpkW
BgBsR6bMDAlG/VlTFgeeP/tkCRd5zNlJi7yBFItXuOoVKXpnHCJuxq2WkZ1Rb74M
Gk1d3TduekHJm8VsLEdCJR/tEk1cMc0zVUD/a1yzI4Z21QxvXUCqMDdws4/Ey184
qmKgNR9R94vSC5xIPRhM
=9GrU
-----END PGP SIGNATURE-----
Merge tag 'pm-4.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These fix the main idle loop and the menu cpuidle governor, clean up
the latter, fix a mistake in the PCI bus type's support for system
suspend and resume, fix the ondemand and conservative cpufreq
governors, address a build issue in the system wakeup framework and
make the ACPI C-states desciptions less confusing.
Specifics:
- Make the idle loop handle stopped scheduler tick correctly (Rafael
Wysocki).
- Prevent the menu cpuidle governor from letting CPUs spend too much
time in shallow idle states when it is invoked with scheduler tick
stopped and clean it up somewhat (Rafael Wysocki).
- Avoid invoking the platform firmware to make the platform enter the
ACPI S3 sleep state with suspended PCIe root ports which may
confuse the firmware and cause it to crash (Rafael Wysocki).
- Fix sysfs-related race in the ondemand and conservative cpufreq
governors which may cause the system to crash if the governor
module is removed during an update of CPU frequency limits (Henry
Willard).
- Select SRCU when building the system wakeup framework to avoid a
build issue in it (zhangyi).
- Make the descriptions of ACPI C-states vendor-neutral to avoid
confusion (Prarit Bhargava)"
* tag 'pm-4.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpuidle: menu: Handle stopped tick more aggressively
sched: idle: Avoid retaining the tick when it has been stopped
PCI / ACPI / PM: Resume all bridges on suspend-to-RAM
cpuidle: menu: Update stale polling override comment
cpufreq: governor: Avoid accessing invalid governor_data
x86/ACPI/cstate: Make APCI C1 FFH MWAIT C-state description vendor-neutral
cpuidle: menu: Fix white space
PM / sleep: wakeup: Fix build error caused by missing SRCU support
Commit 87c9fe6ee4 (cpuidle: menu: Avoid selecting shallow states
with stopped tick) missed the case when the target residencies of
deep idle states of CPUs are above the tick boundary which may cause
the CPU to get stuck in a shallow idle state for a long time.
Say there are two CPU idle states available: one shallow, with the
target residency much below the tick boundary and one deep, with
the target residency significantly above the tick boundary. In
that case, if the tick has been stopped already and the expected
next timer event is relatively far in the future, the governor will
assume the idle duration to be equal to TICK_USEC and it will select
the idle state for the CPU accordingly. However, that will cause the
shallow state to be selected even though it would have been more
energy-efficient to select the deep one.
To address this issue, modify the governor to always use the time
till the closest timer event instead of the predicted idle duration
if the latter is less than the tick period length and the tick has
been stopped already. Also make it extend the search for a matching
idle state if the tick is stopped to avoid settling on a shallow
state if deep states with target residencies above the tick period
length are available.
In addition, make it always indicate that the tick should be stopped
if it has been stopped already for consistency.
Fixes: 87c9fe6ee4 (cpuidle: menu: Avoid selecting shallow states with stopped tick)
Reported-by: Leo Yan <leo.yan@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: 4.17+ <stable@vger.kernel.org> # 4.17+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Notable changes:
- A fix for a bug in our page table fragment allocator, where a page table page
could be freed and reallocated for something else while still in use, leading
to memory corruption etc. The fix reuses pt_mm in struct page (x86 only) for
a powerpc only refcount.
- Fixes to our pkey support. Several are user-visible changes, but bring us in
to line with x86 behaviour and/or fix outright bugs. Thanks to Florian Weimer
for reporting many of these.
- A series to improve the hvc driver & related OPAL console code, which have
been seen to cause hardlockups at times. The hvc driver changes in particular
have been in linux-next for ~month.
- Increase our MAX_PHYSMEM_BITS to 128TB when SPARSEMEM_VMEMMAP=y.
- Remove Power8 DD1 and Power9 DD1 support, neither chip should be in use
anywhere other than as a paper weight.
- An optimised memcmp implementation using Power7-or-later VMX instructions
- Support for barrier_nospec on some NXP CPUs.
- Support for flushing the count cache on context switch on some IBM CPUs
(controlled by firmware), as a Spectre v2 mitigation.
- A series to enhance the information we print on unhandled signals to bring it
into line with other arches, including showing the offending VMA and dumping
the instructions around the fault.
Thanks to:
Aaro Koskinen, Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Alexey
Spirkov, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar,
Arnd Bergmann, Bartosz Golaszewski, Benjamin Herrenschmidt, Bharat Bhushan,
Bjoern Noetel, Boqun Feng, Breno Leitao, Bryant G. Ly, Camelia Groza,
Christophe Leroy, Christoph Hellwig, Cyril Bur, Dan Carpenter, Daniel Klamt,
Darren Stevens, Dave Young, David Gibson, Diana Craciun, Finn Thain, Florian
Weimer, Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven, Geoff Levand,
Guenter Roeck, Gustavo Romero, Haren Myneni, Hari Bathini, Joel Stanley,
Jonathan Neuschäfer, Kees Cook, Madhavan Srinivasan, Mahesh Salgaonkar, Markus
Elfring, Mathieu Malaterre, Mauro S. M. Rodrigues, Michael Hanselmann, Michael
Neuling, Michael Schmitz, Mukesh Ojha, Murilo Opsfelder Araujo, Nicholas
Piggin, Parth Y Shah, Paul Mackerras, Paul Menzel, Ram Pai, Randy Dunlap,
Rashmica Gupta, Reza Arbab, Rodrigo R. Galvao, Russell Currey, Sam Bobroff,
Scott Wood, Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stan Johnson,
Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain, Vasant Hegde, Venkat Rao
B, zhong jiang.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAlt2O6cTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgC7hD/4+cj796Df7GsVsIMxzQm7SS9dklIdO
JuKj2Nr5HRzTH59jWlXukLG9mfTNCFgFJB4gEpK1ArDOTcHTCI9RRsLZTZ/kum66
7Pd+7T40dLYXB5uecuUs0vMXa2fI3syKh1VLzACSXv3Dh9BBIKQBwW/aD2eww4YI
1fS5LnXZ2PSxfr6KNAC6ogZnuaiD0sHXOYrtGHq+S/TFC7+Z6ySa6+AnPS+hPVoo
/rHDE1Khr66aj7uk+PP2IgUrCFj6Sbj6hTVlS/iAuwbMjUl9ty6712PmvX9x6wMZ
13hJQI+g6Ci+lqLKqmqVUpXGSr6y4NJGPS/Hko4IivBTJApI+qV/tF2H9nxU+6X0
0RqzsMHPHy13n2torA1gC7ttzOuXPI4hTvm6JWMSsfmfjTxLANJng3Dq3ejh6Bqw
76EMowpDLexwpy7/glPpqNdsP4ySf2Qm8yq3mR7qpL4m3zJVRGs11x+s5DW8NKBL
Fl5SqZvd01abH+sHwv6NLaLkEtayUyohxvyqu2RU3zu5M5vi7DhqstybTPjKPGu0
icSPh7b2y10WpOUpC6lxpdi8Me8qH47mVc/trZ+SpgBrsuEmtJhGKszEnzRCOqos
o2IhYHQv3lQv86kpaAFQlg/RO+Lv+Lo5qbJ209V+hfU5nYzXpEulZs4dx1fbA+ze
fK8GEh+u0L4uJg==
=PzRz
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Notable changes:
- A fix for a bug in our page table fragment allocator, where a page
table page could be freed and reallocated for something else while
still in use, leading to memory corruption etc. The fix reuses
pt_mm in struct page (x86 only) for a powerpc only refcount.
- Fixes to our pkey support. Several are user-visible changes, but
bring us in to line with x86 behaviour and/or fix outright bugs.
Thanks to Florian Weimer for reporting many of these.
- A series to improve the hvc driver & related OPAL console code,
which have been seen to cause hardlockups at times. The hvc driver
changes in particular have been in linux-next for ~month.
- Increase our MAX_PHYSMEM_BITS to 128TB when SPARSEMEM_VMEMMAP=y.
- Remove Power8 DD1 and Power9 DD1 support, neither chip should be in
use anywhere other than as a paper weight.
- An optimised memcmp implementation using Power7-or-later VMX
instructions
- Support for barrier_nospec on some NXP CPUs.
- Support for flushing the count cache on context switch on some IBM
CPUs (controlled by firmware), as a Spectre v2 mitigation.
- A series to enhance the information we print on unhandled signals
to bring it into line with other arches, including showing the
offending VMA and dumping the instructions around the fault.
Thanks to: Aaro Koskinen, Akshay Adiga, Alastair D'Silva, Alexey
Kardashevskiy, Alexey Spirkov, Alistair Popple, Andrew Donnellan,
Aneesh Kumar K.V, Anju T Sudhakar, Arnd Bergmann, Bartosz Golaszewski,
Benjamin Herrenschmidt, Bharat Bhushan, Bjoern Noetel, Boqun Feng,
Breno Leitao, Bryant G. Ly, Camelia Groza, Christophe Leroy, Christoph
Hellwig, Cyril Bur, Dan Carpenter, Daniel Klamt, Darren Stevens, Dave
Young, David Gibson, Diana Craciun, Finn Thain, Florian Weimer,
Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven, Geoff Levand,
Guenter Roeck, Gustavo Romero, Haren Myneni, Hari Bathini, Joel
Stanley, Jonathan Neuschäfer, Kees Cook, Madhavan Srinivasan, Mahesh
Salgaonkar, Markus Elfring, Mathieu Malaterre, Mauro S. M. Rodrigues,
Michael Hanselmann, Michael Neuling, Michael Schmitz, Mukesh Ojha,
Murilo Opsfelder Araujo, Nicholas Piggin, Parth Y Shah, Paul
Mackerras, Paul Menzel, Ram Pai, Randy Dunlap, Rashmica Gupta, Reza
Arbab, Rodrigo R. Galvao, Russell Currey, Sam Bobroff, Scott Wood,
Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stan Johnson, Thiago
Jung Bauermann, Tyrel Datwyler, Vaibhav Jain, Vasant Hegde, Venkat
Rao, zhong jiang"
* tag 'powerpc-4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (234 commits)
powerpc/mm/book3s/radix: Add mapping statistics
powerpc/uaccess: Enable get_user(u64, *p) on 32-bit
powerpc/mm/hash: Remove unnecessary do { } while(0) loop
powerpc/64s: move machine check SLB flushing to mm/slb.c
powerpc/powernv/idle: Fix build error
powerpc/mm/tlbflush: update the mmu_gather page size while iterating address range
powerpc/mm: remove warning about ‘type’ being set
powerpc/32: Include setup.h header file to fix warnings
powerpc: Move `path` variable inside DEBUG_PROM
powerpc/powermac: Make some functions static
powerpc/powermac: Remove variable x that's never read
cxl: remove a dead branch
powerpc/powermac: Add missing include of header pmac.h
powerpc/kexec: Use common error handling code in setup_new_fdt()
powerpc/xmon: Add address lookup for percpu symbols
powerpc/mm: remove huge_pte_offset_and_shift() prototype
powerpc/lib: Use patch_site to patch copy_32 functions once cache is enabled
powerpc/pseries: Fix endianness while restoring of r3 in MCE handler.
powerpc/fadump: merge adjacent memory ranges to reduce PT_LOAD segements
powerpc/fadump: handle crash memory ranges array index overflow
...
The comment to explain why the menu governor uses idle state 1
instead of idle state 0 as the first one sometimes is stale (among
other things it mentions a user setting not present any more),
so update it.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Export pnv_idle_states and nr_pnv_idle_states so that its accessible to
cpuidle driver. Use properties from pnv_idle_states structure for powernv
cpuidle_init.
Signed-off-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
It's perfectly fine to have multiple cpuidle driver compiled in the
build configuration. However, it's not good to throw error on driver
registration failure if some other driver is already initialised and
assigned. In such cases, __cpuidle_register_driver returns -EBUSY and
we can check for such error before throwing the error.
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Notable changes:
- Support for split PMD page table lock on 64-bit Book3S (Power8/9).
- Add support for HAVE_RELIABLE_STACKTRACE, so we properly support live
patching again.
- Add support for patching barrier_nospec in copy_from_user() and syscall entry.
- A couple of fixes for our data breakpoints on Book3S.
- A series from Nick optimising TLB/mm handling with the Radix MMU.
- Numerous small cleanups to squash sparse/gcc warnings from Mathieu Malaterre.
- Several series optimising various parts of the 32-bit code from Christophe Leroy.
- Removal of support for two old machines, "SBC834xE" and "C2K" ("GEFanuc,C2K"),
which is why the diffstat has so many deletions.
And many other small improvements & fixes.
There's a few out-of-area changes. Some minor ftrace changes OK'ed by Steve, and
a fix to our powernv cpuidle driver. Then there's a series touching mm, x86 and
fs/proc/task_mmu.c, which cleans up some details around pkey support. It was
ack'ed/reviewed by Ingo & Dave and has been in next for several weeks.
Thanks to:
Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Al Viro, Andrew
Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd Bergmann, Balbir Singh,
Cédric Le Goater, Christophe Leroy, Christophe Lombard, Colin Ian King, Dave
Hansen, Fabio Estevam, Finn Thain, Frederic Barrat, Gautham R. Shenoy, Haren
Myneni, Hari Bathini, Ingo Molnar, Jonathan Neuschäfer, Josh Poimboeuf,
Kamalesh Babulal, Madhavan Srinivasan, Mahesh Salgaonkar, Mark Greer, Mathieu
Malaterre, Matthew Wilcox, Michael Neuling, Michal Suchanek, Naveen N. Rao,
Nicholas Piggin, Nicolai Stange, Olof Johansson, Paul Gortmaker, Paul
Mackerras, Peter Rosin, Pridhiviraj Paidipeddi, Ram Pai, Rashmica Gupta, Ravi
Bangoria, Russell Currey, Sam Bobroff, Samuel Mendoza-Jonas, Segher
Boessenkool, Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stewart Smith,
Thiago Jung Bauermann, Torsten Duwe, Vaibhav Jain, Wei Yongjun, Wolfram Sang,
Yisheng Xie, YueHaibing.
-----BEGIN PGP SIGNATURE-----
iQIwBAABCAAaBQJbGQKBExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYBq
TRAAioK7rz5xYMkxaM3Ng3ybobEeNAwQqOolz98xvmnB9SfDWNuc99vf8cGu0/fQ
zc8AKZ5RcnwipOjyGlxW9oa1ZhVq0xtYnQPiYLEKMdLQmh5D+C7+KpvAd1UElweg
ub40/xDySWfMujfuMSF9JDCWPIXyojt4Xg5nJKIVRrAm/3YMe/+i5Am7NWHuMCEb
aQmZtlYW5Mz81XY0968hjpUO6eKFRmsaM7yFAhGTXx6+oLRpGj1PZB4AwdRIKS2L
Ak7q/VgxtE4W+s3a0GK2s+eXIhGKeFuX9AVnx3nti+8/K1OqrqhDcLMUC/9JpCpv
EvOtO7dxPnZujHjdu4Eai/xNoo4h6zRy7bWqve9LoBM40CP5jljKzu1lwqqb5yO0
jC7/aXhgiSIxxcRJLjoI/TYpZPu40MifrkydmczykdPyPCnMIWEJDcj4KsRL/9Y8
9SSbJzRNC/SgQNTbUYPZFFi6G0QaMmlcbCb628k8QT+Gn3Xkdf/ZtxzqEyoF4Irq
46kFBsiSSK4Bu0rVlcUtJQLgdqytWULO6NKEYnD67laxYcgQd8pGFQ8SjZhRZLgU
q5LA3HIWhoAI4M0wZhOnKXO6JfiQ1UbO8gUJLsWsfF0Fk5KAcdm+4kb4jbI1H4Qk
Vol9WNRZwEllyaiqScZN9RuVVuH0GPOZeEH1dtWK+uWi0lM=
=ZlBf
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Notable changes:
- Support for split PMD page table lock on 64-bit Book3S (Power8/9).
- Add support for HAVE_RELIABLE_STACKTRACE, so we properly support
live patching again.
- Add support for patching barrier_nospec in copy_from_user() and
syscall entry.
- A couple of fixes for our data breakpoints on Book3S.
- A series from Nick optimising TLB/mm handling with the Radix MMU.
- Numerous small cleanups to squash sparse/gcc warnings from Mathieu
Malaterre.
- Several series optimising various parts of the 32-bit code from
Christophe Leroy.
- Removal of support for two old machines, "SBC834xE" and "C2K"
("GEFanuc,C2K"), which is why the diffstat has so many deletions.
And many other small improvements & fixes.
There's a few out-of-area changes. Some minor ftrace changes OK'ed by
Steve, and a fix to our powernv cpuidle driver. Then there's a series
touching mm, x86 and fs/proc/task_mmu.c, which cleans up some details
around pkey support. It was ack'ed/reviewed by Ingo & Dave and has
been in next for several weeks.
Thanks to: Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Al
Viro, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd
Bergmann, Balbir Singh, Cédric Le Goater, Christophe Leroy, Christophe
Lombard, Colin Ian King, Dave Hansen, Fabio Estevam, Finn Thain,
Frederic Barrat, Gautham R. Shenoy, Haren Myneni, Hari Bathini, Ingo
Molnar, Jonathan Neuschäfer, Josh Poimboeuf, Kamalesh Babulal,
Madhavan Srinivasan, Mahesh Salgaonkar, Mark Greer, Mathieu Malaterre,
Matthew Wilcox, Michael Neuling, Michal Suchanek, Naveen N. Rao,
Nicholas Piggin, Nicolai Stange, Olof Johansson, Paul Gortmaker, Paul
Mackerras, Peter Rosin, Pridhiviraj Paidipeddi, Ram Pai, Rashmica
Gupta, Ravi Bangoria, Russell Currey, Sam Bobroff, Samuel
Mendoza-Jonas, Segher Boessenkool, Shilpasri G Bhat, Simon Guo,
Souptick Joarder, Stewart Smith, Thiago Jung Bauermann, Torsten Duwe,
Vaibhav Jain, Wei Yongjun, Wolfram Sang, Yisheng Xie, YueHaibing"
* tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (251 commits)
powerpc/64s/radix: Fix missing ptesync in flush_cache_vmap
cpuidle: powernv: Fix promotion from snooze if next state disabled
powerpc: fix build failure by disabling attribute-alias warning in pci_32
ocxl: Fix missing unlock on error in afu_ioctl_enable_p9_wait()
powerpc-opal: fix spelling mistake "Uniterrupted" -> "Uninterrupted"
powerpc: fix spelling mistake: "Usupported" -> "Unsupported"
powerpc/pkeys: Detach execute_only key on !PROT_EXEC
powerpc/powernv: copy/paste - Mask SO bit in CR
powerpc: Remove core support for Marvell mv64x60 hostbridges
powerpc/boot: Remove core support for Marvell mv64x60 hostbridges
powerpc/boot: Remove support for Marvell mv64x60 i2c controller
powerpc/boot: Remove support for Marvell MPSC serial controller
powerpc/embedded6xx: Remove C2K board support
powerpc/lib: optimise PPC32 memcmp
powerpc/lib: optimise 32 bits __clear_user()
powerpc/time: inline arch_vtime_task_switch()
powerpc/Makefile: set -mcpu=860 flag for the 8xx
powerpc: Implement csum_ipv6_magic in assembly
powerpc/32: Optimise __csum_partial()
powerpc/lib: Adjust .balign inside string functions for PPC32
...
The commit 78eaa10f02 ("cpuidle: powernv/pseries: Auto-promotion of
snooze to deeper idle state") introduced a timeout for the snooze idle
state so that it could be eventually be promoted to a deeper idle
state. The snooze timeout value is static and set to the target
residency of the next idle state, which would train the cpuidle
governor to pick the next idle state eventually.
The unfortunate side-effect of this is that if the next idle state(s)
is disabled, the CPU will forever remain in snooze, despite the fact
that the system is completely idle, and other deeper idle states are
available.
This patch fixes the issue by dynamically setting the snooze timeout
to the target residency of the next enabled state on the device.
Before Patch:
POWER8 : Only nap disabled.
$ cpupower monitor sleep 30
sleep took 30.01297 seconds and exited with status 0
|Idle_Stats
PKG |CORE|CPU | snoo | Nap | Fast
0| 8| 0| 96.41| 0.00| 0.00
0| 8| 1| 96.43| 0.00| 0.00
0| 8| 2| 96.47| 0.00| 0.00
0| 8| 3| 96.35| 0.00| 0.00
0| 8| 4| 96.37| 0.00| 0.00
0| 8| 5| 96.37| 0.00| 0.00
0| 8| 6| 96.47| 0.00| 0.00
0| 8| 7| 96.47| 0.00| 0.00
POWER9: Shallow states (stop0lite, stop1lite, stop2lite, stop0, stop1,
stop2) disabled:
$ cpupower monitor sleep 30
sleep took 30.05033 seconds and exited with status 0
|Idle_Stats
PKG |CORE|CPU | snoo | stop | stop | stop | stop | stop | stop | stop | stop
0| 16| 0| 89.79| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00
0| 16| 1| 90.12| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00
0| 16| 2| 90.21| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00
0| 16| 3| 90.29| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00
After Patch:
POWER8 : Only nap disabled.
$ cpupower monitor sleep 30
sleep took 30.01200 seconds and exited with status 0
|Idle_Stats
PKG |CORE|CPU | snoo | Nap | Fast
0| 8| 0| 16.58| 0.00| 77.21
0| 8| 1| 18.42| 0.00| 75.38
0| 8| 2| 4.70| 0.00| 94.09
0| 8| 3| 17.06| 0.00| 81.73
0| 8| 4| 3.06| 0.00| 95.73
0| 8| 5| 7.00| 0.00| 96.80
0| 8| 6| 1.00| 0.00| 98.79
0| 8| 7| 5.62| 0.00| 94.17
POWER9: Shallow states (stop0lite, stop1lite, stop2lite, stop0, stop1,
stop2) disabled:
$ cpupower monitor sleep 30
sleep took 30.02110 seconds and exited with status 0
|Idle_Stats
PKG |CORE|CPU | snoo | stop | stop | stop | stop | stop | stop | stop | stop
0| 0| 0| 0.69| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 9.39| 89.70
0| 0| 1| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.05| 93.21
0| 0| 2| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 89.93
0| 0| 3| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 0.00| 93.26
Fixes: 78eaa10f02 ("cpuidle: powernv/pseries: Auto-promotion of snooze to deeper idle state")
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There is some code duplication related to the PM QoS handling between
the existing cpuidle governors, so move that code to a common helper
function and call that from the governors.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
PM_QOS_RESUME_LATENCY_NO_CONSTRAINT is defined as the 32-bit integer
maximum, so it is not necessary to test the return value of
dev_pm_qos_raw_read_value() against it directly in the menu and
ladder cpuidle governors.
Drop these redundant checks.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If the scheduler tick has been stopped already and the governor
selects a shallow idle state, the CPU can spend a long time in that
state if the selection is based on an inaccurate prediction of idle
time. That effect turns out to be relevant, so it needs to be
mitigated.
To that end, modify the menu governor to discard the result of the
idle time prediction if the tick is stopped and the predicted idle
time is less than the tick period length, unless the tick timer is
going to expire soon.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
If the tick isn't stopped, the target residency of the state selected
by the menu governor may be greater than the actual time to the next
tick and that means lost energy.
To avoid that, make tick_nohz_get_sleep_length() return the current
time to the next event (before stopping the tick) in addition to the
estimated one via an extra pointer argument and make menu_select()
use that value to refine the state selection when necessary.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Add a new pointer argument to cpuidle_select() and to the ->select
cpuidle governor callback to allow a boolean value indicating
whether or not the tick should be stopped before entering the
selected state to be returned from there.
Make the ladder governor ignore that pointer (to preserve its
current behavior) and make the menu governor return 'false" through
it if:
(1) the idle exit latency is constrained at 0, or
(2) the selected state is a polling one, or
(3) the expected idle period duration is within the tick period
range.
In addition to that, the correction factor computations in the menu
governor need to take the possibility that the tick may not be
stopped into account to avoid artificially small correction factor
values. To that end, add a mechanism to record tick wakeups, as
suggested by Peter Zijlstra, and use it to modify the menu_update()
behavior when tick wakeup occurs. Namely, if the CPU is woken up by
the tick and the return value of tick_nohz_get_sleep_length() is not
within the tick boundary, the predicted idle duration is likely too
short, so make menu_update() try to compensate for that by updating
the governor statistics as though the CPU was idle for a long time.
Since the value returned through the new argument pointer of
cpuidle_select() is not used by its caller yet, this change by
itself is not expected to alter the functionality of the code.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Rik reports that he sees an increase in CPU use in one benchmark
due to commit 612f1a22f067 "cpuidle: poll_state: Add time limit to
poll_idle()" that caused poll_idle() to call local_clock() in every
iteration of the loop. Utilization increase generally means more
non-idle time with respect to total CPU time (on the average) which
implies reduced CPU frequency.
Doug reports that limiting the rate of local_clock() invocations
in there causes much less power to be drawn during a CPU-intensive
parallel workload (with idle states 1 and 2 disabled to enforce more
state 0 residency).
These two reports together suggest that executing local_clock() on
multiple CPUs in parallel at a high rate may cause chips to get hot
and trigger thermal/power limits on them to kick in, so reduce the
rate of local_clock() invocations in poll_idle() to avoid that issue.
Fixes: 612f1a22f067 "cpuidle: poll_state: Add time limit to poll_idle()"
Reported-by: Rik van Riel <riel@surriel.com>
Reported-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Add a new attribute group called "s2idle" under the sysfs directory
of each cpuidle state that supports the ->enter_s2idle callback
and put two new attributes, "usage" and "time", into that group to
represent the number of times the given state was requested for
suspend-to-idle and the total time spent in suspend-to-idle after
requesting that state, respectively.
That will allow diagnostic information related to suspend-to-idle
to be collected without enabling advanced debug features and
analyzing dmesg output.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
All the needed code has been already merged to mach-exynos core in
commit af9971144d ("ARM: EXYNOS: add coupled cpuidle support for
Exynos3250"), so enable support for coupled variant also for Exynos3250
SoCs.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Krzysztof Kozlowski <krzk@kernel.org>
Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If poll_idle() is allowed to spin until need_resched() returns 'true',
it may actually spin for a much longer time than expected by the idle
governor, since set_tsk_need_resched() is not always called by the
timer interrupt handler. If that happens, the CPU may spend much
more time than anticipated in the "polling" state.
To prevent that from happening, limit the time of the spinning loop
in poll_idle().
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Doug Smythies <dsmythies@telus.net>
Omit an extra message for a memory allocation failure in this function.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Highlights:
- Enable support for memory protection keys aka "pkeys" on Power7/8/9 when
using the hash table MMU.
- Extend our interrupt soft masking to support masking PMU interrupts as well
as "normal" interrupts, and then use that to implement local_t for a ~4x
speedup vs the current atomics-based implementation.
- A new driver "ocxl" for "Open Coherent Accelerator Processor Interface
(OpenCAPI)" devices.
- Support for new device tree properties on PowerVM to describe hotpluggable
memory and devices.
- Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE to the 64-bit VDSO.
- Freescale updates from Scott:
"Contains fixes for CPM GPIO and an FSL PCI erratum workaround, plus a
minor cleanup patch."
As well as quite a lot of other changes all over the place, and small fixes and
cleanups as always.
Thanks to:
Alan Modra, Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple, Andreas
Schwab, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anshuman
Khandual, Anton Blanchard, Arnd Bergmann, Balbir Singh, Benjamin
Herrenschmidt, Bhaktipriya Shridhar, Bryant G. Ly, Cédric Le Goater,
Christophe Leroy, Christophe Lombard, Cyril Bur, David Gibson, Desnes A. Nunes
do Rosario, Dmitry Torokhov, Frederic Barrat, Geert Uytterhoeven, Guilherme G.
Piccoli, Gustavo A. R. Silva, Gustavo Romero, Ivan Mikhaylov, Joakim
Tjernlund, Joe Perches, Josh Poimboeuf, Juan J. Alvarez, Julia Cartwright,
Kamalesh Babulal, Madhavan Srinivasan, Mahesh Salgaonkar, Mathieu Malaterre,
Michael Bringmann, Michael Hanselmann, Michael Neuling, Nathan Fontenot,
Naveen N. Rao, Nicholas Piggin, Paul Mackerras, Philippe Bergheaud, Ram Pai,
Russell Currey, Santosh Sivaraj, Scott Wood, Seth Forshee, Simon Guo, Stewart
Smith, Sukadev Bhattiprolu, Thiago Jung Bauermann, Vaibhav Jain, Vasyl
Gomonovych.
-----BEGIN PGP SIGNATURE-----
iQIwBAABCAAaBQJadF6wExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYA2
nBAAnguCEyAIYpc+ffE3WU9xJEWxa6bKuVufHcUFVntGiGD+igmMS+SHp4ay3Aos
HcA4WFrpzNb2KZ++kmFWtAKWnMfCiW9xuYJNicjr7X5ZiVBEhLWN/mQCwBKs3p6L
5+HhvytcdkKVbEcyVjEGvRL40AyxXNOI02o6Co9X8vanHsmWB4q0eWe4PHstZqlg
6K6kazMp+NTvEFYwKNXDOvuHouKSL57l14SLROH7CpJkNTOQ9s+W59/LmnuCjRlu
o70b7iWOAEbF9tvMma1ksDZVNj7mSyaymLYCyOXu4CkuuleJacZYJ9oQGNddoIbC
wk7l93vPT/yze7DYg8x3uXpKcaDEvEepPuQ/ubz+UXFQWuJtl5ej6Cv+0eOmyZIs
+bjWhGHKdNttnsiPlTRCX/gWD13RE1dB6xXJlfOJ7Oz9OnXXK8ZKc1NTREbQXRWM
8tClAwf9upWpm86GHPVnyrgYbgZo5b1os4SoS8e3kESzakrQVQP7J376u2DtccRq
2AGqjJ+tl5tYPnhm8zG1cNrpqHHpgkNGqLS7DvWRg3EPmEKVQcltN1b/0aKaAjHA
aTRofjrVo+jJ4MX1uyEo59yNCEQPfjkmHRQdLwm+xjWTzEPfIMzpWyXm14tawDQf
OjcAe90W/qQ18brw4z+2BI14J76XziOSX/QcunOn1u/sqaM=
=3rYn
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights:
- Enable support for memory protection keys aka "pkeys" on Power7/8/9
when using the hash table MMU.
- Extend our interrupt soft masking to support masking PMU interrupts
as well as "normal" interrupts, and then use that to implement
local_t for a ~4x speedup vs the current atomics-based
implementation.
- A new driver "ocxl" for "Open Coherent Accelerator Processor
Interface (OpenCAPI)" devices.
- Support for new device tree properties on PowerVM to describe
hotpluggable memory and devices.
- Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE to the 64-bit
VDSO.
- Freescale updates from Scott: fixes for CPM GPIO and an FSL PCI
erratum workaround, plus a minor cleanup patch.
As well as quite a lot of other changes all over the place, and small
fixes and cleanups as always.
Thanks to: Alan Modra, Alastair D'Silva, Alexey Kardashevskiy,
Alistair Popple, Andreas Schwab, Andrew Donnellan, Aneesh Kumar K.V,
Anju T Sudhakar, Anshuman Khandual, Anton Blanchard, Arnd Bergmann,
Balbir Singh, Benjamin Herrenschmidt, Bhaktipriya Shridhar, Bryant G.
Ly, Cédric Le Goater, Christophe Leroy, Christophe Lombard, Cyril Bur,
David Gibson, Desnes A. Nunes do Rosario, Dmitry Torokhov, Frederic
Barrat, Geert Uytterhoeven, Guilherme G. Piccoli, Gustavo A. R. Silva,
Gustavo Romero, Ivan Mikhaylov, Joakim Tjernlund, Joe Perches, Josh
Poimboeuf, Juan J. Alvarez, Julia Cartwright, Kamalesh Babulal,
Madhavan Srinivasan, Mahesh Salgaonkar, Mathieu Malaterre, Michael
Bringmann, Michael Hanselmann, Michael Neuling, Nathan Fontenot,
Naveen N. Rao, Nicholas Piggin, Paul Mackerras, Philippe Bergheaud,
Ram Pai, Russell Currey, Santosh Sivaraj, Scott Wood, Seth Forshee,
Simon Guo, Stewart Smith, Sukadev Bhattiprolu, Thiago Jung Bauermann,
Vaibhav Jain, Vasyl Gomonovych"
* tag 'powerpc-4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (199 commits)
powerpc/mm/radix: Fix build error when RADIX_MMU=n
macintosh/ams-input: Use true and false for boolean values
macintosh: change some data types from int to bool
powerpc/watchdog: Print the NIP in soft_nmi_interrupt()
powerpc/watchdog: regs can't be null in soft_nmi_interrupt()
powerpc/watchdog: Tweak watchdog printks
powerpc/cell: Remove axonram driver
rtc-opal: Fix handling of firmware error codes, prevent busy loops
powerpc/mpc52xx_gpt: make use of raw_spinlock variants
macintosh/adb: Properly mark continued kernel messages
powerpc/pseries: Fix cpu hotplug crash with memoryless nodes
powerpc/numa: Ensure nodes initialized for hotplug
powerpc/numa: Use ibm,max-associativity-domains to discover possible nodes
powerpc/kernel: Block interrupts when updating TIDR
powerpc/powernv/idoa: Remove unnecessary pcidev from pci_dn
powerpc/mm/nohash: do not flush the entire mm when range is a single page
powerpc/pseries: Add Initialization of VF Bars
powerpc/pseries/pci: Associate PEs to VFs in configure SR-IOV
powerpc/eeh: Add EEH notify resume sysfs
powerpc/eeh: Add EEH operations to notify resume
...
For shared processor guests (e.g., KVM), add an idle polling mode rather
than immediately returning to the hypervisor when the guest CPU goes
idle.
Test setup is a 2 socket POWER9 with 4 guests running, each with vCPUs
equal to 1/2 of real of CPUs. Saturated each guest with tbench. Using
polling idle gives about 1.4x throughput.
Kernel compile speed was not changed significantly.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Since e1689795a7 ("cpuidle: Add common time keeping and irq enabling"),
cpuidle drivers are expected to return from ->enter with irqs disabled.
Update the cpuidle-powernv snooze and cede loops to disable irqs before
returning.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Since e1689795a7 ("cpuidle: Add common time keeping and irq enabling"),
cpuidle drivers are expected to return from ->enter with irqs disabled.
Update the cpuidle-powernv snooze loop to disable irqs before returning.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Checks if the new governor is NULL before updating the
cupidle_curr_governor.
Signed-off-by: gaurav jindal <gauravjindal1104@gmail.com>
[ rjw : Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Non-highlights:
- Five fixes for the >128T address space handling, both to fix bugs in our
implementation and to bring the semantics exactly into line with x86.
Highlights:
- Support for a new OPAL call on bare metal machines which gives us a true NMI
(ie. is not masked by MSR[EE]=0) for debugging etc.
- Support for Power9 DD2 in the CXL driver.
- Improvements to machine check handling so that uncorrectable errors can be
reported into the generic memory_failure() machinery.
- Some fixes and improvements for VPHN, which is used under PowerVM to notify
the Linux partition of topology changes.
- Plumbing to enable TM (transactional memory) without suspend on some Power9
processors (PPC_FEATURE2_HTM_NO_SUSPEND).
- Support for emulating vector loads form cache-inhibited memory, on some
Power9 revisions.
- Disable the fast-endian switch "syscall" by default (behind a CONFIG), we
believe it has never had any users.
- A major rework of the API drivers use when initiating and waiting for long
running operations performed by OPAL firmware, and changes to the
powernv_flash driver to use the new API.
- Several fixes for the handling of FP/VMX/VSX while processes are using
transactional memory.
- Optimisations of TLB range flushes when using the radix MMU on Power9.
- Improvements to the VAS facility used to access coprocessors on Power9, and
related improvements to the way the NX crypto driver handles requests.
- Implementation of PMEM_API and UACCESS_FLUSHCACHE for 64-bit.
Thanks to:
Alexey Kardashevskiy, Alistair Popple, Allen Pais, Andrew Donnellan, Aneesh
Kumar K.V, Arnd Bergmann, Balbir Singh, Benjamin Herrenschmidt, Breno Leitao,
Christophe Leroy, Christophe Lombard, Cyril Bur, Frederic Barrat, Gautham R.
Shenoy, Geert Uytterhoeven, Guilherme G. Piccoli, Gustavo Romero, Haren
Myneni, Joel Stanley, Kamalesh Babulal, Kautuk Consul, Markus Elfring, Masami
Hiramatsu, Michael Bringmann, Michael Neuling, Michal Suchanek, Naveen N. Rao,
Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pedro Miraglia Franco de
Carvalho, Philippe Bergheaud, Sandipan Das, Seth Forshee, Shriya, Stephen
Rothwell, Stewart Smith, Sukadev Bhattiprolu, Tyrel Datwyler, Vaibhav Jain,
Vaidyanathan Srinivasan, William A. Kennington III.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJaDXGuAAoJEFHr6jzI4aWAEqwP/0TA35KFAK6wqfkCf67z4q+O
I+5piI4eDV4jdCakfoIN1JfjhQRULNePSoCHTccan30mu/bm30p69xtOLL2/h5xH
Mhz/eDBAOo0lrT20nyZfYMW3FnM66wnNf++qJ0O+8L052r4WOB02J0k1uM1ST01D
5Lb5mUoxRLRzCgKRYAYWJifn+IFPUB9NMsvMTym94krAFlIjIzMEQXhDoln+jJMr
QmY5f1BTA/fLfXobn0zwoc/C1oa2PUtxd+rxbwGrLoZ6G843mMqUi90SMr5ybhXp
RzepnBTj4by3vOsnk/X1mANyaZfLsunp75FwnjHdPzKrAS/TuPp8D/iSxxE/PzEq
cLwJFBnFXSgQMefDErXxhHSDz2dAg5r14rsTpDcq2Ko8TPV4rPsuSfmbd9Txekb0
yWHsjoJUBBMl2QcWqIHl+AlV8j1RklF6solcTBcGnH1CZJMfa05VKXV7xGEvOHa0
RJ+/xPyR9KjoB/SUp++9Vmx/M6SwQYFOJlr3Zpg9LNtR8WpoPYu1E6eO+u1Hhzny
eJqaNstH+i+VdY9eqszkAsEBh8o9M/+b+7Wx7TetvU+v368CbXtgFYs9qy2oZjPF
t9sY/BHaHZ8eZ7I00an77a0fVV5B1PVASUtIz5CqkwGpMvX6Z6W2K/XUUFI61kuu
E06HS6Ht8UPJAzrAPUMl
=Rq81
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"A bit of a small release, I suspect in part due to me travelling for
KS. But my backlog of patches to review is smaller than usual, so I
think in part folks just didn't send as much this cycle.
Non-highlights:
- Five fixes for the >128T address space handling, both to fix bugs
in our implementation and to bring the semantics exactly into line
with x86.
Highlights:
- Support for a new OPAL call on bare metal machines which gives us a
true NMI (ie. is not masked by MSR[EE]=0) for debugging etc.
- Support for Power9 DD2 in the CXL driver.
- Improvements to machine check handling so that uncorrectable errors
can be reported into the generic memory_failure() machinery.
- Some fixes and improvements for VPHN, which is used under PowerVM
to notify the Linux partition of topology changes.
- Plumbing to enable TM (transactional memory) without suspend on
some Power9 processors (PPC_FEATURE2_HTM_NO_SUSPEND).
- Support for emulating vector loads form cache-inhibited memory, on
some Power9 revisions.
- Disable the fast-endian switch "syscall" by default (behind a
CONFIG), we believe it has never had any users.
- A major rework of the API drivers use when initiating and waiting
for long running operations performed by OPAL firmware, and changes
to the powernv_flash driver to use the new API.
- Several fixes for the handling of FP/VMX/VSX while processes are
using transactional memory.
- Optimisations of TLB range flushes when using the radix MMU on
Power9.
- Improvements to the VAS facility used to access coprocessors on
Power9, and related improvements to the way the NX crypto driver
handles requests.
- Implementation of PMEM_API and UACCESS_FLUSHCACHE for 64-bit.
Thanks to: Alexey Kardashevskiy, Alistair Popple, Allen Pais, Andrew
Donnellan, Aneesh Kumar K.V, Arnd Bergmann, Balbir Singh, Benjamin
Herrenschmidt, Breno Leitao, Christophe Leroy, Christophe Lombard,
Cyril Bur, Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven,
Guilherme G. Piccoli, Gustavo Romero, Haren Myneni, Joel Stanley,
Kamalesh Babulal, Kautuk Consul, Markus Elfring, Masami Hiramatsu,
Michael Bringmann, Michael Neuling, Michal Suchanek, Naveen N. Rao,
Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pedro Miraglia
Franco de Carvalho, Philippe Bergheaud, Sandipan Das, Seth Forshee,
Shriya, Stephen Rothwell, Stewart Smith, Sukadev Bhattiprolu, Tyrel
Datwyler, Vaibhav Jain, Vaidyanathan Srinivasan, and William A.
Kennington III"
* tag 'powerpc-4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (151 commits)
powerpc/64s: Fix Power9 DD2.0 workarounds by adding DD2.1 feature
powerpc/64s: Fix masking of SRR1 bits on instruction fault
powerpc/64s: mm_context.addr_limit is only used on hash
powerpc/64s/radix: Fix 128TB-512TB virtual address boundary case allocation
powerpc/64s/hash: Allow MAP_FIXED allocations to cross 128TB boundary
powerpc/64s/hash: Fix fork() with 512TB process address space
powerpc/64s/hash: Fix 128TB-512TB virtual address boundary case allocation
powerpc/64s/hash: Fix 512T hint detection to use >= 128T
powerpc: Fix DABR match on hash based systems
powerpc/signal: Properly handle return value from uprobe_deny_signal()
powerpc/fadump: use kstrtoint to handle sysfs store
powerpc/lib: Implement UACCESS_FLUSHCACHE API
powerpc/lib: Implement PMEM API
powerpc/powernv/npu: Don't explicitly flush nmmu tlb
powerpc/powernv/npu: Use flush_all_mm() instead of flush_tlb_mm()
powerpc/powernv/idle: Round up latency and residency values
powerpc/kprobes: refactor kprobe_lookup_name for safer string operations
powerpc/kprobes: Blacklist emulate_update_regs() from kprobes
powerpc/kprobes: Do not disable interrupts for optprobes and kprobes_on_ftrace
powerpc/kprobes: Disable preemption before invoking probe handler for optprobes
...
* pm-cpuidle:
intel_idle: Graceful probe failure when MWAIT is disabled
cpuidle: Avoid assignment in if () argument
cpuidle: Clean up cpuidle_enable_device() error handling a bit
cpuidle: ladder: Add per CPU PM QoS resume latency support
ARM: cpuidle: Refactor rollback operations if init fails
ARM: cpuidle: Correct driver unregistration if init fails
intel_idle: replace conditionals with static_cpu_has(X86_FEATURE_ARAT)
cpuidle: fix broadcast control when broadcast can not be entered
Conflicts:
drivers/idle/intel_idle.c
On PowerNV platforms, firmware provides exit latency and
target residency for each of the idle states in nano
seconds. Cpuidle framework expects the values in micro
seconds. Round up to nearest micro seconds to avoid errors
in cases where the values are defined as fractional micro
seconds.
Default idle state of 'snooze' has exit latency of zero. If
other states have fractional micro second exit latency, they
would get rounded down to zero micro second and make cpuidle
framework choose deeper idle state when snooze loop is the
right choice.
Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Clean up cpuidle_enable_device() to avoid doing an assignment
in an expression evaluated as an argument of if (), which also
makes the code in question more readable.
Signed-off-by: Gaurav Jindal <gauravjindal1104@gmail.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Do not fetch per CPU drv if cpuidle_curr_governor is NULL
to avoid useless per CPU processing.
Signed-off-by: Gaurav Jindal <gauravjindal1104@gmail.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Individual CPUs may have special requirements to not enter
deep idle states. For example, a CPU running real time
applications would not want to enter deep idle states to
avoid latency impacts. At the same time other CPUs that
do not have such a requirement could allow deep idle
states to save power.
This was already implemented in the menu governor.
Implementing similar changes in the ladder governor which
gets selected when CONFIG_NO_HZ and CONFIG_NO_HZ_IDLE are not
set. Refer following commits for the menu governor changes.
Signed-off-by: Ramesh Thomas <ramesh.thomas@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The special value of 0 for device resume latency PM QoS means
"no restriction", but there are two problems with that.
First, device resume latency PM QoS requests with 0 as the
value are always put in front of requests with positive
values in the priority lists used internally by the PM QoS
framework, causing 0 to be chosen as an effective constraint
value. However, that 0 is then interpreted as "no restriction"
effectively overriding the other requests with specific
restrictions which is incorrect.
Second, the users of device resume latency PM QoS have no
way to specify that *any* resume latency at all should be
avoided, which is an artificial limitation in general.
To address these issues, modify device resume latency PM QoS to
use S32_MAX as the "no constraint" value and 0 as the "no
latency at all" one and rework its users (the cpuidle menu
governor, the genpd QoS governor and the runtime PM framework)
to follow these changes.
Also add a special "n/a" value to the corresponding user space I/F
to allow user space to indicate that it cannot accept any resume
latencies at all for the given device.
Fixes: 85dc0b8a40 (PM / QoS: Make it possible to expose PM QoS latency constraints)
Link: https://bugzilla.kernel.org/show_bug.cgi?id=197323
Reported-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Tero Kristo <t-kristo@ti.com>
Reviewed-by: Ramesh Thomas <ramesh.thomas@intel.com>
MIPS will soon not be a part of Imagination Technologies, and as such
many @imgtec.com email addresses will no longer be valid. This patch
updates the addresses for those who:
- Have 10 or more patches in mainline authored using an @imgtec.com
email address, or any patches dated within the past year.
- Are still with Imagination but leaving as part of the MIPS business
unit, as determined from an internal email address list.
- Haven't already updated their email address (ie. JamesH) or expressed
a desire to be excluded (ie. Maciej).
- Acked v2 or earlier of this patch, which leaves Deng-Cheng, Matt &
myself.
New addresses are of the form firstname.lastname@mips.com, and all
verified against an internal email address list. An entry is added to
.mailmap for each person such that get_maintainer.pl will report the new
addresses rather than @imgtec.com addresses which will soon be dead.
Instances of the affected addresses throughout the tree are then
mechanically replaced with the new @mips.com address.
Signed-off-by: Paul Burton <paul.burton@mips.com>
Cc: Deng-Cheng Zhu <dengcheng.zhu@imgtec.com>
Cc: Deng-Cheng Zhu <dengcheng.zhu@mips.com>
Acked-by: Dengcheng Zhu <dengcheng.zhu@mips.com>
Cc: Matt Redfearn <matt.redfearn@imgtec.com>
Cc: Matt Redfearn <matt.redfearn@mips.com>
Acked-by: Matt Redfearn <matt.redfearn@mips.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: trivial@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If init fails, we need execute two levels rollback operations: the first
level is for the failed CPU rollback operations, the second level is to
iterate all succeeded CPUs to cancel their registration; currently the
code uses one function to finish these two levels rollback operations.
This commit is to refactor rollback operations, so it adds a new
function arm_idle_init_cpu() to encapsulate one specified CPU driver
registration and rollback the first level operations; and use function
arm_idle_init() to iterate all CPUs and finish the second level's
rollback operations.
Suggested-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If cpuidle init fails, the code misses to unregister the driver for
current CPU. Furthermore, we also need to rollback to cancel all
previous CPUs registration; but the code retrieves driver handler by
using function cpuidle_get_driver(), this function returns back
current CPU driver handler but not previous CPU's handler, which leads
to the failure handling code cannot unregister previous CPUs driver.
This commit fixes two mentioned issues, it adds error handling path
'goto out_unregister_drv' for current CPU driver unregistration; and
it is to replace cpuidle_get_driver() with cpuidle_get_cpu_driver(),
the later function can retrieve driver handler for previous CPUs
according to the CPU device handler so can unregister the driver
properly.
This patch also adds extra error handling paths 'goto out_kfree_dev'
and 'goto out_kfree_drv' and adjusts the freeing sentences for previous
CPUs; so make the code more readable for freeing 'dev' and 'drv'
structures.
Suggested-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Fixes: d50a7d8acd (ARM: cpuidle: Support asymmetric idle definition)
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When failing to enter broadcast timer mode for an idle state that
requires it, a new state is selected that does not require broadcast,
but the broadcast variable remains set. This causes
tick_broadcast_exit to be called despite not having entered broadcast
mode.
This causes the WARN_ON_ONCE(!irqs_disabled()) to trigger in some
cases. It does not appear to cause problems for code today, but seems
to violate the interface so should be fixed.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In case there are no DT idle states defined or
cpuidle_register_driver() fails, the copy of the idle driver is leaked:
unreferenced object 0xede0dc00 (size 1024):
comm "swapper/0", pid 1, jiffies 4294937431 (age 744.510s)
hex dump (first 32 bytes):
94 9e 0b c1 00 00 00 00 00 00 00 00 00 00 00 00 ................
57 46 49 00 00 00 00 00 00 00 00 00 00 00 00 00 WFI.............
backtrace:
[<c1295f04>] arm_idle_init+0x44/0x1ac
[<c0301e6c>] do_one_initcall+0x3c/0x16c
[<c1200d70>] kernel_init_freeable+0x110/0x1d0
[<c0cb3624>] kernel_init+0x8/0x114
[<c0307a98>] ret_from_fork+0x14/0x3c
So fix this by freeing the unregistered copy in error case.
Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com>
Fixes: d50a7d8acd (ARM: cpuidle: Support asymmetric idle definition)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull MIPS updates from Ralf Baechle:
"This is the main pull request for 4.14 for MIPS; below a summary of
the non-merge commits:
CM:
- Rename mips_cm_base to mips_gcr_base
- Specify register size when generating accessors
- Use BIT/GENMASK for register fields, order & drop shifts
- Add cluster & block args to mips_cm_lock_other()
CPC:
- Use common CPS accessor generation macros
- Use BIT/GENMASK for register fields, order & drop shifts
- Introduce register modify (set/clear/change) accessors
- Use change_*, set_* & clear_* where appropriate
- Add CM/CPC 3.5 register definitions
- Use GlobalNumber macros rather than magic numbers
- Have asm/mips-cps.h include CM & CPC headers
- Cluster support for topology functions
- Detect CPUs in secondary clusters
CPS:
- Read GIC_VL_IDENT directly, not via irqchip driver
DMA:
- Consolidate coherent and non-coherent dma_alloc code
- Don't use dma_cache_sync to implement fd_cacheflush
FPU emulation / FP assist code:
- Another series of 14 commits fixing corner cases such as NaN
propgagation and other special input values.
- Zero bits 32-63 of the result for a CLASS.D instruction.
- Enhanced statics via debugfs
- Do not use bools for arithmetic. GCC 7.1 moans about this.
- Correct user fault_addr type
Generic MIPS:
- Enhancement of stack backtraces
- Cleanup from non-existing options
- Handle non word sized instructions when examining frame
- Fix detection and decoding of ADDIUSP instruction
- Fix decoding of SWSP16 instruction
- Refactor handling of stack pointer in get_frame_info
- Remove unreachable code from force_fcr31_sig()
- Convert to using %pOF instead of full_name
- Remove the R6000 support.
- Move FP code from *_switch.S to *_fpu.S
- Remove unused ST_OFF from r2300_switch.S
- Allow platform to specify multiple its.S files
- Add #includes to various files to ensure code builds reliable and
without warning..
- Remove __invalidate_kernel_vmap_range
- Remove plat_timer_setup
- Declare various variables & functions static
- Abstract CPU core & VP(E) ID access through accessor functions
- Store core & VP IDs in GlobalNumber-style variable
- Unify checks for sibling CPUs
- Add CPU cluster number accessors
- Prevent direct use of generic_defconfig
- Make CONFIG_MIPS_MT_SMP default y
- Add __ioread64_copy
- Remove unnecessary inclusions of linux/irqchip/mips-gic.h
GIC:
- Introduce asm/mips-gic.h with accessor functions
- Use new GIC accessor functions in mips-gic-timer
- Remove counter access functions from irq-mips-gic.c
- Remove gic_read_local_vp_id() from irq-mips-gic.c
- Simplify shared interrupt pending/mask reads in irq-mips-gic.c
- Simplify gic_local_irq_domain_map() in irq-mips-gic.c
- Drop gic_(re)set_mask() functions in irq-mips-gic.c
- Remove gic_set_polarity(), gic_set_trigger(), gic_set_dual_edge(),
gic_map_to_pin() and gic_map_to_vpe() from irq-mips-gic.c.
- Convert remaining shared reg access, local int mask access and
remaining local reg access to new accessors
- Move GIC_LOCAL_INT_* to asm/mips-gic.h
- Remove GIC_CPU_INT* macros from irq-mips-gic.c
- Move various definitions to the driver
- Remove gic_get_usm_range()
- Remove __gic_irq_dispatch() forward declaration
- Remove gic_init()
- Use mips_gic_present() in place of gic_present and remove
gic_present
- Move gic_get_c0_*_int() to asm/mips-gic.h
- Remove linux/irqchip/mips-gic.h
- Inline __gic_init()
- Inline gic_basic_init()
- Make pcpu_masks a per-cpu variable
- Use pcpu_masks to avoid reading GIC_SH_MASK*
- Clean up mti, reserved-cpu-vectors handling
- Use cpumask_first_and() in gic_set_affinity()
- Let the core set struct irq_common_data affinity
microMIPS:
- Fix microMIPS stack unwinding on big endian systems
MIPS-GIC:
- SYNC after enabling GIC region
NUMA:
- Remove the unused parent_node() macro
R6:
- Constify r2_decoder_tables
- Add accessor & bit definitions for GlobalNumber
SMP:
- Constify smp ops
- Allow boot_secondary SMP op to return errors
VDSO:
- Drop gic_get_usm_range() usage
- Avoid use of linux/irqchip/mips-gic.h
Platform changes:
Alchemy:
- Add devboard machine type to cpuinfo
- update cpu feature overrides
- Threaded carddetect irqs for devboards
AR7:
- allow NULL clock for clk_get_rate
BCM63xx:
- Fix ENETDMA_6345_MAXBURST_REG offset
- Allow NULL clock for clk_get_rate
CI20:
- Enable GPIO and RTC drivers in defconfig
- Add ethernet and fixed-regulator nodes to DTS
Generic platform:
- Move Boston and NI 169445 FIT image source to their own files
- Include asm/bootinfo.h for plat_fdt_relocated()
- Include asm/time.h for get_c0_*_int()
- Include asm/bootinfo.h for plat_fdt_relocated()
- Include asm/time.h for get_c0_*_int()
- Allow filtering enabled boards by requirements
- Don't explicitly disable CONFIG_USB_SUPPORT
- Bump default NR_CPUS to 16
JZ4700:
- Probe the jz4740-rtc driver from devicetree
Lantiq:
- Drop check of boot select from the spi-falcon driver.
- Drop check of boot select from the lantiq-flash MTD driver.
- Access boot cause register in the watchdog driver through regmap
- Add device tree binding documentation for the watchdog driver
- Add docs for the RCU DT bindings.
- Convert the fpi bus driver to a platform_driver
- Remove ltq_reset_cause() and ltq_boot_select(
- Switch to a proper reset driver
- Switch to a new drivers/soc GPHY driver
- Add an USB PHY driver for the Lantiq SoCs using the RCU module
- Use of_platform_default_populate instead of __dt_register_buses
- Enable MFD_SYSCON to be able to use it for the RCU MFD
- Replace ltq_boot_select() with dummy implementation.
Loongson 2F:
- Allow NULL clock for clk_get_rate
Malta:
- Use new GIC accessor functions
NI 169445:
- Add support for NI 169445 board.
- Only include in 32r2el kernels
Octeon:
- Add support for watchdog of 78XX SOCs.
- Add support for watchdog of CN68XX SOCs.
- Expose support for mips32r1, mips32r2 and mips64r1
- Enable more drivers in config file
- Add support for accessing the boot vector.
- Remove old boot vector code from watchdog driver
- Define watchdog registers for 70xx, 73xx, 78xx, F75xx.
- Make CSR functions node aware.
- Allow access to CIU3 IRQ domains.
- Misc cleanups in the watchdog driver
Omega2+:
- New board, add support and defconfig
Pistachio:
- Enable Root FS on NFS in defconfig
Ralink:
- Add Mediatek MT7628A SoC
- Allow NULL clock for clk_get_rate
- Explicitly request exclusive reset control in the pci-mt7620 PCI driver.
SEAD3:
- Only include in 32 bit kernels by default
VoCore:
- Add VoCore as a vendor t0 dt-bindings
- Add defconfig file"
* '4.14-features' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (167 commits)
MIPS: Refactor handling of stack pointer in get_frame_info
MIPS: Stacktrace: Fix microMIPS stack unwinding on big endian systems
MIPS: microMIPS: Fix decoding of swsp16 instruction
MIPS: microMIPS: Fix decoding of addiusp instruction
MIPS: microMIPS: Fix detection of addiusp instruction
MIPS: Handle non word sized instructions when examining frame
MIPS: ralink: allow NULL clock for clk_get_rate
MIPS: Loongson 2F: allow NULL clock for clk_get_rate
MIPS: BCM63XX: allow NULL clock for clk_get_rate
MIPS: AR7: allow NULL clock for clk_get_rate
MIPS: BCM63XX: fix ENETDMA_6345_MAXBURST_REG offset
mips: Save all registers when saving the frame
MIPS: Add DWARF unwinding to assembly
MIPS: Make SAVE_SOME more standard
MIPS: Fix issues in backtraces
MIPS: jz4780: DTS: Probe the jz4740-rtc driver from devicetree
MIPS: Ci20: Enable RTC driver
watchdog: octeon-wdt: Add support for 78XX SOCs.
watchdog: octeon-wdt: Add support for cn68XX SOCs.
watchdog: octeon-wdt: File cleaning.
...
- Drop the P-state selection algorithm based on a PID controller
from intel_pstate and make it use the same P-state selection
method (based on the CPU load) for all types of systems in the
active mode (Rafael Wysocki, Srinivas Pandruvada).
- Rework the cpufreq core and governors to make it possible to
take cross-CPU utilization updates into account and modify the
schedutil governor to actually do so (Viresh Kumar).
- Clean up the handling of transition latency information in the
cpufreq core and untangle it from the information on which drivers
cannot do dynamic frequency switching (Viresh Kumar).
- Add support for new SoCs (MT2701/MT7623 and MT7622) to the
mediatek cpufreq driver and update its DT bindings (Sean Wang).
- Modify the cpufreq dt-platdev driver to autimatically create
cpufreq devices for the new (v2) Operating Performance Points
(OPP) DT bindings and update its whitelist of supported systems
(Viresh Kumar, Shubhrajyoti Datta, Marc Gonzalez, Khiem Nguyen,
Finley Xiao).
- Add support for Ux500 to the cpufreq-dt driver and drop the
obsolete dbx500 cpufreq driver (Linus Walleij, Arnd Bergmann).
- Add new SoC (R8A7795) support to the cpufreq rcar driver (Khiem
Nguyen).
- Fix and clean up assorted issues in the cpufreq drivers and core
(Arvind Yadav, Christophe Jaillet, Colin Ian King, Gustavo Silva,
Julia Lawall, Leonard Crestez, Rob Herring, Sudeep Holla).
- Update the IO-wait boost handling in the schedutil governor to
make it less aggressive (Joel Fernandes).
- Rework system suspend diagnostics to make it print fewer messages
to the kernel log by default, add a sysfs knob to allow more
suspend-related messages to be printed and add Low Power S0 Idle
constraints checks to the ACPI suspend-to-idle code (Rafael
Wysocki, Srinivas Pandruvada).
- Prefer suspend-to-idle over S3 on ACPI-based systems with the
ACPI_FADT_LOW_POWER_S0 flag set and the Low Power Idle S0 _DSM
interface present in the ACPI tables (Rafael Wysocki).
- Update documentation related to system sleep and rename a number
of items in the code to make it cleare that they are related to
suspend-to-idle (Rafael Wysocki).
- Export a variable allowing device drivers to check the target
system sleep state from the core system suspend code (Florian
Fainelli).
- Clean up the cpuidle subsystem to handle the polling state on
x86 in a more straightforward way and to use %pOF instead of
full_name (Rafael Wysocki, Rob Herring).
- Update the devfreq framework to fix and clean up a few minor
issues (Chanwoo Choi, Rob Herring).
- Extend diagnostics in the generic power domains (genpd) framework
and clean it up slightly (Thara Gopinath, Rob Herring).
- Fix and clean up a couple of issues in the operating performance
points (OPP) framework (Viresh Kumar, Waldemar Rymarkiewicz).
- Add support for RV1108 to the rockchip-io Adaptive Voltage Scaling
(AVS) driver (David Wu).
- Fix the usage of notifiers in CPU power management on some
platforms (Alex Shi).
- Update the pm-graph system suspend/hibernation and boot profiling
utility (Todd Brandt).
- Make it possible to run the cpupower utility without CPU0 (Prarit
Bhargava).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJZrcDJAAoJEILEb/54YlRx9FUQAIUKvWBAARc61ZIZXjbqZF1v
aEMOBuksFns0CMekdptSic6n4wc81E/XYMS8yDhOOMpyDzfAZsTWjmu+gKwN7w3l
E/yf/NVlhob9JZ7MqGgqD4EUFfFIaKBXPlWFdDi2rdCUXE2L8xJ7rla8i7zyZlc5
pYHfAppBbF4qUcEY4OoOVOOGRZCfMdiLXj0iZOhMX8Y6yLBRk/AjnVADYsF33hoj
gBEfomU+H0K5V8nQEp0ZFKDArPwL+oElHQj6i+nxBpGfPM5evvLXhHOyR6AsldJ5
J4YI1kMuQNSCmvHMqOTxTYyJf8Jcf3Fj4wcjwaVMVGceY1lz6McAKknnFnCqCvz+
mskn84gFCBCM8EoJDqRf0b9MQHcuRyQKM+yw4tjnR9r8yd32erb85ZWFHcPWYhCT
fZatNOwFFv2MU+2vo5J3yeUNSWIKT+uBjy+tKPbrDkUwpKZVRj3Oj+hP3Mq9NE8U
YBqltsj7tmrdA634zI8C7jfS6wF221S0fId/iPszwmPJaVn/lq8Ror7pWL5YI8U7
SCJFjiqDiGmAcQEkuWwFAQnscZkyHpO+Y3A+jfXl/izoaZETaI5+ceIHBaocm3+5
XrOOpHS3ik8EHf9ji0KFCKZ/pYDwllday3cBQPWo3sMIzpQ2lrjbqdnE1cVnBrld
OtHZAeD/jLUXuY6XW2jN
=mAiV
-----END PGP SIGNATURE-----
Merge tag 'pm-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"This time (again) cpufreq gets the majority of changes which mostly
are driver updates (including a major consolidation of intel_pstate),
some schedutil governor modifications and core cleanups.
There also are some changes in the system suspend area, mostly related
to diagnostics and debug messages plus some renames of things related
to suspend-to-idle. One major change here is that suspend-to-idle is
now going to be preferred over S3 on systems where the ACPI tables
indicate to do so and provide requsite support (the Low Power Idle S0
_DSM in particular). The system sleep documentation and the tools
related to it are updated too.
The rest is a few cpuidle changes (nothing major), devfreq updates,
generic power domains (genpd) framework updates and a few assorted
modifications elsewhere.
Specifics:
- Drop the P-state selection algorithm based on a PID controller from
intel_pstate and make it use the same P-state selection method
(based on the CPU load) for all types of systems in the active mode
(Rafael Wysocki, Srinivas Pandruvada).
- Rework the cpufreq core and governors to make it possible to take
cross-CPU utilization updates into account and modify the schedutil
governor to actually do so (Viresh Kumar).
- Clean up the handling of transition latency information in the
cpufreq core and untangle it from the information on which drivers
cannot do dynamic frequency switching (Viresh Kumar).
- Add support for new SoCs (MT2701/MT7623 and MT7622) to the mediatek
cpufreq driver and update its DT bindings (Sean Wang).
- Modify the cpufreq dt-platdev driver to autimatically create
cpufreq devices for the new (v2) Operating Performance Points (OPP)
DT bindings and update its whitelist of supported systems (Viresh
Kumar, Shubhrajyoti Datta, Marc Gonzalez, Khiem Nguyen, Finley
Xiao).
- Add support for Ux500 to the cpufreq-dt driver and drop the
obsolete dbx500 cpufreq driver (Linus Walleij, Arnd Bergmann).
- Add new SoC (R8A7795) support to the cpufreq rcar driver (Khiem
Nguyen).
- Fix and clean up assorted issues in the cpufreq drivers and core
(Arvind Yadav, Christophe Jaillet, Colin Ian King, Gustavo Silva,
Julia Lawall, Leonard Crestez, Rob Herring, Sudeep Holla).
- Update the IO-wait boost handling in the schedutil governor to make
it less aggressive (Joel Fernandes).
- Rework system suspend diagnostics to make it print fewer messages
to the kernel log by default, add a sysfs knob to allow more
suspend-related messages to be printed and add Low Power S0 Idle
constraints checks to the ACPI suspend-to-idle code (Rafael
Wysocki, Srinivas Pandruvada).
- Prefer suspend-to-idle over S3 on ACPI-based systems with the
ACPI_FADT_LOW_POWER_S0 flag set and the Low Power Idle S0 _DSM
interface present in the ACPI tables (Rafael Wysocki).
- Update documentation related to system sleep and rename a number of
items in the code to make it cleare that they are related to
suspend-to-idle (Rafael Wysocki).
- Export a variable allowing device drivers to check the target
system sleep state from the core system suspend code (Florian
Fainelli).
- Clean up the cpuidle subsystem to handle the polling state on x86
in a more straightforward way and to use %pOF instead of full_name
(Rafael Wysocki, Rob Herring).
- Update the devfreq framework to fix and clean up a few minor issues
(Chanwoo Choi, Rob Herring).
- Extend diagnostics in the generic power domains (genpd) framework
and clean it up slightly (Thara Gopinath, Rob Herring).
- Fix and clean up a couple of issues in the operating performance
points (OPP) framework (Viresh Kumar, Waldemar Rymarkiewicz).
- Add support for RV1108 to the rockchip-io Adaptive Voltage Scaling
(AVS) driver (David Wu).
- Fix the usage of notifiers in CPU power management on some
platforms (Alex Shi).
- Update the pm-graph system suspend/hibernation and boot profiling
utility (Todd Brandt).
- Make it possible to run the cpupower utility without CPU0 (Prarit
Bhargava)"
* tag 'pm-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (87 commits)
cpuidle: Make drivers initialize polling state
cpuidle: Move polling state initialization code to separate file
cpuidle: Eliminate the CPUIDLE_DRIVER_STATE_START symbol
cpufreq: imx6q: Fix imx6sx low frequency support
cpufreq: speedstep-lib: make several arrays static, makes code smaller
PM: docs: Delete the obsolete states.txt document
PM: docs: Describe high-level PM strategies and sleep states
PM / devfreq: Fix memory leak when fail to register device
PM / devfreq: Add dependency on PM_OPP
PM / devfreq: Move private devfreq_update_stats() into devfreq
PM / devfreq: Convert to using %pOF instead of full_name
PM / AVS: rockchip-io: add io selectors and supplies for RV1108
cpufreq: ti: Fix 'of_node_put' being called twice in error handling path
cpufreq: dt-platdev: Drop few entries from whitelist
cpufreq: dt-platdev: Automatically create cpufreq device with OPP v2
ARM: ux500: don't select CPUFREQ_DT
cpuidle: Convert to using %pOF instead of full_name
cpufreq: Convert to using %pOF instead of full_name
PM / Domains: Convert to using %pOF instead of full_name
cpufreq: Cap the default transition delay value to 10 ms
...
* pm-sleep:
ACPI / PM: Check low power idle constraints for debug only
PM / s2idle: Rename platform operations structure
PM / s2idle: Rename ->enter_freeze to ->enter_s2idle
PM / s2idle: Rename freeze_state enum and related items
PM / s2idle: Rename PM_SUSPEND_FREEZE to PM_SUSPEND_TO_IDLE
ACPI / PM: Prefer suspend-to-idle over S3 on some systems
platform/x86: intel-hid: Wake up Dell Latitude 7275 from suspend-to-idle
PM / suspend: Define pr_fmt() in suspend.c
PM / suspend: Use mem_sleep_labels[] strings in messages
PM / sleep: Put pm_test under CONFIG_PM_SLEEP_DEBUG
PM / sleep: Check pm_wakeup_pending() in __device_suspend_noirq()
PM / core: Add error argument to dpm_show_time()
PM / core: Split dpm_suspend_noirq() and dpm_resume_noirq()
PM / s2idle: Rearrange the main suspend-to-idle loop
PM / timekeeping: Print debug messages when requested
PM / sleep: Mark suspend/hibernation start and finish
PM / sleep: Do not print debug messages by default
PM / suspend: Export pm_suspend_target_state
Make the drivers that want to include the polling state into their
states table initialize it explicitly and drop the initialization of
it (which in fact is conditional, but that is not obvious from the
code) from the core.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Move the polling state initialization code to a separate file built
conditionally on CONFIG_ARCH_HAS_CPU_RELAX to get rid of the #ifdef
in driver.c.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
On some architectures the first (index 0) idle state is a polling
one and it doesn't really save energy, so there is the
CPUIDLE_DRIVER_STATE_START symbol allowing some pieces of
cpuidle code to avoid using that state.
However, this makes the code rather hard to follow. It is better
to explicitly avoid the polling state, so add a new cpuidle state
flag CPUIDLE_FLAG_POLLING to mark it and make the relevant code
check that flag for the first state instead of using the
CPUIDLE_DRIVER_STATE_START symbol.
In the ACPI processor driver that cannot always rely on the state
flags (like before the states table has been set up) define
a new internal symbol ACPI_IDLE_STATE_START equivalent to the
CPUIDLE_DRIVER_STATE_START one and drop the latter.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Up until now we have open-coded checks for whether CPUs are siblings,
with slight variations on whether we consider the package ID or not.
This will only get more complex when we introduce cluster support, so in
preparation for that this patch introduces a cpus_are_siblings()
function which can be used to check whether or not 2 CPUs are siblings
in a consistent manner.
By checking globalnumber with the VP ID masked out this also has the
neat side effect of being ready for multi-cluster systems already.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Acked-by: Rafael J. Wysocki <rjw@rjwysocki.net>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/17011/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
We currently have fields in struct cpuinfo_mips for the core & VP(E) ID
of a particular CPU, and various pieces of code directly access those
fields. This patch abstracts such access by introducing accessor
functions cpu_core(), cpu_set_core(), cpu_vpe_id() & cpu_set_vpe_id()
and having code that needs to access these values call those functions
rather than directly accessing the struct cpuinfo_mips fields. This
prepares us for changes to the way in which those values are stored in
later patches.
The cpu_vpe_id() function is introduced even though we already had a
cpu_vpe_id() macro for a couple of reasons:
1) It's more consistent with the core, and future cluster, accessors.
2) It ensures a sensible return type without explicit casts.
3) It's generally preferable to use functions rather than macros.
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/17009/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
struct call_single_data is used in IPIs to transfer information between
CPUs. Its size is bigger than sizeof(unsigned long) and less than
cache line size. Currently it is not allocated with any explicit alignment
requirements. This makes it possible for allocated call_single_data to
cross two cache lines, which results in double the number of the cache lines
that need to be transferred among CPUs.
This can be fixed by requiring call_single_data to be aligned with the
size of call_single_data. Currently the size of call_single_data is the
power of 2. If we add new fields to call_single_data, we may need to
add padding to make sure the size of new definition is the power of 2
as well.
Fortunately, this is enforced by GCC, which will report bad sizes.
To set alignment requirements of call_single_data to the size of
call_single_data, a struct definition and a typedef is used.
To test the effect of the patch, I used the vm-scalability multiple
thread swap test case (swap-w-seq-mt). The test will create multiple
threads and each thread will eat memory until all RAM and part of swap
is used, so that huge number of IPIs are triggered when unmapping
memory. In the test, the throughput of memory writing improves ~5%
compared with misaligned call_single_data, because of faster IPIs.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Huang, Ying <ying.huang@intel.com>
[ Add call_single_data_t and align with size of call_single_data. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/87bmnqd6lz.fsf@yhuang-mobile.sh.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that we have a custom printf format specifier, convert users of
full_name to use %pOF instead. This is preparation to remove storing
of the full path string for each node.
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Rename the ->enter_freeze cpuidle driver callback to ->enter_s2idle
to make it clear that it is used for entering suspend-to-idle and
rename the related functions, variables and so on accordingly.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently, we use the opal call opal_slw_set_reg() to inform the
Sleep-Winkle Engine (SLW) to restore the contents of some of the
Hypervisor state on wakeup from deep idle states that lose full
hypervisor context (characterized by the flag
OPAL_PM_LOSE_FULL_CONTEXT).
However, the current code has a bug in that if opal_slw_set_reg()
fails, we don't disable the use of these deep states (winkle on
POWER8, stop4 onwards on POWER9).
This patch fixes this bug by ensuring that if programing the
sleep-winkle engine to restore the hypervisor states in
pnv_save_sprs_for_deep_states() fails, then we exclude such states by
clearing the OPAL_PM_LOSE_FULL_CONTEXT flag from
supported_cpuidle_states. As a result POWER8 will be prevented from
using winkle for CPU-Hotplug, and POWER9 will put the offlined CPUs to
the default stop state when available.
Further, we ensure in the initialization of the cpuidle-powernv driver
to only include those states whose flags are present in
supported_cpuidle_states, thereby skipping OPAL_PM_LOSE_FULL_CONTEXT
states when they have been disabled due to stop-api failure.
Fixes: 1e1601b38e ("powerpc/powernv/idle: Restore SPRs for deep idle
states via stop API.")
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Highlights include:
- Support for STRICT_KERNEL_RWX on 64-bit server CPUs.
- Platform support for FSP2 (476fpe) board
- Enable ZONE_DEVICE on 64-bit server CPUs.
- Generic & powerpc spin loop primitives to optimise busy waiting
- Convert VDSO update function to use new update_vsyscall() interface
- Optimisations to hypercall/syscall/context-switch paths
- Improvements to the CPU idle code on Power8 and Power9.
As well as many other fixes and improvements.
Thanks to:
Akshay Adiga, Andrew Donnellan, Andrew Jeffery, Anshuman Khandual, Anton
Blanchard, Balbir Singh, Benjamin Herrenschmidt, Christophe Leroy, Christophe
Lombard, Colin Ian King, Dan Carpenter, Gautham R. Shenoy, Hari Bathini, Ian
Munsie, Ivan Mikhaylov, Javier Martinez Canillas, Madhavan Srinivasan,
Masahiro Yamada, Matt Brown, Michael Neuling, Michal Suchanek, Murilo
Opsfelder Araujo, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Paul
Mackerras, Pavel Machek, Russell Currey, Santosh Sivaraj, Stephen Rothwell,
Thiago Jung Bauermann, Yang Li.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJZXyPCAAoJEFHr6jzI4aWAI9QQAISf2x5y//cqCi4ISyQB5pTq
KLS/yQajNkQOw7c0fzBZOaH5Xd/SJ6AcKWDg8yDlpDR3+sRRsr98iIRECgKS5I7/
DxD9ywcbSoMXFQQo1ZMCp5CeuMUIJRtugBnUQM+JXCSUCPbznCY5DchDTLyTBTpO
MeMVhI//JxthhoOMA9MudiEGaYCU9ho442Z4OJUSiLUv8WRbvQX9pTqoc4vx1fxA
BWf2mflztBVcIfKIyxIIIlDLukkMzix6gSYPMCbC7lzkbnU7JSqKiheJXjo1gJS2
ePHKDxeNR2/QP0g/j3aT/MR1uTt9MaNBSX3gANE1xQ9OoJ8m1sOtCO4gNbSdLWae
eXhDnoiEp930DRZOeEioOItuWWoxFaMyYk3BMmRKV4QNdYL3y3TRVeL2/XmRGqYL
Lxz4IY/x5TteFEJNGcRX90uizNSi8AaEXPF16pUk8Ctt6eH3ZSwPMv2fHeYVCMr0
KFlKHyaPEKEoztyzLcUR6u9QB56yxDN58bvLpd32AeHvKhqyxFoySy59x9bZbatn
B2y8mmDItg860e0tIG6jrtplpOVvL8i5jla5RWFVoQDuxxrLAds3vG9JZQs+eRzx
Fiic93bqeUAS6RzhXbJ6QQJYIyhE7yqpcgv7ME1W87SPef3HPBk9xlp3yIDwdA2z
bBDvrRnvupusz8qCWrxe
=w8rj
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights include:
- Support for STRICT_KERNEL_RWX on 64-bit server CPUs.
- Platform support for FSP2 (476fpe) board
- Enable ZONE_DEVICE on 64-bit server CPUs.
- Generic & powerpc spin loop primitives to optimise busy waiting
- Convert VDSO update function to use new update_vsyscall() interface
- Optimisations to hypercall/syscall/context-switch paths
- Improvements to the CPU idle code on Power8 and Power9.
As well as many other fixes and improvements.
Thanks to: Akshay Adiga, Andrew Donnellan, Andrew Jeffery, Anshuman
Khandual, Anton Blanchard, Balbir Singh, Benjamin Herrenschmidt,
Christophe Leroy, Christophe Lombard, Colin Ian King, Dan Carpenter,
Gautham R. Shenoy, Hari Bathini, Ian Munsie, Ivan Mikhaylov, Javier
Martinez Canillas, Madhavan Srinivasan, Masahiro Yamada, Matt Brown,
Michael Neuling, Michal Suchanek, Murilo Opsfelder Araujo, Naveen N.
Rao, Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pavel Machek,
Russell Currey, Santosh Sivaraj, Stephen Rothwell, Thiago Jung
Bauermann, Yang Li"
* tag 'powerpc-4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (158 commits)
powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs
powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for Radix
powerpc/mm/hash: Implement mark_rodata_ro() for hash
powerpc/vmlinux.lds: Align __init_begin to 16M
powerpc/lib/code-patching: Use alternate map for patch_instruction()
powerpc/xmon: Add patch_instruction() support for xmon
powerpc/kprobes/optprobes: Use patch_instruction()
powerpc/kprobes: Move kprobes over to patch_instruction()
powerpc/mm/radix: Fix execute permissions for interrupt_vectors
powerpc/pseries: Fix passing of pp0 in updatepp() and updateboltedpp()
powerpc/64s: Blacklist rtas entry/exit from kprobes
powerpc/64s: Blacklist functions invoked on a trap
powerpc/64s: Un-blacklist system_call() from kprobes
powerpc/64s: Move system_call() symbol to just after setting MSR_EE
powerpc/64s: Blacklist system_call() and system_call_common() from kprobes
powerpc/64s: Convert .L__replay_interrupt_return to a local label
powerpc64/elfv1: Only dereference function descriptor for non-text symbols
cxl: Export library to support IBM XSL
powerpc/dts: Use #include "..." to include local DT
powerpc/perf/hv-24x7: Aggregate result elements on POWER9 SMT8
...
- Rework suspend-to-idle to allow it to take wakeup events signaled
by the EC into account on ACPI-based platforms in order to properly
support power button wakeup from suspend-to-idle on recent Dell
laptops (Rafael Wysocki).
That includes the core suspend-to-idle code rework, support for
the Low Power S0 _DSM interface, and support for the ACPI INT0002
Virtual GPIO device from Hans de Goede (required for USB keyboard
wakeup from suspend-to-idle to work on some machines).
- Stop trying to export the current CPU frequency via /proc/cpuinfo
on x86 as that is inaccurate and confusing (Len Brown).
- Rework the way in which the current CPU frequency is exported by
the kernel (over the cpufreq sysfs interface) on x86 systems with
the APERF and MPERF registers by always using values read from
these registers, when available, to compute the current frequency
regardless of which cpufreq driver is in use (Len Brown).
- Rework the PCI/ACPI device wakeup infrastructure to remove the
questionable and artificial distinction between "devices that
can wake up the system from sleep states" and "devices that can
generate wakeup signals in the working state" from it, which
allows the code to be simplified quite a bit (Rafael Wysocki).
- Fix the wakeup IRQ framework by making it use SRCU instead of
RCU which doesn't allow sleeping in the read-side critical
sections, but which in turn is expected to be allowed by the
IRQ bus locking infrastructure (Thomas Gleixner).
- Modify some computations in the intel_pstate driver to avoid
rounding errors resulting from them (Srinivas Pandruvada).
- Reduce the overhead of the intel_pstate driver in the HWP
(hardware-managed P-states) mode and when the "performance"
P-state selection algorithm is in use by making it avoid
registering scheduler callbacks in those cases (Len Brown).
- Rework the energy_performance_preference sysfs knob in
intel_pstate by changing the values that correspond to
different symbolic hint names used by it (Len Brown).
- Make it possible to use more than one cpuidle driver at the same
time on ARM (Daniel Lezcano).
- Make it possible to prevent the cpuidle menu governor from using
the 0 state by disabling it via sysfs (Nicholas Piggin).
- Add support for FFH (Fixed Functional Hardware) MWAIT in ACPI C1
on AMD systems (Yazen Ghannam).
- Make the CPPC cpufreq driver take the lowest nonlinear performance
information into account (Prashanth Prakash).
- Add support for hi3660 to the cpufreq-dt driver, fix the
imx6q driver and clean up the sfi, exynos5440 and intel_pstate
drivers (Colin Ian King, Krzysztof Kozlowski, Octavian Purdila,
Rafael Wysocki, Tao Wang).
- Fix a few minor issues in the generic power domains (genpd)
framework and clean it up somewhat (Krzysztof Kozlowski,
Mikko Perttunen, Viresh Kumar).
- Fix a couple of minor issues in the operating performance points
(OPP) framework and clean it up somewhat (Viresh Kumar).
- Fix a CONFIG dependency in the hibernation core and clean it up
slightly (Balbir Singh, Arvind Yadav, BaoJun Luo).
- Add rk3228 support to the rockchip-io adaptive voltage scaling
(AVS) driver (David Wu).
- Fix an incorrect bit shift operation in the RAPL power capping
driver (Adam Lessnau).
- Add support for the EPP field in the HWP (hardware managed
P-states) control register, HWP.EPP, to the x86_energy_perf_policy
tool and update msr-index.h with HWP.EPP values (Len Brown).
- Fix some minor issues in the turbostat tool (Len Brown).
- Add support for AMD family 0x17 CPUs to the cpupower tool and fix
a minor issue in it (Sherry Hurwitz).
- Assorted cleanups, mostly related to the constification of some
data structures (Arvind Yadav, Joe Perches, Kees Cook, Krzysztof
Kozlowski).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJZWrICAAoJEILEb/54YlRxZYMQAIRhfbyDxKq+ByvSilUS8kTA
AItwJ8FFzykhiwN75Cqabg4rAGyWma7IRs1vzU7zeC1aEQIn+bTQtvk+utZNI+g2
ANFlDha20q/sXsP/CDMMTIAdW9tSOC0TOvFI9s2V2Y8dJZhoekO4ctx34FAfUS5d
Ao6rwSAWCMsCXcGaTAlqTA+TEJmBG7u6Iq6hq6ngltoFwOv3mWWBVn52VVaJ7SMp
9/IPbbLGMFAedrgEBRGCR+MME1xZZpvcZIJaTt1Mgn7Cx3cJaysIUAvqY/SsvFGq
5FcUTcF2qpK3+AGawiAxZIjvOBsGRtIwqKinNIzYWs/NjiIdzmgVAmTeuPtTqp+5
HFehUdtkFcnuDnLqSNzAaZUa7tw84cJkwnbVMnesx0MkG6rZ1SeL22E2Sabpcdsh
3Yo1ThzJSxi59DhiiE92EQnNCEjmCldRy+8q5Ag035muxl6EJYvuNBMnZv/BMCUn
ltSNOrmps1DlN+Col8ORIeNzQ1YjYzWMqKAYzSbyccm4ug/iSHx0/DuESmQ4GTlF
YCwkmqyWiHrBwpl51jc+4a7SGlMmKRqU+MJes0CjagaaqoUAb8qeBOpzEJ0yNwjZ
wtI41l6blE6kbMD3yqGdCfiB2S7GlPVoxa15eX1wRyLH3fLjwwrzJirEaiBS86tI
1PzHZEOmBlh3DYC6DBKA
=Wsph
-----END PGP SIGNATURE-----
Merge tag 'pm-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"The big ticket items here are the rework of suspend-to-idle in order
to add proper support for power button wakeup from it on recent Dell
laptops and the rework of interfaces exporting the current CPU
frequency on x86.
In addition to that, support for a few new pieces of hardware is
added, the PCI/ACPI device wakeup infrastructure is simplified
significantly and the wakeup IRQ framework is fixed to unbreak the IRQ
bus locking infrastructure.
Also, there are some functional improvements for intel_pstate, tools
updates and small fixes and cleanups all over.
Specifics:
- Rework suspend-to-idle to allow it to take wakeup events signaled
by the EC into account on ACPI-based platforms in order to properly
support power button wakeup from suspend-to-idle on recent Dell
laptops (Rafael Wysocki).
That includes the core suspend-to-idle code rework, support for the
Low Power S0 _DSM interface, and support for the ACPI INT0002
Virtual GPIO device from Hans de Goede (required for USB keyboard
wakeup from suspend-to-idle to work on some machines).
- Stop trying to export the current CPU frequency via /proc/cpuinfo
on x86 as that is inaccurate and confusing (Len Brown).
- Rework the way in which the current CPU frequency is exported by
the kernel (over the cpufreq sysfs interface) on x86 systems with
the APERF and MPERF registers by always using values read from
these registers, when available, to compute the current frequency
regardless of which cpufreq driver is in use (Len Brown).
- Rework the PCI/ACPI device wakeup infrastructure to remove the
questionable and artificial distinction between "devices that can
wake up the system from sleep states" and "devices that can
generate wakeup signals in the working state" from it, which allows
the code to be simplified quite a bit (Rafael Wysocki).
- Fix the wakeup IRQ framework by making it use SRCU instead of RCU
which doesn't allow sleeping in the read-side critical sections,
but which in turn is expected to be allowed by the IRQ bus locking
infrastructure (Thomas Gleixner).
- Modify some computations in the intel_pstate driver to avoid
rounding errors resulting from them (Srinivas Pandruvada).
- Reduce the overhead of the intel_pstate driver in the HWP
(hardware-managed P-states) mode and when the "performance" P-state
selection algorithm is in use by making it avoid registering
scheduler callbacks in those cases (Len Brown).
- Rework the energy_performance_preference sysfs knob in intel_pstate
by changing the values that correspond to different symbolic hint
names used by it (Len Brown).
- Make it possible to use more than one cpuidle driver at the same
time on ARM (Daniel Lezcano).
- Make it possible to prevent the cpuidle menu governor from using
the 0 state by disabling it via sysfs (Nicholas Piggin).
- Add support for FFH (Fixed Functional Hardware) MWAIT in ACPI C1 on
AMD systems (Yazen Ghannam).
- Make the CPPC cpufreq driver take the lowest nonlinear performance
information into account (Prashanth Prakash).
- Add support for hi3660 to the cpufreq-dt driver, fix the imx6q
driver and clean up the sfi, exynos5440 and intel_pstate drivers
(Colin Ian King, Krzysztof Kozlowski, Octavian Purdila, Rafael
Wysocki, Tao Wang).
- Fix a few minor issues in the generic power domains (genpd)
framework and clean it up somewhat (Krzysztof Kozlowski, Mikko
Perttunen, Viresh Kumar).
- Fix a couple of minor issues in the operating performance points
(OPP) framework and clean it up somewhat (Viresh Kumar).
- Fix a CONFIG dependency in the hibernation core and clean it up
slightly (Balbir Singh, Arvind Yadav, BaoJun Luo).
- Add rk3228 support to the rockchip-io adaptive voltage scaling
(AVS) driver (David Wu).
- Fix an incorrect bit shift operation in the RAPL power capping
driver (Adam Lessnau).
- Add support for the EPP field in the HWP (hardware managed
P-states) control register, HWP.EPP, to the x86_energy_perf_policy
tool and update msr-index.h with HWP.EPP values (Len Brown).
- Fix some minor issues in the turbostat tool (Len Brown).
- Add support for AMD family 0x17 CPUs to the cpupower tool and fix a
minor issue in it (Sherry Hurwitz).
- Assorted cleanups, mostly related to the constification of some
data structures (Arvind Yadav, Joe Perches, Kees Cook, Krzysztof
Kozlowski)"
* tag 'pm-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (69 commits)
cpufreq: Update scaling_cur_freq documentation
cpufreq: intel_pstate: Clean up after performance governor changes
PM: hibernate: constify attribute_group structures.
cpuidle: menu: allow state 0 to be disabled
intel_idle: Use more common logging style
PM / Domains: Fix missing default_power_down_ok comment
PM / Domains: Fix unsafe iteration over modified list of domains
PM / Domains: Fix unsafe iteration over modified list of domain providers
PM / Domains: Fix unsafe iteration over modified list of device links
PM / Domains: Handle safely genpd_syscore_switch() call on non-genpd device
PM / Domains: Call driver's noirq callbacks
PM / core: Drop run_wake flag from struct dev_pm_info
PCI / PM: Simplify device wakeup settings code
PCI / PM: Drop pme_interrupt flag from struct pci_dev
ACPI / PM: Consolidate device wakeup settings code
ACPI / PM: Drop run_wake from struct acpi_device_wakeup_flags
PM / QoS: constify *_attribute_group.
PM / AVS: rockchip-io: add io selectors and supplies for rk3228
powercap/RAPL: prevent overridding bits outside of the mask
PM / sysfs: Constify attribute groups
...
The menu driver does not allow state0 to be disabled completely.
If it is disabled but other enabled states don't meet latency
requirements, it is still used.
Fix this by starting with the first enabled idle state. Fall back
to state 0 if no idle states are enabled (arguably this should be
-EINVAL if it is attempted, but this is the minimal fix).
Acked-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
A memory barrier is not required after the task wakes up,
only if we clear the polling flag before waking. The case
where we have work to do is the important one, so optimise
for it.
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Ensure these don't get put into bouncing cachelines.
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
local_irq_enable can cause interrupts to be taken which could
take significant amount of processing time. The idle process
should set its polling flag before this, so another process that
wakes it during this time will not have to send an IPI.
Expand the TIF_POLLING_NRFLAG coverage to as large as possible.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Some hardware have clusters with different idle states. The current code does
not support this and fails as it expects all the idle states to be identical.
Because of this, the Mediatek mtk8173 had to create the same idle state for a
big.Little system and now the Hisilicon 960 is facing the same situation.
Solve this by simply assuming the multiple driver will be needed for all the
platforms using the ARM generic cpuidle driver which makes sense because of the
different topologies we can support with a single kernel for ARM32 or ARM64.
Every CPU has its own driver, so every single CPU can specify in the DT the
idle states.
This simple approach allows to support the future dynamIQ system, current SMP
and HMP.
Tested on:
- 96boards: Hikey 620
- 96boards: Hikey 960
- 96boards: dragonboard410c
- Mediatek 8173
Tested-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Conflicts:
kernel/sched/Makefile
Pick up the waitqueue related renames - it didn't get much feedback,
so it appears to be uncontroversial. Famous last words? ;-)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This simplifies the asm and fixes irq-off tracing over sleep
instructions.
Also move powersave_nap check for POWER8 into C code, and move
PSSCR register value calculation for POWER9 into C.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
'of_node_put()' should be called on pointer returned by
'of_parse_phandle()' when done. In this function this is done in all path
except this 'continue', so add it.
Fixes: 97735da074 (drivers: cpuidle: Add status property to ARM idle states)
Signed-off-by: Christophe Jaillet <christophe.jaillet@wanadoo.fr>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The current code in the cpuidle-powernv intialization only allows deep
stop states (indicated by OPAL_PM_STOP_INST_DEEP) which lose timebase
(indicated by OPAL_PM_TIMEBASE_STOP). This assumption goes back to
POWER8 time where deep states used to lose the timebase. However, on
POWER9, we do have stop states that are deep (they lose hypervisor
state) but retain the timebase.
Fix the initialization code in the cpuidle-powernv driver to allow
such deep states.
Further, there is a bug in cpuidle-powernv driver with
CONFIG_TICK_ONESHOT=n where we end up incrementing the nr_idle_states
even if a platform idle state which loses time base was not added to
the cpuidle table.
Fix this by ensuring that the nr_idle_states variable gets incremented
only when the platform idle state was added to the cpuidle table.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Ville reported that on his Core2, which has TSC stop in idle, we would
always report very short idle durations. He tracked this down to
commit:
e93e59ce5b ("cpuidle: Replace ktime_get() with local_clock()")
which replaces ktime_get() with local_clock().
Add a sched_clock_idle_wakeup_event() call, which will re-sync the
clock with ktime_get_ns() when TSC is unstable and no-op otherwise.
Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: e93e59ce5b ("cpuidle: Replace ktime_get() with local_clock()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
* pm-domains:
PM / Domains: Add DT file to MAINTAINERS
PM / Domains: Fix DT example
* pm-cpuidle:
x86/intel_idle: add Gemini Lake support
cpuidle: check dev before usage in cpuidle_use_deepest_state()
* pm-sleep:
ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle
PM / wakeup: Integrate mechanism to abort transitions in progress
* powercap:
powercap: intel_rapl: Add support for Gemini Lake
In case of there is no cpuidle devices registered, dev will be null, and
panic will be triggered like below;
In this patch, add checking of dev before usage, like that done in
cpuidle_idle_call.
Panic without fix:
[ 184.961328] BUG: unable to handle kernel NULL pointer dereference at
(null)
[ 184.961328] IP: cpuidle_use_deepest_state+0x30/0x60
...
[ 184.961328] play_idle+0x8d/0x210
[ 184.961328] ? __schedule+0x359/0x8e0
[ 184.961328] ? _raw_spin_unlock_irqrestore+0x28/0x50
[ 184.961328] ? kthread_queue_delayed_work+0x41/0x80
[ 184.961328] clamp_idle_injection_func+0x64/0x1e0
Fixes: bb8313b603 (cpuidle: Allow enforcing deepest idle state selection)
Signed-off-by: Li, Fei <fei.li@intel.com>
Tested-by: Shi, Feng <fengx.shi@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: 4.10+ <stable@vger.kernel.org> # 4.10+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When in the snooze_loop() we want to take up the least amount of
resources. On my version of gcc (6.3), we end up with an extra
branch because it predicts snooze_timeout_en to be false, whereas it
is almost always true.
Use likely() to avoid the branch and be a little nicer to the
other non idle threads on the core.
Signed-off-by: Anton Blanchard <anton@samba.org>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The powerpc64 kernel exception handlers have preserved thread priorities
for a long time now, so there is no need to continually set it.
Just set it once on entry and once exit.
Signed-off-by: Anton Blanchard <anton@samba.org>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The core of snooze_loop() continually bounces between low and very
low thread priority. Changing thread priorities is an expensive
operation that can negatively impact other threads on a core.
All CPUs that can run PowerNV support very low priority, so we can
avoid the change completely.
Signed-off-by: Anton Blanchard <anton@samba.org>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
'core' in cps_cpuidle_init has never been used and is unnecessary, so
remove the dead code.
Signed-off-by: Marcin Nowakowski <marcin.nowakowski@imgtec.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
drv->cpumask defaults to cpu_possible_mask in __cpuidle_driver_init().
On PowerNV platform cpu_present could be less than cpu_possible in cases
where firmware detects the cpu, but it is not available to the OS. When
CONFIG_HOTPLUG_CPU=n, such cpus are not hotplugable at runtime and hence
we skip creating cpu_device.
This breaks cpuidle on powernv where register_cpu() is not called for
cpus in cpu_possible_mask that cannot be hot-added at runtime.
Trying cpuidle_register_device() on cpu without cpu_device will cause
crash like this:
cpu 0xf: Vector: 380 (Data SLB Access) at [c000000ff1503490]
pc: c00000000022c8bc: string+0x34/0x60
lr: c00000000022ed78: vsnprintf+0x284/0x42c
sp: c000000ff1503710
msr: 9000000000009033
dar: 6000000060000000
current = 0xc000000ff1480000
paca = 0xc00000000fe82d00 softe: 0 irq_happened: 0x01
pid = 1, comm = swapper/8
Linux version 4.11.0-rc2 (sv@sagarika) (gcc version 4.9.4
(Buildroot 2017.02-00004-gc28573e) ) #15 SMP Fri Mar 17 19:32:02 IST 2017
enter ? for help
[link register ] c00000000022ed78 vsnprintf+0x284/0x42c
[c000000ff1503710] c00000000022ebb8 vsnprintf+0xc4/0x42c (unreliable)
[c000000ff1503800] c00000000022ef40 vscnprintf+0x20/0x44
[c000000ff1503830] c0000000000ab61c vprintk_emit+0x94/0x2cc
[c000000ff15038a0] c0000000000acc9c vprintk_func+0x60/0x74
[c000000ff15038c0] c000000000619694 printk+0x38/0x4c
[c000000ff15038e0] c000000000224950 kobject_get+0x40/0x60
[c000000ff1503950] c00000000022507c kobject_add_internal+0x60/0x2c4
[c000000ff15039e0] c000000000225350 kobject_init_and_add+0x70/0x78
[c000000ff1503a60] c00000000053c288 cpuidle_add_sysfs+0x9c/0xe0
[c000000ff1503ae0] c00000000053aeac cpuidle_register_device+0xd4/0x12c
[c000000ff1503b30] c00000000053b108 cpuidle_register+0x98/0xcc
[c000000ff1503bc0] c00000000085eaf0 powernv_processor_idle_init+0x140/0x1e0
[c000000ff1503c60] c00000000000cd60 do_one_initcall+0xc0/0x15c
[c000000ff1503d20] c000000000833e84 kernel_init_freeable+0x1a0/0x25c
[c000000ff1503dc0] c00000000000d478 kernel_init+0x24/0x12c
[c000000ff1503e30] c00000000000b564 ret_from_kernel_thread+0x5c/0x78
This patch fixes the bug by passing correct cpumask from
powernv-cpuidle driver.
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
[ rjw: Comment massage ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The various properties associated with powernv idle states such as
names, flags, residency-ns, latencies-ns, psscr, psscr-mask are
exposed in the device-tree as property arrays such the pointwise
entries in each of these arrays correspond to the properties of the
same idle state.
This patch validates that the lengths of the property arrays are the
same. If there is a mismatch, the patch will ensure that we bail out
and not expose the platform idle states via cpuidle.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If a given cpu is not in cpu_present and cpu hotplug
is disabled, arch can skip setting up the cpu_dev.
Arch cpuidle driver should pass correct cpu mask
for registration, but failing to do so by the driver
causes error to propagate and crash like this:
[ 30.076045] Unable to handle kernel paging request for data at address 0x00000048
[ 30.076100] Faulting instruction address: 0xc0000000007b2f30
cpu 0x4d: Vector: 300 (Data Access) at [c000003feb18b670]
pc: c0000000007b2f30: kobject_get+0x20/0x70
lr: c0000000007b3c94: kobject_add_internal+0x54/0x3f0
sp: c000003feb18b8f0
msr: 9000000000009033
dar: 48
dsisr: 40000000
current = 0xc000003fd2ed8300
paca = 0xc00000000fbab500 softe: 0 irq_happened: 0x01
pid = 1, comm = swapper/0
Linux version 4.11.0-rc2-svaidy+ (sv@sagarika) (gcc version 6.2.0
20161005 (Ubuntu 6.2.0-5ubuntu12) ) #10 SMP Sun Mar 19 00:08:09 IST 2017
enter ? for help
[c000003feb18b960] c0000000007b3c94 kobject_add_internal+0x54/0x3f0
[c000003feb18b9f0] c0000000007b43a4 kobject_init_and_add+0x64/0xa0
[c000003feb18ba70] c000000000e284f4 cpuidle_add_sysfs+0xb4/0x130
[c000003feb18baf0] c000000000e26038 cpuidle_register_device+0x118/0x1c0
[c000003feb18bb30] c000000000e26c48 cpuidle_register+0x78/0x120
[c000003feb18bbc0] c00000000168fd9c powernv_processor_idle_init+0x110/0x1c4
[c000003feb18bc40] c00000000000cff8 do_one_initcall+0x68/0x1d0
[c000003feb18bd00] c0000000016242f4 kernel_init_freeable+0x280/0x360
[c000003feb18bdc0] c00000000000d864 kernel_init+0x24/0x160
[c000003feb18be30] c00000000000b4e8 ret_from_kernel_thread+0x5c/0x74
Validating cpu_dev fixes the crash and reports correct error message like:
[ 30.163506] Failed to register cpuidle device for cpu136
[ 30.173329] Registration of powernv driver failed.
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
[ rjw: Comment massage ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull sched.h split-up from Ingo Molnar:
"The point of these changes is to significantly reduce the
<linux/sched.h> header footprint, to speed up the kernel build and to
have a cleaner header structure.
After these changes the new <linux/sched.h>'s typical preprocessed
size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K
lines), which is around 40% faster to build on typical configs.
Not much changed from the last version (-v2) posted three weeks ago: I
eliminated quirks, backmerged fixes plus I rebased it to an upstream
SHA1 from yesterday that includes most changes queued up in -next plus
all sched.h changes that were pending from Andrew.
I've re-tested the series both on x86 and on cross-arch defconfigs,
and did a bisectability test at a number of random points.
I tried to test as many build configurations as possible, but some
build breakage is probably still left - but it should be mostly
limited to architectures that have no cross-compiler binaries
available on kernel.org, and non-default configurations"
* 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits)
sched/headers: Clean up <linux/sched.h>
sched/headers: Remove #ifdefs from <linux/sched.h>
sched/headers: Remove the <linux/topology.h> include from <linux/sched.h>
sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h>
sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h>
sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h>
sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/init.h>
sched/core: Remove unused prefetch_stack()
sched/headers: Remove <linux/rculist.h> from <linux/sched.h>
sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h>
sched/headers: Remove <linux/signal.h> from <linux/sched.h>
sched/headers: Remove <linux/rwsem.h> from <linux/sched.h>
sched/headers: Remove the runqueue_is_locked() prototype
sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h>
sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h>
sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h>
...
- Fix for a cpuidle menu governor problem that started to take an
unnecessary spinlock after one of the recent updates and that
did not play well with the RT patch (Rafael Wysocki).
- Fix for the new intel_pstate operation mode switching feature
added recently that did not reinitialize P-state limits properly
when switching operation modes (Rafael Wysocki).
- Removal of unused global notifiers from the PM QoS framework
(Viresh Kumar).
- Generic power domains framework update to make it handle
asynchronous invocations of PM callbacks in the "noirq" phases
of system suspend/hibernation correctly (Ulf Hansson).
- Two hibernation core cleanups (Rafael Wysocki).
- intel_idle cleanup related to the sysfs interface (Len Brown).
- Off-by-one bug fix in the OPP (Operating Performance Points)
framework (Andrzej Hajda).
- OPP framework's documentation fix (Viresh Kumar).
- cpufreq qoriq driver cleanup (Tang Yuantian).
- Fixes for typos in comments in the device runtime PM framework
(Christophe Jaillet).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJYuLeGAAoJEILEb/54YlRxJvcP/0BRmh8Hn4Itx/NIWNg71X6j
U+v8Pn8T3MP33gCcleLYlgre2JUIAUDmhdK99+UOx+/abhjMhQSaF3HhTOwYaPtQ
6njoHVS0NnfqUf+x5kp+EpRxBVNucYVbdRTVd1DsHIeLLz/96DFOzb/R7tko/pKx
pFMWvNdotHLLgXOG1UvdRimwTDlFMffxFzD8Se53LPjRXS0S73A5VWfqZOye44Re
j3W1AJ0Idgq5uduA6J8x1MWbaxDq1h+j6CSUm05yvqrINzxXwXt0Hv6stCQTo+Gb
YMdiBd8MujNyAgcchw3jiDQ8Vp+zmfLPcHrfPe//SSefj26eB8LyVNSYelvbUdOz
cNjvyErva37MmaegCL9QC7WbLM+A7VE6bU6YzDCi/rR8jYMJ51Fb9jGiYb/oimry
OLlblEekikUsskWv4hGV1JVt5VhmUMlagWtexxn+lMszATcZro0tfXu/vgQWksYs
noUnwuWJWxvj2aNMsvbzW3HLlTGSmYl2UxJ7IymQQaTDblwF9Kg61rm3+5coUctd
ifceynDVp9Gju25faYgZ+Dq9+o8ktlOGOHRRPdLIRNJ/T+4tUDnlGkdbPb+Tfn03
XUIzYCu74U8/oW8gOk6t0WpmWzvxEXNgdirdEIR6y3loYIC0Jr3v4gyD975Eug74
Hzfrdg7ignAmWV+nf6UY
=SeUF
-----END PGP SIGNATURE-----
Merge tag 'pm-extra-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates deom Rafael Wysocki:
"These fix two bugs introduced by recent power management updates (in
the cpuidle menu governor and intel_pstate) and a few other issues,
clean up things and remove unused code.
Specifics:
- Fix for a cpuidle menu governor problem that started to take an
unnecessary spinlock after one of the recent updates and that did
not play well with the RT patch (Rafael Wysocki).
- Fix for the new intel_pstate operation mode switching feature added
recently that did not reinitialize P-state limits properly when
switching operation modes (Rafael Wysocki).
- Removal of unused global notifiers from the PM QoS framework
(Viresh Kumar).
- Generic power domains framework update to make it handle
asynchronous invocations of PM callbacks in the "noirq" phases of
system suspend/hibernation correctly (Ulf Hansson).
- Two hibernation core cleanups (Rafael Wysocki).
- intel_idle cleanup related to the sysfs interface (Len Brown).
- Off-by-one bug fix in the OPP (Operating Performance Points)
framework (Andrzej Hajda).
- OPP framework's documentation fix (Viresh Kumar).
- cpufreq qoriq driver cleanup (Tang Yuantian).
- Fixes for typos in comments in the device runtime PM framework
(Christophe Jaillet)"
* tag 'pm-extra-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / OPP: Documentation: Fix opp-microvolt in examples
intel_idle: stop exposing platform acronyms in sysfs
cpufreq: intel_pstate: Fix limits issue with operation mode switching
PM / hibernate: Define pr_fmt() and use pr_*() instead of printk()
PM / hibernate: Untangle power_down()
cpuidle: menu: Avoid taking spinlock for accessing QoS values
PM / QoS: Remove global notifiers
PM / runtime: Fix some typos
cpufreq: qoriq: clean up unused code
PM / OPP: fix off-by-one bug in dev_pm_opp_get_max_volt_latency loop
PM / Domains: Power off masters immediately in the power off sequence
PM / Domains: Rename is_async to one_dev_on for genpd_power_off()
PM / Domains: Move genpd_power_off() above genpd_power_on()
We are going to split <linux/sched/stat.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/stat.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/loadavg.h> out of <linux/sched.h>, which
will have to be picked up from a couple of .c files.
Create a trivial placeholder <linux/sched/topology.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which
will have to be picked up from other headers and .c files.
Create a trivial placeholder <linux/sched/clock.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/idle.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/idle.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
After commit 9908859aca (cpuidle/menu: add per CPU PM QoS resume
latency consideration) the cpuidle menu governor calls
dev_pm_qos_read_value() on CPU devices to read the current resume
latency QoS constraint values for them. That function takes a spinlock
to prevent the device's power.qos pointer from becoming NULL during
the access which is a problem for the RT patchset where spinlocks are
converted into mutexes and the idle loop stops working.
However, it is not even necessary for the menu governor to take
that spinlock, because the power.qos pointer accessed under it
cannot be modified during the access anyway.
For this reason, introduce a "raw" routine for accessing device
QoS resume latency constraints without locking and use it in the
menu governor.
Fixes: 9908859aca (cpuidle/menu: add per CPU PM QoS resume latency consideration)
Acked-by: Alex Shi <alex.shi@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Highlights include:
- Support for direct mapped LPC on POWER9, giving Linux direct access to
devices that may be on there such as a UART.
- Memory hotplug support for the Power9 Radix MMU.
- Add new AUX vectors describing the processor's cache geometry, to be used by
glibc.
- The ability for a guest to ask the hypervisor to resize the guest's hash
table, and in addition support for doing so automatically when memory is
hotplugged into/out-of the guest. This allows the hash table to be sized
based on the current memory usage of the guest, rather than the maximum
possible memory usage.
- Implementation of optprobes (kprobe optimisation) for powerpc.
In addition there's the topic branch shared with the KVM tree, which includes
support for guests to use the Radix MMU on Power9.
Thanks to:
Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T, Anton Blanchard,
Benjamin Herrenschmidt, Chris Packham, Daniel Axtens, Daniel Borkmann, David
Gibson, Finn Thain, Gautham R. Shenoy, Gavin Shan, Greg Kurz, Joel Stanley,
John Allen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring, Michael
Neuling, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Paul Mackerras, Ravi
Bangoria, Reza Arbab, Shailendra Singh, Vaibhav Jain, Wei Yongjun.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJYrWj0AAoJEFHr6jzI4aWAsn4P/08Kz3TtOvDuuPGVNoO7fWOn
ag5/zVt8R4FuCALqWpAZbVqMuUU4wLxG0RuWlmBNYYhrjMC6JxHpOSjQXxM2D7YT
CdGTJxG414r6HMOeToL9i/z33o0m+KT07tscer+QMKlXVKCR2z0fEJch+zoPBHYA
5eVBFpLLTtbNiX9UnrcM/vYz61d56kT4YJey9/8qbAkTAc1rMPa8ucU5UiKYJ7yX
mF8cd7WE+7aqif9V8yN59G2rcbz+h3pbMw/gzImiYsYrUj4fLjU+VTKL5PPT+UFy
WWBXD3MAMm1dksMMZi3hgoo2BZhDn3RkymeYi6Jo4kDknNMPZzkMxGyvaJ8Eq5H/
bIYXdS1AbTtvaaEEuWqDFjpnChOEvj/8IeqitU0jjlql8BVjNKg/ESaaKucZr+pO
pk2Mfvw0Tb/lxJT5qj27yq4aRsxJwdFOoPYCN7MquPp/wV2Tg5M6h4nVQ4T6Wl0S
tMFQeCqXflhDWh0Xgr2UXpF66/YTj3Du5LasOTkgGeU30Z8TcNGFEmDWShKP3cEm
e0oQE+OHhIGN4WSBXBAto/Gw8/0v3pXlMs+VEfeHqenwPss2sWtSwXWe8khmiy9e
DJ48sTzj75/Zx1fiqRldw9YEnrL+NK0eOOpvzxeyKpfvdUytc+chFfEqUmO6kl3Z
DW2UmlZxmW+b0SfexCHL
=Icle
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights include:
- Support for direct mapped LPC on POWER9, giving Linux direct access
to devices that may be on there such as a UART.
- Memory hotplug support for the Power9 Radix MMU.
- Add new AUX vectors describing the processor's cache geometry, to
be used by glibc.
- The ability for a guest to ask the hypervisor to resize the guest's
hash table, and in addition support for doing so automatically when
memory is hotplugged into/out-of the guest. This allows the hash
table to be sized based on the current memory usage of the guest,
rather than the maximum possible memory usage.
- Implementation of optprobes (kprobe optimisation) for powerpc.
In addition there's the topic branch shared with the KVM tree, which
includes support for guests to use the Radix MMU on Power9.
Thanks to:
Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T, Anton
Blanchard, Benjamin Herrenschmidt, Chris Packham, Daniel Axtens,
Daniel Borkmann, David Gibson, Finn Thain, Gautham R. Shenoy, Gavin
Shan, Greg Kurz, Joel Stanley, John Allen, Madhavan Srinivasan,
Mahesh Salgaonkar, Markus Elfring, Michael Neuling, Nathan Fontenot,
Naveen N. Rao, Nicholas Piggin, Paul Mackerras, Ravi Bangoria, Reza
Arbab, Shailendra Singh, Vaibhav Jain, Wei Yongjun"
* tag 'powerpc-4.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (129 commits)
powerpc/mm/radix: Skip ptesync in pte update helpers
powerpc/mm/radix: Use ptep_get_and_clear_full when clearing pte for full mm
powerpc/mm/radix: Update pte update sequence for pte clear case
powerpc/mm: Update PROTFAULT handling in the page fault path
powerpc/xmon: Fix data-breakpoint
powerpc/mm: Fix build break with BOOK3S_64=n and MEMORY_HOTPLUG=y
powerpc/mm: Fix build break when CMA=n && SPAPR_TCE_IOMMU=y
powerpc/mm: Fix build break with RADIX=y & HUGETLBFS=n
powerpc/pseries: Fix typo in parameter description
powerpc/kprobes: Remove kprobe_exceptions_notify()
kprobes: Introduce weak variant of kprobe_exceptions_notify()
powerpc/ftrace: Fix confusing help text for DISABLE_MPROFILE_KERNEL
powerpc/powernv: Fix opal_exit tracepoint opcode
powerpc: Add a prototype for mcount() so it can be versioned
powerpc: Drop GPL from of_node_to_nid() export to match other arches
powerpc/kprobes: Optimize kprobe in kretprobe_trampoline()
powerpc/kprobes: Implement Optprobes
powerpc/kprobes: Fixes for kprobe_lookup_name() on BE
powerpc: Add helper to check if offset is within relative branch range
powerpc/bpf: Introduce __PPC_SH64()
...
The power9_idle_stop method currently takes only the requested stop
level as a parameter and picks up the rest of the PSSCR bits from a
hand-coded macro. This is not a very flexible design, especially when
the firmware has the capability to communicate the psscr value and the
mask associated with a particular stop state via device tree.
This patch modifies the power9_idle_stop API to take as parameters the
PSSCR value and the PSSCR mask corresponding to the stop state that
needs to be set. These PSSCR value and mask are respectively obtained
by parsing the "ibm,cpu-idle-state-psscr" and
"ibm,cpu-idle-state-psscr-mask" fields from the device tree.
In addition to this, the patch adds support for handling stop states
for which ESL and EC bits in the PSSCR are zero. As per the
architecture, a wakeup from these stop states resumes execution from
the subsequent instruction as opposed to waking up at the System
Vector.
The older firmware sets only the Requested Level (RL) field in the
psscr and psscr-mask exposed in the device tree. For older firmware
where psscr-mask=0xf, this patch will set the default sane values that
the set for for remaining PSSCR fields (i.e PSLL, MTL, ESL, EC, and
TR). For the new firmware, the patch will validate that the invariants
required by the ISA for the psscr values are maintained by the
firmware.
This skiboot patch that exports fully populated PSSCR values and the
mask for all the stop states can be found here:
https://lists.ozlabs.org/pipermail/skiboot/2016-September/004869.html
[Optimize the number of instructions before entering STOP with
ESL=EC=0, validate the PSSCR values provided by the firimware
maintains the invariants required as per the ISA suggested by Balbir
Singh]
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In the current code for powernv_add_idle_states, there is a lot of code
duplication while initializing an idle state in powernv_states table.
Add an inline helper function to populate the powernv_states[] table
for a given idle state. Invoke this for populating the "Nap",
"Fastsleep" and the stop states in powernv_add_idle_states.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There may be special requirements on CPU response time, like if a
interrupt is pinned to a CPU, that CPU should not go into excessively
deep idle states. For this reason, add a mechanism for adding
PM QoS resume latency constraints for individual CPUs and modify the
menu governor to take them into account.
To that end, extend the device PM QoS pm_qos_resume_latency attribute
to CPUs, which is possible, because the exit latency for CPUs is
effectively equivalent to the resume latency for devices.
Signed-off-by: Alex Shi <alex.shi@linaro.org>
Acked-by: Rik van Riel <riel@redhat.com>
[ rjw : Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Obsolete commit 71abbbf856 (cpuidle: extend cpuidle and menu governor
to handle dynamic states) wanted to introduce dynamic C-states, but that
idea was dropped long ago. The nonsense deeper C-state checking
remained, though.
Since both target_residency and exit_latency are longer for deeper
idle state, there's no need to waste CPU time on useless checks.
Signed-off-by: Alex Shi <alex.shi@linaro.org>
Acked-by: Rik van Riel <riel@redhat.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)
to do the replacement at the end of the merge window.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since cpuidle_use_deepest_state() is not static, add a proper
kerneldoc comment to it to document its purpose.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
In function cpuidle_add_state_sysfs(), variable ret takes the return
value. Its value should be negative on errors. Because ret is reset in
the loop, its value will be 0 during the second and after repeat of the
loop. If kzalloc() returns a NULL pointer then, it will return 0. It may
be better to explicitly assign "-ENOMEM" when the call to kzalloc()
fails.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=188901
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When idle injection is used to cap power, we need to override the
governor's choice of idle states.
For this reason, make it possible the deepest idle state selection to
be enforced by setting a flag on a given CPU to achieve the maximum
potential power draw reduction.
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
powernv_idle_driver isn't exported, it can be made static. Found by sparse.
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
enter_freeze() callback is expected atleast to do the same as enter()
but it has to guarantee that interrupts aren't enabled at any point
in its execution, as the tick is frozen.
CPUs execute ->enter_freeze with the local tick or entire timekeeping
suspended, so it must not re-enable interrupts at any point (even
temporarily) or attempt to change states of clock event devices.
It will be called when the system goes to suspend-to-idle and will
reduce power usage because CPUs won't be awaken for unnecessary IRQs
(i.e. woken up only on IRQs from "wakeup sources")
We can reuse the same code for both the enter() and enter_freeze()
callbacks as along as they don't re-enable interrupts. Only "coupled"
cpuidle mechanism enables interrupts and doing that with timekeeping
suspended is generally not safe.
Since this generic DT based idle driver doesn't support "coupled"
states, it is safe to assume that the interrupts are not re-enabled.
This patch assign enter_freeze to same as enter callback function which
helps to save power without any intermittent spurious wakeups from
suspend-to-idle.
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Tested-by: Andy Gross <andy.gross@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The governor's code use try_module_get() and put_module() to refcount
the governor's module. But the governors are not compiled as module.
The refcount does not prevent to switch the governor or unload
a module as they aren't compiled as modules. The code is pointless,
so remove it.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull MIPS updates from Ralf Baechle:
"This is the main MIPS pull request for 4.9:
MIPS core arch code:
- traps: 64bit kernels should read CP0_EBase 64bit
- traps: Convert ebase to KSEG0
- c-r4k: Drop bc_wback_inv() from icache flush
- c-r4k: Split user/kernel flush_icache_range()
- cacheflush: Use __flush_icache_user_range()
- uprobes: Flush icache via kernel address
- KVM: Use __local_flush_icache_user_range()
- c-r4k: Fix flush_icache_range() for EVA
- Fix -mabi=64 build of vdso.lds
- VDSO: Drop duplicated -I*/-E* aflags
- tracing: move insn_has_delay_slot to a shared header
- tracing: disable uprobe/kprobe on compact branch instructions
- ptrace: Fix regs_return_value for kernel context
- Squash lines for simple wrapper functions
- Move identification of VP(E) into proc.c from smp-mt.c
- Add definitions of SYNC barrierstype values
- traps: Ensure full EBase is written
- tlb-r4k: If there are wired entries, don't use TLBINVF
- Sanitise coherentio semantics
- dma-default: Don't check hw_coherentio if device is non-coherent
- Support per-device DMA coherence
- Adjust MIPS64 CAC_BASE to reflect Config.K0
- Support generating Flattened Image Trees (.itb)
- generic: Introduce generic DT-based board support
- generic: Convert SEAD-3 to a generic board
- Enable hardened usercopy
- Don't specify STACKPROTECTOR in defconfigs
Octeon:
- Delete dead code and files across the platform.
- Change to use all memory into use by default.
- Rename upper case variables in setup code to lowercase.
- Delete legacy hack for broken bootloaders.
- Leave maintaining the link state to the actual ethernet/PHY drivers.
- Add DTS for D-Link DSR-500N.
- Fix PCI interrupt routing on D-Link DSR-500N.
Pistachio:
- Remove ANDROID_TIMED_OUTPUT from defconfig
TX39xx:
- Move GPIO setup from .mem_setup() to .arch_init()
- Convert to Common Clock Framework
TX49xx:
- Move GPIO setup from .mem_setup() to .arch_init()
- Convert to Common Clock Framework
txx9wdt:
- Add missing clock (un)prepare calls for CCF
BMIPS:
- Add PW, GPIO SDHCI and NAND device node names
- Support APPENDED_DTB
- Add missing bcm97435svmb to DT_NONE
- Rename bcm96358nb4ser to bcm6358-neufbox4-sercom
- Add DT examples for BCM63268, BCM3368 and BCM6362
- Add support for BCM3368 and BCM6362
PCI
- Reduce stack frame usage
- Use struct list_head lists
- Support for CONFIG_PCI_DOMAINS_GENERIC
- Make pcibios_set_cache_line_size an initcall
- Inline pcibios_assign_all_busses
- Split pci.c into pci.c & pci-legacy.c
- Introduce CONFIG_PCI_DRIVERS_LEGACY
- Support generic drivers
CPC
- Convert bare 'unsigned' to 'unsigned int'
- Avoid lock when MIPS CM >= 3 is present
GIC:
- Delete unused file smp-gic.c
mt7620:
- Delete unnecessary assignment for the field "owner" from PCI
BCM63xx:
- Let clk_disable() return immediately if clk is NULL
pm-cps:
- Change FSB workaround to CPU blacklist
- Update comments on barrier instructions
- Use MIPS standard lightweight ordering barrier
- Use MIPS standard completion barrier
- Remove selection of sync types
- Add MIPSr6 CPU support
- Support CM3 changes to Coherence Enable Register
SMP:
- Wrap call to mips_cpc_lock_other in mips_cm_lock_other
- Introduce mechanism for freeing and allocating IPIs
cpuidle:
- cpuidle-cps: Enable use with MIPSr6 CPUs.
SEAD3:
- Rewrite to use DT and generic kernel feature.
USB:
- host: ehci-sead3: Remove SEAD-3 EHCI code
FBDEV:
- cobalt_lcdfb: Drop SEAD3 support
dt-bindings:
- Document a binding for simple ASCII LCDs
auxdisplay:
- img-ascii-lcd: driver for simple ASCII LCD displays
irqchip i8259:
- i8259: Add domain before mapping parent irq
- i8259: Allow platforms to override poll function
- i8259: Remove unused i8259A_irq_pending
Malta:
- Rewrite to use DT
of/platform:
- Probe "isa" busses by default
CM:
- Print CM error reports upon bus errors
Module:
- Migrate exception table users off module.h and onto extable.h
- Make various drivers explicitly non-modular:
- Audit and remove any unnecessary uses of module.h
mailmap:
- Canonicalize to Qais' current email address.
Documentation:
- MIPS supports HAVE_REGS_AND_STACK_ACCESS_API
Loongson1C:
- Add CPU support for Loongson1C
- Add board support
- Add defconfig
- Add RTC support for Loongson1C board
All this except one Documentation fix has sat in linux-next and has
survived Imagination's automated build test system"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (127 commits)
Documentation: MIPS supports HAVE_REGS_AND_STACK_ACCESS_API
MIPS: ptrace: Fix regs_return_value for kernel context
MIPS: VDSO: Drop duplicated -I*/-E* aflags
MIPS: Fix -mabi=64 build of vdso.lds
MIPS: Enable hardened usercopy
MIPS: generic: Convert SEAD-3 to a generic board
MIPS: generic: Introduce generic DT-based board support
MIPS: Support generating Flattened Image Trees (.itb)
MIPS: Adjust MIPS64 CAC_BASE to reflect Config.K0
MIPS: Print CM error reports upon bus errors
MIPS: Support per-device DMA coherence
MIPS: dma-default: Don't check hw_coherentio if device is non-coherent
MIPS: Sanitise coherentio semantics
MIPS: PCI: Support generic drivers
MIPS: PCI: Introduce CONFIG_PCI_DRIVERS_LEGACY
MIPS: PCI: Split pci.c into pci.c & pci-legacy.c
MIPS: PCI: Inline pcibios_assign_all_busses
MIPS: PCI: Make pcibios_set_cache_line_size an initcall
MIPS: PCI: Support for CONFIG_PCI_DOMAINS_GENERIC
MIPS: PCI: Use struct list_head lists
...
When doing an nmi backtrace of many cores, most of which are idle, the
output is a little overwhelming and very uninformative. Suppress
messages for cpus that are idling when they are interrupted and just
emit one line, "NMI backtrace for N skipped: idling at pc 0xNNN".
We do this by grouping all the cpuidle code together into a new
.cpuidle.text section, and then checking the address of the interrupted
PC to see if it lies within that section.
This commit suitably tags x86 and tile idle routines, and only adds in
the minimal framework for other architectures.
Link: http://lkml.kernel.org/r/1472487169-14923-5-git-send-email-cmetcalf@mellanox.com
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org> [arm]
Tested-by: Petr Mladek <pmladek@suse.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull CPU hotplug updates from Thomas Gleixner:
"Yet another batch of cpu hotplug core updates and conversions:
- Provide core infrastructure for multi instance drivers so the
drivers do not have to keep custom lists.
- Convert custom lists to the new infrastructure. The block-mq custom
list conversion comes through the block tree and makes the diffstat
tip over to more lines removed than added.
- Handle unbalanced hotplug enable/disable calls more gracefully.
- Remove the obsolete CPU_STARTING/DYING notifier support.
- Convert another batch of notifier users.
The relayfs changes which conflicted with the conversion have been
shipped to me by Andrew.
The remaining lot is targeted for 4.10 so that we finally can remove
the rest of the notifiers"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
cpufreq: Fix up conversion to hotplug state machine
blk/mq: Reserve hotplug states for block multiqueue
x86/apic/uv: Convert to hotplug state machine
s390/mm/pfault: Convert to hotplug state machine
mips/loongson/smp: Convert to hotplug state machine
mips/octeon/smp: Convert to hotplug state machine
fault-injection/cpu: Convert to hotplug state machine
padata: Convert to hotplug state machine
cpufreq: Convert to hotplug state machine
ACPI/processor: Convert to hotplug state machine
virtio scsi: Convert to hotplug state machine
oprofile/timer: Convert to hotplug state machine
block/softirq: Convert to hotplug state machine
lib/irq_poll: Convert to hotplug state machine
x86/microcode: Convert to hotplug state machine
sh/SH-X3 SMP: Convert to hotplug state machine
ia64/mca: Convert to hotplug state machine
ARM/OMAP/wakeupgen: Convert to hotplug state machine
ARM/shmobile: Convert to hotplug state machine
arm64/FP/SIMD: Convert to hotplug state machine
...
Install the callbacks via the state machine.
v1…v2: - Use only CPUHP_CPUIDLE_DEAD (requested by Daniel Lezcano)
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-pm@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160824091259.ozyslcopxvbfdqzy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We know that 'ret = 0' because it has been tested a few lines above.
So, if 'kzalloc' fails, 0 will be returned instead of an error code.
Return -ENOMEM instead.
Fixes: a0d46a3dfd ("ARM: cpuidle: Register per cpuidle device")
Signed-off-by: Christophe Jaillet <christophe.jaillet@wanadoo.fr>
Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: 4.1+ <stable@vger.kernel.org> # 4.1+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Highlights:
- PowerNV PCI hotplug support.
- Lots more Power9 support.
- eBPF JIT support on ppc64le.
- Lots of cxl updates.
- Boot code consolidation.
Bug fixes:
- Fix spin_unlock_wait() from Boqun Feng
- Fix stack pointer corruption in __tm_recheckpoint() from Michael Neuling
- Fix multiple bugs in memory_hotplug_max() from Bharata B Rao
- mm: Ensure "special" zones are empty from Oliver O'Halloran
- ftrace: Separate the heuristics for checking call sites from Michael Ellerman
- modules: Never restore r2 for a mprofile-kernel style mcount() call from Michael Ellerman
- Fix endianness when reading TCEs from Alexey Kardashevskiy
- start rtasd before PCI probing from Greg Kurz
- PCI: rpaphp: Fix slot registration for multiple slots under a PHB from Tyrel Datwyler
- powerpc/mm: Add memory barrier in __hugepte_alloc() from Sukadev Bhattiprolu
Cleanups & fixes:
- Drop support for MPIC in pseries from Rashmica Gupta
- Define and use PPC64_ELF_ABI_v2/v1 from Michael Ellerman
- Remove unused symbols in asm-offsets.c from Rashmica Gupta
- Fix SRIOV not building without EEH enabled from Russell Currey
- Remove kretprobe_trampoline_holder. from Thiago Jung Bauermann
- Reduce log level of PCI I/O space warning from Benjamin Herrenschmidt
- Add array bounds checking to crash_shutdown_handlers from Suraj Jitindar Singh
- Avoid -maltivec when using clang integrated assembler from Anton Blanchard
- Fix array overrun in ppc_rtas() syscall from Andrew Donnellan
- Fix error return value in cmm_mem_going_offline() from Rasmus Villemoes
- export cpu_to_core_id() from Mauricio Faria de Oliveira
- Remove old symbols from defconfigs from Andrew Donnellan
- Update obsolete comments in setup_32.c about entry conditions from Benjamin Herrenschmidt
- Add comment explaining the purpose of setup_kdump_trampoline() from Benjamin Herrenschmidt
- Merge the RELOCATABLE config entries for ppc32 and ppc64 from Kevin Hao
- Remove RELOCATABLE_PPC32 from Kevin Hao
- Fix .long's in tlb-radix.c to more meaningful from Balbir Singh
Minor cleanups & fixes:
- Andrew Donnellan, Anna-Maria Gleixner, Anton Blanchard, Benjamin
Herrenschmidt, Bharata B Rao, Christophe Leroy, Colin Ian King, Geliang
Tang, Greg Kurz, Madhavan Srinivasan, Michael Ellerman, Michael Ellerman,
Stephen Rothwell, Stewart Smith.
Freescale updates from Scott:
- "Highlights include more 8xx optimizations, device tree updates,
and MVME7100 support."
PowerNV PCI hotplug from Gavin Shan:
- PCI: Add pcibios_setup_bridge()
- Override pcibios_setup_bridge()
- Remove PCI_RESET_DELAY_US
- Move pnv_pci_ioda_setup_opal_tce_kill() around
- Increase PE# capacity
- Allocate PE# in reverse order
- Create PEs in pcibios_setup_bridge()
- Setup PE for root bus
- Extend PCI bridge resources
- Make pnv_ioda_deconfigure_pe() visible
- Dynamically release PE
- Update bridge windows on PCI plug
- Delay populating pdn
- Support PCI slot ID
- Use PCI slot reset infrastructure
- Introduce pnv_pci_get_slot_id()
- Functions to get/set PCI slot state
- PCI/hotplug: PowerPC PowerNV PCI hotplug driver
- Print correct PHB type names
Power9 idle support from Shreyas B. Prabhu:
- set power_save func after the idle states are initialized
- Use PNV_THREAD_WINKLE macro while requesting for winkle
- make hypervisor state restore a function
- Rename idle_power7.S to idle_book3s.S
- Rename reusable idle functions to hardware agnostic names
- Make pnv_powersave_common more generic
- abstraction for saving SPRs before entering deep idle states
- Add platform support for stop instruction
- cpuidle/powernv: Use CPUIDLE_STATE_MAX instead of MAX_POWERNV_IDLE_STATES
- cpuidle/powernv: cleanup cpuidle-powernv.c
- cpuidle/powernv: Add support for POWER ISA v3 idle states
- Use deepest stop state when cpu is offlined
Power9 PMU from Madhavan Srinivasan:
- factor out power8 pmu macros and defines
- factor out power8 pmu functions
- factor out power8 __init_pmu code
- Add power9 event list macros for generic and cache events
- Power9 PMU support
- Export Power9 generic and cache events to sysfs
Power9 preliminary interrupt & PCI support from Benjamin Herrenschmidt:
- Add XICS emulation APIs
- Move a few exception common handlers to make room
- Add support for HV virtualization interrupts
- Add mechanism to force a replay of interrupts
- Add ICP OPAL backend
- Discover IODA3 PHBs
- pci: Remove obsolete SW invalidate
- opal: Add real mode call wrappers
- Rename TCE invalidation calls
- Remove SWINV constants and obsolete TCE code
- Rework accessing the TCE invalidate register
- Fallback to OPAL for TCE invalidations
- Use the device-tree to get available range of M64's
- Check status of a PHB before using it
- pci: Don't try to allocate resources that will be reassigned
Other Power9:
- Send SIGBUS on unaligned copy and paste from Chris Smart
- Large Decrementer support from Oliver O'Halloran
- Load Monitor Register Support from Jack Miller
Performance improvements from Anton Blanchard:
- Avoid load hit store in __giveup_fpu() and __giveup_altivec()
- Avoid load hit store in setup_sigcontext()
- Remove assembly versions of strcpy, strcat, strlen and strcmp
- Align hot loops of some string functions
eBPF JIT from Naveen N. Rao:
- Fix/enhance 32-bit Load Immediate implementation
- Optimize 64-bit Immediate loads
- Introduce rotate immediate instructions
- A few cleanups
- Isolate classic BPF JIT specifics into a separate header
- Implement JIT compiler for extended BPF
Operator Panel driver from Suraj Jitindar Singh:
- devicetree/bindings: Add binding for operator panel on FSP machines
- Add inline function to get rc from an ASYNC_COMP opal_msg
- Add driver for operator panel on FSP machines
Sparse fixes from Daniel Axtens:
- make some things static
- Introduce asm-prototypes.h
- Include headers containing prototypes
- Use #ifdef __BIG_ENDIAN__ #else for REG_BYTE
- kvm: Clarify __user annotations
- Pass endianness to sparse
- Make ppc_md.{halt, restart} __noreturn
MM fixes & cleanups from Aneesh Kumar K.V:
- radix: Update LPCR HR bit as per ISA
- use _raw variant of page table accessors
- Compile out radix related functions if RADIX_MMU is disabled
- Clear top 16 bits of va only on older cpus
- Print formation regarding the the MMU mode
- hash: Update SDR1 size encoding as documented in ISA 3.0
- radix: Update PID switch sequence
- radix: Update machine call back to support new HCALL.
- radix: Add LPID based tlb flush helpers
- radix: Add a kernel command line to disable radix
- Cleanup LPCR defines
Boot code consolidation from Benjamin Herrenschmidt:
- Move epapr_paravirt_early_init() to early_init_devtree()
- cell: Don't use flat device-tree after boot
- ge_imp3a: Don't use the flat device-tree after boot
- mpc85xx_ds: Don't use the flat device-tree after boot
- mpc85xx_rdb: Don't use the flat device-tree after boot
- Don't test for machine type in rtas_initialize()
- Don't test for machine type in smp_setup_cpu_maps()
- dt: Add of_device_compatible_match()
- Factor do_feature_fixup calls
- Move 64-bit feature fixup earlier
- Move 64-bit memory reserves to setup_arch()
- Use a cachable DART
- Move FW feature probing out of pseries probe()
- Put exception configuration in a common place
- Remove early allocation of the SMU command buffer
- Move MMU backend selection out of platform code
- pasemi: Remove IOBMAP allocation from platform probe()
- mm/hash: Don't use machine_is() early during boot
- Don't test for machine type to detect HEA special case
- pmac: Remove spurrious machine type test
- Move hash table ops to a separate structure
- Ensure that ppc_md is empty before probing for machine type
- Move 64-bit probe_machine() to later in the boot process
- Move 32-bit probe() machine to later in the boot process
- Get rid of ppc_md.init_early()
- Move the boot time info banner to a separate function
- Move setting of {i,d}cache_bsize to initialize_cache_info()
- Move the content of setup_system() to setup_arch()
- Move cache info inits to a separate function
- Re-order the call to smp_setup_cpu_maps()
- Re-order setup_panic()
- Make a few boot functions __init
- Merge 32-bit and 64-bit setup_arch()
Other new features:
- tty/hvc: Use IRQF_SHARED for OPAL hvc consoles from Sam Mendoza-Jonas
- tty/hvc: Use opal irqchip interface if available from Sam Mendoza-Jonas
- powerpc: Add module autoloading based on CPU features from Alastair D'Silva
- crypto: vmx - Convert to CPU feature based module autoloading from Alastair D'Silva
- Wake up kopald polling thread before waiting for events from Benjamin Herrenschmidt
- xmon: Dump ISA 2.06 SPRs from Michael Ellerman
- xmon: Dump ISA 2.07 SPRs from Michael Ellerman
- Add a parameter to disable 1TB segs from Oliver O'Halloran
- powerpc/boot: Add OPAL console to epapr wrappers from Oliver O'Halloran
- Assign fixed PHB number based on device-tree properties from Guilherme G. Piccoli
- pseries: Add pseries hotplug workqueue from John Allen
- pseries: Add support for hotplug interrupt source from John Allen
- pseries: Use kernel hotplug queue for PowerVM hotplug events from John Allen
- pseries: Move property cloning into its own routine from Nathan Fontenot
- pseries: Dynamic add entires to associativity lookup array from Nathan Fontenot
- pseries: Auto-online hotplugged memory from Nathan Fontenot
- pseries: Remove call to memblock_add() from Nathan Fontenot
cxl:
- Add set and get private data to context struct from Michael Neuling
- make base more explicitly non-modular from Paul Gortmaker
- Use for_each_compatible_node() macro from Wei Yongjun
- Frederic Barrat
- Abstract the differences between the PSL and XSL
- Make vPHB device node match adapter's
- Philippe Bergheaud
- Add mechanism for delivering AFU driver specific events
- Ignore CAPI adapters misplaced in switched slots
- Refine slice error debug messages
- Andrew Donnellan
- static-ify variables to fix sparse warnings
- PCI/hotplug: pnv_php: export symbols and move struct types needed by cxl
- PCI/hotplug: pnv_php: handle OPAL_PCI_SLOT_OFFLINE power state
- Add cxl_check_and_switch_mode() API to switch bi-modal cards
- remove dead Kconfig options
- fix potential NULL dereference in free_adapter()
- Ian Munsie
- Update process element after allocating interrupts
- Add support for CAPP DMA mode
- Fix allowing bogus AFU descriptors with 0 maximum processes
- Fix allocating a minimum of 2 pages for the SPA
- Fix bug where AFU disable operation had no effect
- Workaround XSL bug that does not clear the RA bit after a reset
- Fix NULL pointer dereference on kernel contexts with no AFU interrupts
- powerpc/powernv: Split cxl code out into a separate file
- Add cxl_slot_is_supported API
- Enable bus mastering for devices using CAPP DMA mode
- Move cxl_afu_get / cxl_afu_put to base
- Allow a default context to be associated with an external pci_dev
- Do not create vPHB if there are no AFU configuration records
- powerpc/powernv: Add support for the cxl kernel api on the real phb
- Add support for using the kernel API with a real PHB
- Add kernel APIs to get & set the max irqs per context
- Add preliminary workaround for CX4 interrupt limitation
- Add support for interrupts on the Mellanox CX4
- Workaround PE=0 hardware limitation in Mellanox CX4
- powerpc/powernv: Fix pci-cxl.c build when CONFIG_MODULES=n
selftests:
- Test unaligned copy and paste from Chris Smart
- Load Monitor Register Tests from Jack Miller
- Cyril Bur
- exec() with suspended transaction
- Use signed long to read perf_event_paranoid
- Fix usage message in context_switch
- Fix generation of vector instructions/types in context_switch
- Michael Ellerman
- Use "Delta" rather than "Error" in normal output
- Import Anton's mmap & futex micro benchmarks
- Add a test for PROT_SAO
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXnWchAAoJEFHr6jzI4aWAe64P/36Vd9yJLptjkoyZp8/IQtu1
Cv8buQwGdKuSMzdkcUAOXcC3fe2u70ZWXMKKLfY3koIV1IAiqdWk5/XWRKMP2XmE
dG0LhSf0uu7uh+mE0WvQnRu46ImeKtQ+mPp4Hbs/s9SxMSeYjruv3vdWWmgUq0cl
Gac2qJSRtAMmgLuHWMjf7N5mxOTOnKejU4o2i9cJ+YHmWKOdCigv2Ge1UadOQFlC
E7tRPiUR3asfDfj+e+LVTTdToH6p8pk+mOUzIoZ8jIkQ+IXzi62UDl5+Rw9mqiuX
1CtqEMUXxo2qwX+d4TcV/QUOp0YKPuIcUZ9NMMS+S3lOyJ4NFt+j2Izk7QJp5kNP
gKVqB68TjDQsBuDr3P9ynlHbduxTIhZAqopbTrLe0FIg48nUe4n1yHJBVzqaVajX
rFBJSsSUffBLAARNPSXJJhIgc2C1/qOC8dgMeDMcR2kPirDHaQZ/lY1yEpq1yiqR
q6e3v5hvIAm4IjbYk0mF7TUxBrPGVE/ExyBINyASRoYxAJ1PyeD/iljZ9vI3asRA
s+hhxT8H3f7lnqTrmJqMjHgAdGkmag07EdmvFNX4xK4aADSy7Y6g4dw25ffRopo9
p9Jf9HX+dZv65Y3UjbV/6HuXcaSEBJJLSVWvii65PebqSN0LuHEFvNeIJ6Iblx0B
AWh/hd0Iin2gdkcG39Mr
=Z5kM
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights:
- PowerNV PCI hotplug support.
- Lots more Power9 support.
- eBPF JIT support on ppc64le.
- Lots of cxl updates.
- Boot code consolidation.
Bug fixes:
- Fix spin_unlock_wait() from Boqun Feng
- Fix stack pointer corruption in __tm_recheckpoint() from Michael
Neuling
- Fix multiple bugs in memory_hotplug_max() from Bharata B Rao
- mm: Ensure "special" zones are empty from Oliver O'Halloran
- ftrace: Separate the heuristics for checking call sites from
Michael Ellerman
- modules: Never restore r2 for a mprofile-kernel style mcount() call
from Michael Ellerman
- Fix endianness when reading TCEs from Alexey Kardashevskiy
- start rtasd before PCI probing from Greg Kurz
- PCI: rpaphp: Fix slot registration for multiple slots under a PHB
from Tyrel Datwyler
- powerpc/mm: Add memory barrier in __hugepte_alloc() from Sukadev
Bhattiprolu
Cleanups & fixes:
- Drop support for MPIC in pseries from Rashmica Gupta
- Define and use PPC64_ELF_ABI_v2/v1 from Michael Ellerman
- Remove unused symbols in asm-offsets.c from Rashmica Gupta
- Fix SRIOV not building without EEH enabled from Russell Currey
- Remove kretprobe_trampoline_holder from Thiago Jung Bauermann
- Reduce log level of PCI I/O space warning from Benjamin
Herrenschmidt
- Add array bounds checking to crash_shutdown_handlers from Suraj
Jitindar Singh
- Avoid -maltivec when using clang integrated assembler from Anton
Blanchard
- Fix array overrun in ppc_rtas() syscall from Andrew Donnellan
- Fix error return value in cmm_mem_going_offline() from Rasmus
Villemoes
- export cpu_to_core_id() from Mauricio Faria de Oliveira
- Remove old symbols from defconfigs from Andrew Donnellan
- Update obsolete comments in setup_32.c about entry conditions from
Benjamin Herrenschmidt
- Add comment explaining the purpose of setup_kdump_trampoline() from
Benjamin Herrenschmidt
- Merge the RELOCATABLE config entries for ppc32 and ppc64 from Kevin
Hao
- Remove RELOCATABLE_PPC32 from Kevin Hao
- Fix .long's in tlb-radix.c to more meaningful from Balbir Singh
Minor cleanups & fixes:
- Andrew Donnellan, Anna-Maria Gleixner, Anton Blanchard, Benjamin
Herrenschmidt, Bharata B Rao, Christophe Leroy, Colin Ian King,
Geliang Tang, Greg Kurz, Madhavan Srinivasan, Michael Ellerman,
Michael Ellerman, Stephen Rothwell, Stewart Smith.
Freescale updates from Scott:
- "Highlights include more 8xx optimizations, device tree updates,
and MVME7100 support."
PowerNV PCI hotplug from Gavin Shan:
- PCI: Add pcibios_setup_bridge()
- Override pcibios_setup_bridge()
- Remove PCI_RESET_DELAY_US
- Move pnv_pci_ioda_setup_opal_tce_kill() around
- Increase PE# capacity
- Allocate PE# in reverse order
- Create PEs in pcibios_setup_bridge()
- Setup PE for root bus
- Extend PCI bridge resources
- Make pnv_ioda_deconfigure_pe() visible
- Dynamically release PE
- Update bridge windows on PCI plug
- Delay populating pdn
- Support PCI slot ID
- Use PCI slot reset infrastructure
- Introduce pnv_pci_get_slot_id()
- Functions to get/set PCI slot state
- PCI/hotplug: PowerPC PowerNV PCI hotplug driver
- Print correct PHB type names
Power9 idle support from Shreyas B. Prabhu:
- set power_save func after the idle states are initialized
- Use PNV_THREAD_WINKLE macro while requesting for winkle
- make hypervisor state restore a function
- Rename idle_power7.S to idle_book3s.S
- Rename reusable idle functions to hardware agnostic names
- Make pnv_powersave_common more generic
- abstraction for saving SPRs before entering deep idle states
- Add platform support for stop instruction
- cpuidle/powernv: Use CPUIDLE_STATE_MAX instead of MAX_POWERNV_IDLE_STATES
- cpuidle/powernv: cleanup cpuidle-powernv.c
- cpuidle/powernv: Add support for POWER ISA v3 idle states
- Use deepest stop state when cpu is offlined
Power9 PMU from Madhavan Srinivasan:
- factor out power8 pmu macros and defines
- factor out power8 pmu functions
- factor out power8 __init_pmu code
- Add power9 event list macros for generic and cache events
- Power9 PMU support
- Export Power9 generic and cache events to sysfs
Power9 preliminary interrupt & PCI support from Benjamin Herrenschmidt:
- Add XICS emulation APIs
- Move a few exception common handlers to make room
- Add support for HV virtualization interrupts
- Add mechanism to force a replay of interrupts
- Add ICP OPAL backend
- Discover IODA3 PHBs
- pci: Remove obsolete SW invalidate
- opal: Add real mode call wrappers
- Rename TCE invalidation calls
- Remove SWINV constants and obsolete TCE code
- Rework accessing the TCE invalidate register
- Fallback to OPAL for TCE invalidations
- Use the device-tree to get available range of M64's
- Check status of a PHB before using it
- pci: Don't try to allocate resources that will be reassigned
Other Power9:
- Send SIGBUS on unaligned copy and paste from Chris Smart
- Large Decrementer support from Oliver O'Halloran
- Load Monitor Register Support from Jack Miller
Performance improvements from Anton Blanchard:
- Avoid load hit store in __giveup_fpu() and __giveup_altivec()
- Avoid load hit store in setup_sigcontext()
- Remove assembly versions of strcpy, strcat, strlen and strcmp
- Align hot loops of some string functions
eBPF JIT from Naveen N. Rao:
- Fix/enhance 32-bit Load Immediate implementation
- Optimize 64-bit Immediate loads
- Introduce rotate immediate instructions
- A few cleanups
- Isolate classic BPF JIT specifics into a separate header
- Implement JIT compiler for extended BPF
Operator Panel driver from Suraj Jitindar Singh:
- devicetree/bindings: Add binding for operator panel on FSP machines
- Add inline function to get rc from an ASYNC_COMP opal_msg
- Add driver for operator panel on FSP machines
Sparse fixes from Daniel Axtens:
- make some things static
- Introduce asm-prototypes.h
- Include headers containing prototypes
- Use #ifdef __BIG_ENDIAN__ #else for REG_BYTE
- kvm: Clarify __user annotations
- Pass endianness to sparse
- Make ppc_md.{halt, restart} __noreturn
MM fixes & cleanups from Aneesh Kumar K.V:
- radix: Update LPCR HR bit as per ISA
- use _raw variant of page table accessors
- Compile out radix related functions if RADIX_MMU is disabled
- Clear top 16 bits of va only on older cpus
- Print formation regarding the the MMU mode
- hash: Update SDR1 size encoding as documented in ISA 3.0
- radix: Update PID switch sequence
- radix: Update machine call back to support new HCALL.
- radix: Add LPID based tlb flush helpers
- radix: Add a kernel command line to disable radix
- Cleanup LPCR defines
Boot code consolidation from Benjamin Herrenschmidt:
- Move epapr_paravirt_early_init() to early_init_devtree()
- cell: Don't use flat device-tree after boot
- ge_imp3a: Don't use the flat device-tree after boot
- mpc85xx_ds: Don't use the flat device-tree after boot
- mpc85xx_rdb: Don't use the flat device-tree after boot
- Don't test for machine type in rtas_initialize()
- Don't test for machine type in smp_setup_cpu_maps()
- dt: Add of_device_compatible_match()
- Factor do_feature_fixup calls
- Move 64-bit feature fixup earlier
- Move 64-bit memory reserves to setup_arch()
- Use a cachable DART
- Move FW feature probing out of pseries probe()
- Put exception configuration in a common place
- Remove early allocation of the SMU command buffer
- Move MMU backend selection out of platform code
- pasemi: Remove IOBMAP allocation from platform probe()
- mm/hash: Don't use machine_is() early during boot
- Don't test for machine type to detect HEA special case
- pmac: Remove spurrious machine type test
- Move hash table ops to a separate structure
- Ensure that ppc_md is empty before probing for machine type
- Move 64-bit probe_machine() to later in the boot process
- Move 32-bit probe() machine to later in the boot process
- Get rid of ppc_md.init_early()
- Move the boot time info banner to a separate function
- Move setting of {i,d}cache_bsize to initialize_cache_info()
- Move the content of setup_system() to setup_arch()
- Move cache info inits to a separate function
- Re-order the call to smp_setup_cpu_maps()
- Re-order setup_panic()
- Make a few boot functions __init
- Merge 32-bit and 64-bit setup_arch()
Other new features:
- tty/hvc: Use IRQF_SHARED for OPAL hvc consoles from Sam Mendoza-Jonas
- tty/hvc: Use opal irqchip interface if available from Sam Mendoza-Jonas
- powerpc: Add module autoloading based on CPU features from Alastair D'Silva
- crypto: vmx - Convert to CPU feature based module autoloading from Alastair D'Silva
- Wake up kopald polling thread before waiting for events from Benjamin Herrenschmidt
- xmon: Dump ISA 2.06 SPRs from Michael Ellerman
- xmon: Dump ISA 2.07 SPRs from Michael Ellerman
- Add a parameter to disable 1TB segs from Oliver O'Halloran
- powerpc/boot: Add OPAL console to epapr wrappers from Oliver O'Halloran
- Assign fixed PHB number based on device-tree properties from Guilherme G. Piccoli
- pseries: Add pseries hotplug workqueue from John Allen
- pseries: Add support for hotplug interrupt source from John Allen
- pseries: Use kernel hotplug queue for PowerVM hotplug events from John Allen
- pseries: Move property cloning into its own routine from Nathan Fontenot
- pseries: Dynamic add entires to associativity lookup array from Nathan Fontenot
- pseries: Auto-online hotplugged memory from Nathan Fontenot
- pseries: Remove call to memblock_add() from Nathan Fontenot
cxl:
- Add set and get private data to context struct from Michael Neuling
- make base more explicitly non-modular from Paul Gortmaker
- Use for_each_compatible_node() macro from Wei Yongjun
- Frederic Barrat
- Abstract the differences between the PSL and XSL
- Make vPHB device node match adapter's
- Philippe Bergheaud
- Add mechanism for delivering AFU driver specific events
- Ignore CAPI adapters misplaced in switched slots
- Refine slice error debug messages
- Andrew Donnellan
- static-ify variables to fix sparse warnings
- PCI/hotplug: pnv_php: export symbols and move struct types needed by cxl
- PCI/hotplug: pnv_php: handle OPAL_PCI_SLOT_OFFLINE power state
- Add cxl_check_and_switch_mode() API to switch bi-modal cards
- remove dead Kconfig options
- fix potential NULL dereference in free_adapter()
- Ian Munsie
- Update process element after allocating interrupts
- Add support for CAPP DMA mode
- Fix allowing bogus AFU descriptors with 0 maximum processes
- Fix allocating a minimum of 2 pages for the SPA
- Fix bug where AFU disable operation had no effect
- Workaround XSL bug that does not clear the RA bit after a reset
- Fix NULL pointer dereference on kernel contexts with no AFU interrupts
- powerpc/powernv: Split cxl code out into a separate file
- Add cxl_slot_is_supported API
- Enable bus mastering for devices using CAPP DMA mode
- Move cxl_afu_get / cxl_afu_put to base
- Allow a default context to be associated with an external pci_dev
- Do not create vPHB if there are no AFU configuration records
- powerpc/powernv: Add support for the cxl kernel api on the real phb
- Add support for using the kernel API with a real PHB
- Add kernel APIs to get & set the max irqs per context
- Add preliminary workaround for CX4 interrupt limitation
- Add support for interrupts on the Mellanox CX4
- Workaround PE=0 hardware limitation in Mellanox CX4
- powerpc/powernv: Fix pci-cxl.c build when CONFIG_MODULES=n
selftests:
- Test unaligned copy and paste from Chris Smart
- Load Monitor Register Tests from Jack Miller
- Cyril Bur
- exec() with suspended transaction
- Use signed long to read perf_event_paranoid
- Fix usage message in context_switch
- Fix generation of vector instructions/types in context_switch
- Michael Ellerman
- Use "Delta" rather than "Error" in normal output
- Import Anton's mmap & futex micro benchmarks
- Add a test for PROT_SAO"
* tag 'powerpc-4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (263 commits)
powerpc/mm: Parenthesise IS_ENABLED() in if condition
tty/hvc: Use opal irqchip interface if available
tty/hvc: Use IRQF_SHARED for OPAL hvc consoles
selftests/powerpc: exec() with suspended transaction
powerpc: Improve comment explaining why we modify VRSAVE
powerpc/mm: Drop unused externs for hpte_init_beat[_v3]()
powerpc/mm: Rename hpte_init_lpar() and move the fallback to a header
powerpc/mm: Fix build break when PPC_NATIVE=n
crypto: vmx - Convert to CPU feature based module autoloading
powerpc: Add module autoloading based on CPU features
powerpc/powernv/ioda: Fix endianness when reading TCEs
powerpc/mm: Add memory barrier in __hugepte_alloc()
powerpc/modules: Never restore r2 for a mprofile-kernel style mcount() call
powerpc/ftrace: Separate the heuristics for checking call sites
powerpc: Merge 32-bit and 64-bit setup_arch()
powerpc/64: Make a few boot functions __init
powerpc: Re-order setup_panic()
powerpc: Re-order the call to smp_setup_cpu_maps()
powerpc/32: Move cache info inits to a separate function
powerpc/64: Move the content of setup_system() to setup_arch()
...
The function arm_enter_idle_state is exactly the same in both generic
ARM{32,64} CPUIdle driver and will be the same even on ARM64 backend
for ACPI processor idle driver. So we can unify it and move it to a
common place by introducing CPU_PM_CPU_IDLE_ENTER macro that can be
used in all places avoiding duplication.
This is in preparation of reuse of the generic cpuidle entry function
for ACPI LPI support on ARM64.
Suggested-by: Rafael J. Wysocki <rjw@rjwysocki.net>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
POWER ISA v3 defines a new idle processor core mechanism. In summary,
a) new instruction named stop is added.
b) new per thread SPR named PSSCR is added which controls the behavior
of stop instruction.
Supported idle states and value to be written to PSSCR register to enter
any idle state is exposed via ibm,cpu-idle-state-names and
ibm,cpu-idle-state-psscr respectively. To enter an idle state,
platform provided power_stop() needs to be invoked with the appropriate
PSSCR value.
This patch adds support for this new mechanism in cpuidle powernv driver.
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Lorenzo Pieralisi <Lorenzo.Pieralisi@arm.com>
Cc: linux-pm@vger.kernel.org
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
- Use stack instead of kzalloc'ed memory for variables while probing
device tree for idle states.
- Set cap for number of idle states that can be added to
cpuidle_state_table
- Minor change in way we check of_property_read_u32_array for error
for sake of consistency
- Drop unnecessary "&" while assigning function pointer
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: linux-pm@vger.kernel.org
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use cpuidle's CPUIDLE_STATE_MAX macro instead of powernv specific
MAX_POWERNV_IDLE_STATES.
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: linux-pm@vger.kernel.org
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Snooze is a poll idle state in powernv and pseries platforms. Snooze
has a timeout so that if a CPU stays in snooze for more than target
residency of the next available idle state, then it would exit
thereby giving chance to the cpuidle governor to re-evaluate and
promote the CPU to a deeper idle state. Therefore whenever snooze
exits due to this timeout, its last_residency will be target_residency
of the next deeper state.
Commit e93e59ce5b "cpuidle: Replace ktime_get() with local_clock()"
changed the math around last_residency calculation. Specifically,
while converting last_residency value from nano- to microseconds, it
carries out right shift by 10. Because of that, in snooze timeout
exit scenarios last_residency calculated is roughly 2.3% less than
target_residency of the next available state. This pattern is picked
up by get_typical_interval() in the menu governor and therefore
expected_interval in menu_select() is frequently less than the
target_residency of any state other than snooze.
Due to this we are entering snooze at a higher rate, thereby
affecting the single thread performance.
Fix this by using more precise division via ktime_us_delta().
Fixes: e93e59ce5b "cpuidle: Replace ktime_get() with local_clock()"
Reported-by: Anton Blanchard <anton@samba.org>
Bisected-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 0b89e9aa28 (cpuidle: delay enabling interrupts until all
coupled CPUs leave idle) rightfully fixed a regression by letting
the coupled idle state framework to handle local interrupt enabling
when the CPU is exiting an idle state.
The current code checks if the idle state is coupled and, if so, it
will let the coupled code to enable interrupts. This way, it can
decrement the ready-count before handling the interrupt. This
mechanism prevents the other CPUs from waiting for a CPU which is
handling interrupts.
But the check is done against the state index returned by the back
end driver's ->enter functions which could be different from the
initial index passed as parameter to the cpuidle_enter_state()
function.
entered_state = target_state->enter(dev, drv, index);
[ ... ]
if (!cpuidle_state_is_coupled(drv, entered_state))
local_irq_enable();
[ ... ]
If the 'index' is referring to a coupled idle state but the
'entered_state' is *not* coupled, then the interrupts are enabled
again. All CPUs blocked on the sync barrier may busy loop longer
if the CPU has interrupts to handle before decrementing the
ready-count. That's consuming more energy than saving.
Fixes: 0b89e9aa28 (cpuidle: delay enabling interrupts until all coupled CPUs leave idle)
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: 3.15+ <stable@vger.kernel.org> # 3.15+
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
arm_cpuidle_suspend() may return -EOPNOTSUPP, or any value returned
by the cpu_ops/cpuidle_ops suspend call. arm_enter_idle_state() doesn't
update 'ret' with this value, meaning we always signal success to
cpuidle_enter_state(), causing it to update the usage counters as if we
succeeded.
Fixes: 191de17aa3 ("ARM64: cpuidle: Replace cpu_suspend by the common ARM/ARM64 function")
Signed-off-by: James Morse <james.morse@arm.com>
Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: 4.1+ <stable@vger.kernel.org> # 4.1+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The ktime_get() can have a non negligeable overhead, use local_clock()
instead.
In order to test the difference between ktime_get() and local_clock(),
a quick hack has been added to trigger, via debugfs, 10000 times a
call to ktime_get() and local_clock() and measure the elapsed time.
Then the average value, the min and max is computed for each call.
From userspace, the test above was called 100 times every 2 seconds.
So, ktime_get() and local_clock() have been called 1000000 times in
total.
The results are:
ktime_get():
============
* average: 101 ns (stddev: 27.4)
* maximum: 38313 ns
* minimum: 65 ns
local_clock():
==============
* average: 60 ns (stddev: 9.8)
* maximum: 13487 ns
* minimum: 46 ns
The local_clock() is faster and more stable.
Even if it is a drop in the ocean, changing the ktime_get() by the
local_clock() allows to save 80ns at idle time (entry + exit). And
in some circumstances, especially when there are several CPUs racing
for the clock access, we save tens of microseconds.
The idle duration resulting from a diff is converted from nanosec to
microsec. This could be done with integer division (div 1000) - which is
an expensive operation or by 10 bits shifting (div 1024) - which is fast
but unprecise.
The following table gives some results at the limits.
------------------------------------------
| nsec | div(1000) | div(1024) |
------------------------------------------
| 1e3 | 1 usec | 976 nsec |
------------------------------------------
| 1e6 | 1000 usec | 976 usec |
------------------------------------------
| 1e9 | 1000000 usec | 976562 usec |
------------------------------------------
There is a linear deviation of 2.34%. This loss of precision is acceptable
in the context of the resulting diff which is used for statistics. These
ones are processed to guess estimate an approximation of the duration of the
next idle period which ends up into an idle state selection. The selection
criteria takes into account the next duration based on large intervals,
represented by the idle state's target residency.
The 2^10 division is enough because the approximation regarding the 1e3
division is lost in all the approximations done for the next idle duration
computation.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Currently the 'registered' member of the cpuidle_device struct is set
to 1 during cpuidle_register_device. In this same function there are
checks to see if the device is already registered to prevent duplicate
calls to register the device, but this value is never set to 0 even on
unregister of the device. Because of this, any attempt to call
cpuidle_register_device after a call to cpuidle_unregister_device will
fail which shouldn't be the case.
To prevent this, set registered to 0 when the device is unregistered.
Fixes: c878a52d3c (cpuidle: Check if device is already registered)
Signed-off-by: Dave Gerlach <d-gerlach@ti.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit a9ceb78bc7 (cpuidle,menu: use interactivity_req to disable
polling) changed the behavior of the fallback state selection part
of menu_select() so it looks at interactivity_req instead of
data->next_timer_us when it makes its decision. That effectively
caused polling to be used more often as fallback idle which led to
significant increases of energy consumption in some cases.
Commit e132b9b3bc (cpuidle: menu: use high confidence factors
only when considering polling) changed that logic again to be more
predictable, but that didn't help with the increased energy
consumption problem.
For this reason, go back to making decisions on which state to fall
back to based on data->next_timer_us which is the time we know for
sure something will happen rather than a prediction (which may be
inaccurate and turns out to be so often enough to be problematic).
However, take the target residency of the first proper idle state
(C1) into account, so that state is not used as the fallback one
if its target residency is greater than data->next_timer_us.
Fixes: a9ceb78bc7 (cpuidle,menu: use interactivity_req to disable polling)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-and-tested-by: Doug Smythies <dsmythies@telus.net>
The menu governor uses five different factors to pick the
idle state:
- the user configured latency_req
- the time until the next timer (next_timer_us)
- the typical sleep interval, as measured recently
- an estimate of sleep time by dividing next_timer_us by an observed factor
- a load corrected version of the above, divided again by load
Only the first three items are known with enough confidence that
we can use them to consider polling, instead of an actual CPU
idle state, because the cost of being wrong about polling can be
excessive power use.
The latter two are used in the menu governor's main selection
loop, and can result in choosing a shallower idle state when
the system is expected to be busy again soon.
This pushes a busy system in the "performance" direction of
the performance<>power tradeoff, when choosing between idle
states, but stays more strictly on the "power" state when
deciding between polling and C1.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
We know that the avg variable actually ends up holding a 32 bit
quantity, since it's an average of such numbers. It is only a u64
because it is temporarily used to hold the sum. Making it an actual
u32 allows gcc to generate slightly better code, e.g. when computing
the square, it can do a 32x32->64 multiply.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Computing the integer square root is a rather expensive operation, at
least compared to doing a 64x64 -> 64 multiply (avg*avg) and, on 64
bit platforms, doing an extra comparison to a constant (variance <=
U64_MAX/36).
On 64 bit platforms, this does mean that we add a restriction on the
range of the variance where we end up using the estimate (since
previously the stddev <= ULONG_MAX was a tautology), but on the other
hand, we extend the range quite substantially on 32 bit platforms - in
both cases, we now allow standard deviations up to 715 seconds, which
is for example guaranteed if all observations are less than 1430
seconds.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* pm-cpuidle:
cpuidle: coupled: remove unused define cpuidle_coupled_lock
cpuidle: fix fallback mechanism for suspend to idle in absence of enter_freeze
* pm-cpufreq:
cpufreq: cpufreq-dt: avoid uninitialized variable warnings:
cpufreq: pxa2xx: fix pxa_cpufreq_change_voltage prototype
cpufreq: Use list_is_last() to check last entry of the policy list
cpufreq: Fix NULL reference crash while accessing policy->governor_data
* pm-domains:
PM / Domains: Fix typo in comment
PM / Domains: Fix potential deadlock while adding/removing subdomains
PM / domains: fix lockdep issue for all subdomains
* pm-sleep:
PM: APM_EMULATION does not depend on PM
This was found with the -RT patch enabled, but the fix should apply to
non-RT also.
Used multi_v7_defconfig+PREEMPT_RT_FULL=y and this caused a compilation
warning without this fix:
../drivers/cpuidle/coupled.c:122:21: warning: 'cpuidle_coupled_lock'
defined but not used [-Wunused-variable]
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 51164251f5 "sched / idle: Drop default_idle_call() fallback
from call_cpuidle()" made find_deepest_state() return non-negative
value and check all the states with index > 0. Also as a result,
find_deepest_state() returns 0 even when enter_freeze callbacks are not
implemented and enter_freeze_proper() is called which ends up crashing
the kernel.
This patch updates the check for index > 0 in cpuidle_enter_freeze and
cpuidle_idle_call(when idle_should_freeze is true) to restore the
suspend-to-idle functionality in absence of enter_freeze callback.
Fixes: 51164251f5 "sched / idle: Drop default_idle_call() fallback from call_cpuidle()"
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Modify the driver core and the USB subsystem to allow USB devices
to stay suspended over system suspend/resume cycles if they have
been runtime-suspended already beforehand and fix some bugs on
top of these changes (Tomeu Vizoso, Rafael Wysocki).
- Update ACPICA to upstream revision 20160108, including updates
of the ACPICA's copyright notices, a code fixup resulting from
a regression fix that was necessary in the upstream code only
(the regression fixed by it has never been present in Linux)
and a compiler warning fix (Bob Moore, Lv Zheng).
- Fix a recent regression in the cpuidle menu governor that broke
it on practically all architectures other than x86 and make a
couple of optimizations on top of that fix (Rafael Wysocki).
- Clean up the selection of cpuidle governors depending on whether
or not the kernel is configured for tickless systems (Jean Delvare).
- Revert a recent commit that introduced a regression in the ACPI
backlight driver, address the problem it attempted to fix in a
different way and revert one more cosmetic change depending on
the problematic commit (Hans de Goede).
- Add two more ACPI backlight quirks (Hans de Goede).
- Fix a few minor problems in the core devfreq code, clean it up
a bit and update the MAINTAINERS information related to it
(Chanwoo Choi, MyungJoo Ham).
- Improve an error message in the ACPI fan driver (Andy Lutomirski).
- Fix a recent build regression in the cpupower tool (Shreyas Prabhu).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJWoCQ4AAoJEILEb/54YlRxLscQALEFVKSRnNaco72OqqRZs9Bu
1RI6TgHTpZxR+Ef0+QWqE1QMnDwfImGhKDbSRm/t3S2sMYYZbAOL8cu4y6GmkBv4
bOon/f9WEoPlQCFoo/6U4u8H45rNT5W9zX5+Bva8x+4Wu3n2J1QdvirnS5JHeHe1
o6tGLaHuZXSwX8SLnCk8gJYK1VhATxbubJtpcVtvlnAhO11qUAwsscCrkUmB60i7
5hLyrZb06hoa/hZVcIefGFuSd9qPhzDMQE2M20EohQ7UVkNJQdY9QNHMqCk2P42T
nMWCNSwGnwfiO1p9ByXqunOFBCmyL7P+KV/DHsz6TFCVjz+jeG53Kqey9SkSJ/2W
iaAE80K9MfOMvg8j7rib6fTn5uXBwRfqdeUDF/Hr64QqJoRn3R2LX4HmZe4L8ufb
zA1rece67o8FD+7p7GkNbT3rPV/kA62tn/moFk446X5N+b261Kz90t1DVci8kRVf
k+1gcvEdqO0GPpEHoirfXrBvQFixqkXakKj4r2aAob/DldQeLX7CkOUuRRJ1ykec
bxwI9R0v8MlVe5rDxg+rPB0I9EFxRDmxqxpU5j0MRWxKnMRzLvBtHuk8YNVS/eU1
xwyJOGcwF6yI0PaCFggPqmhebSrWLE7wJxaK+3bC+yiDTvHYPjB+4MfQrmkRAwwM
azgb+ZgXDYx5wXeb8EjB
=bKJ9
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-4.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management and ACPI updates from Rafael Wysocki:
"This includes fixes on top of the previous batch of PM+ACPI updates
and some new material as well.
From the new material perspective the most significant are the driver
core changes that should allow USB devices to stay suspended over
system suspend/resume cycles if they have been runtime-suspended
already beforehand. Apart from that, ACPICA is updated to upstream
revision 20160108 (cosmetic mostly, but including one fixup on top of
the previous ACPICA update) and there are some devfreq updates the
didn't make it before (due to timing).
A few recent regressions are fixed, most importantly in the cpuidle
menu governor and in the ACPI backlight driver and some x86 platform
drivers depending on it.
Some more bugs are fixed and cleanups are made on top of that.
Specifics:
- Modify the driver core and the USB subsystem to allow USB devices
to stay suspended over system suspend/resume cycles if they have
been runtime-suspended already beforehand and fix some bugs on top
of these changes (Tomeu Vizoso, Rafael Wysocki).
- Update ACPICA to upstream revision 20160108, including updates of
the ACPICA's copyright notices, a code fixup resulting from a
regression fix that was necessary in the upstream code only (the
regression fixed by it has never been present in Linux) and a
compiler warning fix (Bob Moore, Lv Zheng).
- Fix a recent regression in the cpuidle menu governor that broke it
on practically all architectures other than x86 and make a couple
of optimizations on top of that fix (Rafael Wysocki).
- Clean up the selection of cpuidle governors depending on whether or
not the kernel is configured for tickless systems (Jean Delvare).
- Revert a recent commit that introduced a regression in the ACPI
backlight driver, address the problem it attempted to fix in a
different way and revert one more cosmetic change depending on the
problematic commit (Hans de Goede).
- Add two more ACPI backlight quirks (Hans de Goede).
- Fix a few minor problems in the core devfreq code, clean it up a
bit and update the MAINTAINERS information related to it (Chanwoo
Choi, MyungJoo Ham).
- Improve an error message in the ACPI fan driver (Andy Lutomirski).
- Fix a recent build regression in the cpupower tool (Shreyas
Prabhu)"
* tag 'pm+acpi-4.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (32 commits)
cpuidle: menu: Avoid pointless checks in menu_select()
sched / idle: Drop default_idle_call() fallback from call_cpuidle()
cpupower: Fix build error in cpufreq-info
cpuidle: Don't enable all governors by default
cpuidle: Default to ladder governor on ticking systems
time: nohz: Expose tick_nohz_enabled
ACPICA: Update version to 20160108
ACPICA: Silence a -Wbad-function-cast warning when acpi_uintptr_t is 'uintptr_t'
ACPICA: Additional 2016 copyright changes
ACPICA: Reduce regression fix divergence from upstream ACPICA
ACPI / video: Add disable_backlight_sysfs_if quirk for the Toshiba Satellite R830
ACPI / video: Revert "thinkpad_acpi: Use acpi_video_handles_brightness_key_presses()"
ACPI / video: Document acpi_video_handles_brightness_key_presses() a bit
ACPI / video: Fix using an uninitialized mutex / list_head in acpi_video_handles_brightness_key_presses()
ACPI / video: Revert "ACPI / video: driver must be registered before checking for keypresses"
ACPI / fan: Improve acpi_device_update_power error message
ACPI / video: Add disable_backlight_sysfs_if quirk for the Toshiba Portege R700
cpuidle: menu: Fix menu_select() for CPUIDLE_DRIVER_STATE_START == 0
MAINTAINERS: Add devfreq-event entry
MAINTAINERS: Add missing git repository and directory for devfreq
...
If menu_select() cannot find a suitable state to return, it will
return the state index stored in data->last_state_idx. This
means that it is pointless to look at the states whose indices
are less than or equal to data->last_state_idx in the main loop,
so don't do that.
Given that those checks are done on every idle state selection, this
change can save quite a bit of completely unnecessary overhead.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
After commit 9c4b2867ed (cpuidle: menu: Fix menu_select() for
CPUIDLE_DRIVER_STATE_START == 0) it is clear that menu_select()
cannot return negative values. Moreover, ladder_select_state()
will never return a negative value too, so make find_deepest_state()
return non-negative values too and drop the default_idle_call()
fallback from call_cpuidle().
This eliminates one branch from the idle loop and makes the governors
and find_deepest_state() handle the case when all states have been
disabled from sysfs consistently.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Since commit d6f346f2d2 (cpuidle: improve governor Kconfig options)
the best cpuidle governor is selected automatically. There is little
point in additionally selecting the other one as it will not be used,
so don't select both governors by default.
Being able to select more than one governor is still good for
developers booting with cpuidle_sysfs_switch, though.
This fixes the second half of kernel BZ #65531.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=65531
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The menu governor is currently the default on all systems. However the
documentation claims that the ladder governor is preferred on ticking
systems. So bump the rating of the ladder governor when NO_HZ is
disabled, or when booting with nohz=off.
This fixes the first half of kernel BZ #65531.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=65531
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- Ground work for the new Power9 MMU from Aneesh Kumar K.V
- Optimise FP/VMX/VSX context switching from Anton Blanchard
- Various cleanups from Krzysztof Kozlowski, John Ogness, Rashmica Gupta,
Russell Currey, Gavin Shan, Daniel Axtens, Michael Neuling, Andrew Donnellan
- Allow wrapper to work on non-english system from Laurent Vivier
- Add rN aliases to the pt_regs_offset table from Rashmica Gupta
- Fix module autoload for rackmeter & axonram drivers from Luis de Bethencourt
- Include KVM guest test in all interrupt vectors from Paul Mackerras
- Fix DSCR inheritance over fork() from Anton Blanchard
- Make value-returning atomics & {cmp}xchg* & their atomic_ versions fully ordered from Boqun Feng
- Print MSR TM bits in oops messages from Michael Neuling
- Add TM signal return & invalid stack selftests from Michael Neuling
- Limit EPOW reset event warnings from Vipin K Parashar
- Remove the Cell QPACE code from Rashmica Gupta
- Append linux_banner to exception information in xmon from Rashmica Gupta
- Add selftest to check if VSRs are corrupted from Rashmica Gupta
- Remove broken GregorianDay() from Daniel Axtens
- Import Anton's context_switch2 benchmark into selftests from Michael Ellerman
- Add selftest script to test HMI functionality from Daniel Axtens
- Remove obsolete OPAL v2 support from Stewart Smith
- Make enter_rtas() private from Michael Ellerman
- PPR exception cleanups from Michael Ellerman
- Add page soft dirty tracking from Laurent Dufour
- Add support for Nvlink NPUs from Alistair Popple
- Add support for kexec on 476fpe from Alistair Popple
- Enable kernel CPU dlpar from sysfs from Nathan Fontenot
- Copy only required pieces of the mm_context_t to the paca from Michael Neuling
- Add a kmsg_dumper that flushes OPAL console output on panic from Russell Currey
- Implement save_stack_trace_regs() to enable kprobe stack tracing from Steven Rostedt
- Add HWCAP bits for Power9 from Michael Ellerman
- Fix _PAGE_PTE breaking swapoff from Aneesh Kumar K.V
- Fix _PAGE_SWP_SOFT_DIRTY breaking swapoff from Hugh Dickins
- scripts/recordmcount.pl: support data in text section on powerpc from Ulrich Weigand
- Handle R_PPC64_ENTRY relocations in modules from Ulrich Weigand
- cxl: Fix possible idr warning when contexts are released from Vaibhav Jain
- cxl: use correct operator when writing pcie config space values from Andrew Donnellan
- cxl: Fix DSI misses when the context owning task exits from Vaibhav Jain
- cxl: fix build for GCC 4.6.x from Brian Norris
- cxl: use -Werror only with CONFIG_PPC_WERROR from Brian Norris
- cxl: Enable PCI device ID for future IBM CXL adapter from Uma Krishnan
- Freescale updates from Scott: Highlights include moving QE code out of
arch/powerpc (to be shared with arm), device tree updates, and minor fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWmIxeAAoJEFHr6jzI4aWAA+cQAIXAw4WfVWJ2V4ZK+1eKfB57
fdXG71PuXG+WYIWy71ly8keLHdzzD1NQ2OUB64bUVRq202nRgVc15ZYKRJ/FE/sP
SkxaQ2AG/2kI2EflWshOi0Lu9qaZ+LMHJnszIqE/9lnGSB2kUI/cwsSXgziiMKXR
XNci9v14SdDd40YV/6BSZXoxApwyq9cUbZ7rnzFLmz4hrFuKmB/L3LABDF8QcpH7
sGt/YaHGOtqP0UX7h5KQTFLGe1OPvK6NWixSXeZKQ71ED6cho1iKUEOtBA9EZeIN
QM5JdHFWgX8MMRA0OHAgidkSiqO38BXjmjkVYWoIbYz7Zax3ThmrDHB4IpFwWnk3
l7WBykEXY7KEqpZzbh0GFGehZWzVZvLnNgDdvpmpk/GkPzeYKomBj7ZZfm3H1yGD
BTHPwuWCTX+/K75yEVNO8aJO12wBg7DRl4IEwBgqhwU8ga4FvUOCJkm+SCxA1Dnn
qlpS7qPwTXNIEfKMJcxp5X0KiwDY1EoOotd4glTN0jbeY5GEYcxe+7RQ302GrYxP
zcc8EGLn8h6BtQvV3ypNHF5l6QeTW/0ZlO9c236tIuUQ5gQU39SQci7jQKsYjSzv
BB1XdLHkbtIvYDkmbnr1elbeJCDbrWL9rAXRUTRyfuCzaFWTfZmfVNe8c8qwDMLk
TUxMR/38aI7bLcIQjwj9
=R5bX
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Core:
- Ground work for the new Power9 MMU from Aneesh Kumar K.V
- Optimise FP/VMX/VSX context switching from Anton Blanchard
Misc:
- Various cleanups from Krzysztof Kozlowski, John Ogness, Rashmica
Gupta, Russell Currey, Gavin Shan, Daniel Axtens, Michael Neuling,
Andrew Donnellan
- Allow wrapper to work on non-english system from Laurent Vivier
- Add rN aliases to the pt_regs_offset table from Rashmica Gupta
- Fix module autoload for rackmeter & axonram drivers from Luis de
Bethencourt
- Include KVM guest test in all interrupt vectors from Paul Mackerras
- Fix DSCR inheritance over fork() from Anton Blanchard
- Make value-returning atomics & {cmp}xchg* & their atomic_ versions
fully ordered from Boqun Feng
- Print MSR TM bits in oops messages from Michael Neuling
- Add TM signal return & invalid stack selftests from Michael Neuling
- Limit EPOW reset event warnings from Vipin K Parashar
- Remove the Cell QPACE code from Rashmica Gupta
- Append linux_banner to exception information in xmon from Rashmica
Gupta
- Add selftest to check if VSRs are corrupted from Rashmica Gupta
- Remove broken GregorianDay() from Daniel Axtens
- Import Anton's context_switch2 benchmark into selftests from
Michael Ellerman
- Add selftest script to test HMI functionality from Daniel Axtens
- Remove obsolete OPAL v2 support from Stewart Smith
- Make enter_rtas() private from Michael Ellerman
- PPR exception cleanups from Michael Ellerman
- Add page soft dirty tracking from Laurent Dufour
- Add support for Nvlink NPUs from Alistair Popple
- Add support for kexec on 476fpe from Alistair Popple
- Enable kernel CPU dlpar from sysfs from Nathan Fontenot
- Copy only required pieces of the mm_context_t to the paca from
Michael Neuling
- Add a kmsg_dumper that flushes OPAL console output on panic from
Russell Currey
- Implement save_stack_trace_regs() to enable kprobe stack tracing
from Steven Rostedt
- Add HWCAP bits for Power9 from Michael Ellerman
- Fix _PAGE_PTE breaking swapoff from Aneesh Kumar K.V
- Fix _PAGE_SWP_SOFT_DIRTY breaking swapoff from Hugh Dickins
- scripts/recordmcount.pl: support data in text section on powerpc
from Ulrich Weigand
- Handle R_PPC64_ENTRY relocations in modules from Ulrich Weigand
cxl:
- cxl: Fix possible idr warning when contexts are released from
Vaibhav Jain
- cxl: use correct operator when writing pcie config space values
from Andrew Donnellan
- cxl: Fix DSI misses when the context owning task exits from Vaibhav
Jain
- cxl: fix build for GCC 4.6.x from Brian Norris
- cxl: use -Werror only with CONFIG_PPC_WERROR from Brian Norris
- cxl: Enable PCI device ID for future IBM CXL adapter from Uma
Krishnan
Freescale:
- Freescale updates from Scott: Highlights include moving QE code out
of arch/powerpc (to be shared with arm), device tree updates, and
minor fixes"
* tag 'powerpc-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (149 commits)
powerpc/module: Handle R_PPC64_ENTRY relocations
scripts/recordmcount.pl: support data in text section on powerpc
powerpc/powernv: Fix OPAL_CONSOLE_FLUSH prototype and usages
powerpc/mm: fix _PAGE_SWP_SOFT_DIRTY breaking swapoff
powerpc/mm: Fix _PAGE_PTE breaking swapoff
cxl: Enable PCI device ID for future IBM CXL adapter
cxl: use -Werror only with CONFIG_PPC_WERROR
cxl: fix build for GCC 4.6.x
powerpc: Add HWCAP bits for Power9
powerpc/powernv: Reserve PE#0 on NPU
powerpc/powernv: Change NPU PE# assignment
powerpc/powernv: Fix update of NVLink DMA mask
powerpc/powernv: Remove misleading comment in pci.c
powerpc: Implement save_stack_trace_regs() to enable kprobe stack tracing
powerpc: Fix build break due to paca mm_context_t changes
cxl: Fix DSI misses when the context owning task exits
MAINTAINERS: Update Scott Wood's e-mail address
powerpc/powernv: Fix minor off-by-one error in opal_mce_check_early_recovery()
powerpc: Fix style of self-test config prompts
powerpc/powernv: Only delay opal_rtc_read() retry when necessary
...
Commit a9ceb78bc7 (cpuidle,menu: use interactivity_req to disable
polling) exposed a bug in menu_select() causing it to return -1
on systems with CPUIDLE_DRIVER_STATE_START equal to zero, although
it should have returned 0. As a result, idle states are not entered
by CPUs on those systems.
Namely, on the systems in question data->last_state_idx is initially
equal to -1 and the above commit modified the condition that would
have caused it to be changed to 0 to be less likely to trigger which
exposed the problem. However, setting data->last_state_idx initially
to -1 doesn't make sense at all and on the affected systems it should
always be set to CPUIDLE_DRIVER_STATE_START (ie. 0) unconditionally,
so make that happen.
Fixes: a9ceb78bc7 (cpuidle,menu: use interactivity_req to disable polling)
Reported-and-tested-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Long ago, only in the lab, there was OPALv1 and OPALv2. Now there is
just OPALv3, with nobody ever expecting anything on pre-OPALv3 to
be cared about or supported by mainline kernels.
So, let's remove FW_FEATURE_OPALv3 and instead use FW_FEATURE_OPAL
exclusively.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The Kconfig currently controlling compilation of this code is:
cpuidle/Kconfig.arm:config ARM_EXYNOS_CPUIDLE
cpuidle/Kconfig.arm: bool "Cpu Idle Driver for the Exynos processors"
...meaning that it currently is not being built as a module by anyone.
Lets remove the couple traces of modularity so that when reading the
driver there is no doubt it is builtin-only.
Since module_platform_driver() uses the same init level priority as
builtin_platform_driver() the init ordering remains unchanged with
this commit.
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The Kconfig currently controlling compilation of this code is:
cpuidle/Kconfig.arm:config ARM_U8500_CPUIDLE
cpuidle/Kconfig.arm: bool "Cpu Idle Driver for the ST-E u8500 processors"
...meaning that it currently is not being built as a module by anyone.
Lets remove the couple traces of modularity so that when reading the
driver there is no doubt it is builtin-only.
Since module_platform_driver() uses the same init level priority as
builtin_platform_driver() the init ordering remains unchanged with
this commit.
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The Kconfig currently controlling compilation of this code is:
drivers/cpuidle/Kconfig.arm:config ARM_CLPS711X_CPUIDLE
drivers/cpuidle/Kconfig.arm: bool "CPU Idle Driver for CLPS711X processors"
...meaning that it currently is not being built as a module by anyone.
Lets remove the modular code that is essentially orphaned, so that
when reading the driver there is no doubt it is builtin-only.
Since module_platform_driver() uses the same init level priority as
builtin_platform_driver() the init ordering remains unchanged with
this commit.
Also note that MODULE_DEVICE_TABLE is a no-op for non-modular code.
We also delete the MODULE_LICENSE tag etc. since all that information
is already contained at the top of the file in the comments.
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpuidle state tables contain the maximum exit latency for each
cpuidle state. On x86, that is the exit latency for when the entire
package goes into that same idle state.
However, a lot of the time we only go into the core idle state,
not the package idle state. This means we see a much smaller exit
latency.
We have no way to detect whether we went into the core or package
idle state while idle, and that is ok.
However, the current menu_update logic does have the potential to
trip up the repeating pattern detection in get_typical_interval.
If the system is experiencing an exit latency near the idle state's
exit latency, some of the samples will have exit_us subtracted,
while others will not. This turns a repeating pattern into mush,
potentially breaking get_typical_interval.
Furthermore, for smaller sleep intervals, we know the chance that
all the cores in the package went to the same idle state are fairly
small. Dividing the measured_us by two, instead of subtracting the
full exit latency when hitting a small measured_us, will reduce the
error.
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The menu governor carefully figures out how much time we typically
sleep for an estimated sleep interval, or whether there is a repeating
pattern going on, and corrects that estimate for the CPU load.
Then it proceeds to ignore that information when determining whether
or not to consider polling. This is not a big deal on most x86 CPUs,
which have very low C1 latencies, and the patch should not have any
effect on those CPUs.
However, certain CPUs (eg. Atom) have much higher C1 latencies, and
it would be good to not waste performance and power on those CPUs if
we are expecting a very low wakeup latency.
Disable polling based on the estimated interactivity requirement, not
on the time to the next timer interrupt.
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The cpuidle menu governor has a forced cut-off for polling at 5us,
in order to deal with firmware that gives the OS bad information
on cpuidle states, leading to the system spending way too much time
in polling.
However, at least one x86 CPU family (Atom) has chips that have
a 20us break-even point for C1. Forcing the polling cut-off to
less than that wastes performance and power.
Increase the polling cut-off to 20us.
Systems with a lower C1 latency will be found in the states table by
the menu governor, which will pick those states as appropriate.
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
As the driver doesn't support unbinding, nor does it support arbitary
binding of devices, disable the bind/unbind attributes for this driver.
Also, as the driver has no remove function, it can never be modular,
so use builtin_platform_driver() to avoid the module exit boilerplate.
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
There's no need to use multiple platform drivers, especially when we
want to do something different in the probe, but we still use a common
probe function.
We can use the platform ID system to only register one platform driver,
but have it match several devices, and give us the CPU idle driver via
the ID's driver_data.
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
- Build fix for the new Mediatek MT8173 cpufreq driver (Guenter Roeck).
- Generic power domains framework fixes (power on error code
path, subdomain removal) and cleanup of a deprecated API user
(Geert Uytterhoeven, Jon Hunter, Ulf Hansson).
- cpufreq-dt driver fixes including two fixes for bugs related to
the new Operating Performance Points Device Tree bindings
introduced recently (Viresh Kumar).
- Suspend frequency support for the cpufreq-dt driver
(Bartlomiej Zolnierkiewicz, Viresh Kumar).
- cpufreq core cleanups (Viresh Kumar).
- intel_pstate driver fixes (Chen Yu, Kristen Carlson Accardi).
- Additional sanity check in the cpuidle core (Xunlei Pang).
- Fix for a comment related to CPU power management (Lina Iyer).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJV80XwAAoJEILEb/54YlRxIAkP/RrT4mDO9QZkocO+XYDBB739
oi1aaSDmmVVKlFtFiyX4njS7S+U5UC8PYn5o/fFtVnP1kD9yuZkoak65U3t7bsKK
+rjun1SzJeooaLXW4ZAJKyPldi/RrsfRJ2JOaEt2vK03ppjx2mouK1VpjNj+qmYd
wwFb4q5S5Rcpph0Sx+xxhpZKoOgBmXk7nWIe5bWDxuORdx4yoxp4lus/P223wFJy
r0kJnNyY55mnY4Yx5e3L8W1rMC/Sf6mt8pyMvqRpsKdBHaEbhmGsiERSi0eCZJwr
AmL666TVgPk6zNM/yc3mFOTRZeAF3soSWF5R6+n2TswWvWFH+ksM14g9ndb7fTv3
FUuOU3jyfSUj2T4xr0qUZLaQUFB3xNrnWBNnm/CzPwjJtH1u1SK8uBSRqdZnEdGq
EeqwnwDLFeSAbG6lQVsKVVAPZHolDCnJPsvZebKdnJH09kjgUdm60BLOnif4QDgw
ULPZ6NToFZkrQSMAX1Uqm42iv/qWcHIgS5x43jAmblY3TsuByqRrI7gBCy+hbGSN
cQfeCxTdVPi/tUPYlJFmWn43dgEsnceSFKvlQRXGrZJM29BFvhVzJQjTFC8Oj79q
OP/Rv28+VeV5J531BQu+w8cvg73GJ3DDab+i91ZkIsd7ZNjRYQKXBfR9TImTSD4l
BPjvaPXwDg6EPhwjcpK/
=C7+4
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-4.3-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management and ACPI updates from Rafael Wysocki:
"These are mostly fixes and cleanups on top of the previous PM+ACPI
pull request (cpufreq core and drivers, cpuidle, generic power domains
framework). Some of them didn't make to that pull request and some
fix issues introduced by it.
The only really new thing is the support for suspend frequency in the
cpufreq-dt driver, but it is needed to fix an issue with Exynos
platforms.
Specifics:
- build fix for the new Mediatek MT8173 cpufreq driver (Guenter
Roeck).
- generic power domains framework fixes (power on error code path,
subdomain removal) and cleanup of a deprecated API user (Geert
Uytterhoeven, Jon Hunter, Ulf Hansson).
- cpufreq-dt driver fixes including two fixes for bugs related to the
new Operating Performance Points Device Tree bindings introduced
recently (Viresh Kumar).
- suspend frequency support for the cpufreq-dt driver (Bartlomiej
Zolnierkiewicz, Viresh Kumar).
- cpufreq core cleanups (Viresh Kumar).
- intel_pstate driver fixes (Chen Yu, Kristen Carlson Accardi).
- additional sanity check in the cpuidle core (Xunlei Pang).
- fix for a comment related to CPU power management (Lina Iyer)"
* tag 'pm+acpi-4.3-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
intel_pstate: fix PCT_TO_HWP macro
intel_pstate: Fix user input of min/max to legal policy region
PM / OPP: Return suspend_opp only if it is enabled
cpufreq-dt: add suspend frequency support
cpufreq: allow cpufreq_generic_suspend() to work without suspend frequency
PM / OPP: add dev_pm_opp_get_suspend_opp() helper
staging: board: Migrate away from __pm_genpd_name_add_device()
cpufreq: Use __func__ to print function's name
cpufreq: staticize cpufreq_cpu_get_raw()
PM / Domains: Ensure subdomain is not in use before removing
cpufreq: Add ARM_MT8173_CPUFREQ dependency on THERMAL
cpuidle/coupled: Add sanity check for safe_state_index
PM / Domains: Try power off masters in error path of __pm_genpd_poweron()
cpufreq: dt: Tolerance applies on both sides of target voltage
cpufreq: dt: Print error on failing to mark OPPs as shared
cpufreq: dt: Check OPP count before marking them shared
kernel/cpu_pm: fix cpu_cluster_pm_exit comment