IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
- Standardize on prefixing scheduler-internal functions defined
in <linux/sched.h> with sched_*() prefix. scheduler_tick() was
the only function using the scheduler_ prefix. Harmonize it.
- The other reason to rename it is the NOHZ scheduler tick
handling functions are already named sched_tick_*().
Make the 'git grep sched_tick' more meaningful.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://lore.kernel.org/r/20240308111819.1101550-3-mingo@kernel.org
run_rebalance_domains() is a misnomer, as it doesn't only
run rebalance_domains(), but since the introduction of the
NOHZ code it also runs nohz_idle_balance().
Rename it to sched_balance_softirq(), reflecting its more
generic purpose and that it's a softirq handler.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://lore.kernel.org/r/20240308111819.1101550-2-mingo@kernel.org
The first sentence of the comment explaining run_rebalance_domains()
is historic and not true anymore:
* run_rebalance_domains is triggered when needed from the scheduler tick.
... contradicted/modified by the second sentence:
* Also triggered for NOHZ idle balancing (with NOHZ_BALANCE_KICK set).
Avoid that kind of confusion straight away and explain from what
places sched_balance_softirq() is triggered.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20240308105901.1096078-9-mingo@kernel.org
Fix two typos:
- There's no such thing as 'nohz_balancing_kick', the
flag is named 'BALANCE' and is capitalized: NOHZ_BALANCE_KICK.
- Likewise there's no such thing as a 'pending nohz_balance_kick'
either, the NOHZ_BALANCE_KICK flag is all upper-case.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240308105901.1096078-8-mingo@kernel.org
So the scheduler has two such comment blocks, with '=' used as a double underline:
/*
* VRUNTIME
* ========
*
'========' also happens to be a Git conflict marker, throwing off a simple
search in an editor for this pattern.
Change them to '-------' type of underline instead - it looks just as good.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240308105901.1096078-7-mingo@kernel.org
We changed the order of definitions within 'enum cpu_idle_type',
which changed the order of [CPU_MAX_IDLE_TYPES] columns in
show_schedstat().
Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
Link: https://lore.kernel.org/r/20240308105901.1096078-5-mingo@kernel.org
The cpu_idle_type enum has the confusingly inverted property
that 'not idle' is 1, and 'idle' is '0'.
This resulted in a number of unnecessary complications in the code.
Reverse the order, remove the CPU_NOT_IDLE type, and convert
all code to a natural boolean form.
It's much more readable:
- enum cpu_idle_type idle = this_rq->idle_balance ?
- CPU_IDLE : CPU_NOT_IDLE;
-
+ enum cpu_idle_type idle = this_rq->idle_balance;
--------------------------------
- if (env->idle == CPU_NOT_IDLE || !busiest->sum_nr_running)
+ if (!env->idle || !busiest->sum_nr_running)
--------------------------------
And gets rid of the double negation in these usages:
- if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
+ if (env->idle && env->src_rq->nr_running <= 1)
Furthermore, this makes code much more obvious where there's
differentiation between CPU_IDLE and CPU_NEWLY_IDLE.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
Link: https://lore.kernel.org/r/20240308105901.1096078-4-mingo@kernel.org
show_schedstat() output breaks and doesn't print all entries
if the ordering of the definitions in 'enum cpu_idle_type' is changed,
because show_schedstat() assumes that 'CPU_IDLE' is 0.
Fix it before we change the definition order & values.
[ mingo: Added changelog. ]
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240308105901.1096078-3-mingo@kernel.org
The 'balancing' spinlock added in:
08c183f31bdb ("[PATCH] sched: add option to serialize load balancing")
... is taken when the SD_SERIALIZE flag is set in a domain, but in reality it
is a glorified global atomic flag serializing the load-balancing of
those domains.
It doesn't have any explicit locking semantics per se: we just
spin_trylock() it.
Turn it into a ... global atomic flag. This makes it more
clear what is going on here, and reduces overhead and code
size a bit:
# kernel/sched/fair.o: [x86-64 defconfig]
text data bss dec hex filename
60730 2721 104 63555 f843 fair.o.before
60718 2721 104 63543 f837 fair.o.after
Also document the flag a bit.
No change in functionality intended.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://lore.kernel.org/r/20240308105901.1096078-2-mingo@kernel.org
- The biggest change is the rework of the percpu code,
to support the 'Named Address Spaces' GCC feature,
by Uros Bizjak:
- This allows C code to access GS and FS segment relative
memory via variables declared with such attributes,
which allows the compiler to better optimize those accesses
than the previous inline assembly code.
- The series also includes a number of micro-optimizations
for various percpu access methods, plus a number of
cleanups of %gs accesses in assembly code.
- These changes have been exposed to linux-next testing for
the last ~5 months, with no known regressions in this area.
- Fix/clean up __switch_to()'s broken but accidentally
working handling of FPU switching - which also generates
better code.
- Propagate more RIP-relative addressing in assembly code,
to generate slightly better code.
- Rework the CPU mitigations Kconfig space to be less idiosyncratic,
to make it easier for distros to follow & maintain these options.
- Rework the x86 idle code to cure RCU violations and
to clean up the logic.
- Clean up the vDSO Makefile logic.
- Misc cleanups and fixes.
[ Please note that there's a higher number of merge commits in
this branch (three) than is usual in x86 topic trees. This happened
due to the long testing lifecycle of the percpu changes that
involved 3 merge windows, which generated a longer history
and various interactions with other core x86 changes that we
felt better about to carry in a single branch. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmXvB0gRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jUqRAAqnEQPiabF5acQlHrwviX+cjSobDlqtH5
9q2AQy9qaEHapzD0XMOxvFye6XIvehGOGxSPvk6CoviSxBND8rb56lvnsEZuLeBV
Bo5QSIL2x42Zrvo11iPHwgXZfTIusU90sBuKDRFkYBAxY3HK2naMDZe8MAsYCUE9
nwgHF8DDc/NYiSOXV8kosWoWpNIkoK/STyH5bvTQZMqZcwyZ49AIeP1jGZb/prbC
e/rbnlrq5Eu6brpM7xo9kELO0Vhd34urV14KrrIpdkmUKytW2KIsyvW8D6fqgDBj
NSaQLLcz0pCXbhF+8Nqvdh/1coR4L7Ymt08P1rfEjCsQgb/2WnSAGUQuC5JoGzaj
ngkbFcZllIbD9gNzMQ1n4Aw5TiO+l9zxCqPC/r58Uuvstr+K9QKlwnp2+B3Q73Ft
rojIJ04NJL6lCHdDgwAjTTks+TD2PT/eBWsDfJ/1pnUWttmv9IjMpnXD5sbHxoiU
2RGGKnYbxXczYdq/ALYDWM6JXpfnJZcXL3jJi0IDcCSsb92xRvTANYFHnTfyzGfw
EHkhbF4e4Vy9f6QOkSP3CvW5H26BmZS9DKG0J9Il5R3u2lKdfbb5vmtUmVTqHmAD
Ulo5cWZjEznlWCAYSI/aIidmBsp9OAEvYd+X7Z5SBIgTfSqV7VWHGt0BfA1heiVv
F/mednG0gGc=
=3v4F
-----END PGP SIGNATURE-----
Merge tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core x86 updates from Ingo Molnar:
- The biggest change is the rework of the percpu code, to support the
'Named Address Spaces' GCC feature, by Uros Bizjak:
- This allows C code to access GS and FS segment relative memory
via variables declared with such attributes, which allows the
compiler to better optimize those accesses than the previous
inline assembly code.
- The series also includes a number of micro-optimizations for
various percpu access methods, plus a number of cleanups of %gs
accesses in assembly code.
- These changes have been exposed to linux-next testing for the
last ~5 months, with no known regressions in this area.
- Fix/clean up __switch_to()'s broken but accidentally working handling
of FPU switching - which also generates better code
- Propagate more RIP-relative addressing in assembly code, to generate
slightly better code
- Rework the CPU mitigations Kconfig space to be less idiosyncratic, to
make it easier for distros to follow & maintain these options
- Rework the x86 idle code to cure RCU violations and to clean up the
logic
- Clean up the vDSO Makefile logic
- Misc cleanups and fixes
* tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
x86/idle: Select idle routine only once
x86/idle: Let prefer_mwait_c1_over_halt() return bool
x86/idle: Cleanup idle_setup()
x86/idle: Clean up idle selection
x86/idle: Sanitize X86_BUG_AMD_E400 handling
sched/idle: Conditionally handle tick broadcast in default_idle_call()
x86: Increase brk randomness entropy for 64-bit systems
x86/vdso: Move vDSO to mmap region
x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together
x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o
x86/retpoline: Ensure default return thunk isn't used at runtime
x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32
x86/vdso: Use $(addprefix ) instead of $(foreach )
x86/vdso: Simplify obj-y addition
x86/vdso: Consolidate targets and clean-files
x86/bugs: Rename CONFIG_RETHUNK => CONFIG_MITIGATION_RETHUNK
x86/bugs: Rename CONFIG_CPU_SRSO => CONFIG_MITIGATION_SRSO
x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY => CONFIG_MITIGATION_IBRS_ENTRY
x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY => CONFIG_MITIGATION_UNRET_ENTRY
x86/bugs: Rename CONFIG_SLS => CONFIG_MITIGATION_SLS
...
- Fix inconsistency in misfit task load-balancing
- Fix CPU isolation bugs in the task-wakeup logic
- Rework & unify the sched_use_asym_prio() and sched_asym_prefer() logic
- Clean up & simplify ->avg_* accesses
- Misc cleanups & fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmXu9V0RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gqWBAAvqPlJx/jwNTePiXtxsObmtTnTStnVSM8
8SRxb2uznSFjYj73RdMDUzeYOfweE48elJoUAN7IGX2fgCFjxeDgpPnAyvnU0jFE
X/gJXEO2xCCYsvDnMg1huNSxEJ1ZQl6YJgdd6eLGjBK6l75pkgLJLOSmeFfTShgw
gMk4yIaUrxd/yc/bBvK39gMW1JDXiFIwmHuzfEl0/5k+abzVOU0ZfqFir2OH/GT9
YH8ZNsKKn88i01mp2qzo9LouF7mmOH4dZYd9k0SueH+rW8Z+goSuVF8O3igodL0T
TM5sqqG7qd1WC8SN0zng+OGODmJ+PrN7soKbTZC5NsC+LvipjVZ1Y92dLyS1xhgn
Bpm+NjVNrz9ZWhZiC5LiIF+zDZHu51RDejcOgt1Va6qBIY229GFKLgxFSis/TzzD
7xFpi7ApGCS/Rp9VeIDC69V8ZVfsCPJ7D1oxo5wmLzGe17nThxMeE1AmoWOXOUp8
M9ISbvete8i/8uS8jJQQMylrFceQkzumTVK7p+LqEdlaH0fF/fNKyeH81ZLZMwpM
0pfc7OVFpxd3Rt4wq+db00ilStdfV4yKkVAJiOLfVPyh+tZusvxkKjqXIMrm3RI/
DkZu6/3KYompfVcfkVXbW57Zu+kfgi6kQVt+6yEGrnLcIPkaPR08inEB7vtf6T+R
EBncKVtt1Rs=
=3CZV
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Fix inconsistency in misfit task load-balancing
- Fix CPU isolation bugs in the task-wakeup logic
- Rework and unify the sched_use_asym_prio() and sched_asym_prefer()
logic
- Clean up and simplify ->avg_* accesses
- Misc cleanups and fixes
* tag 'sched-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/topology: Rename SD_SHARE_PKG_RESOURCES to SD_SHARE_LLC
sched/fair: Check the SD_ASYM_PACKING flag in sched_use_asym_prio()
sched/fair: Rework sched_use_asym_prio() and sched_asym_prefer()
sched/fair: Remove unused parameter from sched_asym()
sched/topology: Remove duplicate descriptions from TOPOLOGY_SD_FLAGS
sched/fair: Simplify the update_sd_pick_busiest() logic
sched/fair: Do strict inequality check for busiest misfit task group
sched/fair: Remove unnecessary goto in update_sd_lb_stats()
sched/fair: Take the scheduling domain into account in select_idle_core()
sched/fair: Take the scheduling domain into account in select_idle_smt()
sched/fair: Add READ_ONCE() and use existing helper function to access ->avg_irq
sched/fair: Use existing helper functions to access ->avg_rt and ->avg_dl
sched/core: Simplify code by removing duplicate #ifdefs
- The hierarchical timer pull model
When timer wheel timers are armed they are placed into the timer wheel
of a CPU which is likely to be busy at the time of expiry. This is done
to avoid wakeups on potentially idle CPUs.
This is wrong in several aspects:
1) The heuristics to select the target CPU are wrong by
definition as the chance to get the prediction right is close
to zero.
2) Due to #1 it is possible that timers are accumulated on a
single target CPU
3) The required computation in the enqueue path is just overhead for
dubious value especially under the consideration that the vast
majority of timer wheel timers are either canceled or rearmed
before they expire.
The timer pull model avoids the above by removing the target
computation on enqueue and queueing timers always on the CPU on which
they get armed.
This is achieved by having separate wheels for CPU pinned timers and
global timers which do not care about where they expire.
As long as a CPU is busy it handles both the pinned and the global
timers which are queued on the CPU local timer wheels.
When a CPU goes idle it evaluates its own timer wheels:
- If the first expiring timer is a pinned timer, then the global
timers can be ignored as the CPU will wake up before they expire.
- If the first expiring timer is a global timer, then the expiry time
is propagated into the timer pull hierarchy and the CPU makes sure
to wake up for the first pinned timer.
The timer pull hierarchy organizes CPUs in groups of eight at the
lowest level and at the next levels groups of eight groups up to the
point where no further aggregation of groups is required, i.e. the
number of levels is log8(NR_CPUS). The magic number of eight has been
established by experimention, but can be adjusted if needed.
In each group one busy CPU acts as the migrator. It's only one CPU to
avoid lock contention on remote timer wheels.
The migrator CPU checks in its own timer wheel handling whether there
are other CPUs in the group which have gone idle and have global timers
to expire. If there are global timers to expire, the migrator locks the
remote CPU timer wheel and handles the expiry.
Depending on the group level in the hierarchy this handling can require
to walk the hierarchy downwards to the CPU level.
Special care is taken when the last CPU goes idle. At this point the
CPU is the systemwide migrator at the top of the hierarchy and it
therefore cannot delegate to the hierarchy. It needs to arm its own
timer device to expire either at the first expiring timer in the
hierarchy or at the first CPU local timer, which ever expires first.
This completely removes the overhead from the enqueue path, which is
e.g. for networking a true hotpath and trades it for a slightly more
complex idle path.
This has been in development for a couple of years and the final series
has been extensively tested by various teams from silicon vendors and
ran through extensive CI.
There have been slight performance improvements observed on network
centric workloads and an Intel team confirmed that this allows them to
power down a die completely on a mult-die socket for the first time in
a mostly idle scenario.
There is only one outstanding ~1.5% regression on a specific overloaded
netperf test which is currently investigated, but the rest is either
positive or neutral performance wise and positive on the power
management side.
- Fixes for the timekeeping interpolation code for cross-timestamps:
cross-timestamps are used for PTP to get snapshots from hardware timers
and interpolated them back to clock MONOTONIC. The changes address a
few corner cases in the interpolation code which got the math and logic
wrong.
- Simplifcation of the clocksource watchdog retry logic to automatically
adjust to handle larger systems correctly instead of having more
incomprehensible command line parameters.
- Treewide consolidation of the VDSO data structures.
- The usual small improvements and cleanups all over the place.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmXuAN0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoVKXEADIR45rjR1Xtz32js7B53Y65O4WNoOQ
6/ycWcswuGzg/h4QUpPSJ6gOGVmKSWwZi4n0P/VadCiXGSPPm0aUKsoRUt9DZsPY
mtj2wjCSXKXiyhTl9OtrZME86ZAIGO1dQXa/sOHsiP5PCjgQkD0b5CYi1+B6eHDt
1/Uo2Tb9g8VAPppq20V5Uo93GrPf642oyi3FCFrR1M112Uuak5DmqHJYiDpreNcG
D5SgI+ykSiaUaVyHifvqijoJk0rYXkqEC6evl02477lJ/X0vVo2/M8XPS95BxHST
s5Iruo4rP+qeAy8QvhZpoPX59fO0m/AgA7cf77XXAtOpVdLH+bs4ILsEbouAIOtv
lsmRkcYt+TpvrZFHPAxks+6g3afuROiDtxD5sXXpVWxvofi8FwWqubdlqdsbw9MP
ZCTNyzNyKL47QeDwBfSynYUL1RSyqsphtIwk4oeQklH9rwMAnW21hi30z15hQ0pQ
FOVkmcwi79JNvl/G+jRkDzw7r8/zcHshWdSjyUM04CDjjnCDjQOFWSIjEPwbQjjz
S4HXpJKJW963dBgs9Z84/Ctw1GwoBk1qedDWDJE1257Qvmo/Wpe/7GddWcazOGnN
RRFMzGPbOqBDbjtErOKGU+iCisgNEvz2XK+TI16uRjWde7DxZpiTVYgNDrZ+/Pyh
rQ23UBms6ZRR+A==
=iQlu
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"A large set of updates and features for timers and timekeeping:
- The hierarchical timer pull model
When timer wheel timers are armed they are placed into the timer
wheel of a CPU which is likely to be busy at the time of expiry.
This is done to avoid wakeups on potentially idle CPUs.
This is wrong in several aspects:
1) The heuristics to select the target CPU are wrong by
definition as the chance to get the prediction right is
close to zero.
2) Due to #1 it is possible that timers are accumulated on
a single target CPU
3) The required computation in the enqueue path is just overhead
for dubious value especially under the consideration that the
vast majority of timer wheel timers are either canceled or
rearmed before they expire.
The timer pull model avoids the above by removing the target
computation on enqueue and queueing timers always on the CPU on
which they get armed.
This is achieved by having separate wheels for CPU pinned timers
and global timers which do not care about where they expire.
As long as a CPU is busy it handles both the pinned and the global
timers which are queued on the CPU local timer wheels.
When a CPU goes idle it evaluates its own timer wheels:
- If the first expiring timer is a pinned timer, then the global
timers can be ignored as the CPU will wake up before they
expire.
- If the first expiring timer is a global timer, then the expiry
time is propagated into the timer pull hierarchy and the CPU
makes sure to wake up for the first pinned timer.
The timer pull hierarchy organizes CPUs in groups of eight at the
lowest level and at the next levels groups of eight groups up to
the point where no further aggregation of groups is required, i.e.
the number of levels is log8(NR_CPUS). The magic number of eight
has been established by experimention, but can be adjusted if
needed.
In each group one busy CPU acts as the migrator. It's only one CPU
to avoid lock contention on remote timer wheels.
The migrator CPU checks in its own timer wheel handling whether
there are other CPUs in the group which have gone idle and have
global timers to expire. If there are global timers to expire, the
migrator locks the remote CPU timer wheel and handles the expiry.
Depending on the group level in the hierarchy this handling can
require to walk the hierarchy downwards to the CPU level.
Special care is taken when the last CPU goes idle. At this point
the CPU is the systemwide migrator at the top of the hierarchy and
it therefore cannot delegate to the hierarchy. It needs to arm its
own timer device to expire either at the first expiring timer in
the hierarchy or at the first CPU local timer, which ever expires
first.
This completely removes the overhead from the enqueue path, which
is e.g. for networking a true hotpath and trades it for a slightly
more complex idle path.
This has been in development for a couple of years and the final
series has been extensively tested by various teams from silicon
vendors and ran through extensive CI.
There have been slight performance improvements observed on network
centric workloads and an Intel team confirmed that this allows them
to power down a die completely on a mult-die socket for the first
time in a mostly idle scenario.
There is only one outstanding ~1.5% regression on a specific
overloaded netperf test which is currently investigated, but the
rest is either positive or neutral performance wise and positive on
the power management side.
- Fixes for the timekeeping interpolation code for cross-timestamps:
cross-timestamps are used for PTP to get snapshots from hardware
timers and interpolated them back to clock MONOTONIC. The changes
address a few corner cases in the interpolation code which got the
math and logic wrong.
- Simplifcation of the clocksource watchdog retry logic to
automatically adjust to handle larger systems correctly instead of
having more incomprehensible command line parameters.
- Treewide consolidation of the VDSO data structures.
- The usual small improvements and cleanups all over the place"
* tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
timer/migration: Fix quick check reporting late expiry
tick/sched: Fix build failure for CONFIG_NO_HZ_COMMON=n
vdso/datapage: Quick fix - use asm/page-def.h for ARM64
timers: Assert no next dyntick timer look-up while CPU is offline
tick: Assume timekeeping is correctly handed over upon last offline idle call
tick: Shut down low-res tick from dying CPU
tick: Split nohz and highres features from nohz_mode
tick: Move individual bit features to debuggable mask accesses
tick: Move got_idle_tick away from common flags
tick: Assume the tick can't be stopped in NOHZ_MODE_INACTIVE mode
tick: Move broadcast cancellation up to CPUHP_AP_TICK_DYING
tick: Move tick cancellation up to CPUHP_AP_TICK_DYING
tick: Start centralizing tick related CPU hotplug operations
tick/sched: Don't clear ts::next_tick again in can_stop_idle_tick()
tick/sched: Rename tick_nohz_stop_sched_tick() to tick_nohz_full_stop_tick()
tick: Use IS_ENABLED() whenever possible
tick/sched: Remove useless oneshot ifdeffery
tick/nohz: Remove duplicate between lowres and highres handlers
tick/nohz: Remove duplicate between tick_nohz_switch_to_nohz() and tick_setup_sched_timer()
hrtimer: Select housekeeping CPU during migration
...
Memoryless nodes do not have any memory to migrate to, so, as an
optimization, stop trying it.
Link: https://lkml.kernel.org/r/20240219041920.1183-1-byungchul@sk.com
Link: https://lkml.kernel.org/r/20240216111502.79759-1-byungchul@sk.com
Fixes: c574bbe91703 ("NUMA balancing: optimize page placement for memory tiering system")
Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Benjamin Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The x86 architecture has an idle routine for AMD CPUs which are affected
by erratum 400. On the affected CPUs the local APIC timer stops in the
C1E halt state.
It therefore requires tick broadcasting. The invocation of
tick_broadcast_enter()/exit() from this function violates the RCU
constraints because it can end up in lockdep or tracing, which
rightfully triggers a warning.
tick_broadcast_enter()/exit() must be invoked before ct_cpuidle_enter()
and after ct_cpuidle_exit() in default_idle_call().
Add a static branch conditional invocation of tick_broadcast_enter()/exit()
into this function to allow X86 to replace the AMD specific idle code. It's
guarded by a config switch which will be selected by x86. Otherwise it's
a NOOP.
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240229142248.266708822@linutronix.de
SD_SHARE_PKG_RESOURCES is a bit of a misnomer: its naming suggests that
it's sharing all 'package resources' - while in reality it's specifically
for sharing the LLC only.
Rename it to SD_SHARE_LLC to reduce confusion.
[ mingo: Rewrote the confusing changelog as well. ]
Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Alex Shi <alexs@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Link: https://lore.kernel.org/r/20240210113924.1130448-5-alexs@kernel.org
sched_use_asym_prio() checks whether CPU priorities should be used. It
makes sense to check for the SD_ASYM_PACKING() inside the function.
Since both sched_asym() and sched_group_asym() use sched_use_asym_prio(),
remove the now superfluous checks for the flag in various places.
Signed-off-by: Alex Shi <alexs@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240210113924.1130448-4-alexs@kernel.org
sched_use_asym_prio() and sched_asym_prefer() are used together in various
places. Consolidate them into a single function sched_asym().
The existing sched_asym() function is only used when collecting statistics
of a scheduling group. Rename it as sched_group_asym(), and remove the
obsolete function description.
This makes the code easier to read. No functional changes.
Signed-off-by: Alex Shi <alexs@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240210113924.1130448-3-alexs@kernel.org
The 'sds' argument is not used in the sched_asym() function anymore, remove it.
Fixes: c9ca07886aaa ("sched/fair: Do not even the number of busy CPUs via asym_packing")
Signed-off-by: Alex Shi <alexs@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240210113924.1130448-2-alexs@kernel.org
These flags are already documented in include/linux/sched/sd_flags.h.
Also, add missing SD_CLUSTER and keep the comment on SD_ASYM_PACKING
as it is a special case.
Suggested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Alex Shi <alexs@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240210113924.1130448-1-alexs@kernel.org
When comparing the current struct sched_group with the yet-busiest
domain in update_sd_pick_busiest(), if the two groups have the same
group type, we're currently doing a bit of unnecessary work for any
group >= group_misfit_task. We're comparing the two groups, and then
returning only if false (the group in question is not the busiest).
Otherwise, we break out, do an extra unnecessary conditional check that's
vacuously false for any group type > group_fully_busy, and then always
return true.
Let's just return directly in the switch statement instead. This doesn't
change the size of vmlinux with llvm 17 (not surprising given that all
of this is inlined in load_balance()), but it does shrink load_balance()
by 88 bytes on x86. Given that it also improves readability, this seems
worth doing.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20240206043921.850302-4-void@manifault.com
In update_sd_pick_busiest(), when comparing two sched groups that are
both of type group_misfit_task, we currently consider the new group as
busier than the current busiest group even if the new group has the
same misfit task load as the current busiest group. We can avoid some
unnecessary writes if we instead only consider the newest group to be
the busiest if it has a higher load than the current busiest. This
matches the behavior of other group types where we compare load, such as
two groups that are both overloaded.
Let's update the group_misfit_task type comparison to also only update
the busiest group in the event of strict inequality.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20240206043921.850302-3-void@manifault.com
In update_sd_lb_stats(), when we're iterating over the sched groups that
comprise a sched domain, we're skipping the call to
update_sd_pick_busiest() for the sched group that contains the local /
destination CPU. We use a goto to skip the call, but we could just as
easily check !local_group, as there's no other logic that we need to
skip with the goto. Let's remove the goto, and check for !local_group in
the if statement instead.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20240206043921.850302-2-void@manifault.com
When picking a CPU on task wakeup, select_idle_core() has to take
into account the scheduling domain where the function looks for the CPU.
This is because the "isolcpus" kernel command line option can remove CPUs
from the domain to isolate them from other SMT siblings.
This change replaces the set of CPUs allowed to run the task from
p->cpus_ptr by the intersection of p->cpus_ptr and sched_domain_span(sd)
which is stored in the 'cpus' argument provided by select_idle_cpu().
Fixes: 9fe1f127b913 ("sched/fair: Merge select_idle_core/cpu()")
Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr>
Signed-off-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240110131707.437301-2-keisuke.nishimura@inria.fr
When picking a CPU on task wakeup, select_idle_smt() has to take
into account the scheduling domain of @target. This is because the
"isolcpus" kernel command line option can remove CPUs from the domain to
isolate them from other SMT siblings.
This fix checks if the candidate CPU is in the target scheduling domain.
Commit:
df3cb4ea1fb6 ("sched/fair: Fix wrong cpu selecting from isolated domain")
... originally introduced this fix by adding the check of the scheduling
domain in the loop.
However, commit:
3e6efe87cd5cc ("sched/fair: Remove redundant check in select_idle_smt()")
... accidentally removed the check. Bring it back.
Fixes: 3e6efe87cd5c ("sched/fair: Remove redundant check in select_idle_smt()")
Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr>
Signed-off-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240110131707.437301-1-keisuke.nishimura@inria.fr
Use existing helper function cpu_util_irq() instead of open-coding
access to ->avg_irq.
During review it was noted that ->avg_irq could be updated by a
different CPU than the one which is trying to access it.
->avg_irq is updated with WRITE_ONCE(), use READ_ONCE to access it
in order to avoid any compiler optimizations.
Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240101154624.100981-3-sshegde@linux.vnet.ibm.com
There are helper functions called cpu_util_dl() and cpu_util_rt() which give
the average utilization of DL and RT respectively. But there are a few
places in code where access to these variables is open-coded.
Instead use the helper function so that code becomes simpler and easier to
maintain later on.
No functional changes intended.
Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240101154624.100981-2-sshegde@linux.vnet.ibm.com
The timekeeping duty is handed over from the outgoing CPU on stop
machine, then the oneshot tick is stopped right after. Therefore it's
guaranteed that the current CPU isn't the timekeeper upon its last call
to idle.
Besides, calling tick_nohz_idle_stop_tick() while the dying CPU goes
into idle suggests that the tick is going to be stopped while it is
actually stopped already from the appropriate CPU hotplug state.
Remove the confusing call and the obsolete case handling and convert it
to a sanity check that verifies the above assumption.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240225225508.11587-16-frederic@kernel.org
The new helper function is needed to help blk-mq check if it needs to
dispatch the softirq on another CPU to match the performance level the
IO requester is running at. This is important on HMP systems where not
all CPUs have the same compute capacity.
Signed-off-by: Qais Yousef <qyousef@layalina.io>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240223155749.2958009-2-qyousef@layalina.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>
On some systems, sys_membarrier can be very expensive, causing overall
slowdowns for everything. So put a lock on the path in order to
serialize the accesses to prevent the ability for this to be called at
too high of a frequency and saturate the machine.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-and-tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Fixes: 22e4ebb97582 ("membarrier: Provide expedited private command")
Fixes: c5f58bd58f43 ("membarrier: Provide GLOBAL_EXPEDITED command")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a few cases of nested #ifdefs in the scheduler code
that can be simplified:
#ifdef DEFINE_A
...code block...
#ifdef DEFINE_A <-- This is a duplicate.
...code block...
#endif
#else
#ifndef DEFINE_A <-- This is also duplicate.
...code block...
#endif
#endif
More details about the script and methods used to find these code
patterns can be found at:
https://lore.kernel.org/all/20240118080326.13137-1-sshegde@linux.ibm.com/
No change in functionality intended.
[ mingo: Clarified the changelog. ]
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20240216061433.535522-1-sshegde@linux.ibm.com
RISC-V uses xRET instructions on return from interrupt and to go back
to user-space; the xRET instruction is not core serializing.
Use FENCE.I for providing core serialization as follows:
- by calling sync_core_before_usermode() on return from interrupt (cf.
ipi_sync_core()),
- via switch_mm() and sync_core_before_usermode() (respectively, for
uthread->uthread and kthread->uthread transitions) before returning
to user-space.
On RISC-V, the serialization in switch_mm() is activated by resetting
the icache_stale_mask of the mm at prepare_sync_core_cmd().
Suggested-by: Palmer Dabbelt <palmer@dabbelt.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/r/20240131144936.29190-5-parri.andrea@gmail.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Introduce an architecture function that architectures can use to set
up ("prepare") SYNC_CORE commands.
The function will be used by RISC-V to update its "deferred icache-
flush" data structures (icache_stale_mask).
Architectures defining prepare_sync_core_cmd() static inline need to
select ARCH_HAS_PREPARE_SYNC_CORE_CMD.
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/r/20240131144936.29190-4-parri.andrea@gmail.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
To gather the architecture requirements of the "private/global
expedited" membarrier commands. The file will be expanded to
integrate further information about the membarrier syscall (as
needed/desired in the future). While at it, amend some related
inline comments in the membarrier codebase.
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/r/20240131144936.29190-3-parri.andrea@gmail.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The membarrier system call requires a full memory barrier after storing
to rq->curr, before going back to user-space. The barrier is only
needed when switching between processes: the barrier is implied by
mmdrop() when switching from kernel to userspace, and it's not needed
when switching from userspace to kernel.
Rely on the feature/mechanism ARCH_HAS_MEMBARRIER_CALLBACKS and on the
primitive membarrier_arch_switch_mm(), already adopted by the PowerPC
architecture, to insert the required barrier.
Fixes: fab957c11efe2f ("RISC-V: Atomic and Locking Code")
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/r/20240131144936.29190-2-parri.andrea@gmail.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Mark the task as having a cached timestamp when set assign it, so we
can efficiently check if it needs updating post being scheduled back in.
This covers both the actual schedule out case, which would've flushed
the plug, and the preemption case which doesn't touch the plugged
requests (for many reasons, one of them being then we'd need to have
preemption disabled around plug state manipulation).
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
where the CPU would remain at the lowest frequency, degrading
performance substantially.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWpM0sRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1giEg/+Mn9hdLqgE7xPPvCa8UWoJzFGTIYgTT3O
gma5Ras/kqB6cJTb1zn/HocAIj1Y2gZAsRU/U3IpOfPzklwIKQLBID1PE+d0izAc
NC9N0LuPau+XbMY5U+G0YNQZzDW+Zioe/9I6uDRKRTtLTdZAk8Plk9yh+tRtpSG8
aEswyoDOJfvkLbl7kJGymHgxDiDtmXEcz6j2pNlFtcEdHFjiSHo2Jq09DMia9sHr
W563FSvO7DVBMOosKH8sq7sSPdCBi0zshaWDiyz2M7Ry2uBsqJvx+9qxDnloafTp
Yqp5rkSVzOxtQwxjtYD+WWy+AgwQqo+O5FHsm0JmoiGVkmpB95bdhQxk2gtshSCo
IwUt2Gqsndd0JM4v5gOn4G/qCPxFUA/Tx1OMWM89nQUVp3OmIwm8z99f5gFxoSYa
DFn2P2Ku/A/fiKfWcNDOCyMgYcJNmqRKSjWEh+mfFeexiuWR3jPrQ4GKbSl9Gusw
vLmBM9pMSyGvivptu+ALXERDDm95wEVVkULgxlcUgpuT8jjpmovbtFj2xYcnzvc4
EKOgJ0FmXCM/B6QFnnbzgMzu2IThoQpL8Ud3JlMeGDRLGDvZip9AA+0RsnirURwX
+EuE7fHcDzfAA+Fv9sGosaFmxD1dUh1EJL41XrFZSYfMsZzzzlj+k9PWf9ABCE4R
6gEHuRza+rU=
=c7Ib
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2024-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
"Fix a cpufreq related performance regression on certain systems, where
the CPU would remain at the lowest frequency, degrading performance
substantially"
* tag 'sched-urgent-2024-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix frequency selection for non-invariant case
Linus reported a ~50% performance regression on single-threaded
workloads on his AMD Ryzen system, and bisected it to:
9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")
When frequency invariance is not enabled, get_capacity_ref_freq(policy)
is supposed to return the current frequency and the performance margin
applied by map_util_perf(), enabling the utilization to go above the
maximum compute capacity and to select a higher frequency than the current one.
After the changes in 9c0b4bb7f630, the performance margin was applied
earlier in the path to take into account utilization clampings and
we couldn't get a utilization higher than the maximum compute capacity,
and the CPU remained 'stuck' at lower frequencies.
To fix this, we must use a frequency above the current frequency to
get a chance to select a higher OPP when the current one becomes fully used.
Apply the same margin and return a frequency 25% higher than the current
one in order to switch to the next OPP before we fully use the CPU
at the current one.
[ mingo: Clarified the changelog. ]
Fixes: 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Bisected-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Wyes Karny <wkarny@gmail.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Wyes Karny <wkarny@gmail.com>
Link: https://lore.kernel.org/r/20240114183600.135316-1-vincent.guittot@linaro.org
The goal is to get sched.h down to a type only header, so the main thing
happening in this patchset is splitting out various _types.h headers and
dependency fixups, as well as moving some things out of sched.h to
better locations.
This is prep work for the memory allocation profiling patchset which
adds new sched.h interdepencencies.
Testing - it's been in -next, and fixes from pretty much all
architectures have percolated in - nothing major.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmWfBwwACgkQE6szbY3K
bnZPwBAAmuRojXaeWxi01IPIOehSGDe68vw44PR9glEMZvxdnZuPOdvE4/+245/L
bRKU2WBCjBUokUbV9msIShwRkFTZAmEMPNfPAAsFMA+VXeDYHKB+ZRdwTggNAQ+I
SG6fZgh5m0HsewCDxU8oqVHkjVq4fXn0cy+aL6xLEd9gu67GoBzX2pDieS2Kvy6j
jnyoKTxFwb+LTQgph0P4EIpq5I2umAsdLwdSR8EJ+8e9NiNvMo1pI00Lx/ntAnFZ
JftWUJcMy3TQ5u1GkyfQN9y/yThX1bZK5GvmHS9SJ2Dkacaus5d+xaKCHtRuFS1I
7C6b8PsNgRczUMumBXus44HdlNfNs1yU3lvVxFvBIPE1qC9pYRHrkWIXXIocXLLC
oxTEJ6B2G3BQZVQgLIA4fOaxMVhmvKffi/aEZLi9vN9VVosd1a6XNKI6KbyRnXFp
GSs9qDqszhn5I3GYNlDNQTc/8UsRlhPFgS6nS0By6QnvxtGi9QkU2tBRBsXvqwCy
cLoCYIhc2tvugHvld70dz26umiJ4rnmxGlobStNoigDvIKAIUt1UmIdr1so8P8eH
xehnL9ZcOX6xnANDL0AqMFFHV6I58CJynhFdUoXfVQf/DWLGX48mpi9LVNsYBzsI
CAwVOAQ0UjGrpdWmJ9ueY/ABYqg9vRjzaDEXQ+MhAYO55CLaVsg=
=3tyT
-----END PGP SIGNATURE-----
Merge tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefs
Pull header cleanups from Kent Overstreet:
"The goal is to get sched.h down to a type only header, so the main
thing happening in this patchset is splitting out various _types.h
headers and dependency fixups, as well as moving some things out of
sched.h to better locations.
This is prep work for the memory allocation profiling patchset which
adds new sched.h interdepencencies"
* tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefs: (51 commits)
Kill sched.h dependency on rcupdate.h
kill unnecessary thread_info.h include
Kill unnecessary kernel.h include
preempt.h: Kill dependency on list.h
rseq: Split out rseq.h from sched.h
LoongArch: signal.c: add header file to fix build error
restart_block: Trim includes
lockdep: move held_lock to lockdep_types.h
sem: Split out sem_types.h
uidgid: Split out uidgid_types.h
seccomp: Split out seccomp_types.h
refcount: Split out refcount_types.h
uapi/linux/resource.h: fix include
x86/signal: kill dependency on time.h
syscall_user_dispatch.h: split out *_types.h
mm_types_task.h: Trim dependencies
Split out irqflags_types.h
ipc: Kill bogus dependency on spinlock.h
shm: Slim down dependencies
workqueue: Split out workqueue_types.h
...
many places. The notable patch series are:
- nilfs2 folio conversion from Matthew Wilcox in "nilfs2: Folio
conversions for file paths".
- Additional nilfs2 folio conversion from Ryusuke Konishi in "nilfs2:
Folio conversions for directory paths".
- IA64 remnant removal in Heiko Carstens's "Remove unused code after
IA-64 removal".
- Arnd Bergmann has enabled the -Wmissing-prototypes warning everywhere
in "Treewide: enable -Wmissing-prototypes". This had some followup
fixes:
- Nathan Chancellor has cleaned up the hexagon build in the series
"hexagon: Fix up instances of -Wmissing-prototypes".
- Nathan also addressed some s390 warnings in "s390: A couple of
fixes for -Wmissing-prototypes".
- Arnd Bergmann addresses the same warnings for MIPS in his series
"mips: address -Wmissing-prototypes warnings".
- Baoquan He has made kexec_file operate in a top-down-fitting manner
similar to kexec_load in the series "kexec_file: Load kernel at top of
system RAM if required"
- Baoquan He has also added the self-explanatory "kexec_file: print out
debugging message if required".
- Some checkstack maintenance work from Tiezhu Yang in the series
"Modify some code about checkstack".
- Douglas Anderson has disentangled the watchdog code's logging when
multiple reports are occurring simultaneously. The series is "watchdog:
Better handling of concurrent lockups".
- Yuntao Wang has contributed some maintenance work on the crash code in
"crash: Some cleanups and fixes".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZ2R6AAKCRDdBJ7gKXxA
juCVAP4t76qUISDOSKugB/Dn5E4Nt9wvPY9PcufnmD+xoPsgkQD+JVl4+jd9+gAV
vl6wkJDiJO5JZ3FVtBtC3DFA/xHtVgk=
=kQw+
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2024-01-09-10-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"Quite a lot of kexec work this time around. Many singleton patches in
many places. The notable patch series are:
- nilfs2 folio conversion from Matthew Wilcox in 'nilfs2: Folio
conversions for file paths'.
- Additional nilfs2 folio conversion from Ryusuke Konishi in 'nilfs2:
Folio conversions for directory paths'.
- IA64 remnant removal in Heiko Carstens's 'Remove unused code after
IA-64 removal'.
- Arnd Bergmann has enabled the -Wmissing-prototypes warning
everywhere in 'Treewide: enable -Wmissing-prototypes'. This had
some followup fixes:
- Nathan Chancellor has cleaned up the hexagon build in the series
'hexagon: Fix up instances of -Wmissing-prototypes'.
- Nathan also addressed some s390 warnings in 's390: A couple of
fixes for -Wmissing-prototypes'.
- Arnd Bergmann addresses the same warnings for MIPS in his series
'mips: address -Wmissing-prototypes warnings'.
- Baoquan He has made kexec_file operate in a top-down-fitting manner
similar to kexec_load in the series 'kexec_file: Load kernel at top
of system RAM if required'
- Baoquan He has also added the self-explanatory 'kexec_file: print
out debugging message if required'.
- Some checkstack maintenance work from Tiezhu Yang in the series
'Modify some code about checkstack'.
- Douglas Anderson has disentangled the watchdog code's logging when
multiple reports are occurring simultaneously. The series is
'watchdog: Better handling of concurrent lockups'.
- Yuntao Wang has contributed some maintenance work on the crash code
in 'crash: Some cleanups and fixes'"
* tag 'mm-nonmm-stable-2024-01-09-10-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (157 commits)
crash_core: fix and simplify the logic of crash_exclude_mem_range()
x86/crash: use SZ_1M macro instead of hardcoded value
x86/crash: remove the unused image parameter from prepare_elf_headers()
kdump: remove redundant DEFAULT_CRASH_KERNEL_LOW_SIZE
scripts/decode_stacktrace.sh: strip unexpected CR from lines
watchdog: if panicking and we dumped everything, don't re-enable dumping
watchdog/hardlockup: use printk_cpu_sync_get_irqsave() to serialize reporting
watchdog/softlockup: use printk_cpu_sync_get_irqsave() to serialize reporting
watchdog/hardlockup: adopt softlockup logic avoiding double-dumps
kexec_core: fix the assignment to kimage->control_page
x86/kexec: fix incorrect end address passed to kernel_ident_mapping_init()
lib/trace_readwrite.c:: replace asm-generic/io with linux/io
nilfs2: cpfile: fix some kernel-doc warnings
stacktrace: fix kernel-doc typo
scripts/checkstack.pl: fix no space expression between sp and offset
x86/kexec: fix incorrect argument passed to kexec_dprintk()
x86/kexec: use pr_err() instead of kexec_dprintk() when an error occurs
nilfs2: add missing set_freezable() for freezable kthread
kernel: relay: remove relay_file_splice_read dead code, doesn't work
docs: submit-checklist: remove all of "make namespacecheck"
...
When a CPU is taken offline, the contribution of its cfs_rqs to task_groups'
load may remain and will negatively impact the calculation of the share of
the online CPUs.
To fix this bug, clear the contribution of an offlining CPU to task groups'
load and skip its contribution while it is inactive.
Here's the reproducer of the anomaly, by Imran Khan:
"So far I have encountered only one rather lengthy way of reproducing this issue,
which is as follows:
1. Take a KVM guest (booted with 4 CPUs and can be scaled up to 124 CPUs) and
create 2 custom cgroups: /sys/fs/cgroup/cpu/test_group_1 and /sys/fs/cgroup/
cpu/test_group_2
2. Assign a CPU intensive workload to each of these cgroups and start the
workload.
For my tests I am using following app:
int main(int argc, char *argv[])
{
unsigned long count, i, val;
if (argc != 2) {
printf("usage: ./a.out <number of random nums to generate> \n");
return 0;
}
count = strtoul(argv[1], NULL, 10);
printf("Generating %lu random numbers \n", count);
for (i = 0; i < count; i++) {
val = rand();
val = val % 2;
//usleep(1);
}
printf("Generated %lu random numbers \n", count);
return 0;
}
Also since the system is booted with 4 CPUs, in order to completely load the
system I am also launching 4 instances of same test app under:
/sys/fs/cgroup/cpu/
3. We can see that both of the cgroups get similar CPU time:
# systemd-cgtop --depth 1
Path Tasks %CPU Memory Input/s Output/s
/ 659 - 5.5G - -
/system.slice - - 5.7G - -
/test_group_1 4 - - - -
/test_group_2 3 - - - -
/user.slice 31 - 56.5M - -
Path Tasks %CPU Memory Input/s Output/s
/ 659 394.6 5.5G - -
/test_group_2 3 65.7 - - -
/user.slice 29 55.1 48.0M - -
/test_group_1 4 47.3 - - -
/system.slice - 2.2 5.7G - -
Path Tasks %CPU Memory Input/s Output/s
/ 659 394.8 5.5G - -
/test_group_1 4 62.9 - - -
/user.slice 28 44.9 54.2M - -
/test_group_2 3 44.7 - - -
/system.slice - 0.9 5.7G - -
Path Tasks %CPU Memory Input/s Output/s
/ 659 394.4 5.5G - -
/test_group_2 3 58.8 - - -
/test_group_1 4 51.9 - - -
/user.slice 30 39.3 59.6M - -
/system.slice - 1.9 5.7G - -
Path Tasks %CPU Memory Input/s Output/s
/ 659 394.7 5.5G - -
/test_group_1 4 60.9 - - -
/test_group_2 3 57.9 - - -
/user.slice 28 43.5 36.9M - -
/system.slice - 3.0 5.7G - -
Path Tasks %CPU Memory Input/s Output/s
/ 659 395.0 5.5G - -
/test_group_1 4 66.8 - - -
/test_group_2 3 56.3 - - -
/user.slice 29 43.1 51.8M - -
/system.slice - 0.7 5.7G - -
4. Now move systemd-udevd to one of these test groups, say test_group_1, and
perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
host side.
5. Run the same workload i.e 4 instances of CPU hogger under /sys/fs/cgroup/cpu
and one instance of CPU hogger each in /sys/fs/cgroup/cpu/test_group_1 and
/sys/fs/cgroup/test_group_2.
It can be seen that test_group_1 (the one where systemd-udevd was moved) is getting
much less CPU time than the test_group_2, even though at this point of time both of
these groups have only CPU hogger running:
# systemd-cgtop --depth 1
Path Tasks %CPU Memory Input/s Output/s
/ 1219 - 5.4G - -
/system.slice - - 5.6G - -
/test_group_1 4 - - - -
/test_group_2 3 - - - -
/user.slice 26 - 91.3M - -
Path Tasks %CPU Memory Input/s Output/s
/ 1221 394.3 5.4G - -
/test_group_2 3 82.7 - - -
/test_group_1 4 14.3 - - -
/system.slice - 0.8 5.6G - -
/user.slice 26 0.4 91.2M - -
Path Tasks %CPU Memory Input/s Output/s
/ 1221 394.6 5.4G - -
/test_group_2 3 67.4 - - -
/system.slice - 24.6 5.6G - -
/test_group_1 4 12.5 - - -
/user.slice 26 0.4 91.2M - -
Path Tasks %CPU Memory Input/s Output/s
/ 1221 395.2 5.4G - -
/test_group_2 3 60.9 - - -
/system.slice - 27.9 5.6G - -
/test_group_1 4 12.2 - - -
/user.slice 26 0.4 91.2M - -
Path Tasks %CPU Memory Input/s Output/s
/ 1221 395.2 5.4G - -
/test_group_2 3 69.4 - - -
/test_group_1 4 13.9 - - -
/user.slice 28 1.6 92.0M - -
/system.slice - 1.0 5.6G - -
Path Tasks %CPU Memory Input/s Output/s
/ 1221 395.6 5.4G - -
/test_group_2 3 59.3 - - -
/test_group_1 4 14.1 - - -
/user.slice 28 1.3 92.2M - -
/system.slice - 0.7 5.6G - -
Path Tasks %CPU Memory Input/s Output/s
/ 1221 395.5 5.4G - -
/test_group_2 3 67.2 - - -
/test_group_1 4 11.5 - - -
/user.slice 28 1.3 92.5M - -
/system.slice - 0.6 5.6G - -
Path Tasks %CPU Memory Input/s Output/s
/ 1221 395.1 5.4G - -
/test_group_2 3 76.8 - - -
/test_group_1 4 12.9 - - -
/user.slice 28 1.3 92.8M - -
/system.slice - 1.2 5.6G - -
From sched_debug data it can be seen that in bad case the load.weight of per-CPU
sched entities corresponding to test_group_1 has reduced significantly and
also load_avg of test_group_1 remains much higher than that of test_group_2,
even though systemd-udevd stopped running long time back and at this point of
time both cgroups just have the CPU hogger app as running entity."
[ mingo: Added details from the original discussion, plus minor edits to the patch. ]
Reported-by: Imran Khan <imran.f.khan@oracle.com>
Tested-by: Imran Khan <imran.f.khan@oracle.com>
Tested-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Imran Khan <imran.f.khan@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lore.kernel.org/r/20231223111545.62135-1-vincent.guittot@linaro.org
We're trying to get sched.h down to more or less just types only, not
code - rseq can live in its own header.
This helps us kill the dependency on preempt.h in sched.h.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This variable became unused in:
5e963f2bd465 ("sched/fair: Commit to EEVDF")
Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/202312141319+0800-wangjinchao@xfusion.com
Running N CPU-bound tasks on an N CPUs platform:
- with asymmetric CPU capacity
- not being a DynamIq system (i.e. having a PKG level sched domain
without the SD_SHARE_PKG_RESOURCES flag set)
.. might result in a task placement where two tasks run on a big CPU
and none on a little CPU. This placement could be more optimal by
using all CPUs.
Testing platform:
Juno-r2:
- 2 big CPUs (1-2), maximum capacity of 1024
- 4 little CPUs (0,3-5), maximum capacity of 383
Testing workload ([1]):
Spawn 6 CPU-bound tasks. During the first 100ms (step 1), each tasks
is affine to a CPU, except for:
- one little CPU which is left idle.
- one big CPU which has 2 tasks affine.
After the 100ms (step 2), remove the cpumask affinity.
Behavior before the patch:
During step 2, the load balancer running from the idle CPU tags sched
domains as:
- little CPUs: 'group_has_spare'. Cf. group_has_capacity() and
group_is_overloaded(), 3 CPU-bound tasks run on a 4 CPUs
sched-domain, and the idle CPU provides enough spare capacity
regarding the imbalance_pct
- big CPUs: 'group_overloaded'. Indeed, 3 tasks run on a 2 CPUs
sched-domain, so the following path is used:
group_is_overloaded()
\-if (sgs->sum_nr_running <= sgs->group_weight) return true;
The following path which would change the migration type to
'migrate_task' is not taken:
calculate_imbalance()
\-if (env->idle != CPU_NOT_IDLE && env->imbalance == 0)
as the local group has some spare capacity, so the imbalance
is not 0.
The migration type requested is 'migrate_util' and the busiest
runqueue is the big CPU's runqueue having 2 tasks (each having a
utilization of 512). The idle little CPU cannot pull one of these
task as its capacity is too small for the task. The following path
is used:
detach_tasks()
\-case migrate_util:
\-if (util > env->imbalance) goto next;
After the patch:
As the number of failed balancing attempts grows (with
'nr_balance_failed'), progressively make it easier to migrate
a big task to the idling little CPU. A similar mechanism is
used for the 'migrate_load' migration type.
Improvement:
Running the testing workload [1] with the step 2 representing
a ~10s load for a big CPU:
Before patch: ~19.3s
After patch: ~18s (-6.7%)
Similar issue reported at:
https://lore.kernel.org/lkml/20230716014125.139577-1-qyousef@layalina.io/
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Qais Yousef <qyousef@layalina.io>
Link: https://lore.kernel.org/r/20231206090043.634697-1-pierre.gondois@arm.com
With UTIL_EST_FASTUP now being permanent, we can take advantage of the
fact that the ewma jumps directly to a higher utilization at dequeue to
simplify util_est and remove the enqueued field.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com>
Reviewed-by: Alex Shi <alexs@kernel.org>
Link: https://lore.kernel.org/r/20231201161652.1241695-3-vincent.guittot@linaro.org