IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
In function sched_core_update_cookie(), a task will enqueue into the
core tree only when it enqueued before, that is, if an uncookied task
is cookied, it will not enqueue into the core tree until it enqueue
again, which will result in unnecessary force idle.
Here follows the scenario:
CPU x and CPU y are a pair of SMT siblings.
1. Start task a running on CPU x without sleeping, and task b and
task c running on CPU y without sleeping.
2. We create a cookie and share it to task a and task b, and then
we create another cookie and share it to task c.
3. Simpling core_forceidle_sum of task a and b from /proc/PID/sched
And we will find out that core_forceidle_sum of task a takes 30%
time of the sampling period, which shouldn't happen as task a and b
have the same cookie.
Then we migrate task a to CPU x', migrate task b and c to CPU y', where
CPU x' and CPU y' are a pair of SMT siblings, and sampling again, we
will found out that core_forceidle_sum of task a and b are almost zero.
To solve this problem, we enqueue the task into the core tree if it's
on rq.
Fixes: 6e33cad0af49("sched: Trivial core scheduling cookie management")
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1656403045-100840-2-git-send-email-CruzZhao@linux.alibaba.com
dequeue_task_rt() only decrements 'rt_rq->rt_nr_running' after having
called sched_update_tick_dependency() preventing it from re-enabling the
tick on systems that no longer have pending SCHED_RT tasks but have
multiple runnable SCHED_OTHER tasks:
dequeue_task_rt()
dequeue_rt_entity()
dequeue_rt_stack()
dequeue_top_rt_rq()
sub_nr_running() // decrements rq->nr_running
sched_update_tick_dependency()
sched_can_stop_tick() // checks rq->rt.rt_nr_running,
...
__dequeue_rt_entity()
dec_rt_tasks() // decrements rq->rt.rt_nr_running
...
Every other scheduler class performs the operation in the opposite
order, and sched_update_tick_dependency() expects the values to be
updated as such. So avoid the misbehaviour by inverting the order in
which the above operations are performed in the RT scheduler.
Fixes: 76d92ac305f2 ("sched: Migrate sched to use new tick dependency mask model")
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220628092259.330171-1-nsaenzju@redhat.com
With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
normal priority tasks (SCHED_OTHER, nice level zero):
INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
Not tainted 5.15.49-rt46 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000
Workqueue: writeback wb_workfn (flush-7:0)
[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
[<c08a3d84>] (schedule) from [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174)
[<c08a65a0>] (rt_mutex_slowlock_block.constprop.0) from [<c08a6708>]
+(rt_mutex_slowlock.constprop.0+0xac/0x174)
[<c08a6708>] (rt_mutex_slowlock.constprop.0) from [<c0374d60>] (fat_write_inode+0x34/0x54)
[<c0374d60>] (fat_write_inode) from [<c0297304>] (__writeback_single_inode+0x354/0x3ec)
[<c0297304>] (__writeback_single_inode) from [<c0297998>] (writeback_sb_inodes+0x250/0x45c)
[<c0297998>] (writeback_sb_inodes) from [<c0297c20>] (__writeback_inodes_wb+0x7c/0xb8)
[<c0297c20>] (__writeback_inodes_wb) from [<c0297f24>] (wb_writeback+0x2c8/0x2e4)
[<c0297f24>] (wb_writeback) from [<c0298c40>] (wb_workfn+0x1a4/0x3e4)
[<c0298c40>] (wb_workfn) from [<c0138ab8>] (process_one_work+0x1fc/0x32c)
[<c0138ab8>] (process_one_work) from [<c0139120>] (worker_thread+0x22c/0x2d8)
[<c0139120>] (worker_thread) from [<c013e6e0>] (kthread+0x16c/0x178)
[<c013e6e0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38)
Exception stack(0xc10e3fb0 to 0xc10e3ff8)
3fa0: 00000000 00000000 00000000 00000000
3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
INFO: task tar:2083 blocked for more than 491 seconds.
Not tainted 5.15.49-rt46 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:tar state:D stack: 0 pid: 2083 ppid: 2082 flags:0x00000000
[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
[<c08a3d84>] (schedule) from [<c08a41b0>] (io_schedule+0x14/0x24)
[<c08a41b0>] (io_schedule) from [<c08a455c>] (bit_wait_io+0xc/0x30)
[<c08a455c>] (bit_wait_io) from [<c08a441c>] (__wait_on_bit_lock+0x54/0xa8)
[<c08a441c>] (__wait_on_bit_lock) from [<c08a44f4>] (out_of_line_wait_on_bit_lock+0x84/0xb0)
[<c08a44f4>] (out_of_line_wait_on_bit_lock) from [<c0371fb0>] (fat_mirror_bhs+0xa0/0x144)
[<c0371fb0>] (fat_mirror_bhs) from [<c0372a68>] (fat_alloc_clusters+0x138/0x2a4)
[<c0372a68>] (fat_alloc_clusters) from [<c0370b14>] (fat_alloc_new_dir+0x34/0x250)
[<c0370b14>] (fat_alloc_new_dir) from [<c03787c0>] (vfat_mkdir+0x58/0x148)
[<c03787c0>] (vfat_mkdir) from [<c0277b60>] (vfs_mkdir+0x68/0x98)
[<c0277b60>] (vfs_mkdir) from [<c027b484>] (do_mkdirat+0xb0/0xec)
[<c027b484>] (do_mkdirat) from [<c0100060>] (ret_fast_syscall+0x0/0x1c)
Exception stack(0xc2e1bfa8 to 0xc2e1bff0)
bfa0: 01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000
bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190
bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc
Here the kworker is waiting on msdos_sb_info::s_lock which is held by
tar which is in turn waiting for a buffer which is locked waiting to be
flushed, but this operation is plugged in the kworker.
The lock is a normal struct mutex, so tsk_is_pi_blocked() will always
return false on !RT and thus the behaviour changes for RT.
It seems that the intent here is to skip blk_flush_plug() in the case
where a non-preemptible lock (such as a spinlock) has been converted to
a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT
schedule flag. But sched_submit_work() is only called from schedule()
which is never called in this scenario, so the check can simply be
deleted.
Looking at the history of the -rt patchset, in fact this change was
present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part
of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b
("sched/core: Rework the __schedule() preempt argument").
As described in [1]:
The schedule process must distinguish between blocking on a regular
sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock
and rwlock):
- rwsem and mutex must flush block requests (blk_schedule_flush_plug())
even if blocked on a lock. This can not deadlock because this also
happens for non-RT.
There should be a warning if the scheduling point is within a RCU read
section.
- spinlock and rwlock must not flush block requests. This will deadlock
if the callback attempts to acquire a lock which is already acquired.
Similarly to being preempted, there should be no warning if the
scheduling point is within a RCU read section.
and with the tsk_is_pi_blocked() in the scheduler path, we hit the first
issue.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patches
Signed-off-by: John Keeping <john@metanate.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20220708162702.1758865-1-john@metanate.com
The capacity of the CPU available for CFS tasks can be reduced because of
other activities running on the latter. In such case, it's worth trying to
move CFS tasks on a CPU with more available capacity.
The rework of the load balance has filtered the case when the CPU is
classified to be fully busy but its capacity is reduced.
Check if CPU's capacity is reduced while gathering load balance statistic
and classify it group_misfit_task instead of group_fully_busy so we can
try to move the load on another CPU.
Reported-by: David Chen <david.chen@nutanix.com>
Reported-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: David Chen <david.chen@nutanix.com>
Tested-by: Zhang Qiao <zhangqiao22@huawei.com>
Link: https://lkml.kernel.org/r/20220708154401.21411-1-vincent.guittot@linaro.org
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) != old in
set_nr_{and_not,if}_polling. x86 cmpxchg returns success in ZF flag,
so this change saves a compare after cmpxchg.
The definition of cmpxchg based fetch_or was changed in the
same way as atomic_fetch_##op definitions were changed
in e6790e4b5d5e97dc287f3496dd2cf2dbabdfdb35.
Also declare these two functions as inline to ensure inlining. In the
case of set_nr_and_not_polling, the compiler (gcc) tries to outsmart
itself by constructing the boolean return value with logic operations
on the fetched value, and these extra operations enlarge the function
over the inlining threshold value.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220629151552.6015-1-ubizjak@gmail.com
4feee7d1260 previously added per-task forced idle accounting. This patch
extends this to also include cgroups.
rstat is used for cgroup accounting, except for the root, which uses
kcpustat in order to bypass the need for doing an rstat flush when
reading root stats.
Only cgroup v2 is supported. Similar to the task accounting, the cgroup
accounting requires that schedstats is enabled.
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lkml.kernel.org/r/20220629211426.3329954-1-joshdon@google.com
find_energy_efficient_cpu() integrates a margin to protect tasks from
bouncing back and forth from a CPU to another. This margin is set as being
6% of the total current energy estimated on the system. This however does
not work for two reasons:
1. The energy estimation is not a good absolute value:
compute_energy() used in feec() is a good estimation for task placement as
it allows to compare the energy with and without a task. The computed
delta will give a good overview of the cost for a certain task placement.
It, however, doesn't work as an absolute estimation for the total energy
of the system. First it adds the contribution to idle CPUs into the
energy, second it mixes util_avg with util_est values. util_avg contains
the near history for a CPU usage, it doesn't tell at all what the current
utilization is. A system that has been quite busy in the near past will
hold a very high energy and then a high margin preventing any task
migration to a lower capacity CPU, wasting energy. It even creates a
negative feedback loop: by holding the tasks on a less efficient CPU, the
margin contributes in keeping the energy high.
2. The margin handicaps small tasks:
On a system where the workload is composed mostly of small tasks (which is
often the case on Android), the overall energy will be high enough to
create a margin none of those tasks can cross. On a Pixel4, a small
utilization of 5% on all the CPUs creates a global estimated energy of 140
joules, as per the Energy Model declaration of that same device. This
means, after applying the 6% margin that any migration must save more than
8 joules to happen. No task with a utilization lower than 40 would then be
able to migrate away from the biggest CPU of the system.
The 6% of the overall system energy was brought by the following patch:
(eb92692b2544 sched/fair: Speed-up energy-aware wake-ups)
It was previously 6% of the prev_cpu energy. Also, the following one
made this margin value conditional on the clusters where the task fits:
(8d4c97c105ca sched/fair: Only compute base_energy_pd if necessary)
We could simply revert that margin change to what it was, but the original
version didn't have strong grounds neither and as demonstrated in (1.) the
estimated energy isn't a good absolute value. Instead, removing it
completely. It is indeed, made possible by recent changes that improved
energy estimation comparison fairness (sched/fair: Remove task_util from
effective utilization in feec()) (PM: EM: Increase energy calculation
precision) and task utilization stabilization (sched/fair: Decay task
util_avg during migration)
Without a margin, we could have feared bouncing between CPUs. But running
LISA's eas_behaviour test coverage on three different platforms (Hikey960,
RB-5 and DB-845) showed no issue.
Removing the energy margin enables more energy-optimized placements for a
more energy efficient system.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-8-vdonnefort@google.com
The energy estimation in find_energy_efficient_cpu() (feec()) relies on
the computation of the effective utilization for each CPU of a perf domain
(PD). This effective utilization is then used as an estimation of the busy
time for this pd. The function effective_cpu_util() which gives this value,
scales the utilization relative to IRQ pressure on the CPU to take into
account that the IRQ time is hidden from the task clock. The IRQ scaling is
as follow:
effective_cpu_util = irq + (cpu_cap - irq)/cpu_cap * util
Where util is the sum of CFS/RT/DL utilization, cpu_cap the capacity of
the CPU and irq the IRQ avg time.
If now we take as an example a task placement which doesn't raise the OPP
on the candidate CPU, we can write the energy delta as:
delta = OPPcost/cpu_cap * (effective_cpu_util(cpu_util + task_util) -
effective_cpu_util(cpu_util))
= OPPcost/cpu_cap * (cpu_cap - irq)/cpu_cap * task_util
We end-up with an energy delta depending on the IRQ avg time, which is a
problem: first the time spent on IRQs by a CPU has no effect on the
additional energy that would be consumed by a task. Second, we don't want
to favour a CPU with a higher IRQ avg time value.
Nonetheless, we need to take the IRQ avg time into account. If a task
placement raises the PD's frequency, it will increase the energy cost for
the entire time where the CPU is busy. A solution is to only use
effective_cpu_util() with the CPU contribution part. The task contribution
is added separately and scaled according to prev_cpu's IRQ time.
No change for the FREQUENCY_UTIL component of the energy estimation. We
still want to get the actual frequency that would be selected after the
task placement.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-7-vdonnefort@google.com
The Perf Domain (PD) cpumask (struct em_perf_domain.cpus) stays
invariant after Energy Model creation, i.e. it is not updated after
CPU hotplug operations.
That's why the PD mask is used in conjunction with the cpu_online_mask
(or Sched Domain cpumask). Thereby the cpu_online_mask is fetched
multiple times (in compute_energy()) during a run-queue selection
for a task.
cpu_online_mask may change during this time which can lead to wrong
energy calculations.
To be able to avoid this, use the select_rq_mask per-cpu cpumask to
create a cpumask out of PD cpumask and cpu_online_mask and pass it
through the function calls of the EAS run-queue selection path.
The PD cpumask for max_spare_cap_cpu/compute_prev_delta selection
(find_energy_efficient_cpu()) is now ANDed not only with the SD mask
but also with the cpu_online_mask. This is fine since this cpumask
has to be in syc with the one used for energy computation
(compute_energy()).
An exclusive cpuset setup with at least one asymmetric CPU capacity
island (hence the additional AND with the SD cpumask) is the obvious
exception here.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-6-vdonnefort@google.com
On 21/06/2022 11:04, Vincent Donnefort wrote:
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
https://lkml.kernel.org/r/202206221253.ZVyGQvPX-lkp@intel.com discovered
that this patch doesn't build anymore (on tip sched/core or linux-next)
because of commit f5b2eeb499910 ("sched/fair: Consider CPU affinity when
allowing NUMA imbalance in find_idlest_group()").
New version of [PATCH v11 4/7] sched/fair: Rename select_idle_mask to
select_rq_mask below.
-- >8 --
Decouple the name of the per-cpu cpumask select_idle_mask from its usage
in select_idle_[cpu/capacity]() of the CFS run-queue selection
(select_task_rq_fair()).
This is to support the reuse of this cpumask in the Energy Aware
Scheduling (EAS) path (find_energy_efficient_cpu()) of the CFS run-queue
selection.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/250691c7-0e2b-05ab-bedf-b245c11d9400@arm.com
effective_cpu_util() already has a `int cpu' parameter which allows to
retrieve the CPU capacity scale factor (or maximum CPU capacity) inside
this function via an arch_scale_cpu_capacity(cpu).
A lot of code calling effective_cpu_util() (or the shim
sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call
arch_scale_cpu_capacity() already.
But not having to pass it into effective_cpu_util() will make the EAS
wake-up code easier, especially when the maximum CPU capacity reduced
by the thermal pressure is passed through the EAS wake-up functions.
Due to the asymmetric CPU capacity support of arm/arm64 architectures,
arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via
per_cpu(cpu_scale, cpu) on such a system.
On all other architectures it is a a compile-time constant
(SCHED_CAPACITY_SCALE).
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-4-vdonnefort@google.com
Before being migrated to a new CPU, a task sees its PELT values
synchronized with rq last_update_time. Once done, that same task will also
have its sched_avg last_update_time reset. This means the time between
the migration and the last clock update will not be accounted for in
util_avg and a discontinuity will appear. This issue is amplified by the
PELT clock scaling. It takes currently one tick after the CPU being idle
to let clock_pelt catching up clock_task.
This is especially problematic for asymmetric CPU capacity systems which
need stable util_avg signals for task placement and energy estimation.
Ideally, this problem would be solved by updating the runqueue clocks
before the migration. But that would require taking the runqueue lock
which is quite expensive [1]. Instead estimate the missing time and update
the task util_avg with that value.
To that end, we need sched_clock_cpu() but it is a costly function. Limit
the usage to the case where the source CPU is idle as we know this is when
the clock is having the biggest risk of being outdated.
See comment in migrate_se_pelt_lag() for more details about how the PELT
value is estimated. Notice though this estimation doesn't take into account
IRQ and Paravirt time.
[1] https://lkml.kernel.org/r/20190709115759.10451-1-chris.redpath@arm.com
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-3-vdonnefort@google.com
Introducing macro helpers u64_u32_{store,load}() to factorize lockless
accesses to u64 variables for 32-bits architectures.
Users are for now cfs_rq.min_vruntime and sched_avg.last_update_time. To
accommodate the later where the copy lies outside of the structure
(cfs_rq.last_udpate_time_copy instead of sched_avg.last_update_time_copy),
use the _copy() version of those helpers.
Those new helpers encapsulate smp_rmb() and smp_wmb() synchronization and
therefore, have a small penalty for 32-bits machines in set_task_rq_fair()
and init_cfs_rq().
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://lkml.kernel.org/r/20220621090414.433602-2-vdonnefort@google.com
[Problem Statement]
select_idle_cpu() might spend too much time searching for an idle CPU,
when the system is overloaded.
The following histogram is the time spent in select_idle_cpu(),
when running 224 instances of netperf on a system with 112 CPUs
per LLC domain:
@usecs:
[0] 533 | |
[1] 5495 | |
[2, 4) 12008 | |
[4, 8) 239252 | |
[8, 16) 4041924 |@@@@@@@@@@@@@@ |
[16, 32) 12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[32, 64) 14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128) 13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[128, 256) 8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[256, 512) 4507667 |@@@@@@@@@@@@@@@ |
[512, 1K) 2600472 |@@@@@@@@@ |
[1K, 2K) 927912 |@@@ |
[2K, 4K) 218720 | |
[4K, 8K) 98161 | |
[8K, 16K) 37722 | |
[16K, 32K) 6715 | |
[32K, 64K) 477 | |
[64K, 128K) 7 | |
netperf latency usecs:
=======
case load Lat_99th std%
TCP_RR thread-224 257.39 ( 0.21)
The time spent in select_idle_cpu() is visible to netperf and might have a negative
impact.
[Symptom analysis]
The patch [1] from Mel Gorman has been applied to track the efficiency
of select_idle_sibling. Copy the indicators here:
SIS Search Efficiency(se_eff%):
A ratio expressed as a percentage of runqueues scanned versus
idle CPUs found. A 100% efficiency indicates that the target,
prev or recent CPU of a task was idle at wakeup. The lower the
efficiency, the more runqueues were scanned before an idle CPU
was found.
SIS Domain Search Efficiency(dom_eff%):
Similar, except only for the slower SIS
patch.
SIS Fast Success Rate(fast_rate%):
Percentage of SIS that used target, prev or
recent CPUs.
SIS Success rate(success_rate%):
Percentage of scans that found an idle CPU.
The test is based on Aubrey's schedtests tool, including netperf, hackbench,
schbench and tbench.
Test on vanilla kernel:
schedstat_parse.py -f netperf_vanilla.log
case load se_eff% dom_eff% fast_rate% success_rate%
TCP_RR 28 threads 99.978 18.535 99.995 100.000
TCP_RR 56 threads 99.397 5.671 99.964 100.000
TCP_RR 84 threads 21.721 6.818 73.632 100.000
TCP_RR 112 threads 12.500 5.533 59.000 100.000
TCP_RR 140 threads 8.524 4.535 49.020 100.000
TCP_RR 168 threads 6.438 3.945 40.309 99.999
TCP_RR 196 threads 5.397 3.718 32.320 99.982
TCP_RR 224 threads 4.874 3.661 25.775 99.767
UDP_RR 28 threads 99.988 17.704 99.997 100.000
UDP_RR 56 threads 99.528 5.977 99.970 100.000
UDP_RR 84 threads 24.219 6.992 76.479 100.000
UDP_RR 112 threads 13.907 5.706 62.538 100.000
UDP_RR 140 threads 9.408 4.699 52.519 100.000
UDP_RR 168 threads 7.095 4.077 44.352 100.000
UDP_RR 196 threads 5.757 3.775 35.764 99.991
UDP_RR 224 threads 5.124 3.704 28.748 99.860
schedstat_parse.py -f schbench_vanilla.log
(each group has 28 tasks)
case load se_eff% dom_eff% fast_rate% success_rate%
normal 1 mthread 99.152 6.400 99.941 100.000
normal 2 mthreads 97.844 4.003 99.908 100.000
normal 3 mthreads 96.395 2.118 99.917 99.998
normal 4 mthreads 55.288 1.451 98.615 99.804
normal 5 mthreads 7.004 1.870 45.597 61.036
normal 6 mthreads 3.354 1.346 20.777 34.230
normal 7 mthreads 2.183 1.028 11.257 21.055
normal 8 mthreads 1.653 0.825 7.849 15.549
schedstat_parse.py -f hackbench_vanilla.log
(each group has 28 tasks)
case load se_eff% dom_eff% fast_rate% success_rate%
process-pipe 1 group 99.991 7.692 99.999 100.000
process-pipe 2 groups 99.934 4.615 99.997 100.000
process-pipe 3 groups 99.597 3.198 99.987 100.000
process-pipe 4 groups 98.378 2.464 99.958 100.000
process-pipe 5 groups 27.474 3.653 89.811 99.800
process-pipe 6 groups 20.201 4.098 82.763 99.570
process-pipe 7 groups 16.423 4.156 77.398 99.316
process-pipe 8 groups 13.165 3.920 72.232 98.828
process-sockets 1 group 99.977 5.882 99.999 100.000
process-sockets 2 groups 99.927 5.505 99.996 100.000
process-sockets 3 groups 99.397 3.250 99.980 100.000
process-sockets 4 groups 79.680 4.258 98.864 99.998
process-sockets 5 groups 7.673 2.503 63.659 92.115
process-sockets 6 groups 4.642 1.584 58.946 88.048
process-sockets 7 groups 3.493 1.379 49.816 81.164
process-sockets 8 groups 3.015 1.407 40.845 75.500
threads-pipe 1 group 99.997 0.000 100.000 100.000
threads-pipe 2 groups 99.894 2.932 99.997 100.000
threads-pipe 3 groups 99.611 4.117 99.983 100.000
threads-pipe 4 groups 97.703 2.624 99.937 100.000
threads-pipe 5 groups 22.919 3.623 87.150 99.764
threads-pipe 6 groups 18.016 4.038 80.491 99.557
threads-pipe 7 groups 14.663 3.991 75.239 99.247
threads-pipe 8 groups 12.242 3.808 70.651 98.644
threads-sockets 1 group 99.990 6.667 99.999 100.000
threads-sockets 2 groups 99.940 5.114 99.997 100.000
threads-sockets 3 groups 99.469 4.115 99.977 100.000
threads-sockets 4 groups 87.528 4.038 99.400 100.000
threads-sockets 5 groups 6.942 2.398 59.244 88.337
threads-sockets 6 groups 4.359 1.954 49.448 87.860
threads-sockets 7 groups 2.845 1.345 41.198 77.102
threads-sockets 8 groups 2.871 1.404 38.512 74.312
schedstat_parse.py -f tbench_vanilla.log
case load se_eff% dom_eff% fast_rate% success_rate%
loopback 28 threads 99.976 18.369 99.995 100.000
loopback 56 threads 99.222 7.799 99.934 100.000
loopback 84 threads 19.723 6.819 70.215 100.000
loopback 112 threads 11.283 5.371 55.371 99.999
loopback 140 threads 0.000 0.000 0.000 0.000
loopback 168 threads 0.000 0.000 0.000 0.000
loopback 196 threads 0.000 0.000 0.000 0.000
loopback 224 threads 0.000 0.000 0.000 0.000
According to the test above, if the system becomes busy, the
SIS Search Efficiency(se_eff%) drops significantly. Although some
benchmarks would finally find an idle CPU(success_rate% = 100%), it is
doubtful whether it is worth it to search the whole LLC domain.
[Proposal]
It would be ideal to have a crystal ball to answer this question:
How many CPUs must a wakeup path walk down, before it can find an idle
CPU? Many potential metrics could be used to predict the number.
One candidate is the sum of util_avg in this LLC domain. The benefit
of choosing util_avg is that it is a metric of accumulated historic
activity, which seems to be smoother than instantaneous metrics
(such as rq->nr_running). Besides, choosing the sum of util_avg
would help predict the load of the LLC domain more precisely, because
SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
time.
In summary, the lower the util_avg is, the more select_idle_cpu()
should scan for idle CPU, and vice versa. When the sum of util_avg
in this LLC domain hits 85% or above, the scan stops. The reason to
choose 85% as the threshold is that this is the imbalance_pct(117)
when a LLC sched group is overloaded.
Introduce the quadratic function:
y = SCHED_CAPACITY_SCALE - p * x^2
and y'= y / SCHED_CAPACITY_SCALE
x is the ratio of sum_util compared to the CPU capacity:
x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE)
y' is the ratio of CPUs to be scanned in the LLC domain,
and the number of CPUs to scan is calculated by:
nr_scan = llc_weight * y'
Choosing quadratic function is because:
[1] Compared to the linear function, it scans more aggressively when the
sum_util is low.
[2] Compared to the exponential function, it is easier to calculate.
[3] It seems that there is no accurate mapping between the sum of util_avg
and the number of CPUs to be scanned. Use heuristic scan for now.
For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
sum_util% 0 5 15 25 35 45 55 65 75 85 86 ...
scan_nr 112 111 108 102 93 81 65 47 25 1 0 ...
For a platform with 16 CPUs per LLC, the number of CPUs to scan is:
sum_util% 0 5 15 25 35 45 55 65 75 85 86 ...
scan_nr 16 15 15 14 13 11 9 6 3 0 0 ...
Furthermore, to minimize the overhead of calculating the metrics in
select_idle_cpu(), borrow the statistics from periodic load balance.
As mentioned by Abel, on a platform with 112 CPUs per LLC, the
sum_util calculated by periodic load balance after 112 ms would
decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay
in reflecting the latest utilization. But it is a trade-off.
Checking the util_avg in newidle load balance would be more frequent,
but it brings overhead - multiple CPUs write/read the per-LLC shared
variable and introduces cache contention. Tim also mentioned that,
it is allowed to be non-optimal in terms of scheduling for the
short-term variations, but if there is a long-term trend in the load
behavior, the scheduler can adjust for that.
When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan
calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and
Mel suggested, SIS_UTIL should be enabled by default.
This patch is based on the util_avg, which is very sensitive to the
CPU frequency invariance. There is an issue that, when the max frequency
has been clamp, the util_avg would decay insanely fast when
the CPU is idle. Commit addca285120b ("cpufreq: intel_pstate: Handle no_turbo
in frequency invariance") could be used to mitigate this symptom, by adjusting
the arch_max_freq_ratio when turbo is disabled. But this issue is still
not thoroughly fixed, because the current code is unaware of the user-specified
max CPU frequency.
[Test result]
netperf and tbench were launched with 25% 50% 75% 100% 125% 150%
175% 200% of CPU number respectively. Hackbench and schbench were launched
by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times.
The following is the benchmark result comparison between
baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare%
indicates better performance.
Each netperf test is a:
netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100
netperf.throughput
=======
case load baseline(std%) compare%( std%)
TCP_RR 28 threads 1.00 ( 0.34) -0.16 ( 0.40)
TCP_RR 56 threads 1.00 ( 0.19) -0.02 ( 0.20)
TCP_RR 84 threads 1.00 ( 0.39) -0.47 ( 0.40)
TCP_RR 112 threads 1.00 ( 0.21) -0.66 ( 0.22)
TCP_RR 140 threads 1.00 ( 0.19) -0.69 ( 0.19)
TCP_RR 168 threads 1.00 ( 0.18) -0.48 ( 0.18)
TCP_RR 196 threads 1.00 ( 0.16) +194.70 ( 16.43)
TCP_RR 224 threads 1.00 ( 0.16) +197.30 ( 7.85)
UDP_RR 28 threads 1.00 ( 0.37) +0.35 ( 0.33)
UDP_RR 56 threads 1.00 ( 11.18) -0.32 ( 0.21)
UDP_RR 84 threads 1.00 ( 1.46) -0.98 ( 0.32)
UDP_RR 112 threads 1.00 ( 28.85) -2.48 ( 19.61)
UDP_RR 140 threads 1.00 ( 0.70) -0.71 ( 14.04)
UDP_RR 168 threads 1.00 ( 14.33) -0.26 ( 11.16)
UDP_RR 196 threads 1.00 ( 12.92) +186.92 ( 20.93)
UDP_RR 224 threads 1.00 ( 11.74) +196.79 ( 18.62)
Take the 224 threads as an example, the SIS search metrics changes are
illustrated below:
vanilla patched
4544492 +237.5% 15338634 sched_debug.cpu.sis_domain_search.avg
38539 +39686.8% 15333634 sched_debug.cpu.sis_failed.avg
128300000 -87.9% 15551326 sched_debug.cpu.sis_scanned.avg
5842896 +162.7% 15347978 sched_debug.cpu.sis_search.avg
There is -87.9% less CPU scans after patched, which indicates lower overhead.
Besides, with this patch applied, there is -13% less rq lock contention
in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
.try_to_wake_up.default_wake_function.woken_wake_function.
This might help explain the performance improvement - Because this patch allows
the waking task to remain on the previous CPU, rather than grabbing other CPUs'
lock.
Each hackbench test is a:
hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100
hackbench.throughput
=========
case load baseline(std%) compare%( std%)
process-pipe 1 group 1.00 ( 1.29) +0.57 ( 0.47)
process-pipe 2 groups 1.00 ( 0.27) +0.77 ( 0.81)
process-pipe 4 groups 1.00 ( 0.26) +1.17 ( 0.02)
process-pipe 8 groups 1.00 ( 0.15) -4.79 ( 0.02)
process-sockets 1 group 1.00 ( 0.63) -0.92 ( 0.13)
process-sockets 2 groups 1.00 ( 0.03) -0.83 ( 0.14)
process-sockets 4 groups 1.00 ( 0.40) +5.20 ( 0.26)
process-sockets 8 groups 1.00 ( 0.04) +3.52 ( 0.03)
threads-pipe 1 group 1.00 ( 1.28) +0.07 ( 0.14)
threads-pipe 2 groups 1.00 ( 0.22) -0.49 ( 0.74)
threads-pipe 4 groups 1.00 ( 0.05) +1.88 ( 0.13)
threads-pipe 8 groups 1.00 ( 0.09) -4.90 ( 0.06)
threads-sockets 1 group 1.00 ( 0.25) -0.70 ( 0.53)
threads-sockets 2 groups 1.00 ( 0.10) -0.63 ( 0.26)
threads-sockets 4 groups 1.00 ( 0.19) +11.92 ( 0.24)
threads-sockets 8 groups 1.00 ( 0.08) +4.31 ( 0.11)
Each tbench test is a:
tbench -t 100 $job 127.0.0.1
tbench.throughput
======
case load baseline(std%) compare%( std%)
loopback 28 threads 1.00 ( 0.06) -0.14 ( 0.09)
loopback 56 threads 1.00 ( 0.03) -0.04 ( 0.17)
loopback 84 threads 1.00 ( 0.05) +0.36 ( 0.13)
loopback 112 threads 1.00 ( 0.03) +0.51 ( 0.03)
loopback 140 threads 1.00 ( 0.02) -1.67 ( 0.19)
loopback 168 threads 1.00 ( 0.38) +1.27 ( 0.27)
loopback 196 threads 1.00 ( 0.11) +1.34 ( 0.17)
loopback 224 threads 1.00 ( 0.11) +1.67 ( 0.22)
Each schbench test is a:
schbench -m $job -t 28 -r 100 -s 30000 -c 30000
schbench.latency_90%_us
========
case load baseline(std%) compare%( std%)
normal 1 mthread 1.00 ( 31.22) -7.36 ( 20.25)*
normal 2 mthreads 1.00 ( 2.45) -0.48 ( 1.79)
normal 4 mthreads 1.00 ( 1.69) +0.45 ( 0.64)
normal 8 mthreads 1.00 ( 5.47) +9.81 ( 14.28)
*Consider the Standard Deviation, this -7.36% regression might not be valid.
Also, a OLTP workload with a commercial RDBMS has been tested, and there
is no significant change.
There were concerns that unbalanced tasks among CPUs would cause problems.
For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are
bound to CPU0~CPU6, while CPU7 is idle:
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
util_avg 1024 1024 1024 1024 1024 1024 1024 0
Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
select_idle_cpu() will not scan, thus CPU7 is undetected during scan.
But according to Mel, it is unlikely the CPU7 will be idle all the time
because CPU7 could pull some tasks via CPU_NEWLY_IDLE.
lkp(kernel test robot) has reported a regression on stress-ng.sock on a
very busy system. According to the sched_debug statistics, it might be caused
by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this
might introduce more context switch, especially involuntary preemption, which
impacts a busy stress-ng. This regression has shown that, not all benchmarks
in every scenario benefit from idle CPU scan limit, and it needs further
investigation.
Besides, there is slight regression in hackbench's 16 groups case when the
LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively
in an LLC domain with 16 CPUs. Because the cost to search for an idle one
among 16 CPUs is negligible. The current patch aims to propose a generic
solution and only considers the util_avg. Something like the below could
be applied on top of the current patch to fulfill the requirement:
if (llc_weight <= 16)
nr_scan = nr_scan * 32 / llc_weight;
For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large.
The smaller the CPU number this LLC domain has, the larger nr_scan will be
expanded. This needs further investigation.
There is also ongoing work[2] from Abel to filter out the busy CPUs during
wakeup, to further speed up the idle CPU scan. And it could be a following-up
optimization on top of this change.
Suggested-by: Tim Chen <tim.c.chen@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220612163428.849378-1-yu.c.chen@intel.com
sched_setattr(2) issues via kernel/sched/core.c:__sched_setscheduler()
a CAP_SYS_NICE audit event unconditionally, even when the requested
operation does not require that capability / is unprivileged, i.e. for
reducing niceness.
This is relevant in connection with SELinux, where a capability check
results in a policy decision and by default a denial message on
insufficient permission is issued.
It can lead to three undesired cases:
1. A denial message is generated, even in case the operation was an
unprivileged one and thus the syscall succeeded, creating noise.
2. To avoid the noise from 1. the policy writer adds a rule to ignore
those denial messages, hiding future syscalls, where the task
performs an actual privileged operation, leading to hidden limited
functionality of that task.
3. To avoid the noise from 1. the policy writer adds a rule to allow
the task the capability CAP_SYS_NICE, while it does not need it,
violating the principle of least privilege.
Conduct privilged/unprivileged categorization first and perform a
capable test (and at most once) only if needed.
Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220615152505.310488-1-cgzones@googlemail.com
When checking for libc rseq support in the library constructor, don't
only depend on the symbols presence, check that the registration was
completed.
This targets a scenario where the libc has rseq support but it is not
wired for the current architecture in 'bits/rseq.h', we want to fallback
to our internal registration mechanism.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/r/20220614154830.1367382-4-mjeanson@efficios.com
This header is also used in librseq where it can be included in C++
code, add a space between literals and string macros.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/r/20220614154830.1367382-3-mjeanson@efficios.com
Make the RISC-V rseq selftests compatible with glibc-2.35 by using the
rseq_get_abi() helper.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/r/20220614154830.1367382-2-mjeanson@efficios.com
Wakelist can help avoid cache bouncing and offload the overhead of waker
cpu. So far, using wakelist within the same llc only happens on
WF_ON_CPU, and this limitation could be removed to further improve
wakeup performance.
The commit 518cd6234178 ("sched: Only queue remote wakeups when
crossing cache boundaries") disabled queuing tasks on wakelist when
the cpus share llc. This is because, at that time, the scheduler must
send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also
supports TIF_POLLING, so this is not a problem now when the wakee cpu is
in idle polling.
Benefits:
Queuing the task on idle cpu can help improving performance on waker cpu
and utilization on wakee cpu, and further improve locality because
the wakee cpu can handle its own rq. This patch helps improving rt on
our real java workloads where wakeup happens frequently.
Consider the normal condition (CPU0 and CPU1 share same llc)
Before this patch:
CPU0 CPU1
select_task_rq() idle
rq_lock(CPU1->rq)
enqueue_task(CPU1->rq)
notify CPU1 (by sending IPI or CPU1 polling)
resched()
After this patch:
CPU0 CPU1
select_task_rq() idle
add to wakelist of CPU1
notify CPU1 (by sending IPI or CPU1 polling)
rq_lock(CPU1->rq)
enqueue_task(CPU1->rq)
resched()
We see CPU0 can finish its work earlier. It only needs to put task to
wakelist and return.
While CPU1 is idle, so let itself handle its own runqueue data.
This patch brings no difference about IPI.
This patch only takes effect when the wakee cpu is:
1) idle polling
2) idle not polling
For 1), there will be no IPI with or without this patch.
For 2), there will always be an IPI before or after this patch.
Before this patch: waker cpu will enqueue task and check preempt. Since
"idle" will be sure to be preempted, waker cpu must send a resched IPI.
After this patch: waker cpu will put the task to the wakelist of wakee
cpu, and send an IPI.
Benchmark:
We've tested schbench, unixbench, and hachbench on both x86 and arm64.
On x86 (Intel Xeon Platinum 8269CY):
schbench -m 2 -t 8
Latency percentiles (usec) before after
50.0000th: 8 6
75.0000th: 10 7
90.0000th: 11 8
95.0000th: 12 8
*99.0000th: 13 10
99.5000th: 15 11
99.9000th: 18 14
Unixbench with full threads (104)
before after
Dhrystone 2 using register variables 3011862938 3009935994 -0.06%
Double-Precision Whetstone 617119.3 617298.5 0.03%
Execl Throughput 27667.3 27627.3 -0.14%
File Copy 1024 bufsize 2000 maxblocks 785871.4 784906.2 -0.12%
File Copy 256 bufsize 500 maxblocks 210113.6 212635.4 1.20%
File Copy 4096 bufsize 8000 maxblocks 2328862.2 2320529.1 -0.36%
Pipe Throughput 145535622.8 145323033.2 -0.15%
Pipe-based Context Switching 3221686.4 3583975.4 11.25%
Process Creation 101347.1 103345.4 1.97%
Shell Scripts (1 concurrent) 120193.5 123977.8 3.15%
Shell Scripts (8 concurrent) 17233.4 17138.4 -0.55%
System Call Overhead 5300604.8 5312213.6 0.22%
hackbench -g 1 -l 100000
before after
Time 3.246 2.251
On arm64 (Ampere Altra):
schbench -m 2 -t 8
Latency percentiles (usec) before after
50.0000th: 14 10
75.0000th: 19 14
90.0000th: 22 16
95.0000th: 23 16
*99.0000th: 24 17
99.5000th: 24 17
99.9000th: 28 25
Unixbench with full threads (80)
before after
Dhrystone 2 using register variables 3536194249 3537019613 0.02%
Double-Precision Whetstone 629383.6 629431.6 0.01%
Execl Throughput 65920.5 65846.2 -0.11%
File Copy 1024 bufsize 2000 maxblocks 1063722.8 1064026.8 0.03%
File Copy 256 bufsize 500 maxblocks 322684.5 318724.5 -1.23%
File Copy 4096 bufsize 8000 maxblocks 2348285.3 2328804.8 -0.83%
Pipe Throughput 133542875.3 131619389.8 -1.44%
Pipe-based Context Switching 3215356.1 3576945.1 11.25%
Process Creation 108520.5 120184.6 10.75%
Shell Scripts (1 concurrent) 122636.3 121888 -0.61%
Shell Scripts (8 concurrent) 17462.1 17381.4 -0.46%
System Call Overhead 4429998.9 4435006.7 0.11%
hackbench -g 1 -l 100000
before after
Time 4.217 2.916
Our patch has improvement on schbench, hackbench
and Pipe-based Context Switching of unixbench
when there exists idle cpus,
and no obvious regression on other tests of unixbench.
This can help improve rt in scenes where wakeup happens frequently.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20220608233412.327341-3-dtcccc@linux.alibaba.com
The commit 2ebb17717550 ("sched/core: Offload wakee task activation if it
the wakee is descheduling") checked rq->nr_running <= 1 to avoid task
stacking when WF_ON_CPU.
Per the ordering of writes to p->on_rq and p->on_cpu, observing p->on_cpu
(WF_ON_CPU) in ttwu_queue_cond() implies !p->on_rq, IOW p has gone through
the deactivate_task() in __schedule(), thus p has been accounted out of
rq->nr_running. As such, the task being the only runnable task on the rq
implies reading rq->nr_running == 0 at that point.
The benchmark result is in [1].
[1] https://lore.kernel.org/all/e34de686-4e85-bde1-9f3c-9bbc86b38627@linux.alibaba.com/
Suggested-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20220608233412.327341-2-dtcccc@linux.alibaba.com
While doing newidle load balancing, it is possible for new tasks to
arrive, such as with pending wakeups. newidle_balance() already accounts
for this by exiting the sched_domain load_balance() iteration if it
detects these cases. This is very important for minimizing wakeup
latency.
However, if we are already in load_balance(), we may stay there for a
while before returning back to newidle_balance(). This is most
exacerbated if we enter a 'goto redo' loop in the LBF_ALL_PINNED case. A
very straightforward workaround to this is to adjust should_we_balance()
to bail out if we're doing a CPU_NEWLY_IDLE balance and new tasks are
detected.
This was tested with the following reproduction:
- two threads that take turns sleeping and waking each other up are
affined to two cores
- a large number of threads with 100% utilization are pinned to all
other cores
Without this patch, wakeup latency was ~120us for the pair of threads,
almost entirely spent in load_balance(). With this patch, wakeup latency
is ~6us.
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220609025515.2086253-1-joshdon@google.com
sysctl_sched_dl_period_max and sysctl_sched_dl_period_min are unsigned
integer, but proc_dointvec() wouldn't return error even if we set a
negative number.
Use proc_douintvec_minmax() instead of proc_dointvec(). Add extra1 for
sysctl_sched_dl_period_max and extra2 for sysctl_sched_dl_period_min.
It's just an optimization for match data and proc_handler in struct
ctl_table. The 'if (period < min || period > max)' in __checkparam_dl()
will work fine even if there hasn't this patch.
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Link: https://lore.kernel.org/r/20220607101807.249965-1-yajun.deng@linux.dev
We notice the rq leaf_cfs_rq_list has two problems when do bugfix
backports and some test profiling.
1. cfs_rqs under throttled subtree could be added to the list, and
make their fully decayed ancestors on the list, even though not needed.
2. #1 also make the leaf_cfs_rq_list management complex and error prone,
this is the list of related bugfix so far:
commit 31bc6aeaab1d ("sched/fair: Optimize update_blocked_averages()")
commit fe61468b2cbc ("sched/fair: Fix enqueue_task_fair warning")
commit b34cb07dde7c ("sched/fair: Fix enqueue_task_fair() warning some more")
commit 39f23ce07b93 ("sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list")
commit 0258bdfaff5b ("sched/fair: Fix unfairness caused by missing load decay")
commit a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
commit fdaba61ef8a2 ("sched/fair: Ensure that the CFS parent is added after unthrottling")
commit 2630cde26711 ("sched/fair: Add ancestors of unthrottled undecayed cfs_rq")
commit 31bc6aeaab1d ("sched/fair: Optimize update_blocked_averages()")
delete every cfs_rq under throttled subtree from rq->leaf_cfs_rq_list,
and delete the throttled_hierarchy() test in update_blocked_averages(),
which optimized update_blocked_averages().
But those later bugfix add cfs_rqs under throttled subtree back to
rq->leaf_cfs_rq_list again, with their fully decayed ancestors, for
the integrity of rq->leaf_cfs_rq_list.
This patch takes another method, skip all cfs_rqs under throttled
hierarchy when list_add_leaf_cfs_rq(), to completely make cfs_rqs
under throttled subtree off the leaf_cfs_rq_list.
So we don't need to consider throttled related things in
enqueue_entity(), unthrottle_cfs_rq() and enqueue_task_fair(),
which simplify the code a lot. Also optimize update_blocked_averages()
since cfs_rqs under throttled hierarchy and their ancestors
won't be on the leaf_cfs_rq_list.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220601021848.76943-1-zhouchengming@bytedance.com
In the case of systems containing multiple LLCs per socket, like
AMD Zen systems, users want to spread bandwidth hungry applications
across multiple LLCs. Stream is one such representative workload where
the best performance is obtained by limiting one stream thread per LLC.
To ensure this, users are known to pin the tasks to a specify a subset
of the CPUs consisting of one CPU per LLC while running such bandwidth
hungry tasks.
Suppose we kickstart a multi-threaded task like stream with 8 threads
using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3
server where each socket contains 128 CPUs
(0-63,128-191 in one socket, 64-127,192-255 in another socket)
Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8
Here each CPU in the list is from a different LLC and 4 of those LLCs
are on one socket, while the other 4 are on another socket.
Ideally we would prefer that each stream thread runs on a different
CPU from the allowed list of CPUs. However, the current heuristics in
find_idlest_group() do not allow this during the initial placement.
Suppose the first socket (0-63,128-191) is our local group from which
we are kickstarting the stream tasks. The first four stream threads
will be placed in this socket. When it comes to placing the 5th
thread, all the allowed CPUs are from the local group (0,16,32,48)
would have been taken.
However, the current scheduler code simply checks if the number of
tasks in the local group is fewer than the allowed numa-imbalance
threshold. This threshold was previously 25% of the NUMA domain span
(in this case threshold = 32) but after the v6 of Mel's patchset
"Adjust NUMA imbalance for multiple LLCs", got merged in sched-tip,
Commit: e496132ebedd ("sched/fair: Adjust the allowed NUMA imbalance
when SD_NUMA spans multiple LLCs") it is now equal to number of LLCs
in the NUMA domain, for processors with multiple LLCs.
(in this case threshold = 8).
For this example, the number of tasks will always be within threshold
and thus all the 8 stream threads will be woken up on the first socket
thereby resulting in sub-optimal performance.
The following sched_wakeup_new tracepoint output shows the initial
placement of tasks in the current tip/sched/core on the Zen3 machine:
stream-5313 [016] d..2. 627.005036: sched_wakeup_new: comm=stream pid=5315 prio=120 target_cpu=032
stream-5313 [016] d..2. 627.005086: sched_wakeup_new: comm=stream pid=5316 prio=120 target_cpu=048
stream-5313 [016] d..2. 627.005141: sched_wakeup_new: comm=stream pid=5317 prio=120 target_cpu=000
stream-5313 [016] d..2. 627.005183: sched_wakeup_new: comm=stream pid=5318 prio=120 target_cpu=016
stream-5313 [016] d..2. 627.005218: sched_wakeup_new: comm=stream pid=5319 prio=120 target_cpu=016
stream-5313 [016] d..2. 627.005256: sched_wakeup_new: comm=stream pid=5320 prio=120 target_cpu=016
stream-5313 [016] d..2. 627.005295: sched_wakeup_new: comm=stream pid=5321 prio=120 target_cpu=016
Once the first four threads are distributed among the allowed CPUs of
socket one, the rest of the treads start piling on these same CPUs
when clearly there are CPUs on the second socket that can be used.
Following the initial pile up on a small number of CPUs, though the
load-balancer eventually kicks in, it takes a while to get to {4}{4}
and even {4}{4} isn't stable as we observe a bunch of ping ponging
between {4}{4} to {5}{3} and back before a stable state is reached
much later (1 Stream thread per allowed CPU) and no more migration is
required.
We can detect this piling and avoid it by checking if the number of
allowed CPUs in the local group are fewer than the number of tasks
running in the local group and use this information to spread the
5th task out into the next socket (after all, the goal in this
slowpath is to find the idlest group and the idlest CPU during the
initial placement!).
The following sched_wakeup_new tracepoint output shows the initial
placement of tasks after adding this fix on the Zen3 machine:
stream-4485 [016] d..2. 230.784046: sched_wakeup_new: comm=stream pid=4487 prio=120 target_cpu=032
stream-4485 [016] d..2. 230.784123: sched_wakeup_new: comm=stream pid=4488 prio=120 target_cpu=048
stream-4485 [016] d..2. 230.784167: sched_wakeup_new: comm=stream pid=4489 prio=120 target_cpu=000
stream-4485 [016] d..2. 230.784222: sched_wakeup_new: comm=stream pid=4490 prio=120 target_cpu=112
stream-4485 [016] d..2. 230.784271: sched_wakeup_new: comm=stream pid=4491 prio=120 target_cpu=096
stream-4485 [016] d..2. 230.784322: sched_wakeup_new: comm=stream pid=4492 prio=120 target_cpu=080
stream-4485 [016] d..2. 230.784368: sched_wakeup_new: comm=stream pid=4493 prio=120 target_cpu=064
We see that threads are using all of the allowed CPUs and there is
no pileup.
No output is generated for tracepoint sched_migrate_task with this
patch due to a perfect initial placement which removes the need
for balancing later on - both across NUMA boundaries and within
NUMA boundaries for stream.
Following are the results from running 8 Stream threads with and
without pinning on a dual socket Zen3 Machine (2 x 64C/128T):
During the testing of this patch, the tip sched/core was at
commit: 089c02ae2771 "ftrace: Use preemption model accessors for trace
header printout"
Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8
5.18.0-rc1 5.18.0-rc1 5.18.0-rc1
tip sched/core tip sched/core tip sched/core
(no pinning) + pinning + this-patch
+ pinning
Copy: 109364.74 (0.00 pct) 94220.50 (-13.84 pct) 158301.28 (44.74 pct)
Scale: 109670.26 (0.00 pct) 90210.59 (-17.74 pct) 149525.64 (36.34 pct)
Add: 129029.01 (0.00 pct) 101906.00 (-21.02 pct) 186658.17 (44.66 pct)
Triad: 127260.05 (0.00 pct) 106051.36 (-16.66 pct) 184327.30 (44.84 pct)
Pinning currently hurts the performance compared to unbound case on
tip/sched/core. With the addition of this patch, we are able to
outperform tip/sched/core by a good margin with pinning.
Following are the results from running 16 Stream threads with and
without pinning on a dual socket IceLake Machine (2 x 32C/64T):
NUMA Topology of Intel Skylake machine:
Node 1: 0,2,4,6 ... 126 (Even numbers)
Node 2: 1,3,5,7 ... 127 (Odd numbers)
Pinning is done using: numactl -C 0-15 ./stream16
5.18.0-rc1 5.18.0-rc1 5.18.0-rc1
tip sched/core tip sched/core tip sched/core
(no pinning) +pinning + this-patch
+ pinning
Copy: 85815.31 (0.00 pct) 149819.21 (74.58 pct) 156807.48 (82.72 pct)
Scale: 64795.60 (0.00 pct) 97595.07 (50.61 pct) 99871.96 (54.13 pct)
Add: 71340.68 (0.00 pct) 111549.10 (56.36 pct) 114598.33 (60.63 pct)
Triad: 68890.97 (0.00 pct) 111635.16 (62.04 pct) 114589.24 (66.33 pct)
In case of Icelake machine, with single LLC per socket, pinning across
the two sockets reduces cache contention, thus showing great
improvement in pinned case which is further benefited by this patch.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lkml.kernel.org/r/20220407111222.22649-1-kprateek.nayak@amd.com
For a single LLC per node, a NUMA imbalance is allowed up until 25%
of CPUs sharing a node could be active. One intent of the cut-off is
to avoid an imbalance of memory channels but there is no topological
information based on active memory channels. Furthermore, there can
be differences between nodes depending on the number of populated
DIMMs.
A cut-off of 25% was arbitrary but generally worked. It does have a severe
corner cases though when an parallel workload is using 25% of all available
CPUs over-saturates memory channels. This can happen due to the initial
forking of tasks that get pulled more to one node after early wakeups
(e.g. a barrier synchronisation) that is not quickly corrected by the
load balancer. The LB may fail to act quickly as the parallel tasks are
considered to be poor migrate candidates due to locality or cache hotness.
On a range of modern Intel CPUs, 12.5% appears to be a better cut-off
assuming all memory channels are populated and is used as the new cut-off
point. A minimum of 1 is specified to allow a communicating pair to
remain local even for CPUs with low numbers of cores. For modern AMDs,
there are multiple LLCs and are not affected.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220520103519.1863-5-mgorman@techsingularity.net
The imbalance limitations are applied inconsistently at fork time
and at runtime. At fork, a new task can remain local until there are
too many running tasks even if the degree of imbalance is larger than
NUMA_IMBALANCE_MIN which is different to runtime. Secondly, the imbalance
figure used during load balancing is different to the one used at NUMA
placement. Load balancing uses the number of tasks that must move to
restore imbalance where as NUMA balancing uses the total imbalance.
In combination, it is possible for a parallel workload that uses a small
number of CPUs without applying scheduler policies to have very variable
run-to-run performance.
[lkp@intel.com: Fix build breakage for arc-allyesconfig]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220520103519.1863-4-mgorman@techsingularity.net
If a destination node has spare capacity but there is an imbalance then
two tasks are selected for swapping. If the tasks have no numa group
or are within the same NUMA group, it's simply shuffling tasks around
without having any impact on the compute imbalance. Instead, it's just
punishing one task to help another.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220520103519.1863-3-mgorman@techsingularity.net
On clone, numa_migrate_retry is inherited from the parent which means
that the first NUMA placement of a task is non-deterministic. This
affects when load balancing recognises numa tasks and whether to
migrate "regular", "remote" or "all" tasks between NUMA scheduler
domains.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220520103519.1863-2-mgorman@techsingularity.net
Highlights:
- Fix hp-wmi regression on HP Omen laptops introduced in 5.18
- Several hardware-id additions
- A couple of other tiny fixes
The following is an automated git shortlog grouped by driver:
barco-p50-gpio:
- Add check for platform_driver_register
gigabyte-wmi:
- Add support for B450M DS3H-CF
- Add Z690M AORUS ELITE AX DDR4 support
hp-wmi:
- Use zero insize parameter only when supported
- Resolve WMI query failures on some devices
platform/mellanox:
- Add static in struct declaration.
- Spelling s/platfom/platform/
platform/x86/intel:
- hid: Add Surface Go to VGBS allow list
- pmc: Support Intel Raptorlake P
- Fix pmt_crashlog array reference
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEEuvA7XScYQRpenhd+kuxHeUQDJ9wFAmKl3+0UHGhkZWdvZWRl
QHJlZGhhdC5jb20ACgkQkuxHeUQDJ9zU3wf/YJUH3Obp0kAUssrOxT9QShHH6iFL
Gh6CN38DYLEbV7FNBH/iA+ZW7DSsymQm1M3jj1QYB0Sd2X/hSsUEh2q6F0XNjTKH
X8vVrSRAL+DStpgyKerBJHKpzO3ESUlipmIvR/JuGVHQAylR8ielHvAx+XhA1PKz
8/ezppIa9bxZ1+LwQNPE40jJv+wlNctGrx3TEwqPep3STWWFEjkeYpR+3XIJ4Ads
5P7FXxpM2cRFKBcI9jVIBDMIZulEfkXxBDcLCHOYYitxcpnrFFghn14ys5Exh9zL
pdaVHnlP6pEFpVhCWa/ICpSyrLhNYqTUJ92dyFeakbGBj9QlwnfeceMLAw==
=SaRl
-----END PGP SIGNATURE-----
Merge tag 'platform-drivers-x86-v5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Hans de Goede:
"Highlights:
- Fix hp-wmi regression on HP Omen laptops introduced in 5.18
- Several hardware-id additions
- A couple of other tiny fixes"
* tag 'platform-drivers-x86-v5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86/intel: hid: Add Surface Go to VGBS allow list
platform/x86: hp-wmi: Use zero insize parameter only when supported
platform/x86: hp-wmi: Resolve WMI query failures on some devices
platform/x86: gigabyte-wmi: Add support for B450M DS3H-CF
platform/x86: gigabyte-wmi: Add Z690M AORUS ELITE AX DDR4 support
platform/x86: barco-p50-gpio: Add check for platform_driver_register
platform/x86/intel: pmc: Support Intel Raptorlake P
platform/x86/intel: Fix pmt_crashlog array reference
platform/mellanox: Add static in struct declaration.
platform/mellanox: Spelling s/platfom/platform/
Tetsuo's patch to trigger build warnings if system-wide wq's are flushed
along with a TP type update and trivial comment update.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCYqUyqQ4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGQPtAQCQZuNFoWhCtdpjW/MWuGdY1pGGPMVl+60xwvew
Ad8gegD/eoAsXP1XAzJ9Z1BPqr/IxncfOgGGDGHbR1Ll39qLlwE=
=i3Sx
-----END PGP SIGNATURE-----
Merge tag 'wq-for-5.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
"Tetsuo's patch to trigger build warnings if system-wide wq's are
flushed along with a TP type update and trivial comment update"
* tag 'wq-for-5.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Switch to new kerneldoc syntax for named variable macro argument
workqueue: Fix type of cpu in trace event
workqueue: Wrap flush_workqueue() using a macro
- Make the *.mod build rule portable for POSIX awk
- Fix regression of 'make nsdeps'
- Make scripts/check-local-export working for older bash versions
- Fix scripts/gdb to extract the .config data from vmlinux
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmKk7n0VHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGFv0P/RNPUu+hhNkQB41nIyDkn977KtM4
QqEdXUaGVEXEMwcBYn7bo4LR/pxbRX73WJ2o3ONwSTnpmbwDkSwmJX0JADuUhFKh
je7E/+Sv9wJsyZ93Tow/nTt13xCcox9QSxx2wZrS06GQz6EmU34N+sXx82tffH8L
ONQq36p31JKCwTuaK9oiAz+FAH6ap9sStL6WBrS+HSpinO0s5P9xTtTrdZHTYmn2
KgFofMZvGAMEjGZsi2H3nobjH52/yZoUyfDanFS4DAS2ZMMQ5Vw8MpCxuKz8G2oN
/VCo3HgK9D+0uaAwPag3z+hugHoMBU5cjNhsvjiO6veB/VEHxm5KoEATo4xmU2OZ
W7GGp3FCsxrZkSkXtKO6FAJatN/DlgHYoJ647xo/yNOxULgzVCZgAs/cPUqUW9l2
bFNeDJ4YIiFDCwi65/KOAyLOpQT8j/wMEWmrrk57HBkrSaBrHhkNV31d3Y5HUnr2
XhjqGk4dX+YX8WAEN0LrjLzUQNntqxk2yvITi0m4OaIvblu0XU2B0F70WRqhXxaf
cmQFysAhmcaFSrlWxiNOnLPENNQjSy5jeKU/f6ZpGcYu7MU09ay3P+eoqFoPGq24
dLBqdLd52QDJQAo91y05cUJrj6IRYNsEK7HYlvokmvfmGHjfclxdM4+rT6dBwHpf
ih8aP2eTxRMIwqgY
=ka85
-----END PGP SIGNATURE-----
Merge tag 'kbuild-fixes-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Make the *.mod build rule portable for POSIX awk
- Fix regression of 'make nsdeps'
- Make scripts/check-local-export working for older bash versions
- Fix scripts/gdb to extract the .config data from vmlinux
* tag 'kbuild-fixes-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
scripts/gdb: change kernel config dumping method
scripts/check-local-export: avoid 'wait $!' for process substitution
scripts/nsdeps: adjust to the format change of *.mod files
kbuild: avoid regex RS for POSIX awk
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmKkwyEACgkQiiy9cAdy
T1EqyAv/d8aDey0rQzGy918wzJd91gZrNFOJUpzVhIs3O5MakBgeoYn+S6rySl1+
xs6lXTQdSEyiL0edqTIq8iqA+iuhLPCBW2BWa/Zw089yHM/Ho3tjc5gBl5w38OcF
7NpFUInkg+yoBYWY9cCwjL83YaPxhcLKGY7S6WWptUxzf5Sg6eUqXCkMS7eUV6hb
YniMa5uWZSJtqY4F6qw/NOw90QekodEmfL4lLU/GXOnDxlJ8v5Ztf3aGHITWNwsd
ovhutUSai/tZz9fYHp6yOZYDcl4i0brOa3dIyU2tr52TdtzS73he8rE+Th4bu+uM
XTXvrDCTwsnOTiRFyyBJcaVDF+6LpqPEcURqLEbVOf0xXHyoEQ4zEwVFQJIBYOP4
Oy8XeXQePRxCnToI2cFZaw85IkLikoZ+4PggFkbsaFdJfkboR7b+XVhkfGVr5jnn
A6Unwrn3f6LS6MLbsDAjUpfzftwRyhbcvYCukeYKWz836xAx5tyOBIZA3m6keXah
LyhX4qSb
=mHWJ
-----END PGP SIGNATURE-----
Merge tag '5.19-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs client fixes from Steve French:
"Three reconnect fixes, all for stable as well.
One of these three reconnect fixes does address a problem with
multichannel reconnect, but this does not include the additional
fix (still being tested) for dynamically detecting multichannel
adapter changes which will improve those reconnect scenarios even
more"
* tag '5.19-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: populate empty hostnames for extra channels
cifs: return errors during session setup during reconnects
cifs: fix reconnect on smb3 mount types
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmKktaEACgkQSfxwEqXe
A65Q+RAAkyLfX6kxQlTZbglTTiapbGwpk7+VhM0pJfc3guXhIe8kNZ5cX/mfkGeZ
f/ebUmWSJhqaNz1OIZeBZQ98ESwh8imfWaxWDkqsFh4c+hGsSp2xwIszMn3Hg+7L
Sm/0Q71eZaSnRBGxWRVBbz3tTppUBS4nJxvFj8iM3jWWUffZa0m/w1lMvqc8kNJu
kM1ONqb+CEuHOJyltUaH2qEXD6fQE3IOpPdC6PdsFqFX8qLN/pwZO6tpY8tYPt3j
AUubp8F3eR4Y7WcmMi1b7BmiRgg47jdsS18aqRSH8CuYvIBbXHNM+tuK54zh/888
d+s9Bj6ALR4z1/a8HXqtudCazYU+1VozWxVIELcpDWTX4wKgqVZ0HLEz7sAEfCgV
wSIcIY8TRJlOTL43KenbJqbyIOsvAidqWEYz5ogF9WYUlaD82s2j8pUMj4wQoD5w
VJqF2CaoewU0BOGK7rZFnElN5rPlfEJtNBZQOEo16BzA2tXilPgRFoOmKs9TMRQo
dGotfp62rUuS14b9x7zc9be0QmtnmwxQKO9U6SRVd5X2HXU/E5PsqTsbt0PKlSbA
qemw2CehPB+uFs0cqbbeI5VBpVnPQJkaflTfOVr04h7623KJ5Pnblv7/8iSedOOh
L4ACxPO5I5MrM8rJAR4g85SYVg3A/HaTufA5XtR5QB/qmEaErXg=
=1Lf0
-----END PGP SIGNATURE-----
Merge tag 'random-5.19-rc2-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator fixes from Jason Donenfeld:
- A fix for a 5.19 regression for a case in which early device tree
initializes the RNG, which flips a static branch.
On most plaforms, jump labels aren't initialized until much later, so
this caused splats. On a few mailing list threads, we cooked up easy
fixes for arm64, arm32, and risc-v. But then things looked slightly
more involved for xtensa, powerpc, arc, and mips. And at that point,
when we're patching 7 architectures in a place before the console is
even available, it seems like the cost/risk just wasn't worth it.
So random.c works around it now by checking the already exported
`static_key_initialized` boolean, as though somebody already ran into
this issue in the past. I'm not super jazzed about that; it'd be
prettier to not have to complicate downstream code. But I suppose
it's practical.
- A few small code nits and adding a missing __init annotation.
- A change to the default config values to use the cpu and bootloader's
seeds for initializing the RNG earlier.
This brings them into line with what all the distros do (Fedora/RHEL,
Debian, Ubuntu, Gentoo, Arch, NixOS, Alpine, SUSE, and Void... at
least), and moreover will now give us test coverage in various test
beds that might have caught the above device tree bug earlier.
- A change to WireGuard CI's configuration to increase test coverage
around the RNG.
- A documentation comment fix to unrelated maintainerless CRC code that
I was asked to take, I guess because it has to do with polynomials
(which the RNG thankfully no longer uses).
* tag 'random-5.19-rc2-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
wireguard: selftests: use maximum cpu features and allow rng seeding
random: remove rng_has_arch_random()
random: credit cpu and bootloader seeds by default
random: do not use jump labels before they are initialized
random: account for arch randomness in bits
random: mark bootloader randomness code as __init
random: avoid checking crng_ready() twice in random_init()
crc-itu-t: fix typo in CRC ITU-T polynomial comment
The Surface Go reports Chassis Type 9 (Laptop,) so the device needs to be
added to dmi_vgbs_allow_list to enable tablet mode when an attached Type
Cover is folded back.
BugLink: https://github.com/linux-surface/linux-surface/issues/837
Signed-off-by: Duke Lee <krnhotwings@gmail.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Link: https://lore.kernel.org/r/20220607213654.5567-1-krnhotwings@gmail.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
commit be9d73e64957 ("platform/x86: hp-wmi: Fix 0x05 error code reported by
several WMI calls") and commit 12b19f14a21a ("platform/x86: hp-wmi: Fix
hp_wmi_read_int() reporting error (0x05)") cause ACPI BIOS Error (bug):
Attempt to CreateField of length zero (20211217/dsopcode-133) because of
the ACPI method HWMC, which unconditionally creates a Field of
size (insize*8) bits:
CreateField (Arg1, 0x80, (Local5 * 0x08), DAIN)
In cases where args->insize = 0, the Field size is 0, resulting in
an error.
Fix this by using zero insize only if 0x5 error code is returned
Tested on Omen 15 AMD (2020) board ID: 8786.
Fixes: be9d73e64957 ("platform/x86: hp-wmi: Fix 0x05 error code reported by several WMI calls")
Signed-off-by: Bedant Patnaik <bedant.patnaik@gmail.com>
Tested-by: Jorge Lopez <jorge.lopez2@hp.com>
Link: https://lore.kernel.org/r/41be46743d21c78741232a47bbb5f1cdbcc3d21e.camel@gmail.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
WMI queries fail on some devices where the ACPI method HWMC
unconditionally attempts to create Fields beyond the buffer
if the buffer is too small, this breaks essential features
such as power profiles:
CreateByteField (Arg1, 0x10, D008)
CreateByteField (Arg1, 0x11, D009)
CreateByteField (Arg1, 0x12, D010)
CreateDWordField (Arg1, 0x10, D032)
CreateField (Arg1, 0x80, 0x0400, D128)
In cases where args->data had zero length, ACPI BIOS Error
(bug): AE_AML_BUFFER_LIMIT, Field [D008] at bit
offset/length 128/8 exceeds size of target Buffer (128 bits)
(20211217/dsopcode-198) was obtained.
ACPI BIOS Error (bug): AE_AML_BUFFER_LIMIT, Field [D009] at bit
offset/length 136/8 exceeds size of target Buffer (136bits)
(20211217/dsopcode-198)
The original code created a buffer size of 128 bytes regardless if
the WMI call required a smaller buffer or not. This particular
behavior occurs in older BIOS and reproduced in OMEN laptops. Newer
BIOS handles buffer sizes properly and meets the latest specification
requirements. This is the reason why testing with a dynamically
allocated buffer did not uncover any failures with the test systems at
hand.
This patch was tested on several OMEN, Elite, and Zbooks. It was
confirmed the patch resolves HPWMI_FAN GET/SET calls in an OMEN
Laptop 15-ek0xxx. No problems were reported when testing on several Elite
and Zbooks notebooks.
Fixes: 4b4967cbd268 ("platform/x86: hp-wmi: Changing bios_args.data to be dynamically allocated")
Signed-off-by: Jorge Lopez <jorge.lopez2@hp.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Link: https://lore.kernel.org/r/20220608212923.8585-2-jorge.lopez2@hp.com
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
The syntax without dots is available since commit 43756e347f21
("scripts/kernel-doc: Add support for named variable macro arguments").
The same HTML output is produced with and without this patch.
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
- make irq_chip structs immutable in several Diolan and intel drivers to
get rid of the new warning we emit when fiddling with irq chips
- don't print error messages on probe deferral in gpio-dwapb
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEFp3rbAvDxGAT0sefEacuoBRx13IFAmKk+XAACgkQEacuoBRx
13Krng//XvlVTqp0nRr7zl/9NaA9iwjTENEkIp0eL9PNA6LjEDz8v1WCW35wIrkW
h/Ve+udXHTbtNEU5dSSW7zrFPU3Ma7oPBHlddE4AzwMgzj+NYXjULhx6Fdvg2aRI
B/pDyArqW3ndJ+PLI01bFS92Sb62ZGBgMngfBV6z9Unc+L/JvK6Hlm/Vn1EQ+oOF
pzzGKgtxW1jE895SyB8TlVoCxtmxGohmOD8lpDlZetq/59q6epEgSKj6nR6k4fTB
E6dk/K0qn4/8N1AU/yZkitxea6Sk59AuS5l5bs+th2ZNaesdyfqMbJSvqqNMNbr1
NAEZ37MDhCdQPS0W0/77SX00E4lblObXnlv4JLM7klpGX78AiIOEY11ILfp5y4wp
n9m+pBiW/6DQNggn+uF/bly4I3wQPqEnL43NOpSmiTYtJVY3KBjiLS08sF48//f1
fJuLjJ4dRJJGWBCeR6xesfhXwkspYAkPEsP/xhqMNdehCMBlK7W6Vw8PH7AAzKGK
l3qubKVlS2U0wcu8ISAeHyt2WDhnY0Wo0G/27+jt6Nx+h87ChGNVhN2E3PFGM05i
W4iYI6tADjT8TGqpslyjEwAREVT3J90oEXx/t3hDsgkt7h5DPPgijMKxM7NXhM5Z
bBDjYgij3Q586L65O4Nwstgv+tolsM+i4nYuqwfGeRr9r0XeiQ8=
=oL8o
-----END PGP SIGNATURE-----
Merge tag 'gpio-fixes-for-v5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
"A set of fixes. Most address the new warning we emit at build time
when irq chips are not immutable with some additional tweaks to
gpio-crystalcove from Andy and a small tweak to gpio-dwapd.
- make irq_chip structs immutable in several Diolan and intel drivers
to get rid of the new warning we emit when fiddling with irq chips
- don't print error messages on probe deferral in gpio-dwapb"
* tag 'gpio-fixes-for-v5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: dwapb: Don't print error on -EPROBE_DEFER
gpio: dln2: make irq_chip immutable
gpio: sch: make irq_chip immutable
gpio: merrifield: make irq_chip immutable
gpio: wcove: make irq_chip immutable
gpio: crystalcove: Join function declarations and long lines
gpio: crystalcove: Use specific type and API for IRQ number
gpio: crystalcove: make irq_chip immutable
13 driver and 1 core patch. Nine of the driver patches are minor
fixes and reworks to lpfc and the rest are trivial and minor fixes.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYqSIMyYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishf5hAP4rj6mu
qqCQmh7V+zMst2ijZ+BBJBHX5eURcOwMWxd48QD/dLp/hX7djKS8i8hju2AznOyx
F8g4PqPvnoJJ0adlpb0=
=GTnk
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Driver fixes and and one core patch.
Nine of the driver patches are minor fixes and reworks to lpfc and the
rest are trivial and minor fixes elsewhere"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: pmcraid: Fix missing resource cleanup in error case
scsi: ipr: Fix missing/incorrect resource cleanup in error case
scsi: mpt3sas: Fix out-of-bounds compiler warning
scsi: lpfc: Update lpfc version to 14.2.0.4
scsi: lpfc: Allow reduced polling rate for nvme_admin_async_event cmd completion
scsi: lpfc: Add more logging of cmd and cqe information for aborted NVMe cmds
scsi: lpfc: Fix port stuck in bypassed state after LIP in PT2PT topology
scsi: lpfc: Resolve NULL ptr dereference after an ELS LOGO is aborted
scsi: lpfc: Address NULL pointer dereference after starget_to_rport()
scsi: lpfc: Resolve some cleanup issues following SLI path refactoring
scsi: lpfc: Resolve some cleanup issues following abort path refactoring
scsi: lpfc: Correct BDE type for XMIT_SEQ64_WQE in lpfc_ct_reject_event()
scsi: vmw_pvscsi: Expand vcpuHint to 16 bits
scsi: sd: Fix interpretation of VPD B9h length
Fixes all over the place, most notably fixes for latent
bugs in drivers that got exposed by suppressing
interrupts before DRIVER_OK, which in turn has been
done by 8b4ec69d7e09 ("virtio: harden vring IRQ").
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmKkIwUPHG1zdEByZWRo
YXQuY29tAAoJECgfDbjSjVRpo/8H/3JwrGv0k5vj1wbHXymmBCRJrQ90Wg5/lwGj
8419DWHidPqf8X/ZAEHoWMr3FM7Jsj7KuTl3sQZODxQbQ4wERvEljTRosxImJ3iI
k2+T7HGf4pAIp8zzSp+1xL1feyctHT3atWKlpjXGqsDNCONexBIfG7/M63MkArOg
SzrbcTsZzabdI2eKDSipiywnrGlQH6mq7YXynaKWBdL7RtJC+PHAYLksJufPz4Lw
vsHpNEfWkoss+hsYKhH73lU4B++eQnHKbjiMktPkNLusxWW035unP97kqJKvxFgu
NDF9PDa/4e47nJqqcibvo6VEZ5q6hLjaDCqRPjzEdKl44V8VeNE=
=Q+zn
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio fixes from Michael Tsirkin:
"Fixes all over the place, most notably fixes for latent bugs in
drivers that got exposed by suppressing interrupts before DRIVER_OK,
which in turn has been done by 8b4ec69d7e09 ("virtio: harden vring
IRQ")"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
um: virt-pci: set device ready in probe()
vdpa: make get_vq_group and set_group_asid optional
virtio: Fix all occurences of the "the the" typo
vduse: Fix NULL pointer dereference on sysfs access
vringh: Fix loop descriptors check in the indirect cases
vdpa/mlx5: clean up indenting in handle_ctrl_vlan()
vdpa/mlx5: fix error code for deleting vlan
virtio-mmio: fix missing put_device() when vm_cmdline_parent registration failed
vdpa/mlx5: Fix syntax errors in comments
virtio-rng: make device ready before making request
Commit 6c77676645ad ("iov_iter: Fix iter_xarray_get_pages{,_alloc}()")
introduced a problem on some 32-bit architectures (at least arm, xtensa,
csky,sparc and mips), that have a 'size_t' that is 'unsigned int'.
The reason is that we now do
min(nr * PAGE_SIZE - offset, maxsize);
where 'nr' and 'offset' and both 'unsigned int', and PAGE_SIZE is
'unsigned long'. As a result, the normal C type rules means that the
first argument to 'min()' ends up being 'unsigned long'.
In contrast, 'maxsize' is of type 'size_t'.
Now, 'size_t' and 'unsigned long' are always the same physical type in
the kernel, so you'd think this doesn't matter, and from an actual
arithmetic standpoint it doesn't.
But on 32-bit architectures 'size_t' is commonly 'unsigned int', even if
it could also be 'unsigned long'. In that situation, both are unsigned
32-bit types, but they are not the *same* type.
And as a result 'min()' will complain about the distinct types (ignore
the "pointer types" part of the error message: that's an artifact of the
way we have made 'min()' check types for being the same):
lib/iov_iter.c: In function 'iter_xarray_get_pages':
include/linux/minmax.h:20:35: error: comparison of distinct pointer types lacks a cast [-Werror]
20 | (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
| ^~
lib/iov_iter.c:1464:16: note: in expansion of macro 'min'
1464 | return min(nr * PAGE_SIZE - offset, maxsize);
| ^~~
This was not visible on 64-bit architectures (where we always define
'size_t' to be 'unsigned long').
Force these cases to use 'min_t(size_t, x, y)' to make the type explicit
and avoid the issue.
[ Nit-picky note: technically 'size_t' doesn't have to match 'unsigned
long' arithmetically. We've certainly historically seen environments
with 16-bit address spaces and 32-bit 'unsigned long'.
Similarly, even in 64-bit modern environments, 'size_t' could be its
own type distinct from 'unsigned long', even if it were arithmetically
identical.
So the above type commentary is only really descriptive of the kernel
environment, not some kind of universal truth for the kinds of wild
and crazy situations that are allowed by the C standard ]
Reported-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Link: https://lore.kernel.org/all/YqRyL2sIqQNDfky2@debian/
Cc: Jeff Layton <jlayton@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By forcing the maximum CPU that QEMU has available, we expose additional
capabilities, such as the RNDR instruction, which increases test
coverage. This then allows the CI to skip the fake seeding step in some
cases. Also enable STRICT_KERNEL_RWX to catch issues related to early
jump labels when the RNG is initialized at boot.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
MAGIC_START("IKCFG_ST") and MAGIC_END("IKCFG_ED") are moved out
from the kernel_config_data variable.
Thus, we parse kernel_config_data directly instead of considering
offset of MAGIC_START and MAGIC_END.
Fixes: 13610aa908dc ("kernel/configs: use .incbin directive to embed config_data.gz")
Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Call virtio_device_ready() to make this driver work after commit
b4ec69d7e09 ("virtio: harden vring IRQ"), since the driver uses the
virtqueues in the probe function. (The virtio core sets the device
ready when probe returns.)
Fixes: 8b4ec69d7e09 ("virtio: harden vring IRQ")
Fixes: 68f5d3f3b654 ("um: add PCI over virtio emulation driver")
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Message-Id: <20220610151203.3492541-1-vincent.whitchurch@axis.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Johannes Berg <johannes@sipsolutions.net>
- There is now a backup maintainer for NFSD
Notable fixes:
- Prevent array overruns in svc_rdma_build_writes()
- Prevent buffer overruns when encoding NFSv3 READDIR results
- Fix a potential UAF in nfsd_file_put()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmKjhbsACgkQM2qzM29m
f5cgig/6A9gC2c9v4lR2fH6ufiCWvJBfuVaBbToubwktJHaDLvqH56JcvS3s/gKL
PKGmbQTI/6lgmVgJqQSxJUnfe6wzHx8G1MdjlZEIwi3pUeiV+LpiprZz9TOhgYYV
YDnXRGhb4wKOm75w8rb6X8k106XdKwdBaRQwb88FDawffWoEY0XNYrlmNmmWi8To
ELlOlIwRCBbKJoJ6yEEWQRrBuVBXapbsn29tipZXbdo58g+vL0yDQq9s97b0mHhi
C2apAN2+k18FiBJsA7b7pW1l/P6k9FNEeetvgWyN8OSMpPNmt0vz1HvKaIstPgg1
BX6rgWe5eQBFEk2KNvSGHrV3R+wAp7jeuVpHUMjxXvzmfj4exJV/H8lu+qZJNDGN
ybCJatomR4APFxk+s1kptlzNo7zfyPz15L80HmWIngYJ/lrBOoKPIIi3bwPQcBwW
q2Rc+SlvpqbJvEcomgF/lqQN6inmx44J+KpOSA/S8qSIdSkz0iaZsDahFxgZNe82
h+X/i1maRtnSIvWdGMR7O6kEFT5jky35WlTv/VutTOsUwA4mUU9vZUnufBBHJH07
nOdLMi/QS/O5GOnlyegrODtN75wi+IeKt+WMNmnN+JB8Tsg0kZwjOsc/dbQfyJqP
PrQJ5AUP0TMm90B1873z8yKmhWeXtB71vgAI/d53aadBG4ZEstU=
=6ugk
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
"Notable changes:
- There is now a backup maintainer for NFSD
Notable fixes:
- Prevent array overruns in svc_rdma_build_writes()
- Prevent buffer overruns when encoding NFSv3 READDIR results
- Fix a potential UAF in nfsd_file_put()"
* tag 'nfsd-5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
SUNRPC: Remove pointer type casts from xdr_get_next_encode_buffer()
SUNRPC: Clean up xdr_get_next_encode_buffer()
SUNRPC: Clean up xdr_commit_encode()
SUNRPC: Optimize xdr_reserve_space()
SUNRPC: Fix the calculation of xdr->end in xdr_get_next_encode_buffer()
SUNRPC: Trap RDMA segment overflows
NFSD: Fix potential use-after-free in nfsd_file_put()
MAINTAINERS: reciprocal co-maintainership for file locking and nfsd
Currently, the secondary channels of a multichannel session
also get hostname populated based on the info in primary channel.
However, this will end up with a wrong resolution of hostname to
IP address during reconnect.
This change fixes this by not populating hostname info for all
secondary channels.
Fixes: 5112d80c162f ("cifs: populate server_hostname for extra channels")
Cc: stable@vger.kernel.org
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>