1205 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Huang Ying
|
c959924b0d |
memory tiering: adjust hot threshold automatically
The promotion hot threshold is workload and system configuration dependent. So in this patch, a method to adjust the hot threshold automatically is implemented. The basic idea is to control the number of the candidate promotion pages to match the promotion rate limit. If the hint page fault latency of a page is less than the hot threshold, we will try to promote the page, and the page is called the candidate promotion page. If the number of the candidate promotion pages in the statistics interval is much more than the promotion rate limit, the hot threshold will be decreased to reduce the number of the candidate promotion pages. Otherwise, the hot threshold will be increased to increase the number of the candidate promotion pages. To make the above method works, in each statistics interval, the total number of the pages to check (on which the hint page faults occur) and the hot/cold distribution need to be stable. Because the page tables are scanned linearly in NUMA balancing, but the hot/cold distribution isn't uniform along the address usually, the statistics interval should be larger than the NUMA balancing scan period. So in the patch, the max scan period is used as statistics interval and it works well in our tests. Link: https://lkml.kernel.org/r/20220713083954.34196-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: osalvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Huang Ying
|
c6833e1000 |
memory tiering: rate limit NUMA migration throughput
In NUMA balancing memory tiering mode, if there are hot pages in slow memory node and cold pages in fast memory node, we need to promote/demote hot/cold pages between the fast and cold memory nodes. A choice is to promote/demote as fast as possible. But the CPU cycles and memory bandwidth consumed by the high promoting/demoting throughput will hurt the latency of some workload because of accessing inflating and slow memory bandwidth contention. A way to resolve this issue is to restrict the max promoting/demoting throughput. It will take longer to finish the promoting/demoting. But the workload latency will be better. This is implemented in this patch as the page promotion rate limit mechanism. The number of the candidate pages to be promoted to the fast memory node via NUMA balancing is counted, if the count exceeds the limit specified by the users, the NUMA balancing promotion will be stopped until the next second. A new sysctl knob kernel.numa_balancing_promote_rate_limit_MBps is added for the users to specify the limit. Link: https://lkml.kernel.org/r/20220713083954.34196-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: osalvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Huang Ying
|
33024536ba |
memory tiering: hot page selection with hint page fault latency
Patch series "memory tiering: hot page selection", v4. To optimize page placement in a memory tiering system with NUMA balancing, the hot pages in the slow memory nodes need to be identified. Essentially, the original NUMA balancing implementation selects the mostly recently accessed (MRU) pages to promote. But this isn't a perfect algorithm to identify the hot pages. Because the pages with quite low access frequency may be accessed eventually given the NUMA balancing page table scanning period could be quite long (e.g. 60 seconds). So in this patchset, we implement a new hot page identification algorithm based on the latency between NUMA balancing page table scanning and hint page fault. Which is a kind of mostly frequently accessed (MFU) algorithm. In NUMA balancing memory tiering mode, if there are hot pages in slow memory node and cold pages in fast memory node, we need to promote/demote hot/cold pages between the fast and cold memory nodes. A choice is to promote/demote as fast as possible. But the CPU cycles and memory bandwidth consumed by the high promoting/demoting throughput will hurt the latency of some workload because of accessing inflating and slow memory bandwidth contention. A way to resolve this issue is to restrict the max promoting/demoting throughput. It will take longer to finish the promoting/demoting. But the workload latency will be better. This is implemented in this patchset as the page promotion rate limit mechanism. The promotion hot threshold is workload and system configuration dependent. So in this patchset, a method to adjust the hot threshold automatically is implemented. The basic idea is to control the number of the candidate promotion pages to match the promotion rate limit. We used the pmbench memory accessing benchmark tested the patchset on a 2-socket server system with DRAM and PMEM installed. The test results are as follows, pmbench score promote rate (accesses/s) MB/s ------------- ------------ base 146887704.1 725.6 hot selection 165695601.2 544.0 rate limit 162814569.8 165.2 auto adjustment 170495294.0 136.9 From the results above, With hot page selection patch [1/3], the pmbench score increases about 12.8%, and promote rate (overhead) decreases about 25.0%, compared with base kernel. With rate limit patch [2/3], pmbench score decreases about 1.7%, and promote rate decreases about 69.6%, compared with hot page selection patch. With threshold auto adjustment patch [3/3], pmbench score increases about 4.7%, and promote rate decrease about 17.1%, compared with rate limit patch. Baolin helped to test the patchset with MySQL on a machine which contains 1 DRAM node (30G) and 1 PMEM node (126G). sysbench /usr/share/sysbench/oltp_read_write.lua \ ...... --tables=200 \ --table-size=1000000 \ --report-interval=10 \ --threads=16 \ --time=120 The tps can be improved about 5%. This patch (of 3): To optimize page placement in a memory tiering system with NUMA balancing, the hot pages in the slow memory node need to be identified. Essentially, the original NUMA balancing implementation selects the mostly recently accessed (MRU) pages to promote. But this isn't a perfect algorithm to identify the hot pages. Because the pages with quite low access frequency may be accessed eventually given the NUMA balancing page table scanning period could be quite long (e.g. 60 seconds). The most frequently accessed (MFU) algorithm is better. So, in this patch we implemented a better hot page selection algorithm. Which is based on NUMA balancing page table scanning and hint page fault as follows, - When the page tables of the processes are scanned to change PTE/PMD to be PROT_NONE, the current time is recorded in struct page as scan time. - When the page is accessed, hint page fault will occur. The scan time is gotten from the struct page. And The hint page fault latency is defined as hint page fault time - scan time The shorter the hint page fault latency of a page is, the higher the probability of their access frequency to be higher. So the hint page fault latency is a better estimation of the page hot/cold. It's hard to find some extra space in struct page to hold the scan time. Fortunately, we can reuse some bits used by the original NUMA balancing. NUMA balancing uses some bits in struct page to store the page accessing CPU and PID (referring to page_cpupid_xchg_last()). Which is used by the multi-stage node selection algorithm to avoid to migrate pages shared accessed by the NUMA nodes back and forth. But for pages in the slow memory node, even if they are shared accessed by multiple NUMA nodes, as long as the pages are hot, they need to be promoted to the fast memory node. So the accessing CPU and PID information are unnecessary for the slow memory pages. We can reuse these bits in struct page to record the scan time. For the fast memory pages, these bits are used as before. For the hot threshold, the default value is 1 second, which works well in our performance test. All pages with hint page fault latency < hot threshold will be considered hot. It's hard for users to determine the hot threshold. So we don't provide a kernel ABI to set it, just provide a debugfs interface for advanced users to experiment. We will continue to work on a hot threshold automatic adjustment mechanism. The downside of the above method is that the response time to the workload hot spot changing may be much longer. For example, - A previous cold memory area becomes hot - The hint page fault will be triggered. But the hint page fault latency isn't shorter than the hot threshold. So the pages will not be promoted. - When the memory area is scanned again, maybe after a scan period, the hint page fault latency measured will be shorter than the hot threshold and the pages will be promoted. To mitigate this, if there are enough free space in the fast memory node, the hot threshold will not be used, all pages will be promoted upon the hint page fault for fast response. Thanks Zhong Jiang reported and tested the fix for a bug when disabling memory tiering mode dynamically. Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Wei Xu <weixugc@google.com> Cc: osalvador <osalvador@suse.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Vincent Guittot
|
c82a69629c |
sched/fair: fix case with reduced capacity CPU
The capacity of the CPU available for CFS tasks can be reduced because of other activities running on the latter. In such case, it's worth trying to move CFS tasks on a CPU with more available capacity. The rework of the load balance has filtered the case when the CPU is classified to be fully busy but its capacity is reduced. Check if CPU's capacity is reduced while gathering load balance statistic and classify it group_misfit_task instead of group_fully_busy so we can try to move the load on another CPU. Reported-by: David Chen <david.chen@nutanix.com> Reported-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: David Chen <david.chen@nutanix.com> Tested-by: Zhang Qiao <zhangqiao22@huawei.com> Link: https://lkml.kernel.org/r/20220708154401.21411-1-vincent.guittot@linaro.org |
||
Vincent Donnefort
|
b812fc9768 |
sched/fair: Remove the energy margin in feec()
find_energy_efficient_cpu() integrates a margin to protect tasks from bouncing back and forth from a CPU to another. This margin is set as being 6% of the total current energy estimated on the system. This however does not work for two reasons: 1. The energy estimation is not a good absolute value: compute_energy() used in feec() is a good estimation for task placement as it allows to compare the energy with and without a task. The computed delta will give a good overview of the cost for a certain task placement. It, however, doesn't work as an absolute estimation for the total energy of the system. First it adds the contribution to idle CPUs into the energy, second it mixes util_avg with util_est values. util_avg contains the near history for a CPU usage, it doesn't tell at all what the current utilization is. A system that has been quite busy in the near past will hold a very high energy and then a high margin preventing any task migration to a lower capacity CPU, wasting energy. It even creates a negative feedback loop: by holding the tasks on a less efficient CPU, the margin contributes in keeping the energy high. 2. The margin handicaps small tasks: On a system where the workload is composed mostly of small tasks (which is often the case on Android), the overall energy will be high enough to create a margin none of those tasks can cross. On a Pixel4, a small utilization of 5% on all the CPUs creates a global estimated energy of 140 joules, as per the Energy Model declaration of that same device. This means, after applying the 6% margin that any migration must save more than 8 joules to happen. No task with a utilization lower than 40 would then be able to migrate away from the biggest CPU of the system. The 6% of the overall system energy was brought by the following patch: (eb92692b2544 sched/fair: Speed-up energy-aware wake-ups) It was previously 6% of the prev_cpu energy. Also, the following one made this margin value conditional on the clusters where the task fits: (8d4c97c105ca sched/fair: Only compute base_energy_pd if necessary) We could simply revert that margin change to what it was, but the original version didn't have strong grounds neither and as demonstrated in (1.) the estimated energy isn't a good absolute value. Instead, removing it completely. It is indeed, made possible by recent changes that improved energy estimation comparison fairness (sched/fair: Remove task_util from effective utilization in feec()) (PM: EM: Increase energy calculation precision) and task utilization stabilization (sched/fair: Decay task util_avg during migration) Without a margin, we could have feared bouncing between CPUs. But running LISA's eas_behaviour test coverage on three different platforms (Hikey960, RB-5 and DB-845) showed no issue. Removing the energy margin enables more energy-optimized placements for a more energy efficient system. Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-8-vdonnefort@google.com |
||
Vincent Donnefort
|
3e8c6c9aac |
sched/fair: Remove task_util from effective utilization in feec()
The energy estimation in find_energy_efficient_cpu() (feec()) relies on the computation of the effective utilization for each CPU of a perf domain (PD). This effective utilization is then used as an estimation of the busy time for this pd. The function effective_cpu_util() which gives this value, scales the utilization relative to IRQ pressure on the CPU to take into account that the IRQ time is hidden from the task clock. The IRQ scaling is as follow: effective_cpu_util = irq + (cpu_cap - irq)/cpu_cap * util Where util is the sum of CFS/RT/DL utilization, cpu_cap the capacity of the CPU and irq the IRQ avg time. If now we take as an example a task placement which doesn't raise the OPP on the candidate CPU, we can write the energy delta as: delta = OPPcost/cpu_cap * (effective_cpu_util(cpu_util + task_util) - effective_cpu_util(cpu_util)) = OPPcost/cpu_cap * (cpu_cap - irq)/cpu_cap * task_util We end-up with an energy delta depending on the IRQ avg time, which is a problem: first the time spent on IRQs by a CPU has no effect on the additional energy that would be consumed by a task. Second, we don't want to favour a CPU with a higher IRQ avg time value. Nonetheless, we need to take the IRQ avg time into account. If a task placement raises the PD's frequency, it will increase the energy cost for the entire time where the CPU is busy. A solution is to only use effective_cpu_util() with the CPU contribution part. The task contribution is added separately and scaled according to prev_cpu's IRQ time. No change for the FREQUENCY_UTIL component of the energy estimation. We still want to get the actual frequency that would be selected after the task placement. Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-7-vdonnefort@google.com |
||
Dietmar Eggemann
|
9b340131a4 |
sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu()
The Perf Domain (PD) cpumask (struct em_perf_domain.cpus) stays invariant after Energy Model creation, i.e. it is not updated after CPU hotplug operations. That's why the PD mask is used in conjunction with the cpu_online_mask (or Sched Domain cpumask). Thereby the cpu_online_mask is fetched multiple times (in compute_energy()) during a run-queue selection for a task. cpu_online_mask may change during this time which can lead to wrong energy calculations. To be able to avoid this, use the select_rq_mask per-cpu cpumask to create a cpumask out of PD cpumask and cpu_online_mask and pass it through the function calls of the EAS run-queue selection path. The PD cpumask for max_spare_cap_cpu/compute_prev_delta selection (find_energy_efficient_cpu()) is now ANDed not only with the SD mask but also with the cpu_online_mask. This is fine since this cpumask has to be in syc with the one used for energy computation (compute_energy()). An exclusive cpuset setup with at least one asymmetric CPU capacity island (hence the additional AND with the SD cpumask) is the obvious exception here. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-6-vdonnefort@google.com |
||
Dietmar Eggemann
|
ec4fc801a0 |
sched/fair: Rename select_idle_mask to select_rq_mask
On 21/06/2022 11:04, Vincent Donnefort wrote: > From: Dietmar Eggemann <dietmar.eggemann@arm.com> https://lkml.kernel.org/r/202206221253.ZVyGQvPX-lkp@intel.com discovered that this patch doesn't build anymore (on tip sched/core or linux-next) because of commit f5b2eeb499910 ("sched/fair: Consider CPU affinity when allowing NUMA imbalance in find_idlest_group()"). New version of [PATCH v11 4/7] sched/fair: Rename select_idle_mask to select_rq_mask below. -- >8 -- Decouple the name of the per-cpu cpumask select_idle_mask from its usage in select_idle_[cpu/capacity]() of the CFS run-queue selection (select_task_rq_fair()). This is to support the reuse of this cpumask in the Energy Aware Scheduling (EAS) path (find_energy_efficient_cpu()) of the CFS run-queue selection. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/250691c7-0e2b-05ab-bedf-b245c11d9400@arm.com |
||
Dietmar Eggemann
|
bb44799949 |
sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()
effective_cpu_util() already has a `int cpu' parameter which allows to retrieve the CPU capacity scale factor (or maximum CPU capacity) inside this function via an arch_scale_cpu_capacity(cpu). A lot of code calling effective_cpu_util() (or the shim sched_cpu_util()) needs the maximum CPU capacity, i.e. it will call arch_scale_cpu_capacity() already. But not having to pass it into effective_cpu_util() will make the EAS wake-up code easier, especially when the maximum CPU capacity reduced by the thermal pressure is passed through the EAS wake-up functions. Due to the asymmetric CPU capacity support of arm/arm64 architectures, arch_scale_cpu_capacity(int cpu) is a per-CPU variable read access via per_cpu(cpu_scale, cpu) on such a system. On all other architectures it is a a compile-time constant (SCHED_CAPACITY_SCALE). Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-4-vdonnefort@google.com |
||
Vincent Donnefort
|
e2f3e35f1f |
sched/fair: Decay task PELT values during wakeup migration
Before being migrated to a new CPU, a task sees its PELT values synchronized with rq last_update_time. Once done, that same task will also have its sched_avg last_update_time reset. This means the time between the migration and the last clock update will not be accounted for in util_avg and a discontinuity will appear. This issue is amplified by the PELT clock scaling. It takes currently one tick after the CPU being idle to let clock_pelt catching up clock_task. This is especially problematic for asymmetric CPU capacity systems which need stable util_avg signals for task placement and energy estimation. Ideally, this problem would be solved by updating the runqueue clocks before the migration. But that would require taking the runqueue lock which is quite expensive [1]. Instead estimate the missing time and update the task util_avg with that value. To that end, we need sched_clock_cpu() but it is a costly function. Limit the usage to the case where the source CPU is idle as we know this is when the clock is having the biggest risk of being outdated. See comment in migrate_se_pelt_lag() for more details about how the PELT value is estimated. Notice though this estimation doesn't take into account IRQ and Paravirt time. [1] https://lkml.kernel.org/r/20190709115759.10451-1-chris.redpath@arm.com Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-3-vdonnefort@google.com |
||
Vincent Donnefort
|
d05b43059d |
sched/fair: Provide u64 read for 32-bits arch helper
Introducing macro helpers u64_u32_{store,load}() to factorize lockless accesses to u64 variables for 32-bits architectures. Users are for now cfs_rq.min_vruntime and sched_avg.last_update_time. To accommodate the later where the copy lies outside of the structure (cfs_rq.last_udpate_time_copy instead of sched_avg.last_update_time_copy), use the _copy() version of those helpers. Those new helpers encapsulate smp_rmb() and smp_wmb() synchronization and therefore, have a small penalty for 32-bits machines in set_task_rq_fair() and init_cfs_rq(). Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lkml.kernel.org/r/20220621090414.433602-2-vdonnefort@google.com |
||
Chen Yu
|
70fb5ccf2e |
sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
[Problem Statement] select_idle_cpu() might spend too much time searching for an idle CPU, when the system is overloaded. The following histogram is the time spent in select_idle_cpu(), when running 224 instances of netperf on a system with 112 CPUs per LLC domain: @usecs: [0] 533 | | [1] 5495 | | [2, 4) 12008 | | [4, 8) 239252 | | [8, 16) 4041924 |@@@@@@@@@@@@@@ | [16, 32) 12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32, 64) 14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64, 128) 13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [128, 256) 8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [256, 512) 4507667 |@@@@@@@@@@@@@@@ | [512, 1K) 2600472 |@@@@@@@@@ | [1K, 2K) 927912 |@@@ | [2K, 4K) 218720 | | [4K, 8K) 98161 | | [8K, 16K) 37722 | | [16K, 32K) 6715 | | [32K, 64K) 477 | | [64K, 128K) 7 | | netperf latency usecs: ======= case load Lat_99th std% TCP_RR thread-224 257.39 ( 0.21) The time spent in select_idle_cpu() is visible to netperf and might have a negative impact. [Symptom analysis] The patch [1] from Mel Gorman has been applied to track the efficiency of select_idle_sibling. Copy the indicators here: SIS Search Efficiency(se_eff%): A ratio expressed as a percentage of runqueues scanned versus idle CPUs found. A 100% efficiency indicates that the target, prev or recent CPU of a task was idle at wakeup. The lower the efficiency, the more runqueues were scanned before an idle CPU was found. SIS Domain Search Efficiency(dom_eff%): Similar, except only for the slower SIS patch. SIS Fast Success Rate(fast_rate%): Percentage of SIS that used target, prev or recent CPUs. SIS Success rate(success_rate%): Percentage of scans that found an idle CPU. The test is based on Aubrey's schedtests tool, including netperf, hackbench, schbench and tbench. Test on vanilla kernel: schedstat_parse.py -f netperf_vanilla.log case load se_eff% dom_eff% fast_rate% success_rate% TCP_RR 28 threads 99.978 18.535 99.995 100.000 TCP_RR 56 threads 99.397 5.671 99.964 100.000 TCP_RR 84 threads 21.721 6.818 73.632 100.000 TCP_RR 112 threads 12.500 5.533 59.000 100.000 TCP_RR 140 threads 8.524 4.535 49.020 100.000 TCP_RR 168 threads 6.438 3.945 40.309 99.999 TCP_RR 196 threads 5.397 3.718 32.320 99.982 TCP_RR 224 threads 4.874 3.661 25.775 99.767 UDP_RR 28 threads 99.988 17.704 99.997 100.000 UDP_RR 56 threads 99.528 5.977 99.970 100.000 UDP_RR 84 threads 24.219 6.992 76.479 100.000 UDP_RR 112 threads 13.907 5.706 62.538 100.000 UDP_RR 140 threads 9.408 4.699 52.519 100.000 UDP_RR 168 threads 7.095 4.077 44.352 100.000 UDP_RR 196 threads 5.757 3.775 35.764 99.991 UDP_RR 224 threads 5.124 3.704 28.748 99.860 schedstat_parse.py -f schbench_vanilla.log (each group has 28 tasks) case load se_eff% dom_eff% fast_rate% success_rate% normal 1 mthread 99.152 6.400 99.941 100.000 normal 2 mthreads 97.844 4.003 99.908 100.000 normal 3 mthreads 96.395 2.118 99.917 99.998 normal 4 mthreads 55.288 1.451 98.615 99.804 normal 5 mthreads 7.004 1.870 45.597 61.036 normal 6 mthreads 3.354 1.346 20.777 34.230 normal 7 mthreads 2.183 1.028 11.257 21.055 normal 8 mthreads 1.653 0.825 7.849 15.549 schedstat_parse.py -f hackbench_vanilla.log (each group has 28 tasks) case load se_eff% dom_eff% fast_rate% success_rate% process-pipe 1 group 99.991 7.692 99.999 100.000 process-pipe 2 groups 99.934 4.615 99.997 100.000 process-pipe 3 groups 99.597 3.198 99.987 100.000 process-pipe 4 groups 98.378 2.464 99.958 100.000 process-pipe 5 groups 27.474 3.653 89.811 99.800 process-pipe 6 groups 20.201 4.098 82.763 99.570 process-pipe 7 groups 16.423 4.156 77.398 99.316 process-pipe 8 groups 13.165 3.920 72.232 98.828 process-sockets 1 group 99.977 5.882 99.999 100.000 process-sockets 2 groups 99.927 5.505 99.996 100.000 process-sockets 3 groups 99.397 3.250 99.980 100.000 process-sockets 4 groups 79.680 4.258 98.864 99.998 process-sockets 5 groups 7.673 2.503 63.659 92.115 process-sockets 6 groups 4.642 1.584 58.946 88.048 process-sockets 7 groups 3.493 1.379 49.816 81.164 process-sockets 8 groups 3.015 1.407 40.845 75.500 threads-pipe 1 group 99.997 0.000 100.000 100.000 threads-pipe 2 groups 99.894 2.932 99.997 100.000 threads-pipe 3 groups 99.611 4.117 99.983 100.000 threads-pipe 4 groups 97.703 2.624 99.937 100.000 threads-pipe 5 groups 22.919 3.623 87.150 99.764 threads-pipe 6 groups 18.016 4.038 80.491 99.557 threads-pipe 7 groups 14.663 3.991 75.239 99.247 threads-pipe 8 groups 12.242 3.808 70.651 98.644 threads-sockets 1 group 99.990 6.667 99.999 100.000 threads-sockets 2 groups 99.940 5.114 99.997 100.000 threads-sockets 3 groups 99.469 4.115 99.977 100.000 threads-sockets 4 groups 87.528 4.038 99.400 100.000 threads-sockets 5 groups 6.942 2.398 59.244 88.337 threads-sockets 6 groups 4.359 1.954 49.448 87.860 threads-sockets 7 groups 2.845 1.345 41.198 77.102 threads-sockets 8 groups 2.871 1.404 38.512 74.312 schedstat_parse.py -f tbench_vanilla.log case load se_eff% dom_eff% fast_rate% success_rate% loopback 28 threads 99.976 18.369 99.995 100.000 loopback 56 threads 99.222 7.799 99.934 100.000 loopback 84 threads 19.723 6.819 70.215 100.000 loopback 112 threads 11.283 5.371 55.371 99.999 loopback 140 threads 0.000 0.000 0.000 0.000 loopback 168 threads 0.000 0.000 0.000 0.000 loopback 196 threads 0.000 0.000 0.000 0.000 loopback 224 threads 0.000 0.000 0.000 0.000 According to the test above, if the system becomes busy, the SIS Search Efficiency(se_eff%) drops significantly. Although some benchmarks would finally find an idle CPU(success_rate% = 100%), it is doubtful whether it is worth it to search the whole LLC domain. [Proposal] It would be ideal to have a crystal ball to answer this question: How many CPUs must a wakeup path walk down, before it can find an idle CPU? Many potential metrics could be used to predict the number. One candidate is the sum of util_avg in this LLC domain. The benefit of choosing util_avg is that it is a metric of accumulated historic activity, which seems to be smoother than instantaneous metrics (such as rq->nr_running). Besides, choosing the sum of util_avg would help predict the load of the LLC domain more precisely, because SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle time. In summary, the lower the util_avg is, the more select_idle_cpu() should scan for idle CPU, and vice versa. When the sum of util_avg in this LLC domain hits 85% or above, the scan stops. The reason to choose 85% as the threshold is that this is the imbalance_pct(117) when a LLC sched group is overloaded. Introduce the quadratic function: y = SCHED_CAPACITY_SCALE - p * x^2 and y'= y / SCHED_CAPACITY_SCALE x is the ratio of sum_util compared to the CPU capacity: x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE) y' is the ratio of CPUs to be scanned in the LLC domain, and the number of CPUs to scan is calculated by: nr_scan = llc_weight * y' Choosing quadratic function is because: [1] Compared to the linear function, it scans more aggressively when the sum_util is low. [2] Compared to the exponential function, it is easier to calculate. [3] It seems that there is no accurate mapping between the sum of util_avg and the number of CPUs to be scanned. Use heuristic scan for now. For a platform with 112 CPUs per LLC, the number of CPUs to scan is: sum_util% 0 5 15 25 35 45 55 65 75 85 86 ... scan_nr 112 111 108 102 93 81 65 47 25 1 0 ... For a platform with 16 CPUs per LLC, the number of CPUs to scan is: sum_util% 0 5 15 25 35 45 55 65 75 85 86 ... scan_nr 16 15 15 14 13 11 9 6 3 0 0 ... Furthermore, to minimize the overhead of calculating the metrics in select_idle_cpu(), borrow the statistics from periodic load balance. As mentioned by Abel, on a platform with 112 CPUs per LLC, the sum_util calculated by periodic load balance after 112 ms would decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay in reflecting the latest utilization. But it is a trade-off. Checking the util_avg in newidle load balance would be more frequent, but it brings overhead - multiple CPUs write/read the per-LLC shared variable and introduces cache contention. Tim also mentioned that, it is allowed to be non-optimal in terms of scheduling for the short-term variations, but if there is a long-term trend in the load behavior, the scheduler can adjust for that. When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and Mel suggested, SIS_UTIL should be enabled by default. This patch is based on the util_avg, which is very sensitive to the CPU frequency invariance. There is an issue that, when the max frequency has been clamp, the util_avg would decay insanely fast when the CPU is idle. Commit addca285120b ("cpufreq: intel_pstate: Handle no_turbo in frequency invariance") could be used to mitigate this symptom, by adjusting the arch_max_freq_ratio when turbo is disabled. But this issue is still not thoroughly fixed, because the current code is unaware of the user-specified max CPU frequency. [Test result] netperf and tbench were launched with 25% 50% 75% 100% 125% 150% 175% 200% of CPU number respectively. Hackbench and schbench were launched by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times. The following is the benchmark result comparison between baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare% indicates better performance. Each netperf test is a: netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100 netperf.throughput ======= case load baseline(std%) compare%( std%) TCP_RR 28 threads 1.00 ( 0.34) -0.16 ( 0.40) TCP_RR 56 threads 1.00 ( 0.19) -0.02 ( 0.20) TCP_RR 84 threads 1.00 ( 0.39) -0.47 ( 0.40) TCP_RR 112 threads 1.00 ( 0.21) -0.66 ( 0.22) TCP_RR 140 threads 1.00 ( 0.19) -0.69 ( 0.19) TCP_RR 168 threads 1.00 ( 0.18) -0.48 ( 0.18) TCP_RR 196 threads 1.00 ( 0.16) +194.70 ( 16.43) TCP_RR 224 threads 1.00 ( 0.16) +197.30 ( 7.85) UDP_RR 28 threads 1.00 ( 0.37) +0.35 ( 0.33) UDP_RR 56 threads 1.00 ( 11.18) -0.32 ( 0.21) UDP_RR 84 threads 1.00 ( 1.46) -0.98 ( 0.32) UDP_RR 112 threads 1.00 ( 28.85) -2.48 ( 19.61) UDP_RR 140 threads 1.00 ( 0.70) -0.71 ( 14.04) UDP_RR 168 threads 1.00 ( 14.33) -0.26 ( 11.16) UDP_RR 196 threads 1.00 ( 12.92) +186.92 ( 20.93) UDP_RR 224 threads 1.00 ( 11.74) +196.79 ( 18.62) Take the 224 threads as an example, the SIS search metrics changes are illustrated below: vanilla patched 4544492 +237.5% 15338634 sched_debug.cpu.sis_domain_search.avg 38539 +39686.8% 15333634 sched_debug.cpu.sis_failed.avg 128300000 -87.9% 15551326 sched_debug.cpu.sis_scanned.avg 5842896 +162.7% 15347978 sched_debug.cpu.sis_search.avg There is -87.9% less CPU scans after patched, which indicates lower overhead. Besides, with this patch applied, there is -13% less rq lock contention in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested .try_to_wake_up.default_wake_function.woken_wake_function. This might help explain the performance improvement - Because this patch allows the waking task to remain on the previous CPU, rather than grabbing other CPUs' lock. Each hackbench test is a: hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100 hackbench.throughput ========= case load baseline(std%) compare%( std%) process-pipe 1 group 1.00 ( 1.29) +0.57 ( 0.47) process-pipe 2 groups 1.00 ( 0.27) +0.77 ( 0.81) process-pipe 4 groups 1.00 ( 0.26) +1.17 ( 0.02) process-pipe 8 groups 1.00 ( 0.15) -4.79 ( 0.02) process-sockets 1 group 1.00 ( 0.63) -0.92 ( 0.13) process-sockets 2 groups 1.00 ( 0.03) -0.83 ( 0.14) process-sockets 4 groups 1.00 ( 0.40) +5.20 ( 0.26) process-sockets 8 groups 1.00 ( 0.04) +3.52 ( 0.03) threads-pipe 1 group 1.00 ( 1.28) +0.07 ( 0.14) threads-pipe 2 groups 1.00 ( 0.22) -0.49 ( 0.74) threads-pipe 4 groups 1.00 ( 0.05) +1.88 ( 0.13) threads-pipe 8 groups 1.00 ( 0.09) -4.90 ( 0.06) threads-sockets 1 group 1.00 ( 0.25) -0.70 ( 0.53) threads-sockets 2 groups 1.00 ( 0.10) -0.63 ( 0.26) threads-sockets 4 groups 1.00 ( 0.19) +11.92 ( 0.24) threads-sockets 8 groups 1.00 ( 0.08) +4.31 ( 0.11) Each tbench test is a: tbench -t 100 $job 127.0.0.1 tbench.throughput ====== case load baseline(std%) compare%( std%) loopback 28 threads 1.00 ( 0.06) -0.14 ( 0.09) loopback 56 threads 1.00 ( 0.03) -0.04 ( 0.17) loopback 84 threads 1.00 ( 0.05) +0.36 ( 0.13) loopback 112 threads 1.00 ( 0.03) +0.51 ( 0.03) loopback 140 threads 1.00 ( 0.02) -1.67 ( 0.19) loopback 168 threads 1.00 ( 0.38) +1.27 ( 0.27) loopback 196 threads 1.00 ( 0.11) +1.34 ( 0.17) loopback 224 threads 1.00 ( 0.11) +1.67 ( 0.22) Each schbench test is a: schbench -m $job -t 28 -r 100 -s 30000 -c 30000 schbench.latency_90%_us ======== case load baseline(std%) compare%( std%) normal 1 mthread 1.00 ( 31.22) -7.36 ( 20.25)* normal 2 mthreads 1.00 ( 2.45) -0.48 ( 1.79) normal 4 mthreads 1.00 ( 1.69) +0.45 ( 0.64) normal 8 mthreads 1.00 ( 5.47) +9.81 ( 14.28) *Consider the Standard Deviation, this -7.36% regression might not be valid. Also, a OLTP workload with a commercial RDBMS has been tested, and there is no significant change. There were concerns that unbalanced tasks among CPUs would cause problems. For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are bound to CPU0~CPU6, while CPU7 is idle: CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 util_avg 1024 1024 1024 1024 1024 1024 1024 0 Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%, select_idle_cpu() will not scan, thus CPU7 is undetected during scan. But according to Mel, it is unlikely the CPU7 will be idle all the time because CPU7 could pull some tasks via CPU_NEWLY_IDLE. lkp(kernel test robot) has reported a regression on stress-ng.sock on a very busy system. According to the sched_debug statistics, it might be caused by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this might introduce more context switch, especially involuntary preemption, which impacts a busy stress-ng. This regression has shown that, not all benchmarks in every scenario benefit from idle CPU scan limit, and it needs further investigation. Besides, there is slight regression in hackbench's 16 groups case when the LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively in an LLC domain with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is negligible. The current patch aims to propose a generic solution and only considers the util_avg. Something like the below could be applied on top of the current patch to fulfill the requirement: if (llc_weight <= 16) nr_scan = nr_scan * 32 / llc_weight; For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large. The smaller the CPU number this LLC domain has, the larger nr_scan will be expanded. This needs further investigation. There is also ongoing work[2] from Abel to filter out the busy CPUs during wakeup, to further speed up the idle CPU scan. And it could be a following-up optimization on top of this change. Suggested-by: Tim Chen <tim.c.chen@intel.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Tested-by: Mohini Narkhede <mohini.narkhede@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20220612163428.849378-1-yu.c.chen@intel.com |
||
Zhang Qiao
|
fb95a5a04d |
sched/fair: Remove redundant word " *"
" *" is redundant. so remove it. Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220617181151.29980-2-zhangqiao22@huawei.com |
||
Josh Don
|
792b9f65a5 |
sched: Allow newidle balancing to bail out of load_balance
While doing newidle load balancing, it is possible for new tasks to arrive, such as with pending wakeups. newidle_balance() already accounts for this by exiting the sched_domain load_balance() iteration if it detects these cases. This is very important for minimizing wakeup latency. However, if we are already in load_balance(), we may stay there for a while before returning back to newidle_balance(). This is most exacerbated if we enter a 'goto redo' loop in the LBF_ALL_PINNED case. A very straightforward workaround to this is to adjust should_we_balance() to bail out if we're doing a CPU_NEWLY_IDLE balance and new tasks are detected. This was tested with the following reproduction: - two threads that take turns sleeping and waking each other up are affined to two cores - a large number of threads with 100% utilization are pinned to all other cores Without this patch, wakeup latency was ~120us for the pair of threads, almost entirely spent in load_balance(). With this patch, wakeup latency is ~6us. Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220609025515.2086253-1-joshdon@google.com |
||
Chengming Zhou
|
51bf903b64 |
sched/fair: Optimize and simplify rq leaf_cfs_rq_list
We notice the rq leaf_cfs_rq_list has two problems when do bugfix backports and some test profiling. 1. cfs_rqs under throttled subtree could be added to the list, and make their fully decayed ancestors on the list, even though not needed. 2. #1 also make the leaf_cfs_rq_list management complex and error prone, this is the list of related bugfix so far: commit 31bc6aeaab1d ("sched/fair: Optimize update_blocked_averages()") commit fe61468b2cbc ("sched/fair: Fix enqueue_task_fair warning") commit b34cb07dde7c ("sched/fair: Fix enqueue_task_fair() warning some more") commit 39f23ce07b93 ("sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list") commit 0258bdfaff5b ("sched/fair: Fix unfairness caused by missing load decay") commit a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") commit fdaba61ef8a2 ("sched/fair: Ensure that the CFS parent is added after unthrottling") commit 2630cde26711 ("sched/fair: Add ancestors of unthrottled undecayed cfs_rq") commit 31bc6aeaab1d ("sched/fair: Optimize update_blocked_averages()") delete every cfs_rq under throttled subtree from rq->leaf_cfs_rq_list, and delete the throttled_hierarchy() test in update_blocked_averages(), which optimized update_blocked_averages(). But those later bugfix add cfs_rqs under throttled subtree back to rq->leaf_cfs_rq_list again, with their fully decayed ancestors, for the integrity of rq->leaf_cfs_rq_list. This patch takes another method, skip all cfs_rqs under throttled hierarchy when list_add_leaf_cfs_rq(), to completely make cfs_rqs under throttled subtree off the leaf_cfs_rq_list. So we don't need to consider throttled related things in enqueue_entity(), unthrottle_cfs_rq() and enqueue_task_fair(), which simplify the code a lot. Also optimize update_blocked_averages() since cfs_rqs under throttled hierarchy and their ancestors won't be on the leaf_cfs_rq_list. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220601021848.76943-1-zhouchengming@bytedance.com |
||
K Prateek Nayak
|
f5b2eeb499 |
sched/fair: Consider CPU affinity when allowing NUMA imbalance in find_idlest_group()
In the case of systems containing multiple LLCs per socket, like AMD Zen systems, users want to spread bandwidth hungry applications across multiple LLCs. Stream is one such representative workload where the best performance is obtained by limiting one stream thread per LLC. To ensure this, users are known to pin the tasks to a specify a subset of the CPUs consisting of one CPU per LLC while running such bandwidth hungry tasks. Suppose we kickstart a multi-threaded task like stream with 8 threads using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3 server where each socket contains 128 CPUs (0-63,128-191 in one socket, 64-127,192-255 in another socket) Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8 Here each CPU in the list is from a different LLC and 4 of those LLCs are on one socket, while the other 4 are on another socket. Ideally we would prefer that each stream thread runs on a different CPU from the allowed list of CPUs. However, the current heuristics in find_idlest_group() do not allow this during the initial placement. Suppose the first socket (0-63,128-191) is our local group from which we are kickstarting the stream tasks. The first four stream threads will be placed in this socket. When it comes to placing the 5th thread, all the allowed CPUs are from the local group (0,16,32,48) would have been taken. However, the current scheduler code simply checks if the number of tasks in the local group is fewer than the allowed numa-imbalance threshold. This threshold was previously 25% of the NUMA domain span (in this case threshold = 32) but after the v6 of Mel's patchset "Adjust NUMA imbalance for multiple LLCs", got merged in sched-tip, Commit: e496132ebedd ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs") it is now equal to number of LLCs in the NUMA domain, for processors with multiple LLCs. (in this case threshold = 8). For this example, the number of tasks will always be within threshold and thus all the 8 stream threads will be woken up on the first socket thereby resulting in sub-optimal performance. The following sched_wakeup_new tracepoint output shows the initial placement of tasks in the current tip/sched/core on the Zen3 machine: stream-5313 [016] d..2. 627.005036: sched_wakeup_new: comm=stream pid=5315 prio=120 target_cpu=032 stream-5313 [016] d..2. 627.005086: sched_wakeup_new: comm=stream pid=5316 prio=120 target_cpu=048 stream-5313 [016] d..2. 627.005141: sched_wakeup_new: comm=stream pid=5317 prio=120 target_cpu=000 stream-5313 [016] d..2. 627.005183: sched_wakeup_new: comm=stream pid=5318 prio=120 target_cpu=016 stream-5313 [016] d..2. 627.005218: sched_wakeup_new: comm=stream pid=5319 prio=120 target_cpu=016 stream-5313 [016] d..2. 627.005256: sched_wakeup_new: comm=stream pid=5320 prio=120 target_cpu=016 stream-5313 [016] d..2. 627.005295: sched_wakeup_new: comm=stream pid=5321 prio=120 target_cpu=016 Once the first four threads are distributed among the allowed CPUs of socket one, the rest of the treads start piling on these same CPUs when clearly there are CPUs on the second socket that can be used. Following the initial pile up on a small number of CPUs, though the load-balancer eventually kicks in, it takes a while to get to {4}{4} and even {4}{4} isn't stable as we observe a bunch of ping ponging between {4}{4} to {5}{3} and back before a stable state is reached much later (1 Stream thread per allowed CPU) and no more migration is required. We can detect this piling and avoid it by checking if the number of allowed CPUs in the local group are fewer than the number of tasks running in the local group and use this information to spread the 5th task out into the next socket (after all, the goal in this slowpath is to find the idlest group and the idlest CPU during the initial placement!). The following sched_wakeup_new tracepoint output shows the initial placement of tasks after adding this fix on the Zen3 machine: stream-4485 [016] d..2. 230.784046: sched_wakeup_new: comm=stream pid=4487 prio=120 target_cpu=032 stream-4485 [016] d..2. 230.784123: sched_wakeup_new: comm=stream pid=4488 prio=120 target_cpu=048 stream-4485 [016] d..2. 230.784167: sched_wakeup_new: comm=stream pid=4489 prio=120 target_cpu=000 stream-4485 [016] d..2. 230.784222: sched_wakeup_new: comm=stream pid=4490 prio=120 target_cpu=112 stream-4485 [016] d..2. 230.784271: sched_wakeup_new: comm=stream pid=4491 prio=120 target_cpu=096 stream-4485 [016] d..2. 230.784322: sched_wakeup_new: comm=stream pid=4492 prio=120 target_cpu=080 stream-4485 [016] d..2. 230.784368: sched_wakeup_new: comm=stream pid=4493 prio=120 target_cpu=064 We see that threads are using all of the allowed CPUs and there is no pileup. No output is generated for tracepoint sched_migrate_task with this patch due to a perfect initial placement which removes the need for balancing later on - both across NUMA boundaries and within NUMA boundaries for stream. Following are the results from running 8 Stream threads with and without pinning on a dual socket Zen3 Machine (2 x 64C/128T): During the testing of this patch, the tip sched/core was at commit: 089c02ae2771 "ftrace: Use preemption model accessors for trace header printout" Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8 5.18.0-rc1 5.18.0-rc1 5.18.0-rc1 tip sched/core tip sched/core tip sched/core (no pinning) + pinning + this-patch + pinning Copy: 109364.74 (0.00 pct) 94220.50 (-13.84 pct) 158301.28 (44.74 pct) Scale: 109670.26 (0.00 pct) 90210.59 (-17.74 pct) 149525.64 (36.34 pct) Add: 129029.01 (0.00 pct) 101906.00 (-21.02 pct) 186658.17 (44.66 pct) Triad: 127260.05 (0.00 pct) 106051.36 (-16.66 pct) 184327.30 (44.84 pct) Pinning currently hurts the performance compared to unbound case on tip/sched/core. With the addition of this patch, we are able to outperform tip/sched/core by a good margin with pinning. Following are the results from running 16 Stream threads with and without pinning on a dual socket IceLake Machine (2 x 32C/64T): NUMA Topology of Intel Skylake machine: Node 1: 0,2,4,6 ... 126 (Even numbers) Node 2: 1,3,5,7 ... 127 (Odd numbers) Pinning is done using: numactl -C 0-15 ./stream16 5.18.0-rc1 5.18.0-rc1 5.18.0-rc1 tip sched/core tip sched/core tip sched/core (no pinning) +pinning + this-patch + pinning Copy: 85815.31 (0.00 pct) 149819.21 (74.58 pct) 156807.48 (82.72 pct) Scale: 64795.60 (0.00 pct) 97595.07 (50.61 pct) 99871.96 (54.13 pct) Add: 71340.68 (0.00 pct) 111549.10 (56.36 pct) 114598.33 (60.63 pct) Triad: 68890.97 (0.00 pct) 111635.16 (62.04 pct) 114589.24 (66.33 pct) In case of Icelake machine, with single LLC per socket, pinning across the two sockets reduces cache contention, thus showing great improvement in pinned case which is further benefited by this patch. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lkml.kernel.org/r/20220407111222.22649-1-kprateek.nayak@amd.com |
||
Mel Gorman
|
cb29a5c19d |
sched/numa: Apply imbalance limitations consistently
The imbalance limitations are applied inconsistently at fork time and at runtime. At fork, a new task can remain local until there are too many running tasks even if the degree of imbalance is larger than NUMA_IMBALANCE_MIN which is different to runtime. Secondly, the imbalance figure used during load balancing is different to the one used at NUMA placement. Load balancing uses the number of tasks that must move to restore imbalance where as NUMA balancing uses the total imbalance. In combination, it is possible for a parallel workload that uses a small number of CPUs without applying scheduler policies to have very variable run-to-run performance. [lkp@intel.com: Fix build breakage for arc-allyesconfig] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20220520103519.1863-4-mgorman@techsingularity.net |
||
Mel Gorman
|
13ede33150 |
sched/numa: Do not swap tasks between nodes when spare capacity is available
If a destination node has spare capacity but there is an imbalance then two tasks are selected for swapping. If the tasks have no numa group or are within the same NUMA group, it's simply shuffling tasks around without having any impact on the compute imbalance. Instead, it's just punishing one task to help another. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20220520103519.1863-3-mgorman@techsingularity.net |
||
Mel Gorman
|
70ce3ea9aa |
sched/numa: Initialise numa_migrate_retry
On clone, numa_migrate_retry is inherited from the parent which means that the first NUMA placement of a task is non-deterministic. This affects when load balancing recognises numa tasks and whether to migrate "regular", "remote" or "all" tasks between NUMA scheduler domains. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20220520103519.1863-2-mgorman@techsingularity.net |
||
Linus Torvalds
|
1ec6574a3c |
This set of changes updates init and user mode helper tasks to be
ordinary user mode tasks. In commit 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") caused init and the user mode helper threads that call kernel_execve to have struct kthread allocated for them. This struct kthread going away during execve in turned made a use after free of struct kthread possible. The commit 343f4c49f243 ("kthread: Don't allocate kthread_struct for init and umh") is enough to fix the use after free and is simple enough to be backportable. The rest of the changes pass struct kernel_clone_args to clean things up and cause the code to make sense. In making init and the user mode helpers tasks purely user mode tasks I ran into two complications. The function task_tick_numa was detecting tasks without an mm by testing for the presence of PF_KTHREAD. The initramfs code in populate_initrd_image was using flush_delayed_fput to ensuere the closing of all it's file descriptors was complete, and flush_delayed_fput does not work in a userspace thread. I have looked and looked and more complications and in my code review I have not found any, and neither has anyone else with the code sitting in linux-next. Link: https://lkml.kernel.org/r/87mtfu4up3.fsf@email.froward.int.ebiederm.org Eric W. Biederman (8): kthread: Don't allocate kthread_struct for init and umh fork: Pass struct kernel_clone_args into copy_thread fork: Explicity test for idle tasks in copy_thread fork: Generalize PF_IO_WORKER handling init: Deal with the init process being a user mode process fork: Explicitly set PF_KTHREAD fork: Stop allowing kthreads to call execve sched: Update task_tick_numa to ignore tasks without an mm arch/alpha/kernel/process.c | 13 ++++++------ arch/arc/kernel/process.c | 13 ++++++------ arch/arm/kernel/process.c | 12 ++++++----- arch/arm64/kernel/process.c | 12 ++++++----- arch/csky/kernel/process.c | 15 ++++++------- arch/h8300/kernel/process.c | 10 ++++----- arch/hexagon/kernel/process.c | 12 ++++++----- arch/ia64/kernel/process.c | 15 +++++++------ arch/m68k/kernel/process.c | 12 ++++++----- arch/microblaze/kernel/process.c | 12 ++++++----- arch/mips/kernel/process.c | 13 ++++++------ arch/nios2/kernel/process.c | 12 ++++++----- arch/openrisc/kernel/process.c | 12 ++++++----- arch/parisc/kernel/process.c | 18 +++++++++------- arch/powerpc/kernel/process.c | 15 +++++++------ arch/riscv/kernel/process.c | 12 ++++++----- arch/s390/kernel/process.c | 12 ++++++----- arch/sh/kernel/process_32.c | 12 ++++++----- arch/sparc/kernel/process_32.c | 12 ++++++----- arch/sparc/kernel/process_64.c | 12 ++++++----- arch/um/kernel/process.c | 15 +++++++------ arch/x86/include/asm/fpu/sched.h | 2 +- arch/x86/include/asm/switch_to.h | 8 +++---- arch/x86/kernel/fpu/core.c | 4 ++-- arch/x86/kernel/process.c | 18 +++++++++------- arch/xtensa/kernel/process.c | 17 ++++++++------- fs/exec.c | 8 ++++--- include/linux/sched/task.h | 8 +++++-- init/initramfs.c | 2 ++ init/main.c | 2 +- kernel/fork.c | 46 +++++++++++++++++++++++++++++++++------- kernel/sched/fair.c | 2 +- kernel/umh.c | 6 +++--- 33 files changed, 234 insertions(+), 160 deletions(-) Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmKaR/MACgkQC/v6Eiaj j0Aayg/7Bx66872d9c6igkJ+MPCTuh+v9QKCGwiYEmiU4Q5sVAFB0HPJO27qC14u 630X0RFNZTkPzNNEJNIW4kw6Dj8s8YRKf+FgQAVt4SzdRwT7eIPDjk1nGraopPJ3 O04pjvuTmUyidyViRyFcf2ptx/pnkrwP8jUSc+bGTgfASAKAgAokqKE5ecjewbBc Y/EAkQ6QW7KxPjeSmpAHwI+t3BpBev9WEC4PbhRhsBCQFO2+PJiklvqdhVNBnIjv qUezll/1xv9UYgniB15Q4Nb722SmnWSU3r8as1eFPugzTHizKhufrrpyP+KMK1A0 tdtEJNs5t2DZF7ZbGTFSPqJWmyTYLrghZdO+lOmnaSjHxK4Nda1d4NzbefJ0u+FE tutewowvHtBX6AFIbx+H3O+DOJM2IgNMf+ReQDU/TyNyVf3wBrTbsr9cLxypIJIp zze8npoLMlB7B4yxVo5ES5e63EXfi3iHl0L3/1EhoGwriRz1kWgVLUX/VZOUpscL RkJHsW6bT8sqxPWAA5kyWjEN+wNR2PxbXi8OE4arT0uJrEBMUgDCzydzOv5tJB00 mSQdytxH9LVdsmxBKAOBp5X6WOLGA4yb1cZ6E/mEhlqXMpBDF1DaMfwbWqxSYi4q sp5zU3SBAW0qceiZSsWZXInfbjrcQXNV/DkDRDO9OmzEZP4m1j0= =x6fy -----END PGP SIGNATURE----- Merge tag 'kthread-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull kthread updates from Eric Biederman: "This updates init and user mode helper tasks to be ordinary user mode tasks. Commit 40966e316f86 ("kthread: Ensure struct kthread is present for all kthreads") caused init and the user mode helper threads that call kernel_execve to have struct kthread allocated for them. This struct kthread going away during execve in turned made a use after free of struct kthread possible. Here, commit 343f4c49f243 ("kthread: Don't allocate kthread_struct for init and umh") is enough to fix the use after free and is simple enough to be backportable. The rest of the changes pass struct kernel_clone_args to clean things up and cause the code to make sense. In making init and the user mode helpers tasks purely user mode tasks I ran into two complications. The function task_tick_numa was detecting tasks without an mm by testing for the presence of PF_KTHREAD. The initramfs code in populate_initrd_image was using flush_delayed_fput to ensuere the closing of all it's file descriptors was complete, and flush_delayed_fput does not work in a userspace thread. I have looked and looked and more complications and in my code review I have not found any, and neither has anyone else with the code sitting in linux-next" * tag 'kthread-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: sched: Update task_tick_numa to ignore tasks without an mm fork: Stop allowing kthreads to call execve fork: Explicitly set PF_KTHREAD init: Deal with the init process being a user mode process fork: Generalize PF_IO_WORKER handling fork: Explicity test for idle tasks in copy_thread fork: Pass struct kernel_clone_args into copy_thread kthread: Don't allocate kthread_struct for init and umh |
||
Linus Torvalds
|
44d35720c9 |
sysctl changes for v5.19-rc1
For two kernel releases now kernel/sysctl.c has been being cleaned up slowly, since the tables were grossly long, sprinkled with tons of #ifdefs and all this caused merge conflicts with one susbystem or another. This tree was put together to help try to avoid conflicts with these cleanups going on different trees at time. So nothing exciting on this pull request, just cleanups. I actually had this sysctl-next tree up since v5.18 but I missed sending a pull request for it on time during the last merge window. And so these changes have been being soaking up on sysctl-next and so linux-next for a while. The last change was merged May 4th. Most of the compile issues were reported by 0day and fixed. To help avoid a conflict with bpf folks at Daniel Borkmann's request I merged bpf-next/pr/bpf-sysctl into sysctl-next to get the effor which moves the BPF sysctls from kernel/sysctl.c to BPF core. Possible merge conflicts and known resolutions as per linux-next: bfp: https://lkml.kernel.org/r/20220414112812.652190b5@canb.auug.org.au rcu: https://lkml.kernel.org/r/20220420153746.4790d532@canb.auug.org.au powerpc: https://lkml.kernel.org/r/20220520154055.7f964b76@canb.auug.org.au -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmKOq8ASHG1jZ3JvZkBr ZXJuZWwub3JnAAoJEM4jHQowkoinDAkQAJVo5YVM9f74UwYp4PQhTpjxJBCjRoZD z1u9bp5rMj2ujTC8Fr7VmzKaHrb8+r1C1WvCvZtIzemYNB4lZUrHpVDYfXuXiPRB ihPmEjhlPO5PFBx6cVCpI3cu9bEhG00rLc1QXnABx/pXwNPcOTJAGZJVamZvqubk chjgZrb7N+adHPfvS55v1+zpwdeKfpp5U3zuu5qlT/nn0GS0HCVzOj5fj4oC4wtJ IqfUubo+FX50Ga58yQABWNrjaPD9Crykz5ohVazy3ElQl0hJ4VsK65ct3blqc2vz 1Bb8kPpWuv6aZ5nr1lCVE8qvF4ZIL33ySvpg5BSdWLQEDrBbSpzvJe9Yn7wgR+eq y7fhpO24+zRM82EoDMEvyxX9u1n1RsvoXRtf3ds9BGf63MUxk8a1cgjlU6vuyO2U JhDmfM1xzdKvPoY4COOnHzcAiIqzItTqKd09N5y0cahmYstROU8lvp9huhTAHqk1 SjQMbLIZG7OnX8ZeQcR1EB8sq/IOPZT48ejj0iJmQ8FyMaep71MOQLYyLPAq4lgh JHXm8P6QdB57jfJbqAeNSyZoK0qdxOUR/83Zcah7Jjns6vkju1DNatEsaEEI2y2M 4n7/rkHeZ3TyFHBUX4e9FomKvGLsAalDBRiqsuxLSOPMU8rGrNLAslOAtKwvp90X 4ht3M2VP098l =btwh -----END PGP SIGNATURE----- Merge tag 'sysctl-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "For two kernel releases now kernel/sysctl.c has been being cleaned up slowly, since the tables were grossly long, sprinkled with tons of #ifdefs and all this caused merge conflicts with one susbystem or another. This tree was put together to help try to avoid conflicts with these cleanups going on different trees at time. So nothing exciting on this pull request, just cleanups. Thanks a lot to the Uniontech and Huawei folks for doing some of this nasty work" * tag 'sysctl-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (28 commits) sched: Fix build warning without CONFIG_SYSCTL reboot: Fix build warning without CONFIG_SYSCTL kernel/kexec_core: move kexec_core sysctls into its own file sysctl: minor cleanup in new_dir() ftrace: fix building with SYSCTL=y but DYNAMIC_FTRACE=n fs/proc: Introduce list_for_each_table_entry for proc sysctl mm: fix unused variable kernel warning when SYSCTL=n latencytop: move sysctl to its own file ftrace: fix building with SYSCTL=n but DYNAMIC_FTRACE=y ftrace: Fix build warning ftrace: move sysctl_ftrace_enabled to ftrace.c kernel/do_mount_initrd: move real_root_dev sysctls to its own file kernel/delayacct: move delayacct sysctls to its own file kernel/acct: move acct sysctls to its own file kernel/panic: move panic sysctls to its own file kernel/lockdep: move lockdep sysctls to its own file mm: move page-writeback sysctls to their own file mm: move oom_kill sysctls to their own file kernel/reboot: move reboot sysctls to its own file sched: Move energy_aware sysctls to topology.c ... |
||
Eric W. Biederman
|
b3f9916d81 |
sched: Update task_tick_numa to ignore tasks without an mm
Qian Cai <quic_qiancai@quicinc.com> wrote: > Reverting the last 3 commits of the series fixed a boot crash. > > 1b2552cbdbe0 fork: Stop allowing kthreads to call execve > 753550eb0ce1 fork: Explicitly set PF_KTHREAD > 68d85f0a33b0 init: Deal with the init process being a user mode process > > BUG: KASAN: null-ptr-deref in task_nr_scan_windows.isra.0 > arch_atomic_long_read at ./include/linux/atomic/atomic-long.h:29 > (inlined by) atomic_long_read at ./include/linux/atomic/atomic-instrumented.h:1266 > (inlined by) get_mm_counter at ./include/linux/mm.h:1996 > (inlined by) get_mm_rss at ./include/linux/mm.h:2049 > (inlined by) task_nr_scan_windows at kernel/sched/fair.c:1123 > Read of size 8 at addr 00000000000003d0 by task swapper/0/1 With the change to init and the user mode helper processes to not have PF_KTHREAD set before they call kernel_execve the PF_KTHREAD test in task_tick_numa became insufficient to detect all tasks that have "->mm == NULL". Correct that by testing for "->mm == NULL" directly. Reported-by: Qian Cai <quic_qiancai@quicinc.com> Tested-by: Qian Cai <quic_qiancai@quicinc.com> Fixes: 1b2552cbdbe0 ("fork: Stop allowing kthreads to call execve") Link: https://lkml.kernel.org/r/87r150ug1l.fsf_-_@email.froward.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
||
Ingo Molnar
|
d70522fc54 |
Linux 5.18-rc5
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmJu9FYeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGAyEH/16xtJSpLmLwrQzG o+4ToQxSQ+/9UHyu0RTEvHg2THm9/8emtIuYyc/5FgdoWctcSa3AaDcveWmuWmkS KYcdhfJsaEqjNHS3OPYXN84fmo9Hel7263shu5+IYmP/sN0DfQp6UWTryX1q4B3Q 4Pdutkuq63Uwd8nBZ5LXQBumaBrmkkuMgWEdT4+6FOo1mPzwdIGBxCuz1UsNNl5k chLWxkQfe2eqgWbYJrgCQfrVdORXVtoU2fGilZUNrHRVGkkldXkkz5clJfapyZD3 odmZCEbrE4GPKgZwCmDERMfD1hzhZDtYKiHfOQ506szH5ykJjPBcOjHed7dA60eB J3+wdek= =39Ca -----END PGP SIGNATURE----- Merge tag 'v5.18-rc5' into sched/core to pull in fixes & to resolve a conflict - sched/core is on a pretty old -rc1 base - refresh it to include recent fixes. - this also allows up to resolve a (trivial) .mailmap conflict Conflicts: .mailmap Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Thomas Gleixner
|
d664e39912 |
sched: Fix missing prototype warnings
A W=1 build emits more than a dozen missing prototype warnings related to scheduler and scheduler specific includes. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220413133024.249118058@linutronix.de |
||
Dietmar Eggemann
|
97956dd278 |
sched/fair: Remove cfs_rq_tg_path()
cfs_rq_tg_path() is used by a tracepoint-to traceevent (tp-2-te) converter to format the path of a taskgroup or autogroup respectively. It doesn't have any in-kernel users after the removal of the sched_trace_cfs_rq_path() helper function. cfs_rq_tg_path() can be coded in a tp-2-te converter. Remove it from kernel/sched/fair.c. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220428144338.479094-3-qais.yousef@arm.com |
||
Dietmar Eggemann
|
50e7b416d2 |
sched/fair: Remove sched_trace_*() helper functions
We no longer need them as we can use DWARF debug info or BTF + pahole to re-generate the required structs to compile against them for a given kernel. This moves the burden of maintaining these helper functions to the module. https://github.com/qais-yousef/sched_tp Note that pahole v1.15 is required at least for using DWARF. And for BTF v1.23 which is not yet released will be required. There's alignment problem that will lead to crashes in earlier versions when used with BTF. We should have enough infrastructure to make these helper functions now obsolete, so remove them. [Rewrote commit message to reflect the new alternative] Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220428144338.479094-2-qais.yousef@arm.com |
||
Dietmar Eggemann
|
4e3c7d338a |
sched/fair: Refactor cpu_util_without()
Except the 'task has no contribution or is new' condition at the beginning of cpu_util_without(), which it shares with the load and runnable counterpart functions, a cpu_util_next(..., dst_cpu = -1) call can replace the rest of it. The UTIL_EST specific check that task util_est has to be subtracted from the CPU one in case of an enqueued (or current (to cater for the wakeup - lb race)) task has to be moved to cpu_util_next(). This was initially introduced by commit c469933e7721 ("sched/fair: Fix cpu_util_wake() for 'execl' type workloads"). UnixBench's `execl` throughput tests were run on the dual socket 40 CPUs Intel E5-2690 v2 to make sure it doesn't regress again. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220318163656.954440-1-dietmar.eggemann@arm.com |
||
Tao Zhou
|
a658353167 |
sched/fair: Revise comment about lb decision matrix
If busiest group type is group_misfit_task, the local group type must be group_has_spare according to below code in update_sd_pick_busiest(): if (sgs->group_type == group_misfit_task && (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) || sds->local_stat.group_type != group_has_spare)) return false; group type imbalanced and overloaded and fully_busy are filtered in here. misfit and asym are filtered before in update_sg_lb_stats(). So, change the decision matrix to: busiest \ local has_spare fully_busy misfit asym imbalanced overloaded has_spare nr_idle balanced N/A N/A balanced balanced fully_busy nr_idle nr_idle N/A N/A balanced balanced misfit_task force N/A N/A N/A *N/A* *N/A* asym_packing force force N/A N/A force force imbalanced force force N/A N/A force force overloaded force force N/A N/A force avg_load Fixes: 0b0695f2b34a ("sched/fair: Rework load_balance()") Signed-off-by: Tao Zhou <tao.zhou@linux.dev> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20220415095505.7765-1-tao.zhou@linux.dev |
||
Chengming Zhou
|
0a00a35464 |
sched/fair: Delete useless condition in tg_unthrottle_up()
We have tested cfs_rq->load.weight in cfs_rq_is_decayed(), the first condition "!cfs_rq_is_decayed(cfs_rq)" is enough to cover the second condition "cfs_rq->nr_running". Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220408115309.81603-2-zhouchengming@bytedance.com |
||
Chengming Zhou
|
64eaf50731 |
sched/fair: Fix cfs_rq_clock_pelt() for throttled cfs_rq
Since commit 23127296889f ("sched/fair: Update scale invariance of PELT") change to use rq_clock_pelt() instead of rq_clock_task(), we should also use rq_clock_pelt() for throttled_clock_task_time and throttled_clock_task accounting to get correct cfs_rq_clock_pelt() of throttled cfs_rq. And rename throttled_clock_task(_time) to be clock_pelt rather than clock_task. Fixes: 23127296889f ("sched/fair: Update scale invariance of PELT") Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220408115309.81603-1-zhouchengming@bytedance.com |
||
zgpeng
|
0635490078 |
sched/fair: Move calculate of avg_load to a better location
In calculate_imbalance function, when the value of local->avg_load is greater than or equal to busiest->avg_load, the calculated sds->avg_load is not used. So this calculation can be placed in a more appropriate position. Signed-off-by: zgpeng <zgpeng@tencent.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Samuel Liao <samuelliao@tencent.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/1649239025-10010-1-git-send-email-zgpeng@tencent.com |
||
kuyo chang
|
40f5aa4c5e |
sched/pelt: Fix attach_entity_load_avg() corner case
The warning in cfs_rq_is_decayed() triggered: SCHED_WARN_ON(cfs_rq->avg.load_avg || cfs_rq->avg.util_avg || cfs_rq->avg.runnable_avg) There exists a corner case in attach_entity_load_avg() which will cause load_sum to be zero while load_avg will not be. Consider se_weight is 88761 as per the sched_prio_to_weight[] table. Further assume the get_pelt_divider() is 47742, this gives: se->avg.load_avg is 1. However, calculating load_sum: se->avg.load_sum = div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se)); se->avg.load_sum = 1*47742/88761 = 0. Then enqueue_load_avg() adds this to the cfs_rq totals: cfs_rq->avg.load_avg += se->avg.load_avg; cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum; Resulting in load_avg being 1 with load_sum is 0, which will trigger the WARN. Fixes: f207934fb79d ("sched/fair: Align PELT windows between cfs_rq and its se") Signed-off-by: kuyo chang <kuyo.chang@mediatek.com> [peterz: massage changelog] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lkml.kernel.org/r/20220414090229.342-1-kuyo.chang@mediatek.com |
||
Zhen Ni
|
d4ae80ffa6 |
sched: Move cfs_bandwidth_slice sysctls to fair.c
move cfs_bandwidth_slice sysctls to fair.c and use the new register_sysctl_init() to register the sysctl interface. Signed-off-by: Zhen Ni <nizhen@uniontech.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> |
||
Zhen Ni
|
a60707d74b |
sched: Move child_runs_first sysctls to fair.c
move child_runs_first sysctls to fair.c and use the new register_sysctl_init() to register the sysctl interface. Signed-off-by: Zhen Ni <nizhen@uniontech.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> |
||
Linus Torvalds
|
1930a6e739 |
ptrace: Cleanups for v5.18
This set of changes removes tracehook.h, moves modification of all of the ptrace fields inside of siglock to remove races, adds a missing permission check to ptrace.c The removal of tracehook.h is quite significant as it has been a major source of confusion in recent years. Much of that confusion was around task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled making the semantics clearer). For people who don't know tracehook.h is a vestiage of an attempt to implement uprobes like functionality that was never fully merged, and was later superseeded by uprobes when uprobes was merged. For many years now we have been removing what tracehook functionaly a little bit at a time. To the point where now anything left in tracehook.h is some weird strange thing that is difficult to understand. Eric W. Biederman (15): ptrace: Move ptrace_report_syscall into ptrace.h ptrace/arm: Rename tracehook_report_syscall report_syscall ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h ptrace: Remove arch_syscall_{enter,exit}_tracehook ptrace: Remove tracehook_signal_handler task_work: Remove unnecessary include from posix_timers.h task_work: Introduce task_work_pending task_work: Call tracehook_notify_signal from get_signal on all architectures task_work: Decouple TIF_NOTIFY_SIGNAL and task_work signal: Move set_notify_signal and clear_notify_signal into sched/signal.h resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume resume_user_mode: Move to resume_user_mode.h tracehook: Remove tracehook.h ptrace: Move setting/clearing ptrace_message into ptrace_stop ptrace: Return the signal to continue with from ptrace_stop Jann Horn (1): ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE Yang Li (1): ptrace: Remove duplicated include in ptrace.c MAINTAINERS | 1 - arch/Kconfig | 5 +- arch/alpha/kernel/ptrace.c | 5 +- arch/alpha/kernel/signal.c | 4 +- arch/arc/kernel/ptrace.c | 5 +- arch/arc/kernel/signal.c | 4 +- arch/arm/kernel/ptrace.c | 12 +- arch/arm/kernel/signal.c | 4 +- arch/arm64/kernel/ptrace.c | 14 +-- arch/arm64/kernel/signal.c | 4 +- arch/csky/kernel/ptrace.c | 5 +- arch/csky/kernel/signal.c | 4 +- arch/h8300/kernel/ptrace.c | 5 +- arch/h8300/kernel/signal.c | 4 +- arch/hexagon/kernel/process.c | 4 +- arch/hexagon/kernel/signal.c | 1 - arch/hexagon/kernel/traps.c | 6 +- arch/ia64/kernel/process.c | 4 +- arch/ia64/kernel/ptrace.c | 6 +- arch/ia64/kernel/signal.c | 1 - arch/m68k/kernel/ptrace.c | 5 +- arch/m68k/kernel/signal.c | 4 +- arch/microblaze/kernel/ptrace.c | 5 +- arch/microblaze/kernel/signal.c | 4 +- arch/mips/kernel/ptrace.c | 5 +- arch/mips/kernel/signal.c | 4 +- arch/nds32/include/asm/syscall.h | 2 +- arch/nds32/kernel/ptrace.c | 5 +- arch/nds32/kernel/signal.c | 4 +- arch/nios2/kernel/ptrace.c | 5 +- arch/nios2/kernel/signal.c | 4 +- arch/openrisc/kernel/ptrace.c | 5 +- arch/openrisc/kernel/signal.c | 4 +- arch/parisc/kernel/ptrace.c | 7 +- arch/parisc/kernel/signal.c | 4 +- arch/powerpc/kernel/ptrace/ptrace.c | 8 +- arch/powerpc/kernel/signal.c | 4 +- arch/riscv/kernel/ptrace.c | 5 +- arch/riscv/kernel/signal.c | 4 +- arch/s390/include/asm/entry-common.h | 1 - arch/s390/kernel/ptrace.c | 1 - arch/s390/kernel/signal.c | 5 +- arch/sh/kernel/ptrace_32.c | 5 +- arch/sh/kernel/signal_32.c | 4 +- arch/sparc/kernel/ptrace_32.c | 5 +- arch/sparc/kernel/ptrace_64.c | 5 +- arch/sparc/kernel/signal32.c | 1 - arch/sparc/kernel/signal_32.c | 4 +- arch/sparc/kernel/signal_64.c | 4 +- arch/um/kernel/process.c | 4 +- arch/um/kernel/ptrace.c | 5 +- arch/x86/kernel/ptrace.c | 1 - arch/x86/kernel/signal.c | 5 +- arch/x86/mm/tlb.c | 1 + arch/xtensa/kernel/ptrace.c | 5 +- arch/xtensa/kernel/signal.c | 4 +- block/blk-cgroup.c | 2 +- fs/coredump.c | 1 - fs/exec.c | 1 - fs/io-wq.c | 6 +- fs/io_uring.c | 11 +- fs/proc/array.c | 1 - fs/proc/base.c | 1 - include/asm-generic/syscall.h | 2 +- include/linux/entry-common.h | 47 +------- include/linux/entry-kvm.h | 2 +- include/linux/posix-timers.h | 1 - include/linux/ptrace.h | 81 ++++++++++++- include/linux/resume_user_mode.h | 64 ++++++++++ include/linux/sched/signal.h | 17 +++ include/linux/task_work.h | 5 + include/linux/tracehook.h | 226 ----------------------------------- include/uapi/linux/ptrace.h | 2 +- kernel/entry/common.c | 19 +-- kernel/entry/kvm.c | 9 +- kernel/exit.c | 3 +- kernel/livepatch/transition.c | 1 - kernel/ptrace.c | 47 +++++--- kernel/seccomp.c | 1 - kernel/signal.c | 62 +++++----- kernel/task_work.c | 4 +- kernel/time/posix-cpu-timers.c | 1 + mm/memcontrol.c | 2 +- security/apparmor/domain.c | 1 - security/selinux/hooks.c | 1 - 85 files changed, 372 insertions(+), 495 deletions(-) Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmJCQkoACgkQC/v6Eiaj j0DCWQ/5AZVFU+hX32obUNCLackHTwgcCtSOs3JNBmNA/zL/htPiYYG0ghkvtlDR Dw5J5DnxC6P7PVAdAqrpvx2uX2FebHYU0bRlyLx8LYUEP5dhyNicxX9jA882Z+vw Ud0Ue9EojwGWS76dC9YoKUj3slThMATbhA2r4GVEoof8fSNJaBxQIqath44t0FwU DinWa+tIOvZANGBZr6CUUINNIgqBIZCH/R4h6ArBhMlJpuQ5Ufk2kAaiWFwZCkX4 0LuuAwbKsCKkF8eap5I2KrIg/7zZVgxAg9O3cHOzzm8OPbKzRnNnQClcDe8perqp S6e/f3MgpE+eavd1EiLxevZ660cJChnmikXVVh8ZYYoefaMKGqBaBSsB38bNcLjY 3+f2dB+TNBFRnZs1aCujK3tWBT9QyjZDKtCBfzxDNWBpXGLhHH6j6lA5Lj+Cef5K /HNHFb+FuqedlFZh5m1Y+piFQ70hTgCa2u8b+FSOubI2hW9Zd+WzINV0ANaZ2LvZ 4YGtcyDNk1q1+c87lxP9xMRl/xi6rNg+B9T2MCo4IUnHgpSVP6VEB3osgUmrrrN0 eQlUI154G/AaDlqXLgmn1xhRmlPGfmenkxpok1AuzxvNJsfLKnpEwQSc13g3oiZr disZQxNY0kBO2Nv3G323Z6PLinhbiIIFez6cJzK5v0YJ2WtO3pY= =uEro -----END PGP SIGNATURE----- Merge tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ptrace cleanups from Eric Biederman: "This set of changes removes tracehook.h, moves modification of all of the ptrace fields inside of siglock to remove races, adds a missing permission check to ptrace.c The removal of tracehook.h is quite significant as it has been a major source of confusion in recent years. Much of that confusion was around task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled making the semantics clearer). For people who don't know tracehook.h is a vestiage of an attempt to implement uprobes like functionality that was never fully merged, and was later superseeded by uprobes when uprobes was merged. For many years now we have been removing what tracehook functionaly a little bit at a time. To the point where anything left in tracehook.h was some weird strange thing that was difficult to understand" * tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ptrace: Remove duplicated include in ptrace.c ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE ptrace: Return the signal to continue with from ptrace_stop ptrace: Move setting/clearing ptrace_message into ptrace_stop tracehook: Remove tracehook.h resume_user_mode: Move to resume_user_mode.h resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume signal: Move set_notify_signal and clear_notify_signal into sched/signal.h task_work: Decouple TIF_NOTIFY_SIGNAL and task_work task_work: Call tracehook_notify_signal from get_signal on all architectures task_work: Introduce task_work_pending task_work: Remove unnecessary include from posix_timers.h ptrace: Remove tracehook_signal_handler ptrace: Remove arch_syscall_{enter,exit}_tracehook ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h ptrace/arm: Rename tracehook_report_syscall report_syscall ptrace: Move ptrace_report_syscall into ptrace.h |
||
Huang, Ying
|
ab31c7fd2d |
sched/numa: Fix boot crash on arm64 systems
Qian Cai reported a boot crash on arm64 systems, caused by: 0fb3978b0aac ("sched/numa: Fix NUMA topology for systems with CPU-less nodes") The bug is that node_state() must be supplied a valid node_states[] array index, but in task_numa_placement() the max_nid search can fail with NUMA_NO_NODE, which is not a valid index. Fix it by checking that max_nid is a valid index. [ mingo: Added changelog. ] Fixes: 0fb3978b0aac ("sched/numa: Fix NUMA topology for systems with CPU-less nodes") Reported-by: Qian Cai <quic_qiancai@quicinc.com> Tested-by: Qian Cai <quic_qiancai@quicinc.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Ingo Molnar
|
c4ad6fcb67 |
sched/headers: Reorganize, clean up and optimize kernel/sched/fair.c dependencies
Use all generic headers from kernel/sched/sched.h that are required for it to build. Sort the sections alphabetically. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Peter Zijlstra <peterz@infradead.org> |
||
Ingo Molnar
|
b9e9c6ca6e |
sched/headers: Standardize kernel/sched/sched.h header dependencies
kernel/sched/sched.h is a weird mix of ad-hoc headers included in the middle of the header. Two of them rely on being included in the middle of kernel/sched/sched.h, due to definitions they require: - "stat.h" needs the rq definitions. - "autogroup.h" needs the task_group definition. Move the inclusion of these two files out of kernel/sched/sched.h, and include them in all files that require them. Move of the rest of the header dependencies to the top of the kernel/sched/sched.h file. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Peter Zijlstra <peterz@infradead.org> |
||
Ingo Molnar
|
6255b48aeb |
Linux 5.17-rc5
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmISrYgeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGg20IAKDZr7rfSHBopjQV Cocw744tom0XuxpvSZpp2GGOOXF+tkswcNNaRIrbGOl1mkyxA7eBZCTMpDeDS9aQ wB0D0Gxx8QBAJp4KgB1W7TB+hIGes/rs8Ve+6iO4ulLLdCVWX/q2boI0aZ7QX9O9 qNi8OsoZQtk6falRvciZFHwV5Av1p2Sy1AW57udQ7DvJ4H98AfKf1u8/z208WWW8 1ixC+qJxQcUcM9vI+7P9Tt7NbFSKv8SvAmqjFY7P+DxQAsVw6KXoqVXykDzeOv0t fUNOE/t0oFZafwtn8h7KBQnwS9lH03+3KkslVZs+iMFyUj/Bar+NVVyKoDhWXtVg /PuMhEg= =eU1o -----END PGP SIGNATURE----- Merge tag 'v5.17-rc5' into sched/core, to resolve conflicts New conflicts in sched/core due to the following upstream fixes: 44585f7bc0cb ("psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n") a06247c6804f ("psi: Fix uaf issue when psi trigger is destroyed while being polled") Conflicts: include/linux/psi_types.h kernel/sched/psi.c Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Frederic Weisbecker
|
04d4e665a6 |
sched/isolation: Use single feature type while referring to housekeeping cpumask
Refer to housekeeping APIs using single feature types instead of flags. This prevents from passing multiple isolation features at once to housekeeping interfaces, which soon won't be possible anymore as each isolation features will have their own cpumask. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org |
||
Huang Ying
|
5c7b1aaf13 |
sched/numa: Avoid migrating task to CPU-less node
In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA nodes. But if the number of the hint page faults on a PMEM node is the max for a task, The current NUMA balancing policy may try to place the task on the PMEM node instead of DRAM node. This is unreasonable, because there's no CPU in PMEM NUMA nodes. To fix this, CPU-less nodes are ignored when searching the migration target node for a task in this patch. To test the patch, we run a workload that accesses more memory in PMEM node than memory in DRAM node. Without the patch, the PMEM node will be chosen as preferred node in task_numa_placement(). While the DRAM node will be chosen instead with the patch. Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220214121553.582248-2-ying.huang@intel.com |
||
Huang Ying
|
0fb3978b0a |
sched/numa: Fix NUMA topology for systems with CPU-less nodes
The NUMA topology parameters (sched_numa_topology_type, sched_domains_numa_levels, and sched_max_numa_distance, etc.) identified by scheduler may be wrong for systems with CPU-less nodes. For example, the ACPI SLIT of a system with CPU-less persistent memory (Intel Optane DCPMM) nodes is as follows, [000h 0000 4] Signature : "SLIT" [System Locality Information Table] [004h 0004 4] Table Length : 0000042C [008h 0008 1] Revision : 01 [009h 0009 1] Checksum : 59 [00Ah 0010 6] Oem ID : "XXXX" [010h 0016 8] Oem Table ID : "XXXXXXX" [018h 0024 4] Oem Revision : 00000001 [01Ch 0028 4] Asl Compiler ID : "INTL" [020h 0032 4] Asl Compiler Revision : 20091013 [024h 0036 8] Localities : 0000000000000004 [02Ch 0044 4] Locality 0 : 0A 15 11 1C [030h 0048 4] Locality 1 : 15 0A 1C 11 [034h 0052 4] Locality 2 : 11 1C 0A 1C [038h 0056 4] Locality 3 : 1C 11 1C 0A While the `numactl -H` output is as follows, available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 0 size: 64136 MB node 0 free: 5981 MB node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 1 size: 64466 MB node 1 free: 10415 MB node 2 cpus: node 2 size: 253952 MB node 2 free: 253920 MB node 3 cpus: node 3 size: 253952 MB node 3 free: 253951 MB node distances: node 0 1 2 3 0: 10 21 17 28 1: 21 10 28 17 2: 17 28 10 28 3: 28 17 28 10 In this system, there are only 2 sockets. In each memory controller, both DRAM and PMEM DIMMs are installed. Although the physical NUMA topology is simple, the logical NUMA topology becomes a little complex. Because both the distance(0, 1) and distance (1, 3) are less than the distance (0, 3), it appears that node 1 sits between node 0 and node 3. And the whole system appears to be a glueless mesh NUMA topology type. But it's definitely not, there is even no CPU in node 3. This isn't a practical problem now yet. Because the PMEM nodes (node 2 and node 3 in example system) are offlined by default during system boot. So init_numa_topology_type() called during system boot will ignore them and set sched_numa_topology_type to NUMA_DIRECT. And init_numa_topology_type() is only called at runtime when a CPU of a never-onlined-before node gets plugged in. And there's no CPU in the PMEM nodes. But it appears better to fix this to make the code more robust. To test the potential problem. We have used a debug patch to call init_numa_topology_type() when the PMEM node is onlined (in __set_migration_target_nodes()). With that, the NUMA parameters identified by scheduler is as follows, sched_numa_topology_type: NUMA_GLUELESS_MESH sched_domains_numa_levels: 4 sched_max_numa_distance: 28 To fix the issue, the CPU-less nodes are ignored when the NUMA topology parameters are identified. Because a node may become CPU-less or not at run time because of CPU hotplug, the NUMA topology parameters need to be re-initialized at runtime for CPU hotplug too. With the patch, the NUMA parameters identified for the example system above is as follows, sched_numa_topology_type: NUMA_DIRECT sched_domains_numa_levels: 2 sched_max_numa_distance: 21 Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220214121553.582248-1-ying.huang@intel.com |
||
Mel Gorman
|
e496132ebe |
sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") allowed an imbalance between NUMA nodes such that communicating tasks would not be pulled apart by the load balancer. This works fine when there is a 1:1 relationship between LLC and node but can be suboptimal for multiple LLCs if independent tasks prematurely use CPUs sharing cache. Zen* has multiple LLCs per node with local memory channels and due to the allowed imbalance, it's far harder to tune some workloads to run optimally than it is on hardware that has 1 LLC per node. This patch allows an imbalance to exist up to the point where LLCs should be balanced between nodes. On a Zen3 machine running STREAM parallelised with OMP to have on instance per LLC the results and without binding, the results are 5.17.0-rc0 5.17.0-rc0 vanilla sched-numaimb-v6 MB/sec copy-16 162596.94 ( 0.00%) 580559.74 ( 257.05%) MB/sec scale-16 136901.28 ( 0.00%) 374450.52 ( 173.52%) MB/sec add-16 157300.70 ( 0.00%) 564113.76 ( 258.62%) MB/sec triad-16 151446.88 ( 0.00%) 564304.24 ( 272.61%) STREAM can use directives to force the spread if the OpenMP is new enough but that doesn't help if an application uses threads and it's not known in advance how many threads will be created. Coremark is a CPU and cache intensive benchmark parallelised with threads. When running with 1 thread per core, the vanilla kernel allows threads to contend on cache. With the patch; 5.17.0-rc0 5.17.0-rc0 vanilla sched-numaimb-v5 Min Score-16 368239.36 ( 0.00%) 389816.06 ( 5.86%) Hmean Score-16 388607.33 ( 0.00%) 427877.08 * 10.11%* Max Score-16 408945.69 ( 0.00%) 481022.17 ( 17.62%) Stddev Score-16 15247.04 ( 0.00%) 24966.82 ( -63.75%) CoeffVar Score-16 3.92 ( 0.00%) 5.82 ( -48.48%) It can also make a big difference for semi-realistic workloads like specjbb which can execute arbitrary numbers of threads without advance knowledge of how they should be placed. Even in cases where the average performance is neutral, the results are more stable. 5.17.0-rc0 5.17.0-rc0 vanilla sched-numaimb-v6 Hmean tput-1 71631.55 ( 0.00%) 73065.57 ( 2.00%) Hmean tput-8 582758.78 ( 0.00%) 556777.23 ( -4.46%) Hmean tput-16 1020372.75 ( 0.00%) 1009995.26 ( -1.02%) Hmean tput-24 1416430.67 ( 0.00%) 1398700.11 ( -1.25%) Hmean tput-32 1687702.72 ( 0.00%) 1671357.04 ( -0.97%) Hmean tput-40 1798094.90 ( 0.00%) 2015616.46 * 12.10%* Hmean tput-48 1972731.77 ( 0.00%) 2333233.72 ( 18.27%) Hmean tput-56 2386872.38 ( 0.00%) 2759483.38 ( 15.61%) Hmean tput-64 2909475.33 ( 0.00%) 2925074.69 ( 0.54%) Hmean tput-72 2585071.36 ( 0.00%) 2962443.97 ( 14.60%) Hmean tput-80 2994387.24 ( 0.00%) 3015980.59 ( 0.72%) Hmean tput-88 3061408.57 ( 0.00%) 3010296.16 ( -1.67%) Hmean tput-96 3052394.82 ( 0.00%) 2784743.41 ( -8.77%) Hmean tput-104 2997814.76 ( 0.00%) 2758184.50 ( -7.99%) Hmean tput-112 2955353.29 ( 0.00%) 2859705.09 ( -3.24%) Hmean tput-120 2889770.71 ( 0.00%) 2764478.46 ( -4.34%) Hmean tput-128 2871713.84 ( 0.00%) 2750136.73 ( -4.23%) Stddev tput-1 5325.93 ( 0.00%) 2002.53 ( 62.40%) Stddev tput-8 6630.54 ( 0.00%) 10905.00 ( -64.47%) Stddev tput-16 25608.58 ( 0.00%) 6851.16 ( 73.25%) Stddev tput-24 12117.69 ( 0.00%) 4227.79 ( 65.11%) Stddev tput-32 27577.16 ( 0.00%) 8761.05 ( 68.23%) Stddev tput-40 59505.86 ( 0.00%) 2048.49 ( 96.56%) Stddev tput-48 168330.30 ( 0.00%) 93058.08 ( 44.72%) Stddev tput-56 219540.39 ( 0.00%) 30687.02 ( 86.02%) Stddev tput-64 121750.35 ( 0.00%) 9617.36 ( 92.10%) Stddev tput-72 223387.05 ( 0.00%) 34081.13 ( 84.74%) Stddev tput-80 128198.46 ( 0.00%) 22565.19 ( 82.40%) Stddev tput-88 136665.36 ( 0.00%) 27905.97 ( 79.58%) Stddev tput-96 111925.81 ( 0.00%) 99615.79 ( 11.00%) Stddev tput-104 146455.96 ( 0.00%) 28861.98 ( 80.29%) Stddev tput-112 88740.49 ( 0.00%) 58288.23 ( 34.32%) Stddev tput-120 186384.86 ( 0.00%) 45812.03 ( 75.42%) Stddev tput-128 78761.09 ( 0.00%) 57418.48 ( 27.10%) Similarly, for embarassingly parallel problems like NPB-ep, there are improvements due to better spreading across LLC when the machine is not fully utilised. vanilla sched-numaimb-v6 Min ep.D 31.79 ( 0.00%) 26.11 ( 17.87%) Amean ep.D 31.86 ( 0.00%) 26.17 * 17.86%* Stddev ep.D 0.07 ( 0.00%) 0.05 ( 24.41%) CoeffVar ep.D 0.22 ( 0.00%) 0.20 ( 7.97%) Max ep.D 31.93 ( 0.00%) 26.21 ( 17.91%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20220208094334.16379-3-mgorman@techsingularity.net |
||
Mel Gorman
|
2cfb7a1b03 |
sched/fair: Improve consistency of allowed NUMA balance calculations
There are inconsistencies when determining if a NUMA imbalance is allowed that should be corrected. o allow_numa_imbalance changes types and is not always examining the destination group so both the type should be corrected as well as the naming. o find_idlest_group uses the sched_domain's weight instead of the group weight which is different to find_busiest_group o find_busiest_group uses the source group instead of the destination which is different to task_numa_find_cpu o Both find_idlest_group and find_busiest_group should account for the number of running tasks if a move was allowed to be consistent with task_numa_find_cpu Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Link: https://lore.kernel.org/r/20220208094334.16379-2-mgorman@techsingularity.net |
||
Honglei Wang
|
12bf8a7eb8 |
sched/numa: initialize numa statistics when forking new task
The child processes will inherit numa_pages_migrated and total_numa_faults from the parent. It means even if there is no numa fault happen on the child, the statistics in /proc/$pid of the child process might show huge amount. This is a bit weird. Let's initialize them when do fork. Signed-off-by: Honglei Wang <wanghonglei@didichuxing.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20220113133920.49900-1-wanghonglei@didichuxing.com |
||
Randy Dunlap
|
a315da5e68 |
sched/fair: Fix all kernel-doc warnings
Quieten all kernel-doc warnings in kernel/sched/fair.c: kernel/sched/fair.c:3663: warning: No description found for return value of 'update_cfs_rq_load_avg' kernel/sched/fair.c:8601: warning: No description found for return value of 'asym_smt_can_pull_tasks' kernel/sched/fair.c:8673: warning: Function parameter or member 'sds' not described in 'update_sg_lb_stats' kernel/sched/fair.c:9483: warning: contents before sections Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20211218055900.2704-1-rdunlap@infradead.org |
||
Vincent Guittot
|
2d02fa8cc2 |
sched/pelt: Relax the sync of load_sum with load_avg
Similarly to util_avg and util_sum, don't sync load_sum with the low bound of load_avg but only ensure that load_sum stays in the correct range. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Link: https://lkml.kernel.org/r/20220111134659.24961-5-vincent.guittot@linaro.org |
||
Vincent Guittot
|
95246d1ec8 |
sched/pelt: Relax the sync of runnable_sum with runnable_avg
Similarly to util_avg and util_sum, don't sync runnable_sum with the low bound of runnable_avg but only ensure that runnable_sum stays in the correct range. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Link: https://lkml.kernel.org/r/20220111134659.24961-4-vincent.guittot@linaro.org |
||
Vincent Guittot
|
7ceb771030 |
sched/pelt: Continue to relax the sync of util_sum with util_avg
Rick reported performance regressions in bugzilla because of cpu frequency being lower than before: https://bugzilla.kernel.org/show_bug.cgi?id=215045 He bisected the problem to: commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent") This commit forces util_sum to be synced with the new util_avg after removing the contribution of a task and before the next periodic sync. By doing so util_sum is rounded to its lower bound and might lost up to LOAD_AVG_MAX-1 of accumulated contribution which has not yet been reflected in util_avg. update_tg_cfs_util() is not the only place where we round util_sum and lost some accumulated contributions that are not already reflected in util_avg. Modify update_tg_cfs_util() and detach_entity_load_avg() to not sync util_sum with the new util_avg. Instead of always setting util_sum to the low bound of util_avg, which can significantly lower the utilization, we propagate the difference. In addition, we also check that cfs's util_sum always stays above the lower bound for a given util_avg as it has been observed that sched_entity's util_sum is sometimes above cfs one. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Link: https://lkml.kernel.org/r/20220111134659.24961-3-vincent.guittot@linaro.org |
||
Vincent Guittot
|
98b0d89022 |
sched/pelt: Relax the sync of util_sum with util_avg
Rick reported performance regressions in bugzilla because of cpu frequency being lower than before: https://bugzilla.kernel.org/show_bug.cgi?id=215045 He bisected the problem to: commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent") This commit forces util_sum to be synced with the new util_avg after removing the contribution of a task and before the next periodic sync. By doing so util_sum is rounded to its lower bound and might lost up to LOAD_AVG_MAX-1 of accumulated contribution which has not yet been reflected in util_avg. Instead of always setting util_sum to the low bound of util_avg, which can significantly lower the utilization of root cfs_rq after propagating the change down into the hierarchy, we revert the change of util_sum and propagate the difference. In addition, we also check that cfs's util_sum always stays above the lower bound for a given util_avg as it has been observed that sched_entity's util_sum is sometimes above cfs one. Fixes: 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent") Reported-by: Rick Yiu <rickyiu@google.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Link: https://lkml.kernel.org/r/20220111134659.24961-2-vincent.guittot@linaro.org |