vmstat: skip periodic vmstat update for isolated CPUs
Problem: The interruption caused by vmstat_update is undesirable for certain applications. With workloads that are running on isolated cpus with nohz full mode to shield off any kernel interruption. For example, a VM running a time sensitive application with a 50us maximum acceptable interruption (use case: soft PLC). oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... The example above shows an additional 7us for the oslat -> kworker -> oslat switches. In the case of a virtualized CPU, and the vmstat_update interruption in the host (of a qemu-kvm vcpu), the latency penalty observed in the guest is higher than 50us, violating the acceptable latency threshold. The isolated vCPU can perform operations that modify per-CPU page counters, for example to complete I/O operations: CPU 11/KVM-9540 [001] dNh1. 2314.248584: mod_zone_page_state <-__folio_end_writeback CPU 11/KVM-9540 [001] dNh1. 2314.248585: <stack trace> => 0xffffffffc042b083 => mod_zone_page_state => __folio_end_writeback => folio_end_writeback => iomap_finish_ioend => blk_mq_end_request_batch => nvme_irq => __handle_irq_event_percpu => handle_irq_event => handle_edge_irq => __common_interrupt => common_interrupt => asm_common_interrupt => vmx_do_interrupt_nmi_irqoff => vmx_handle_exit_irqoff => vcpu_enter_guest => vcpu_run => kvm_arch_vcpu_ioctl_run => kvm_vcpu_ioctl => __x64_sys_ioctl => do_syscall_64 => entry_SYSCALL_64_after_hwframe In kernel users of vmstat counters either require the precise value and they are using zone_page_state_snapshot interface or they can live with an imprecision as the regular flushing can happen at arbitrary time and cumulative error can grow (see calculate_normal_threshold). From that POV the regular flushing can be postponed for CPUs that have been isolated from the kernel interference without critical infrastructure ever noticing. Skip regular flushing from vmstat_shepherd for all isolated CPUs to avoid interference with the isolated workload. Suggested by Michal Hocko. Link: https://lkml.kernel.org/r/ZIDoV/zxFKVmQl7W@tpad Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This commit is contained in:
parent
56e0d1cb16
commit
be5e015d10
15
mm/vmstat.c
15
mm/vmstat.c
@ -28,6 +28,7 @@
|
||||
#include <linux/mm_inline.h>
|
||||
#include <linux/page_ext.h>
|
||||
#include <linux/page_owner.h>
|
||||
#include <linux/sched/isolation.h>
|
||||
|
||||
#include "internal.h"
|
||||
|
||||
@ -2022,6 +2023,20 @@ static void vmstat_shepherd(struct work_struct *w)
|
||||
for_each_online_cpu(cpu) {
|
||||
struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
|
||||
|
||||
/*
|
||||
* In kernel users of vmstat counters either require the precise value and
|
||||
* they are using zone_page_state_snapshot interface or they can live with
|
||||
* an imprecision as the regular flushing can happen at arbitrary time and
|
||||
* cumulative error can grow (see calculate_normal_threshold).
|
||||
*
|
||||
* From that POV the regular flushing can be postponed for CPUs that have
|
||||
* been isolated from the kernel interference without critical
|
||||
* infrastructure ever noticing. Skip regular flushing from vmstat_shepherd
|
||||
* for all isolated CPUs to avoid interference with the isolated workload.
|
||||
*/
|
||||
if (cpu_is_isolated(cpu))
|
||||
continue;
|
||||
|
||||
if (!delayed_work_pending(dw) && need_update(cpu))
|
||||
queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user