linux/arch/powerpc/platforms/pseries/hotplug-cpu.c

902 lines
20 KiB
C
Raw Normal View History

// SPDX-License-Identifier: GPL-2.0-or-later
/*
* pseries CPU Hotplug infrastructure.
*
* Split out from arch/powerpc/platforms/pseries/setup.c
* arch/powerpc/kernel/rtas.c, and arch/powerpc/platforms/pseries/smp.c
*
* Peter Bergner, IBM March 2001.
* Copyright (C) 2001 IBM.
* Dave Engebretsen, Peter Bergner, and
* Mike Corrigan {engebret|bergner|mikec}@us.ibm.com
* Plus various changes from other IBM teams...
*
* Copyright (C) 2006 Michael Ellerman, IBM Corporation
*/
#define pr_fmt(fmt) "pseries-hotplug-cpu: " fmt
#include <linux/kernel.h>
#include <linux/interrupt.h>
#include <linux/delay.h>
powerpc: Fix up implicit sched.h users They are getting it through device.h --> module.h path, but we want to clean that up. This is a sample of what will happen if we don't: pseries/iommu.c: In function 'tce_build_pSeriesLP': pseries/iommu.c:136: error: implicit declaration of function 'show_stack' pseries/eeh.c: In function 'eeh_token_to_phys': pseries/eeh.c:359: error: 'init_mm' undeclared (first use in this function) pseries/eeh_event.c: In function 'eeh_event_handler': pseries/eeh_event.c:63: error: implicit declaration of function 'daemonize' pseries/eeh_event.c:64: error: implicit declaration of function 'set_current_state' pseries/eeh_event.c:64: error: 'TASK_INTERRUPTIBLE' undeclared (first use in this function) pseries/eeh_event.c:64: error: (Each undeclared identifier is reported only once pseries/eeh_event.c:64: error: for each function it appears in.) pseries/eeh_event.c: In function 'eeh_thread_launcher': pseries/eeh_event.c:109: error: 'CLONE_KERNEL' undeclared (first use in this function) hotplug-cpu.c: In function 'pseries_mach_cpu_die': hotplug-cpu.c:115: error: implicit declaration of function 'idle_task_exit' kernel/swsusp_64.c: In function 'do_after_copyback': kernel/swsusp_64.c:17: error: implicit declaration of function 'touch_softlockup_watchdog' cell/spufs/context.c: In function 'alloc_spu_context': cell/spufs/context.c:60: error: implicit declaration of function 'get_task_mm' cell/spufs/context.c:60: warning: assignment makes pointer from integer without a cast cell/spufs/context.c: In function 'spu_forget': cell/spufs/context.c:127: error: implicit declaration of function 'mmput' pasemi/dma_lib.c: In function 'pasemi_dma_stop_chan': pasemi/dma_lib.c:332: error: implicit declaration of function 'cond_resched' sysdev/fsl_lbc.c: In function 'fsl_lbc_ctrl_irq': sysdev/fsl_lbc.c:247: error: 'TASK_NORMAL' undeclared (first use in this function) Add in sched.h so these get the definitions they are looking for. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-05-27 22:25:11 +04:00
#include <linux/sched.h> /* for idle_task_exit */
#include <linux/sched/hotplug.h>
#include <linux/cpu.h>
#include <linux/of.h>
#include <linux/slab.h>
#include <asm/prom.h>
#include <asm/rtas.h>
#include <asm/firmware.h>
#include <asm/machdep.h>
#include <asm/vdso_datapage.h>
#include <asm/xics.h>
powerpc/xive: guest exploitation of the XIVE interrupt controller This is the framework for using XIVE in a PowerVM guest. The support is very similar to the native one in a much simpler form. Each source is associated with an Event State Buffer (ESB). This is a two bit state machine which is used to trigger events. The bits are named "P" (pending) and "Q" (queued) and can be controlled by MMIO. The Guest OS registers event (or notifications) queues on which the HW will post event data for a target to notify. Instead of OPAL calls, a set of Hypervisors call are used to configure the interrupt sources and the event/notification queues of the guest: - H_INT_GET_SOURCE_INFO used to obtain the address of the MMIO page of the Event State Buffer (PQ bits) entry associated with the source. - H_INT_SET_SOURCE_CONFIG assigns a source to a "target". - H_INT_GET_SOURCE_CONFIG determines to which "target" and "priority" is assigned to a source - H_INT_GET_QUEUE_INFO returns the address of the notification management page associated with the specified "target" and "priority". - H_INT_SET_QUEUE_CONFIG sets or resets the event queue for a given "target" and "priority". It is also used to set the notification config associated with the queue, only unconditional notification for the moment. Reset is performed with a queue size of 0 and queueing is disabled in that case. - H_INT_GET_QUEUE_CONFIG returns the queue settings for a given "target" and "priority". - H_INT_RESET resets all of the partition's interrupt exploitation structures to their initial state, losing all configuration set via the hcalls H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG. - H_INT_SYNC issue a synchronisation on a source to make sure sure all notifications have reached their queue. As for XICS, the XIVE interface for the guest is described in the device tree under the "interrupt-controller" node. A couple of new properties are specific to XIVE : - "reg" contains the base address and size of the thread interrupt managnement areas (TIMA), also called rings, for the User level and for the Guest OS level. Only the Guest OS level is taken into account today. - "ibm,xive-eq-sizes" the size of the event queues. One cell per size supported, contains log2 of size, in ascending order. - "ibm,xive-lisn-ranges" the interrupt numbers ranges assigned to the guest. These are allocated using a simple bitmap. and also : - "/ibm,plat-res-int-priorities" contains a list of priorities that the hypervisor has reserved for its own use. Tested with a QEMU XIVE model for pseries and with the Power hypervisor. Signed-off-by: Cédric Le Goater <clg@kaod.org> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-30 22:46:11 +03:00
#include <asm/xive.h>
#include <asm/plpar_wrappers.h>
#include <asm/topology.h>
#include "pseries.h"
/* This version can't take the spinlock, because it never returns */
static int rtas_stop_self_token = RTAS_UNKNOWN_SERVICE;
powerpc/pseries: Prevent free CPU ids being reused on another node When a CPU is hot added, the CPU ids are taken from the available mask from the lower possible set. If that set of values was previously used for a CPU attached to a different node, it appears to an application as if these CPUs have migrated from one node to another node which is not expected. To prevent this, it is needed to record the CPU ids used for each node and to not reuse them on another node. However, to prevent CPU hot plug to fail, in the case the CPU ids is starved on a node, the capability to reuse other nodes’ free CPU ids is kept. A warning is displayed in such a case to warn the user. A new CPU bit mask (node_recorded_ids_map) is introduced for each possible node. It is populated with the CPU onlined at boot time, and then when a CPU is hot plugged to a node. The bits in that mask remain when the CPU is hot unplugged, to remind this CPU ids have been used for this node. If no id set was found, a retry is made without removing the ids used on the other nodes to try reusing them. This is the way ids have been allocated prior to this patch. The effect of this patch can be seen by removing and adding CPUs using the Qemu monitor. In the following case, the first CPU from the node 2 is removed, then the first one from the node 1 is removed too. Later, the first CPU of the node 2 is added back. Without that patch, the kernel will number these CPUs using the first CPU ids available which are the ones freed when removing the second CPU of the node 0. This leads to the CPU ids 16-23 to move from the node 1 to the node 2. With the patch applied, the CPU ids 32-39 are used since they are the lowest free ones which have not been used on another node. At boot time: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Vanilla kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47 Patched kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
/*
* Record the CPU ids used on each nodes.
* Protected by cpu_add_remove_lock.
*/
static cpumask_var_t node_recorded_ids_map[MAX_NUMNODES];
static void rtas_stop_self(void)
{
static struct rtas_args args;
local_irq_disable();
BUG_ON(rtas_stop_self_token == RTAS_UNKNOWN_SERVICE);
rtas_call_unlocked(&args, rtas_stop_self_token, 0, 1, NULL);
panic("Alas, I survived.\n");
}
static void pseries_cpu_offline_self(void)
{
powerpc/pseries: Add hooks to put the CPU into an appropriate offline state When a CPU is offlined on POWER currently, we call rtas_stop_self() and hand the CPU back to the resource pool. This path is used for DLPAR which will cause a change in the LPAR configuration which will be visible outside. This patch changes the default state a CPU is put into when it is offlined. On platforms which support ceding the processor to the hypervisor with latency hint specifier value, during a cpu offline operation, instead of calling rtas_stop_self(), we cede the vCPU to the hypervisor while passing a latency hint specifier value. The Hypervisor can use this hint to provide better energy savings. Also, during the offline operation, the control of the vCPU remains with the LPAR as oppposed to returning it to the resource pool. The patch achieves this by creating an infrastructure to set the preferred_offline_state() which can be either - CPU_STATE_OFFLINE: which is the current behaviour of calling rtas_stop_self() - CPU_STATE_INACTIVE: which cedes the vCPU to the hypervisor with the latency hint specifier. The codepath which wants to perform a DLPAR operation can set the preferred_offline_state() of a CPU to CPU_STATE_OFFLINE before invoking cpu_down(). The patch also provides a boot-time command line argument to disable/enable CPU_STATE_INACTIVE. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-10-29 22:22:53 +03:00
unsigned int hwcpu = hard_smp_processor_id();
local_irq_disable();
idle_task_exit();
powerpc/xive: guest exploitation of the XIVE interrupt controller This is the framework for using XIVE in a PowerVM guest. The support is very similar to the native one in a much simpler form. Each source is associated with an Event State Buffer (ESB). This is a two bit state machine which is used to trigger events. The bits are named "P" (pending) and "Q" (queued) and can be controlled by MMIO. The Guest OS registers event (or notifications) queues on which the HW will post event data for a target to notify. Instead of OPAL calls, a set of Hypervisors call are used to configure the interrupt sources and the event/notification queues of the guest: - H_INT_GET_SOURCE_INFO used to obtain the address of the MMIO page of the Event State Buffer (PQ bits) entry associated with the source. - H_INT_SET_SOURCE_CONFIG assigns a source to a "target". - H_INT_GET_SOURCE_CONFIG determines to which "target" and "priority" is assigned to a source - H_INT_GET_QUEUE_INFO returns the address of the notification management page associated with the specified "target" and "priority". - H_INT_SET_QUEUE_CONFIG sets or resets the event queue for a given "target" and "priority". It is also used to set the notification config associated with the queue, only unconditional notification for the moment. Reset is performed with a queue size of 0 and queueing is disabled in that case. - H_INT_GET_QUEUE_CONFIG returns the queue settings for a given "target" and "priority". - H_INT_RESET resets all of the partition's interrupt exploitation structures to their initial state, losing all configuration set via the hcalls H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG. - H_INT_SYNC issue a synchronisation on a source to make sure sure all notifications have reached their queue. As for XICS, the XIVE interface for the guest is described in the device tree under the "interrupt-controller" node. A couple of new properties are specific to XIVE : - "reg" contains the base address and size of the thread interrupt managnement areas (TIMA), also called rings, for the User level and for the Guest OS level. Only the Guest OS level is taken into account today. - "ibm,xive-eq-sizes" the size of the event queues. One cell per size supported, contains log2 of size, in ascending order. - "ibm,xive-lisn-ranges" the interrupt numbers ranges assigned to the guest. These are allocated using a simple bitmap. and also : - "/ibm,plat-res-int-priorities" contains a list of priorities that the hypervisor has reserved for its own use. Tested with a QEMU XIVE model for pseries and with the Power hypervisor. Signed-off-by: Cédric Le Goater <clg@kaod.org> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-30 22:46:11 +03:00
if (xive_enabled())
xive_teardown_cpu();
else
xics_teardown_cpu();
powerpc/pseries: Add hooks to put the CPU into an appropriate offline state When a CPU is offlined on POWER currently, we call rtas_stop_self() and hand the CPU back to the resource pool. This path is used for DLPAR which will cause a change in the LPAR configuration which will be visible outside. This patch changes the default state a CPU is put into when it is offlined. On platforms which support ceding the processor to the hypervisor with latency hint specifier value, during a cpu offline operation, instead of calling rtas_stop_self(), we cede the vCPU to the hypervisor while passing a latency hint specifier value. The Hypervisor can use this hint to provide better energy savings. Also, during the offline operation, the control of the vCPU remains with the LPAR as oppposed to returning it to the resource pool. The patch achieves this by creating an infrastructure to set the preferred_offline_state() which can be either - CPU_STATE_OFFLINE: which is the current behaviour of calling rtas_stop_self() - CPU_STATE_INACTIVE: which cedes the vCPU to the hypervisor with the latency hint specifier. The codepath which wants to perform a DLPAR operation can set the preferred_offline_state() of a CPU to CPU_STATE_OFFLINE before invoking cpu_down(). The patch also provides a boot-time command line argument to disable/enable CPU_STATE_INACTIVE. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-10-29 22:22:53 +03:00
unregister_slb_shadow(hwcpu);
unregister_vpa(hwcpu);
rtas_stop_self();
powerpc/pseries: Add hooks to put the CPU into an appropriate offline state When a CPU is offlined on POWER currently, we call rtas_stop_self() and hand the CPU back to the resource pool. This path is used for DLPAR which will cause a change in the LPAR configuration which will be visible outside. This patch changes the default state a CPU is put into when it is offlined. On platforms which support ceding the processor to the hypervisor with latency hint specifier value, during a cpu offline operation, instead of calling rtas_stop_self(), we cede the vCPU to the hypervisor while passing a latency hint specifier value. The Hypervisor can use this hint to provide better energy savings. Also, during the offline operation, the control of the vCPU remains with the LPAR as oppposed to returning it to the resource pool. The patch achieves this by creating an infrastructure to set the preferred_offline_state() which can be either - CPU_STATE_OFFLINE: which is the current behaviour of calling rtas_stop_self() - CPU_STATE_INACTIVE: which cedes the vCPU to the hypervisor with the latency hint specifier. The codepath which wants to perform a DLPAR operation can set the preferred_offline_state() of a CPU to CPU_STATE_OFFLINE before invoking cpu_down(). The patch also provides a boot-time command line argument to disable/enable CPU_STATE_INACTIVE. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-10-29 22:22:53 +03:00
/* Should never get here... */
BUG();
for(;;);
}
static int pseries_cpu_disable(void)
{
int cpu = smp_processor_id();
set_cpu_online(cpu, false);
vdso_data->processorCount--;
/*fix boot_cpuid here*/
if (cpu == boot_cpuid)
boot_cpuid = cpumask_any(cpu_online_mask);
/* FIXME: abstract this to not be platform specific later on */
powerpc/xive: guest exploitation of the XIVE interrupt controller This is the framework for using XIVE in a PowerVM guest. The support is very similar to the native one in a much simpler form. Each source is associated with an Event State Buffer (ESB). This is a two bit state machine which is used to trigger events. The bits are named "P" (pending) and "Q" (queued) and can be controlled by MMIO. The Guest OS registers event (or notifications) queues on which the HW will post event data for a target to notify. Instead of OPAL calls, a set of Hypervisors call are used to configure the interrupt sources and the event/notification queues of the guest: - H_INT_GET_SOURCE_INFO used to obtain the address of the MMIO page of the Event State Buffer (PQ bits) entry associated with the source. - H_INT_SET_SOURCE_CONFIG assigns a source to a "target". - H_INT_GET_SOURCE_CONFIG determines to which "target" and "priority" is assigned to a source - H_INT_GET_QUEUE_INFO returns the address of the notification management page associated with the specified "target" and "priority". - H_INT_SET_QUEUE_CONFIG sets or resets the event queue for a given "target" and "priority". It is also used to set the notification config associated with the queue, only unconditional notification for the moment. Reset is performed with a queue size of 0 and queueing is disabled in that case. - H_INT_GET_QUEUE_CONFIG returns the queue settings for a given "target" and "priority". - H_INT_RESET resets all of the partition's interrupt exploitation structures to their initial state, losing all configuration set via the hcalls H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG. - H_INT_SYNC issue a synchronisation on a source to make sure sure all notifications have reached their queue. As for XICS, the XIVE interface for the guest is described in the device tree under the "interrupt-controller" node. A couple of new properties are specific to XIVE : - "reg" contains the base address and size of the thread interrupt managnement areas (TIMA), also called rings, for the User level and for the Guest OS level. Only the Guest OS level is taken into account today. - "ibm,xive-eq-sizes" the size of the event queues. One cell per size supported, contains log2 of size, in ascending order. - "ibm,xive-lisn-ranges" the interrupt numbers ranges assigned to the guest. These are allocated using a simple bitmap. and also : - "/ibm,plat-res-int-priorities" contains a list of priorities that the hypervisor has reserved for its own use. Tested with a QEMU XIVE model for pseries and with the Power hypervisor. Signed-off-by: Cédric Le Goater <clg@kaod.org> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-30 22:46:11 +03:00
if (xive_enabled())
xive_smp_disable_cpu();
else
xics_migrate_irqs_away();
cleanup_cpu_mmu_context();
return 0;
}
powerpc/pseries: Add hooks to put the CPU into an appropriate offline state When a CPU is offlined on POWER currently, we call rtas_stop_self() and hand the CPU back to the resource pool. This path is used for DLPAR which will cause a change in the LPAR configuration which will be visible outside. This patch changes the default state a CPU is put into when it is offlined. On platforms which support ceding the processor to the hypervisor with latency hint specifier value, during a cpu offline operation, instead of calling rtas_stop_self(), we cede the vCPU to the hypervisor while passing a latency hint specifier value. The Hypervisor can use this hint to provide better energy savings. Also, during the offline operation, the control of the vCPU remains with the LPAR as oppposed to returning it to the resource pool. The patch achieves this by creating an infrastructure to set the preferred_offline_state() which can be either - CPU_STATE_OFFLINE: which is the current behaviour of calling rtas_stop_self() - CPU_STATE_INACTIVE: which cedes the vCPU to the hypervisor with the latency hint specifier. The codepath which wants to perform a DLPAR operation can set the preferred_offline_state() of a CPU to CPU_STATE_OFFLINE before invoking cpu_down(). The patch also provides a boot-time command line argument to disable/enable CPU_STATE_INACTIVE. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-10-29 22:22:53 +03:00
/*
* pseries_cpu_die: Wait for the cpu to die.
* @cpu: logical processor id of the CPU whose death we're awaiting.
*
* This function is called from the context of the thread which is performing
* the cpu-offline. Here we wait for long enough to allow the cpu in question
* to self-destroy so that the cpu-offline thread can send the CPU_DEAD
* notifications.
*
* OTOH, pseries_cpu_offline_self() is called by the @cpu when it wants to
powerpc/pseries: Add hooks to put the CPU into an appropriate offline state When a CPU is offlined on POWER currently, we call rtas_stop_self() and hand the CPU back to the resource pool. This path is used for DLPAR which will cause a change in the LPAR configuration which will be visible outside. This patch changes the default state a CPU is put into when it is offlined. On platforms which support ceding the processor to the hypervisor with latency hint specifier value, during a cpu offline operation, instead of calling rtas_stop_self(), we cede the vCPU to the hypervisor while passing a latency hint specifier value. The Hypervisor can use this hint to provide better energy savings. Also, during the offline operation, the control of the vCPU remains with the LPAR as oppposed to returning it to the resource pool. The patch achieves this by creating an infrastructure to set the preferred_offline_state() which can be either - CPU_STATE_OFFLINE: which is the current behaviour of calling rtas_stop_self() - CPU_STATE_INACTIVE: which cedes the vCPU to the hypervisor with the latency hint specifier. The codepath which wants to perform a DLPAR operation can set the preferred_offline_state() of a CPU to CPU_STATE_OFFLINE before invoking cpu_down(). The patch also provides a boot-time command line argument to disable/enable CPU_STATE_INACTIVE. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-10-29 22:22:53 +03:00
* self-destruct.
*/
static void pseries_cpu_die(unsigned int cpu)
{
powerpc/pseries: Add hooks to put the CPU into an appropriate offline state When a CPU is offlined on POWER currently, we call rtas_stop_self() and hand the CPU back to the resource pool. This path is used for DLPAR which will cause a change in the LPAR configuration which will be visible outside. This patch changes the default state a CPU is put into when it is offlined. On platforms which support ceding the processor to the hypervisor with latency hint specifier value, during a cpu offline operation, instead of calling rtas_stop_self(), we cede the vCPU to the hypervisor while passing a latency hint specifier value. The Hypervisor can use this hint to provide better energy savings. Also, during the offline operation, the control of the vCPU remains with the LPAR as oppposed to returning it to the resource pool. The patch achieves this by creating an infrastructure to set the preferred_offline_state() which can be either - CPU_STATE_OFFLINE: which is the current behaviour of calling rtas_stop_self() - CPU_STATE_INACTIVE: which cedes the vCPU to the hypervisor with the latency hint specifier. The codepath which wants to perform a DLPAR operation can set the preferred_offline_state() of a CPU to CPU_STATE_OFFLINE before invoking cpu_down(). The patch also provides a boot-time command line argument to disable/enable CPU_STATE_INACTIVE. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-10-29 22:22:53 +03:00
int cpu_status = 1;
unsigned int pcpu = get_hard_smp_processor_id(cpu);
powerpc/pseries/hotplug-cpu: wait indefinitely for vCPU death For a power9 KVM guest with XIVE enabled, running a test loop where we hotplug 384 vcpus and then unplug them, the following traces can be seen (generally within a few loops) either from the unplugged vcpu: cpu 65 (hwid 65) Ready to die... Querying DEAD? cpu 66 (66) shows 2 list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048 ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:56! Oops: Exception in kernel mode, sig: 5 [#1] LE SMP NR_CPUS=2048 NUMA pSeries Modules linked in: fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 ... CPU: 66 PID: 0 Comm: swapper/66 Kdump: loaded Not tainted 4.18.0-221.el8.ppc64le #1 NIP: c0000000007ab50c LR: c0000000007ab508 CTR: 00000000000003ac REGS: c0000009e5a17840 TRAP: 0700 Not tainted (4.18.0-221.el8.ppc64le) MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28000842 XER: 20040000 ... NIP __list_del_entry_valid+0xac/0x100 LR __list_del_entry_valid+0xa8/0x100 Call Trace: __list_del_entry_valid+0xa8/0x100 (unreliable) free_pcppages_bulk+0x1f8/0x940 free_unref_page+0xd0/0x100 xive_spapr_cleanup_queue+0x148/0x1b0 xive_teardown_cpu+0x1bc/0x240 pseries_mach_cpu_die+0x78/0x2f0 cpu_die+0x48/0x70 arch_cpu_idle_dead+0x20/0x40 do_idle+0x2f4/0x4c0 cpu_startup_entry+0x38/0x40 start_secondary+0x7bc/0x8f0 start_secondary_prolog+0x10/0x14 or on the worker thread handling the unplug: pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a Querying DEAD? cpu 314 (314) shows 2 BUG: Bad page state in process kworker/u768:3 pfn:95de1 cpu 314 (hwid 314) Ready to die... page:c00a000002577840 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 flags: 0x5ffffc00000000() raw: 005ffffc00000000 5deadbeef0000100 5deadbeef0000200 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffff7f 0000000000000000 page dumped because: nonzero mapcount Modules linked in: kvm xt_CHECKSUM ipt_MASQUERADE xt_conntrack ... CPU: 0 PID: 548 Comm: kworker/u768:3 Kdump: loaded Not tainted 4.18.0-224.el8.bz1856588.ppc64le #1 Workqueue: pseries hotplug workque pseries_hp_work_fn Call Trace: dump_stack+0xb0/0xf4 (unreliable) bad_page+0x12c/0x1b0 free_pcppages_bulk+0x5bc/0x940 page_alloc_cpu_dead+0x118/0x120 cpuhp_invoke_callback.constprop.5+0xb8/0x760 _cpu_down+0x188/0x340 cpu_down+0x5c/0xa0 cpu_subsys_offline+0x24/0x40 device_offline+0xf0/0x130 dlpar_offline_cpu+0x1c4/0x2a0 dlpar_cpu_remove+0xb8/0x190 dlpar_cpu_remove_by_index+0x12c/0x150 dlpar_cpu+0x94/0x800 pseries_hp_work_fn+0x128/0x1e0 process_one_work+0x304/0x5d0 worker_thread+0xcc/0x7a0 kthread+0x1ac/0x1c0 ret_from_kernel_thread+0x5c/0x80 The latter trace is due to the following sequence: page_alloc_cpu_dead drain_pages drain_pages_zone free_pcppages_bulk where drain_pages() in this case is called under the assumption that the unplugged cpu is no longer executing. To ensure that is the case, and early call is made to __cpu_die()->pseries_cpu_die(), which runs a loop that waits for the cpu to reach a halted state by polling its status via query-cpu-stopped-state RTAS calls. It only polls for 25 iterations before giving up, however, and in the trace above this results in the following being printed only .1 seconds after the hotplug worker thread begins processing the unplug request: pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a Querying DEAD? cpu 314 (314) shows 2 At that point the worker thread assumes the unplugged CPU is in some unknown/dead state and procedes with the cleanup, causing the race with the XIVE cleanup code executed by the unplugged CPU. Fix this by waiting indefinitely, but also making an effort to avoid spurious lockup messages by allowing for rescheduling after polling the CPU status and printing a warning if we wait for longer than 120s. Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller") Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com> Tested-by: Greg Kurz <groug@kaod.org> Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com> Reviewed-by: Greg Kurz <groug@kaod.org> [mpe: Trim oopses in change log slightly for readability] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200811161544.10513-1-mdroth@linux.vnet.ibm.com
2020-08-11 19:15:44 +03:00
unsigned long timeout = jiffies + msecs_to_jiffies(120000);
powerpc/pseries/hotplug-cpu: wait indefinitely for vCPU death For a power9 KVM guest with XIVE enabled, running a test loop where we hotplug 384 vcpus and then unplug them, the following traces can be seen (generally within a few loops) either from the unplugged vcpu: cpu 65 (hwid 65) Ready to die... Querying DEAD? cpu 66 (66) shows 2 list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048 ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:56! Oops: Exception in kernel mode, sig: 5 [#1] LE SMP NR_CPUS=2048 NUMA pSeries Modules linked in: fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 ... CPU: 66 PID: 0 Comm: swapper/66 Kdump: loaded Not tainted 4.18.0-221.el8.ppc64le #1 NIP: c0000000007ab50c LR: c0000000007ab508 CTR: 00000000000003ac REGS: c0000009e5a17840 TRAP: 0700 Not tainted (4.18.0-221.el8.ppc64le) MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28000842 XER: 20040000 ... NIP __list_del_entry_valid+0xac/0x100 LR __list_del_entry_valid+0xa8/0x100 Call Trace: __list_del_entry_valid+0xa8/0x100 (unreliable) free_pcppages_bulk+0x1f8/0x940 free_unref_page+0xd0/0x100 xive_spapr_cleanup_queue+0x148/0x1b0 xive_teardown_cpu+0x1bc/0x240 pseries_mach_cpu_die+0x78/0x2f0 cpu_die+0x48/0x70 arch_cpu_idle_dead+0x20/0x40 do_idle+0x2f4/0x4c0 cpu_startup_entry+0x38/0x40 start_secondary+0x7bc/0x8f0 start_secondary_prolog+0x10/0x14 or on the worker thread handling the unplug: pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a Querying DEAD? cpu 314 (314) shows 2 BUG: Bad page state in process kworker/u768:3 pfn:95de1 cpu 314 (hwid 314) Ready to die... page:c00a000002577840 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 flags: 0x5ffffc00000000() raw: 005ffffc00000000 5deadbeef0000100 5deadbeef0000200 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffff7f 0000000000000000 page dumped because: nonzero mapcount Modules linked in: kvm xt_CHECKSUM ipt_MASQUERADE xt_conntrack ... CPU: 0 PID: 548 Comm: kworker/u768:3 Kdump: loaded Not tainted 4.18.0-224.el8.bz1856588.ppc64le #1 Workqueue: pseries hotplug workque pseries_hp_work_fn Call Trace: dump_stack+0xb0/0xf4 (unreliable) bad_page+0x12c/0x1b0 free_pcppages_bulk+0x5bc/0x940 page_alloc_cpu_dead+0x118/0x120 cpuhp_invoke_callback.constprop.5+0xb8/0x760 _cpu_down+0x188/0x340 cpu_down+0x5c/0xa0 cpu_subsys_offline+0x24/0x40 device_offline+0xf0/0x130 dlpar_offline_cpu+0x1c4/0x2a0 dlpar_cpu_remove+0xb8/0x190 dlpar_cpu_remove_by_index+0x12c/0x150 dlpar_cpu+0x94/0x800 pseries_hp_work_fn+0x128/0x1e0 process_one_work+0x304/0x5d0 worker_thread+0xcc/0x7a0 kthread+0x1ac/0x1c0 ret_from_kernel_thread+0x5c/0x80 The latter trace is due to the following sequence: page_alloc_cpu_dead drain_pages drain_pages_zone free_pcppages_bulk where drain_pages() in this case is called under the assumption that the unplugged cpu is no longer executing. To ensure that is the case, and early call is made to __cpu_die()->pseries_cpu_die(), which runs a loop that waits for the cpu to reach a halted state by polling its status via query-cpu-stopped-state RTAS calls. It only polls for 25 iterations before giving up, however, and in the trace above this results in the following being printed only .1 seconds after the hotplug worker thread begins processing the unplug request: pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a Querying DEAD? cpu 314 (314) shows 2 At that point the worker thread assumes the unplugged CPU is in some unknown/dead state and procedes with the cleanup, causing the race with the XIVE cleanup code executed by the unplugged CPU. Fix this by waiting indefinitely, but also making an effort to avoid spurious lockup messages by allowing for rescheduling after polling the CPU status and printing a warning if we wait for longer than 120s. Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller") Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com> Tested-by: Greg Kurz <groug@kaod.org> Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com> Reviewed-by: Greg Kurz <groug@kaod.org> [mpe: Trim oopses in change log slightly for readability] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200811161544.10513-1-mdroth@linux.vnet.ibm.com
2020-08-11 19:15:44 +03:00
while (true) {
powerpc/pseries: remove cede offline state for CPUs This effectively reverts commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an appropriate offline state"), which added an offline mode for CPUs which uses the H_CEDE hcall instead of the architected stop-self RTAS function in order to facilitate "folding" of dedicated mode processors on PowerVM platforms to achieve energy savings. This has been the default offline mode since its introduction. There's nothing about stop-self that would prevent the hypervisor from achieving the energy savings available via H_CEDE, so the original premise of this change appears to be flawed. I also have encountered the claim that the transition to and from ceded state is much faster than stop-self/start-cpu. Certainly we would not want to use stop-self as an *idle* mode. That is what H_CEDE is for. However, this difference is insignificant in the context of Linux CPU hotplug, where the latency of an offline or online operation on current systems is on the order of 100ms, mainly attributable to all the various subsystems' cpuhp callbacks. The cede offline mode also prevents accurate accounting, as discussed before: https://lore.kernel.org/linuxppc-dev/1571740391-3251-1-git-send-email-ego@linux.vnet.ibm.com/ Unconditionally use stop-self to offline processor threads. This is the architected method for offlining CPUs on PAPR systems. The "cede_offline" boot parameter is rendered obsolete. Removing this code enables the removal of the partition suspend code which temporarily onlines all present CPUs. Fixes: 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an appropriate offline state") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200612051238.1007764-2-nathanl@linux.ibm.com
2020-06-12 08:12:21 +03:00
cpu_status = smp_query_cpu_stopped(pcpu);
if (cpu_status == QCSS_STOPPED ||
cpu_status == QCSS_HARDWARE_ERROR)
break;
powerpc/pseries: Add hooks to put the CPU into an appropriate offline state When a CPU is offlined on POWER currently, we call rtas_stop_self() and hand the CPU back to the resource pool. This path is used for DLPAR which will cause a change in the LPAR configuration which will be visible outside. This patch changes the default state a CPU is put into when it is offlined. On platforms which support ceding the processor to the hypervisor with latency hint specifier value, during a cpu offline operation, instead of calling rtas_stop_self(), we cede the vCPU to the hypervisor while passing a latency hint specifier value. The Hypervisor can use this hint to provide better energy savings. Also, during the offline operation, the control of the vCPU remains with the LPAR as oppposed to returning it to the resource pool. The patch achieves this by creating an infrastructure to set the preferred_offline_state() which can be either - CPU_STATE_OFFLINE: which is the current behaviour of calling rtas_stop_self() - CPU_STATE_INACTIVE: which cedes the vCPU to the hypervisor with the latency hint specifier. The codepath which wants to perform a DLPAR operation can set the preferred_offline_state() of a CPU to CPU_STATE_OFFLINE before invoking cpu_down(). The patch also provides a boot-time command line argument to disable/enable CPU_STATE_INACTIVE. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-10-29 22:22:53 +03:00
powerpc/pseries/hotplug-cpu: wait indefinitely for vCPU death For a power9 KVM guest with XIVE enabled, running a test loop where we hotplug 384 vcpus and then unplug them, the following traces can be seen (generally within a few loops) either from the unplugged vcpu: cpu 65 (hwid 65) Ready to die... Querying DEAD? cpu 66 (66) shows 2 list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048 ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:56! Oops: Exception in kernel mode, sig: 5 [#1] LE SMP NR_CPUS=2048 NUMA pSeries Modules linked in: fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 ... CPU: 66 PID: 0 Comm: swapper/66 Kdump: loaded Not tainted 4.18.0-221.el8.ppc64le #1 NIP: c0000000007ab50c LR: c0000000007ab508 CTR: 00000000000003ac REGS: c0000009e5a17840 TRAP: 0700 Not tainted (4.18.0-221.el8.ppc64le) MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28000842 XER: 20040000 ... NIP __list_del_entry_valid+0xac/0x100 LR __list_del_entry_valid+0xa8/0x100 Call Trace: __list_del_entry_valid+0xa8/0x100 (unreliable) free_pcppages_bulk+0x1f8/0x940 free_unref_page+0xd0/0x100 xive_spapr_cleanup_queue+0x148/0x1b0 xive_teardown_cpu+0x1bc/0x240 pseries_mach_cpu_die+0x78/0x2f0 cpu_die+0x48/0x70 arch_cpu_idle_dead+0x20/0x40 do_idle+0x2f4/0x4c0 cpu_startup_entry+0x38/0x40 start_secondary+0x7bc/0x8f0 start_secondary_prolog+0x10/0x14 or on the worker thread handling the unplug: pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a Querying DEAD? cpu 314 (314) shows 2 BUG: Bad page state in process kworker/u768:3 pfn:95de1 cpu 314 (hwid 314) Ready to die... page:c00a000002577840 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 flags: 0x5ffffc00000000() raw: 005ffffc00000000 5deadbeef0000100 5deadbeef0000200 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffff7f 0000000000000000 page dumped because: nonzero mapcount Modules linked in: kvm xt_CHECKSUM ipt_MASQUERADE xt_conntrack ... CPU: 0 PID: 548 Comm: kworker/u768:3 Kdump: loaded Not tainted 4.18.0-224.el8.bz1856588.ppc64le #1 Workqueue: pseries hotplug workque pseries_hp_work_fn Call Trace: dump_stack+0xb0/0xf4 (unreliable) bad_page+0x12c/0x1b0 free_pcppages_bulk+0x5bc/0x940 page_alloc_cpu_dead+0x118/0x120 cpuhp_invoke_callback.constprop.5+0xb8/0x760 _cpu_down+0x188/0x340 cpu_down+0x5c/0xa0 cpu_subsys_offline+0x24/0x40 device_offline+0xf0/0x130 dlpar_offline_cpu+0x1c4/0x2a0 dlpar_cpu_remove+0xb8/0x190 dlpar_cpu_remove_by_index+0x12c/0x150 dlpar_cpu+0x94/0x800 pseries_hp_work_fn+0x128/0x1e0 process_one_work+0x304/0x5d0 worker_thread+0xcc/0x7a0 kthread+0x1ac/0x1c0 ret_from_kernel_thread+0x5c/0x80 The latter trace is due to the following sequence: page_alloc_cpu_dead drain_pages drain_pages_zone free_pcppages_bulk where drain_pages() in this case is called under the assumption that the unplugged cpu is no longer executing. To ensure that is the case, and early call is made to __cpu_die()->pseries_cpu_die(), which runs a loop that waits for the cpu to reach a halted state by polling its status via query-cpu-stopped-state RTAS calls. It only polls for 25 iterations before giving up, however, and in the trace above this results in the following being printed only .1 seconds after the hotplug worker thread begins processing the unplug request: pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a Querying DEAD? cpu 314 (314) shows 2 At that point the worker thread assumes the unplugged CPU is in some unknown/dead state and procedes with the cleanup, causing the race with the XIVE cleanup code executed by the unplugged CPU. Fix this by waiting indefinitely, but also making an effort to avoid spurious lockup messages by allowing for rescheduling after polling the CPU status and printing a warning if we wait for longer than 120s. Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller") Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com> Tested-by: Greg Kurz <groug@kaod.org> Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com> Reviewed-by: Greg Kurz <groug@kaod.org> [mpe: Trim oopses in change log slightly for readability] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200811161544.10513-1-mdroth@linux.vnet.ibm.com
2020-08-11 19:15:44 +03:00
if (time_after(jiffies, timeout)) {
pr_warn("CPU %i (hwid %i) didn't die after 120 seconds\n",
cpu, pcpu);
timeout = jiffies + msecs_to_jiffies(120000);
}
cond_resched();
}
powerpc/pseries: Add hooks to put the CPU into an appropriate offline state When a CPU is offlined on POWER currently, we call rtas_stop_self() and hand the CPU back to the resource pool. This path is used for DLPAR which will cause a change in the LPAR configuration which will be visible outside. This patch changes the default state a CPU is put into when it is offlined. On platforms which support ceding the processor to the hypervisor with latency hint specifier value, during a cpu offline operation, instead of calling rtas_stop_self(), we cede the vCPU to the hypervisor while passing a latency hint specifier value. The Hypervisor can use this hint to provide better energy savings. Also, during the offline operation, the control of the vCPU remains with the LPAR as oppposed to returning it to the resource pool. The patch achieves this by creating an infrastructure to set the preferred_offline_state() which can be either - CPU_STATE_OFFLINE: which is the current behaviour of calling rtas_stop_self() - CPU_STATE_INACTIVE: which cedes the vCPU to the hypervisor with the latency hint specifier. The codepath which wants to perform a DLPAR operation can set the preferred_offline_state() of a CPU to CPU_STATE_OFFLINE before invoking cpu_down(). The patch also provides a boot-time command line argument to disable/enable CPU_STATE_INACTIVE. Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-10-29 22:22:53 +03:00
powerpc/pseries/hotplug-cpu: wait indefinitely for vCPU death For a power9 KVM guest with XIVE enabled, running a test loop where we hotplug 384 vcpus and then unplug them, the following traces can be seen (generally within a few loops) either from the unplugged vcpu: cpu 65 (hwid 65) Ready to die... Querying DEAD? cpu 66 (66) shows 2 list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048 ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:56! Oops: Exception in kernel mode, sig: 5 [#1] LE SMP NR_CPUS=2048 NUMA pSeries Modules linked in: fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 ... CPU: 66 PID: 0 Comm: swapper/66 Kdump: loaded Not tainted 4.18.0-221.el8.ppc64le #1 NIP: c0000000007ab50c LR: c0000000007ab508 CTR: 00000000000003ac REGS: c0000009e5a17840 TRAP: 0700 Not tainted (4.18.0-221.el8.ppc64le) MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28000842 XER: 20040000 ... NIP __list_del_entry_valid+0xac/0x100 LR __list_del_entry_valid+0xa8/0x100 Call Trace: __list_del_entry_valid+0xa8/0x100 (unreliable) free_pcppages_bulk+0x1f8/0x940 free_unref_page+0xd0/0x100 xive_spapr_cleanup_queue+0x148/0x1b0 xive_teardown_cpu+0x1bc/0x240 pseries_mach_cpu_die+0x78/0x2f0 cpu_die+0x48/0x70 arch_cpu_idle_dead+0x20/0x40 do_idle+0x2f4/0x4c0 cpu_startup_entry+0x38/0x40 start_secondary+0x7bc/0x8f0 start_secondary_prolog+0x10/0x14 or on the worker thread handling the unplug: pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a Querying DEAD? cpu 314 (314) shows 2 BUG: Bad page state in process kworker/u768:3 pfn:95de1 cpu 314 (hwid 314) Ready to die... page:c00a000002577840 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 flags: 0x5ffffc00000000() raw: 005ffffc00000000 5deadbeef0000100 5deadbeef0000200 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffff7f 0000000000000000 page dumped because: nonzero mapcount Modules linked in: kvm xt_CHECKSUM ipt_MASQUERADE xt_conntrack ... CPU: 0 PID: 548 Comm: kworker/u768:3 Kdump: loaded Not tainted 4.18.0-224.el8.bz1856588.ppc64le #1 Workqueue: pseries hotplug workque pseries_hp_work_fn Call Trace: dump_stack+0xb0/0xf4 (unreliable) bad_page+0x12c/0x1b0 free_pcppages_bulk+0x5bc/0x940 page_alloc_cpu_dead+0x118/0x120 cpuhp_invoke_callback.constprop.5+0xb8/0x760 _cpu_down+0x188/0x340 cpu_down+0x5c/0xa0 cpu_subsys_offline+0x24/0x40 device_offline+0xf0/0x130 dlpar_offline_cpu+0x1c4/0x2a0 dlpar_cpu_remove+0xb8/0x190 dlpar_cpu_remove_by_index+0x12c/0x150 dlpar_cpu+0x94/0x800 pseries_hp_work_fn+0x128/0x1e0 process_one_work+0x304/0x5d0 worker_thread+0xcc/0x7a0 kthread+0x1ac/0x1c0 ret_from_kernel_thread+0x5c/0x80 The latter trace is due to the following sequence: page_alloc_cpu_dead drain_pages drain_pages_zone free_pcppages_bulk where drain_pages() in this case is called under the assumption that the unplugged cpu is no longer executing. To ensure that is the case, and early call is made to __cpu_die()->pseries_cpu_die(), which runs a loop that waits for the cpu to reach a halted state by polling its status via query-cpu-stopped-state RTAS calls. It only polls for 25 iterations before giving up, however, and in the trace above this results in the following being printed only .1 seconds after the hotplug worker thread begins processing the unplug request: pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a Querying DEAD? cpu 314 (314) shows 2 At that point the worker thread assumes the unplugged CPU is in some unknown/dead state and procedes with the cleanup, causing the race with the XIVE cleanup code executed by the unplugged CPU. Fix this by waiting indefinitely, but also making an effort to avoid spurious lockup messages by allowing for rescheduling after polling the CPU status and printing a warning if we wait for longer than 120s. Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller") Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com> Tested-by: Greg Kurz <groug@kaod.org> Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com> Reviewed-by: Greg Kurz <groug@kaod.org> [mpe: Trim oopses in change log slightly for readability] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200811161544.10513-1-mdroth@linux.vnet.ibm.com
2020-08-11 19:15:44 +03:00
if (cpu_status == QCSS_HARDWARE_ERROR) {
pr_warn("CPU %i (hwid %i) reported error while dying\n",
cpu, pcpu);
}
paca_ptrs[cpu]->cpu_start = 0;
}
powerpc/pseries: Prevent free CPU ids being reused on another node When a CPU is hot added, the CPU ids are taken from the available mask from the lower possible set. If that set of values was previously used for a CPU attached to a different node, it appears to an application as if these CPUs have migrated from one node to another node which is not expected. To prevent this, it is needed to record the CPU ids used for each node and to not reuse them on another node. However, to prevent CPU hot plug to fail, in the case the CPU ids is starved on a node, the capability to reuse other nodes’ free CPU ids is kept. A warning is displayed in such a case to warn the user. A new CPU bit mask (node_recorded_ids_map) is introduced for each possible node. It is populated with the CPU onlined at boot time, and then when a CPU is hot plugged to a node. The bits in that mask remain when the CPU is hot unplugged, to remind this CPU ids have been used for this node. If no id set was found, a retry is made without removing the ids used on the other nodes to try reusing them. This is the way ids have been allocated prior to this patch. The effect of this patch can be seen by removing and adding CPUs using the Qemu monitor. In the following case, the first CPU from the node 2 is removed, then the first one from the node 1 is removed too. Later, the first CPU of the node 2 is added back. Without that patch, the kernel will number these CPUs using the first CPU ids available which are the ones freed when removing the second CPU of the node 0. This leads to the CPU ids 16-23 to move from the node 1 to the node 2. With the patch applied, the CPU ids 32-39 are used since they are the lowest free ones which have not been used on another node. At boot time: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Vanilla kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47 Patched kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
/**
* find_cpu_id_range - found a linear ranger of @nthreads free CPU ids.
* @nthreads : the number of threads (cpu ids)
* @assigned_node : the node it belongs to or NUMA_NO_NODE if free ids from any
* node can be peek.
* @cpu_mask: the returned CPU mask.
*
* Returns 0 on success.
*/
static int find_cpu_id_range(unsigned int nthreads, int assigned_node,
cpumask_var_t *cpu_mask)
{
cpumask_var_t candidate_mask;
unsigned int cpu, node;
int rc = -ENOSPC;
if (!zalloc_cpumask_var(&candidate_mask, GFP_KERNEL))
return -ENOMEM;
cpumask_clear(*cpu_mask);
for (cpu = 0; cpu < nthreads; cpu++)
cpumask_set_cpu(cpu, *cpu_mask);
BUG_ON(!cpumask_subset(cpu_present_mask, cpu_possible_mask));
/* Get a bitmap of unoccupied slots. */
cpumask_xor(candidate_mask, cpu_possible_mask, cpu_present_mask);
if (assigned_node != NUMA_NO_NODE) {
/*
* Remove free ids previously assigned on the other nodes. We
* can walk only online nodes because once a node became online
* it is not turned offlined back.
*/
for_each_online_node(node) {
if (node == assigned_node)
continue;
cpumask_andnot(candidate_mask, candidate_mask,
node_recorded_ids_map[node]);
}
}
if (cpumask_empty(candidate_mask))
goto out;
while (!cpumask_empty(*cpu_mask)) {
if (cpumask_subset(*cpu_mask, candidate_mask))
/* Found a range where we can insert the new cpu(s) */
break;
cpumask_shift_left(*cpu_mask, *cpu_mask, nthreads);
}
if (!cpumask_empty(*cpu_mask))
rc = 0;
out:
free_cpumask_var(candidate_mask);
return rc;
}
/*
* Update cpu_present_mask and paca(s) for a new cpu node. The wrinkle
powerpc/pseries: Prevent free CPU ids being reused on another node When a CPU is hot added, the CPU ids are taken from the available mask from the lower possible set. If that set of values was previously used for a CPU attached to a different node, it appears to an application as if these CPUs have migrated from one node to another node which is not expected. To prevent this, it is needed to record the CPU ids used for each node and to not reuse them on another node. However, to prevent CPU hot plug to fail, in the case the CPU ids is starved on a node, the capability to reuse other nodes’ free CPU ids is kept. A warning is displayed in such a case to warn the user. A new CPU bit mask (node_recorded_ids_map) is introduced for each possible node. It is populated with the CPU onlined at boot time, and then when a CPU is hot plugged to a node. The bits in that mask remain when the CPU is hot unplugged, to remind this CPU ids have been used for this node. If no id set was found, a retry is made without removing the ids used on the other nodes to try reusing them. This is the way ids have been allocated prior to this patch. The effect of this patch can be seen by removing and adding CPUs using the Qemu monitor. In the following case, the first CPU from the node 2 is removed, then the first one from the node 1 is removed too. Later, the first CPU of the node 2 is added back. Without that patch, the kernel will number these CPUs using the first CPU ids available which are the ones freed when removing the second CPU of the node 0. This leads to the CPU ids 16-23 to move from the node 1 to the node 2. With the patch applied, the CPU ids 32-39 are used since they are the lowest free ones which have not been used on another node. At boot time: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Vanilla kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47 Patched kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
* here is that a cpu device node may represent multiple logical cpus
* in the SMT case. We must honor the assumption in other code that
* the logical ids for sibling SMT threads x and y are adjacent, such
* that x^1 == y and y^1 == x.
*/
static int pseries_add_processor(struct device_node *np)
{
powerpc/pseries: Prevent free CPU ids being reused on another node When a CPU is hot added, the CPU ids are taken from the available mask from the lower possible set. If that set of values was previously used for a CPU attached to a different node, it appears to an application as if these CPUs have migrated from one node to another node which is not expected. To prevent this, it is needed to record the CPU ids used for each node and to not reuse them on another node. However, to prevent CPU hot plug to fail, in the case the CPU ids is starved on a node, the capability to reuse other nodes’ free CPU ids is kept. A warning is displayed in such a case to warn the user. A new CPU bit mask (node_recorded_ids_map) is introduced for each possible node. It is populated with the CPU onlined at boot time, and then when a CPU is hot plugged to a node. The bits in that mask remain when the CPU is hot unplugged, to remind this CPU ids have been used for this node. If no id set was found, a retry is made without removing the ids used on the other nodes to try reusing them. This is the way ids have been allocated prior to this patch. The effect of this patch can be seen by removing and adding CPUs using the Qemu monitor. In the following case, the first CPU from the node 2 is removed, then the first one from the node 1 is removed too. Later, the first CPU of the node 2 is added back. Without that patch, the kernel will number these CPUs using the first CPU ids available which are the ones freed when removing the second CPU of the node 0. This leads to the CPU ids 16-23 to move from the node 1 to the node 2. With the patch applied, the CPU ids 32-39 are used since they are the lowest free ones which have not been used on another node. At boot time: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Vanilla kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47 Patched kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
int len, nthreads, node, cpu, assigned_node;
int rc = 0;
cpumask_var_t cpu_mask;
const __be32 *intserv;
intserv = of_get_property(np, "ibm,ppc-interrupt-server#s", &len);
if (!intserv)
return 0;
nthreads = len / sizeof(u32);
powerpc/pseries: Prevent free CPU ids being reused on another node When a CPU is hot added, the CPU ids are taken from the available mask from the lower possible set. If that set of values was previously used for a CPU attached to a different node, it appears to an application as if these CPUs have migrated from one node to another node which is not expected. To prevent this, it is needed to record the CPU ids used for each node and to not reuse them on another node. However, to prevent CPU hot plug to fail, in the case the CPU ids is starved on a node, the capability to reuse other nodes’ free CPU ids is kept. A warning is displayed in such a case to warn the user. A new CPU bit mask (node_recorded_ids_map) is introduced for each possible node. It is populated with the CPU onlined at boot time, and then when a CPU is hot plugged to a node. The bits in that mask remain when the CPU is hot unplugged, to remind this CPU ids have been used for this node. If no id set was found, a retry is made without removing the ids used on the other nodes to try reusing them. This is the way ids have been allocated prior to this patch. The effect of this patch can be seen by removing and adding CPUs using the Qemu monitor. In the following case, the first CPU from the node 2 is removed, then the first one from the node 1 is removed too. Later, the first CPU of the node 2 is added back. Without that patch, the kernel will number these CPUs using the first CPU ids available which are the ones freed when removing the second CPU of the node 0. This leads to the CPU ids 16-23 to move from the node 1 to the node 2. With the patch applied, the CPU ids 32-39 are used since they are the lowest free ones which have not been used on another node. At boot time: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Vanilla kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47 Patched kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
if (!alloc_cpumask_var(&cpu_mask, GFP_KERNEL))
return -ENOMEM;
powerpc/pseries: Prevent free CPU ids being reused on another node When a CPU is hot added, the CPU ids are taken from the available mask from the lower possible set. If that set of values was previously used for a CPU attached to a different node, it appears to an application as if these CPUs have migrated from one node to another node which is not expected. To prevent this, it is needed to record the CPU ids used for each node and to not reuse them on another node. However, to prevent CPU hot plug to fail, in the case the CPU ids is starved on a node, the capability to reuse other nodes’ free CPU ids is kept. A warning is displayed in such a case to warn the user. A new CPU bit mask (node_recorded_ids_map) is introduced for each possible node. It is populated with the CPU onlined at boot time, and then when a CPU is hot plugged to a node. The bits in that mask remain when the CPU is hot unplugged, to remind this CPU ids have been used for this node. If no id set was found, a retry is made without removing the ids used on the other nodes to try reusing them. This is the way ids have been allocated prior to this patch. The effect of this patch can be seen by removing and adding CPUs using the Qemu monitor. In the following case, the first CPU from the node 2 is removed, then the first one from the node 1 is removed too. Later, the first CPU of the node 2 is added back. Without that patch, the kernel will number these CPUs using the first CPU ids available which are the ones freed when removing the second CPU of the node 0. This leads to the CPU ids 16-23 to move from the node 1 to the node 2. With the patch applied, the CPU ids 32-39 are used since they are the lowest free ones which have not been used on another node. At boot time: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Vanilla kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47 Patched kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
/*
* Fetch from the DT nodes read by dlpar_configure_connector() the NUMA
* node id the added CPU belongs to.
*/
node = of_node_to_nid(np);
if (node < 0 || !node_possible(node))
node = first_online_node;
powerpc/pseries: Prevent free CPU ids being reused on another node When a CPU is hot added, the CPU ids are taken from the available mask from the lower possible set. If that set of values was previously used for a CPU attached to a different node, it appears to an application as if these CPUs have migrated from one node to another node which is not expected. To prevent this, it is needed to record the CPU ids used for each node and to not reuse them on another node. However, to prevent CPU hot plug to fail, in the case the CPU ids is starved on a node, the capability to reuse other nodes’ free CPU ids is kept. A warning is displayed in such a case to warn the user. A new CPU bit mask (node_recorded_ids_map) is introduced for each possible node. It is populated with the CPU onlined at boot time, and then when a CPU is hot plugged to a node. The bits in that mask remain when the CPU is hot unplugged, to remind this CPU ids have been used for this node. If no id set was found, a retry is made without removing the ids used on the other nodes to try reusing them. This is the way ids have been allocated prior to this patch. The effect of this patch can be seen by removing and adding CPUs using the Qemu monitor. In the following case, the first CPU from the node 2 is removed, then the first one from the node 1 is removed too. Later, the first CPU of the node 2 is added back. Without that patch, the kernel will number these CPUs using the first CPU ids available which are the ones freed when removing the second CPU of the node 0. This leads to the CPU ids 16-23 to move from the node 1 to the node 2. With the patch applied, the CPU ids 32-39 are used since they are the lowest free ones which have not been used on another node. At boot time: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Vanilla kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47 Patched kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
BUG_ON(node == NUMA_NO_NODE);
assigned_node = node;
cpu_maps_update_begin();
rc = find_cpu_id_range(nthreads, node, &cpu_mask);
if (rc && nr_node_ids > 1) {
/*
* Try again, considering the free CPU ids from the other node.
*/
powerpc/pseries: Prevent free CPU ids being reused on another node When a CPU is hot added, the CPU ids are taken from the available mask from the lower possible set. If that set of values was previously used for a CPU attached to a different node, it appears to an application as if these CPUs have migrated from one node to another node which is not expected. To prevent this, it is needed to record the CPU ids used for each node and to not reuse them on another node. However, to prevent CPU hot plug to fail, in the case the CPU ids is starved on a node, the capability to reuse other nodes’ free CPU ids is kept. A warning is displayed in such a case to warn the user. A new CPU bit mask (node_recorded_ids_map) is introduced for each possible node. It is populated with the CPU onlined at boot time, and then when a CPU is hot plugged to a node. The bits in that mask remain when the CPU is hot unplugged, to remind this CPU ids have been used for this node. If no id set was found, a retry is made without removing the ids used on the other nodes to try reusing them. This is the way ids have been allocated prior to this patch. The effect of this patch can be seen by removing and adding CPUs using the Qemu monitor. In the following case, the first CPU from the node 2 is removed, then the first one from the node 1 is removed too. Later, the first CPU of the node 2 is added back. Without that patch, the kernel will number these CPUs using the first CPU ids available which are the ones freed when removing the second CPU of the node 0. This leads to the CPU ids 16-23 to move from the node 1 to the node 2. With the patch applied, the CPU ids 32-39 are used since they are the lowest free ones which have not been used on another node. At boot time: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Vanilla kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47 Patched kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
node = NUMA_NO_NODE;
rc = find_cpu_id_range(nthreads, NUMA_NO_NODE, &cpu_mask);
}
powerpc/pseries: Prevent free CPU ids being reused on another node When a CPU is hot added, the CPU ids are taken from the available mask from the lower possible set. If that set of values was previously used for a CPU attached to a different node, it appears to an application as if these CPUs have migrated from one node to another node which is not expected. To prevent this, it is needed to record the CPU ids used for each node and to not reuse them on another node. However, to prevent CPU hot plug to fail, in the case the CPU ids is starved on a node, the capability to reuse other nodes’ free CPU ids is kept. A warning is displayed in such a case to warn the user. A new CPU bit mask (node_recorded_ids_map) is introduced for each possible node. It is populated with the CPU onlined at boot time, and then when a CPU is hot plugged to a node. The bits in that mask remain when the CPU is hot unplugged, to remind this CPU ids have been used for this node. If no id set was found, a retry is made without removing the ids used on the other nodes to try reusing them. This is the way ids have been allocated prior to this patch. The effect of this patch can be seen by removing and adding CPUs using the Qemu monitor. In the following case, the first CPU from the node 2 is removed, then the first one from the node 1 is removed too. Later, the first CPU of the node 2 is added back. Without that patch, the kernel will number these CPUs using the first CPU ids available which are the ones freed when removing the second CPU of the node 0. This leads to the CPU ids 16-23 to move from the node 1 to the node 2. With the patch applied, the CPU ids 32-39 are used since they are the lowest free ones which have not been used on another node. At boot time: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Vanilla kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47 Patched kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
if (rc) {
pr_err("Cannot add cpu %pOF; this system configuration"
" supports %d logical cpus.\n", np, num_possible_cpus());
goto out;
}
powerpc/pseries: Prevent free CPU ids being reused on another node When a CPU is hot added, the CPU ids are taken from the available mask from the lower possible set. If that set of values was previously used for a CPU attached to a different node, it appears to an application as if these CPUs have migrated from one node to another node which is not expected. To prevent this, it is needed to record the CPU ids used for each node and to not reuse them on another node. However, to prevent CPU hot plug to fail, in the case the CPU ids is starved on a node, the capability to reuse other nodes’ free CPU ids is kept. A warning is displayed in such a case to warn the user. A new CPU bit mask (node_recorded_ids_map) is introduced for each possible node. It is populated with the CPU onlined at boot time, and then when a CPU is hot plugged to a node. The bits in that mask remain when the CPU is hot unplugged, to remind this CPU ids have been used for this node. If no id set was found, a retry is made without removing the ids used on the other nodes to try reusing them. This is the way ids have been allocated prior to this patch. The effect of this patch can be seen by removing and adding CPUs using the Qemu monitor. In the following case, the first CPU from the node 2 is removed, then the first one from the node 1 is removed too. Later, the first CPU of the node 2 is added back. Without that patch, the kernel will number these CPUs using the first CPU ids available which are the ones freed when removing the second CPU of the node 0. This leads to the CPU ids 16-23 to move from the node 1 to the node 2. With the patch applied, the CPU ids 32-39 are used since they are the lowest free ones which have not been used on another node. At boot time: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Vanilla kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47 Patched kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
for_each_cpu(cpu, cpu_mask) {
BUG_ON(cpu_present(cpu));
set_cpu_present(cpu, true);
set_hard_smp_processor_id(cpu, be32_to_cpu(*intserv++));
}
powerpc/pseries: Prevent free CPU ids being reused on another node When a CPU is hot added, the CPU ids are taken from the available mask from the lower possible set. If that set of values was previously used for a CPU attached to a different node, it appears to an application as if these CPUs have migrated from one node to another node which is not expected. To prevent this, it is needed to record the CPU ids used for each node and to not reuse them on another node. However, to prevent CPU hot plug to fail, in the case the CPU ids is starved on a node, the capability to reuse other nodes’ free CPU ids is kept. A warning is displayed in such a case to warn the user. A new CPU bit mask (node_recorded_ids_map) is introduced for each possible node. It is populated with the CPU onlined at boot time, and then when a CPU is hot plugged to a node. The bits in that mask remain when the CPU is hot unplugged, to remind this CPU ids have been used for this node. If no id set was found, a retry is made without removing the ids used on the other nodes to try reusing them. This is the way ids have been allocated prior to this patch. The effect of this patch can be seen by removing and adding CPUs using the Qemu monitor. In the following case, the first CPU from the node 2 is removed, then the first one from the node 1 is removed too. Later, the first CPU of the node 2 is added back. Without that patch, the kernel will number these CPUs using the first CPU ids available which are the ones freed when removing the second CPU of the node 0. This leads to the CPU ids 16-23 to move from the node 1 to the node 2. With the patch applied, the CPU ids 32-39 are used since they are the lowest free ones which have not been used on another node. At boot time: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Vanilla kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47 Patched kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
/* Record the newly used CPU ids for the associate node. */
cpumask_or(node_recorded_ids_map[assigned_node],
node_recorded_ids_map[assigned_node], cpu_mask);
/*
* If node is set to NUMA_NO_NODE, CPU ids have be reused from
* another node, remove them from its mask.
*/
if (node == NUMA_NO_NODE) {
cpu = cpumask_first(cpu_mask);
pr_warn("Reusing free CPU ids %d-%d from another node\n",
cpu, cpu + nthreads - 1);
for_each_online_node(node) {
if (node == assigned_node)
continue;
cpumask_andnot(node_recorded_ids_map[node],
node_recorded_ids_map[node],
cpu_mask);
}
}
out:
cpu_maps_update_done();
powerpc/pseries: Prevent free CPU ids being reused on another node When a CPU is hot added, the CPU ids are taken from the available mask from the lower possible set. If that set of values was previously used for a CPU attached to a different node, it appears to an application as if these CPUs have migrated from one node to another node which is not expected. To prevent this, it is needed to record the CPU ids used for each node and to not reuse them on another node. However, to prevent CPU hot plug to fail, in the case the CPU ids is starved on a node, the capability to reuse other nodes’ free CPU ids is kept. A warning is displayed in such a case to warn the user. A new CPU bit mask (node_recorded_ids_map) is introduced for each possible node. It is populated with the CPU onlined at boot time, and then when a CPU is hot plugged to a node. The bits in that mask remain when the CPU is hot unplugged, to remind this CPU ids have been used for this node. If no id set was found, a retry is made without removing the ids used on the other nodes to try reusing them. This is the way ids have been allocated prior to this patch. The effect of this patch can be seen by removing and adding CPUs using the Qemu monitor. In the following case, the first CPU from the node 2 is removed, then the first one from the node 1 is removed too. Later, the first CPU of the node 2 is added back. Without that patch, the kernel will number these CPUs using the first CPU ids available which are the ones freed when removing the second CPU of the node 0. This leads to the CPU ids 16-23 to move from the node 1 to the node 2. With the patch applied, the CPU ids 32-39 are used since they are the lowest free ones which have not been used on another node. At boot time: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Vanilla kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47 Patched kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
free_cpumask_var(cpu_mask);
return rc;
}
/*
* Update the present map for a cpu node which is going away, and set
* the hard id in the paca(s) to -1 to be consistent with boot time
* convention for non-present cpus.
*/
static void pseries_remove_processor(struct device_node *np)
{
unsigned int cpu;
int len, nthreads, i;
const __be32 *intserv;
u32 thread;
intserv = of_get_property(np, "ibm,ppc-interrupt-server#s", &len);
if (!intserv)
return;
nthreads = len / sizeof(u32);
cpu_maps_update_begin();
for (i = 0; i < nthreads; i++) {
thread = be32_to_cpu(intserv[i]);
for_each_present_cpu(cpu) {
if (get_hard_smp_processor_id(cpu) != thread)
continue;
BUG_ON(cpu_online(cpu));
set_cpu_present(cpu, false);
set_hard_smp_processor_id(cpu, -1);
update_numa_cpu_lookup_table(cpu, -1);
break;
}
if (cpu >= nr_cpu_ids)
printk(KERN_WARNING "Could not find cpu to remove "
"with physical id 0x%x\n", thread);
}
cpu_maps_update_done();
}
powerpc/pseries: safely roll back failed DLPAR cpu add dlpar_online_cpu() attempts to online all threads of a core that has been added to an LPAR. If onlining a non-primary thread fails (e.g. due to an allocation failure), the core is left with at least one thread online. dlpar_cpu_add() attempts to roll back the whole operation, releasing the core back to the platform. However, since some threads of the core being removed are still online, the BUG_ON(cpu_online(cpu)) in pseries_remove_processor() strikes: LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 3 PID: 8587 Comm: drmgr Not tainted 5.3.0-rc2-00190-g9b123d1ea237-dirty #46 NIP: c0000000000eeb2c LR: c0000000000eeac4 CTR: c0000000000ee9e0 REGS: c0000001f745b6c0 TRAP: 0700 Not tainted (5.3.0-rc2-00190-g9b123d1ea237-dirty) MSR: 800000010282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44002448 XER: 00000000 CFAR: c00000000195d718 IRQMASK: 0 GPR00: c0000000000eeac4 c0000001f745b950 c0000000032f6200 0000000000000008 GPR04: 0000000000000008 c000000003349c78 0000000000000040 00000000000001ff GPR08: 0000000000000008 0000000000000000 0000000000000001 0007ffffffffffff GPR12: 0000000084002844 c00000001ecacb80 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000008 GPR24: c000000003349ee0 c00000000334a2e4 c0000000fca4d7a8 c000000001d20048 GPR28: 0000000000000001 ffffffffffffffff ffffffffffffffff c0000000fca4d7c4 NIP [c0000000000eeb2c] pseries_smp_notifier+0x14c/0x2e0 LR [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0 Call Trace: [c0000001f745b950] [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0 (unreliable) [c0000001f745ba10] [c0000000001ac774] notifier_call_chain+0xb4/0x190 [c0000001f745bab0] [c0000000001ad62c] blocking_notifier_call_chain+0x7c/0xb0 [c0000001f745baf0] [c00000000167bda0] of_detach_node+0xc0/0x110 [c0000001f745bb50] [c0000000000e7ae4] dlpar_detach_node+0x64/0xa0 [c0000001f745bb80] [c0000000000edefc] dlpar_cpu_add+0x31c/0x360 [c0000001f745bc10] [c0000000000ee980] dlpar_cpu_probe+0x50/0xb0 [c0000001f745bc50] [c00000000002cf70] arch_cpu_probe+0x40/0x70 [c0000001f745bc70] [c000000000ccd808] cpu_probe_store+0x48/0x80 [c0000001f745bcb0] [c000000000cbcef8] dev_attr_store+0x38/0x60 [c0000001f745bcd0] [c00000000059c980] sysfs_kf_write+0x70/0xb0 [c0000001f745bd10] [c00000000059afb8] kernfs_fop_write+0xf8/0x280 [c0000001f745bd60] [c0000000004b437c] __vfs_write+0x3c/0x70 [c0000001f745bd80] [c0000000004b8710] vfs_write+0xd0/0x220 [c0000001f745bdd0] [c0000000004b8acc] ksys_write+0x7c/0x140 [c0000001f745be20] [c00000000000bbd8] system_call+0x5c/0x68 Move dlpar_offline_cpu() up in the file so that dlpar_online_cpu() can use it to re-offline any threads that have been onlined when an error is encountered. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Fixes: e666ae0b10aa ("powerpc/pseries: Update CPU hotplug error recovery") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191016183611.10867-3-nathanl@linux.ibm.com
2019-10-16 21:36:11 +03:00
static int dlpar_offline_cpu(struct device_node *dn)
{
int rc = 0;
unsigned int cpu;
int len, nthreads, i;
const __be32 *intserv;
u32 thread;
intserv = of_get_property(dn, "ibm,ppc-interrupt-server#s", &len);
if (!intserv)
return -EINVAL;
nthreads = len / sizeof(u32);
cpu_maps_update_begin();
for (i = 0; i < nthreads; i++) {
thread = be32_to_cpu(intserv[i]);
for_each_present_cpu(cpu) {
if (get_hard_smp_processor_id(cpu) != thread)
continue;
powerpc/pseries: remove cede offline state for CPUs This effectively reverts commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an appropriate offline state"), which added an offline mode for CPUs which uses the H_CEDE hcall instead of the architected stop-self RTAS function in order to facilitate "folding" of dedicated mode processors on PowerVM platforms to achieve energy savings. This has been the default offline mode since its introduction. There's nothing about stop-self that would prevent the hypervisor from achieving the energy savings available via H_CEDE, so the original premise of this change appears to be flawed. I also have encountered the claim that the transition to and from ceded state is much faster than stop-self/start-cpu. Certainly we would not want to use stop-self as an *idle* mode. That is what H_CEDE is for. However, this difference is insignificant in the context of Linux CPU hotplug, where the latency of an offline or online operation on current systems is on the order of 100ms, mainly attributable to all the various subsystems' cpuhp callbacks. The cede offline mode also prevents accurate accounting, as discussed before: https://lore.kernel.org/linuxppc-dev/1571740391-3251-1-git-send-email-ego@linux.vnet.ibm.com/ Unconditionally use stop-self to offline processor threads. This is the architected method for offlining CPUs on PAPR systems. The "cede_offline" boot parameter is rendered obsolete. Removing this code enables the removal of the partition suspend code which temporarily onlines all present CPUs. Fixes: 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an appropriate offline state") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200612051238.1007764-2-nathanl@linux.ibm.com
2020-06-12 08:12:21 +03:00
if (!cpu_online(cpu))
powerpc/pseries: safely roll back failed DLPAR cpu add dlpar_online_cpu() attempts to online all threads of a core that has been added to an LPAR. If onlining a non-primary thread fails (e.g. due to an allocation failure), the core is left with at least one thread online. dlpar_cpu_add() attempts to roll back the whole operation, releasing the core back to the platform. However, since some threads of the core being removed are still online, the BUG_ON(cpu_online(cpu)) in pseries_remove_processor() strikes: LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 3 PID: 8587 Comm: drmgr Not tainted 5.3.0-rc2-00190-g9b123d1ea237-dirty #46 NIP: c0000000000eeb2c LR: c0000000000eeac4 CTR: c0000000000ee9e0 REGS: c0000001f745b6c0 TRAP: 0700 Not tainted (5.3.0-rc2-00190-g9b123d1ea237-dirty) MSR: 800000010282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44002448 XER: 00000000 CFAR: c00000000195d718 IRQMASK: 0 GPR00: c0000000000eeac4 c0000001f745b950 c0000000032f6200 0000000000000008 GPR04: 0000000000000008 c000000003349c78 0000000000000040 00000000000001ff GPR08: 0000000000000008 0000000000000000 0000000000000001 0007ffffffffffff GPR12: 0000000084002844 c00000001ecacb80 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000008 GPR24: c000000003349ee0 c00000000334a2e4 c0000000fca4d7a8 c000000001d20048 GPR28: 0000000000000001 ffffffffffffffff ffffffffffffffff c0000000fca4d7c4 NIP [c0000000000eeb2c] pseries_smp_notifier+0x14c/0x2e0 LR [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0 Call Trace: [c0000001f745b950] [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0 (unreliable) [c0000001f745ba10] [c0000000001ac774] notifier_call_chain+0xb4/0x190 [c0000001f745bab0] [c0000000001ad62c] blocking_notifier_call_chain+0x7c/0xb0 [c0000001f745baf0] [c00000000167bda0] of_detach_node+0xc0/0x110 [c0000001f745bb50] [c0000000000e7ae4] dlpar_detach_node+0x64/0xa0 [c0000001f745bb80] [c0000000000edefc] dlpar_cpu_add+0x31c/0x360 [c0000001f745bc10] [c0000000000ee980] dlpar_cpu_probe+0x50/0xb0 [c0000001f745bc50] [c00000000002cf70] arch_cpu_probe+0x40/0x70 [c0000001f745bc70] [c000000000ccd808] cpu_probe_store+0x48/0x80 [c0000001f745bcb0] [c000000000cbcef8] dev_attr_store+0x38/0x60 [c0000001f745bcd0] [c00000000059c980] sysfs_kf_write+0x70/0xb0 [c0000001f745bd10] [c00000000059afb8] kernfs_fop_write+0xf8/0x280 [c0000001f745bd60] [c0000000004b437c] __vfs_write+0x3c/0x70 [c0000001f745bd80] [c0000000004b8710] vfs_write+0xd0/0x220 [c0000001f745bdd0] [c0000000004b8acc] ksys_write+0x7c/0x140 [c0000001f745be20] [c00000000000bbd8] system_call+0x5c/0x68 Move dlpar_offline_cpu() up in the file so that dlpar_online_cpu() can use it to re-offline any threads that have been onlined when an error is encountered. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Fixes: e666ae0b10aa ("powerpc/pseries: Update CPU hotplug error recovery") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191016183611.10867-3-nathanl@linux.ibm.com
2019-10-16 21:36:11 +03:00
break;
/*
* device_offline() will return -EBUSY (via cpu_down()) if there
* is only one CPU left. Check it here to fail earlier and with a
* more informative error message, while also retaining the
* cpu_add_remove_lock to be sure that no CPUs are being
* online/offlined during this check.
*/
if (num_online_cpus() == 1) {
pr_warn("Unable to remove last online CPU %pOFn\n", dn);
rc = -EBUSY;
goto out_unlock;
}
powerpc/pseries: remove cede offline state for CPUs This effectively reverts commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an appropriate offline state"), which added an offline mode for CPUs which uses the H_CEDE hcall instead of the architected stop-self RTAS function in order to facilitate "folding" of dedicated mode processors on PowerVM platforms to achieve energy savings. This has been the default offline mode since its introduction. There's nothing about stop-self that would prevent the hypervisor from achieving the energy savings available via H_CEDE, so the original premise of this change appears to be flawed. I also have encountered the claim that the transition to and from ceded state is much faster than stop-self/start-cpu. Certainly we would not want to use stop-self as an *idle* mode. That is what H_CEDE is for. However, this difference is insignificant in the context of Linux CPU hotplug, where the latency of an offline or online operation on current systems is on the order of 100ms, mainly attributable to all the various subsystems' cpuhp callbacks. The cede offline mode also prevents accurate accounting, as discussed before: https://lore.kernel.org/linuxppc-dev/1571740391-3251-1-git-send-email-ego@linux.vnet.ibm.com/ Unconditionally use stop-self to offline processor threads. This is the architected method for offlining CPUs on PAPR systems. The "cede_offline" boot parameter is rendered obsolete. Removing this code enables the removal of the partition suspend code which temporarily onlines all present CPUs. Fixes: 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an appropriate offline state") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200612051238.1007764-2-nathanl@linux.ibm.com
2020-06-12 08:12:21 +03:00
cpu_maps_update_done();
rc = device_offline(get_cpu_device(cpu));
if (rc)
goto out;
cpu_maps_update_begin();
powerpc/pseries: safely roll back failed DLPAR cpu add dlpar_online_cpu() attempts to online all threads of a core that has been added to an LPAR. If onlining a non-primary thread fails (e.g. due to an allocation failure), the core is left with at least one thread online. dlpar_cpu_add() attempts to roll back the whole operation, releasing the core back to the platform. However, since some threads of the core being removed are still online, the BUG_ON(cpu_online(cpu)) in pseries_remove_processor() strikes: LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 3 PID: 8587 Comm: drmgr Not tainted 5.3.0-rc2-00190-g9b123d1ea237-dirty #46 NIP: c0000000000eeb2c LR: c0000000000eeac4 CTR: c0000000000ee9e0 REGS: c0000001f745b6c0 TRAP: 0700 Not tainted (5.3.0-rc2-00190-g9b123d1ea237-dirty) MSR: 800000010282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44002448 XER: 00000000 CFAR: c00000000195d718 IRQMASK: 0 GPR00: c0000000000eeac4 c0000001f745b950 c0000000032f6200 0000000000000008 GPR04: 0000000000000008 c000000003349c78 0000000000000040 00000000000001ff GPR08: 0000000000000008 0000000000000000 0000000000000001 0007ffffffffffff GPR12: 0000000084002844 c00000001ecacb80 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000008 GPR24: c000000003349ee0 c00000000334a2e4 c0000000fca4d7a8 c000000001d20048 GPR28: 0000000000000001 ffffffffffffffff ffffffffffffffff c0000000fca4d7c4 NIP [c0000000000eeb2c] pseries_smp_notifier+0x14c/0x2e0 LR [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0 Call Trace: [c0000001f745b950] [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0 (unreliable) [c0000001f745ba10] [c0000000001ac774] notifier_call_chain+0xb4/0x190 [c0000001f745bab0] [c0000000001ad62c] blocking_notifier_call_chain+0x7c/0xb0 [c0000001f745baf0] [c00000000167bda0] of_detach_node+0xc0/0x110 [c0000001f745bb50] [c0000000000e7ae4] dlpar_detach_node+0x64/0xa0 [c0000001f745bb80] [c0000000000edefc] dlpar_cpu_add+0x31c/0x360 [c0000001f745bc10] [c0000000000ee980] dlpar_cpu_probe+0x50/0xb0 [c0000001f745bc50] [c00000000002cf70] arch_cpu_probe+0x40/0x70 [c0000001f745bc70] [c000000000ccd808] cpu_probe_store+0x48/0x80 [c0000001f745bcb0] [c000000000cbcef8] dev_attr_store+0x38/0x60 [c0000001f745bcd0] [c00000000059c980] sysfs_kf_write+0x70/0xb0 [c0000001f745bd10] [c00000000059afb8] kernfs_fop_write+0xf8/0x280 [c0000001f745bd60] [c0000000004b437c] __vfs_write+0x3c/0x70 [c0000001f745bd80] [c0000000004b8710] vfs_write+0xd0/0x220 [c0000001f745bdd0] [c0000000004b8acc] ksys_write+0x7c/0x140 [c0000001f745be20] [c00000000000bbd8] system_call+0x5c/0x68 Move dlpar_offline_cpu() up in the file so that dlpar_online_cpu() can use it to re-offline any threads that have been onlined when an error is encountered. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Fixes: e666ae0b10aa ("powerpc/pseries: Update CPU hotplug error recovery") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191016183611.10867-3-nathanl@linux.ibm.com
2019-10-16 21:36:11 +03:00
break;
}
if (cpu == num_possible_cpus()) {
pr_warn("Could not find cpu to offline with physical id 0x%x\n",
thread);
}
}
out_unlock:
powerpc/pseries: safely roll back failed DLPAR cpu add dlpar_online_cpu() attempts to online all threads of a core that has been added to an LPAR. If onlining a non-primary thread fails (e.g. due to an allocation failure), the core is left with at least one thread online. dlpar_cpu_add() attempts to roll back the whole operation, releasing the core back to the platform. However, since some threads of the core being removed are still online, the BUG_ON(cpu_online(cpu)) in pseries_remove_processor() strikes: LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 3 PID: 8587 Comm: drmgr Not tainted 5.3.0-rc2-00190-g9b123d1ea237-dirty #46 NIP: c0000000000eeb2c LR: c0000000000eeac4 CTR: c0000000000ee9e0 REGS: c0000001f745b6c0 TRAP: 0700 Not tainted (5.3.0-rc2-00190-g9b123d1ea237-dirty) MSR: 800000010282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44002448 XER: 00000000 CFAR: c00000000195d718 IRQMASK: 0 GPR00: c0000000000eeac4 c0000001f745b950 c0000000032f6200 0000000000000008 GPR04: 0000000000000008 c000000003349c78 0000000000000040 00000000000001ff GPR08: 0000000000000008 0000000000000000 0000000000000001 0007ffffffffffff GPR12: 0000000084002844 c00000001ecacb80 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000008 GPR24: c000000003349ee0 c00000000334a2e4 c0000000fca4d7a8 c000000001d20048 GPR28: 0000000000000001 ffffffffffffffff ffffffffffffffff c0000000fca4d7c4 NIP [c0000000000eeb2c] pseries_smp_notifier+0x14c/0x2e0 LR [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0 Call Trace: [c0000001f745b950] [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0 (unreliable) [c0000001f745ba10] [c0000000001ac774] notifier_call_chain+0xb4/0x190 [c0000001f745bab0] [c0000000001ad62c] blocking_notifier_call_chain+0x7c/0xb0 [c0000001f745baf0] [c00000000167bda0] of_detach_node+0xc0/0x110 [c0000001f745bb50] [c0000000000e7ae4] dlpar_detach_node+0x64/0xa0 [c0000001f745bb80] [c0000000000edefc] dlpar_cpu_add+0x31c/0x360 [c0000001f745bc10] [c0000000000ee980] dlpar_cpu_probe+0x50/0xb0 [c0000001f745bc50] [c00000000002cf70] arch_cpu_probe+0x40/0x70 [c0000001f745bc70] [c000000000ccd808] cpu_probe_store+0x48/0x80 [c0000001f745bcb0] [c000000000cbcef8] dev_attr_store+0x38/0x60 [c0000001f745bcd0] [c00000000059c980] sysfs_kf_write+0x70/0xb0 [c0000001f745bd10] [c00000000059afb8] kernfs_fop_write+0xf8/0x280 [c0000001f745bd60] [c0000000004b437c] __vfs_write+0x3c/0x70 [c0000001f745bd80] [c0000000004b8710] vfs_write+0xd0/0x220 [c0000001f745bdd0] [c0000000004b8acc] ksys_write+0x7c/0x140 [c0000001f745be20] [c00000000000bbd8] system_call+0x5c/0x68 Move dlpar_offline_cpu() up in the file so that dlpar_online_cpu() can use it to re-offline any threads that have been onlined when an error is encountered. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Fixes: e666ae0b10aa ("powerpc/pseries: Update CPU hotplug error recovery") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191016183611.10867-3-nathanl@linux.ibm.com
2019-10-16 21:36:11 +03:00
cpu_maps_update_done();
out:
return rc;
}
static int dlpar_online_cpu(struct device_node *dn)
{
int rc = 0;
unsigned int cpu;
int len, nthreads, i;
const __be32 *intserv;
u32 thread;
intserv = of_get_property(dn, "ibm,ppc-interrupt-server#s", &len);
if (!intserv)
return -EINVAL;
nthreads = len / sizeof(u32);
cpu_maps_update_begin();
for (i = 0; i < nthreads; i++) {
thread = be32_to_cpu(intserv[i]);
for_each_present_cpu(cpu) {
if (get_hard_smp_processor_id(cpu) != thread)
continue;
if (!topology_is_primary_thread(cpu)) {
if (cpu_smt_control != CPU_SMT_ENABLED)
break;
if (!topology_smt_thread_allowed(cpu))
break;
}
cpu_maps_update_done();
find_and_update_cpu_nid(cpu);
rc = device_online(get_cpu_device(cpu));
powerpc/pseries: safely roll back failed DLPAR cpu add dlpar_online_cpu() attempts to online all threads of a core that has been added to an LPAR. If onlining a non-primary thread fails (e.g. due to an allocation failure), the core is left with at least one thread online. dlpar_cpu_add() attempts to roll back the whole operation, releasing the core back to the platform. However, since some threads of the core being removed are still online, the BUG_ON(cpu_online(cpu)) in pseries_remove_processor() strikes: LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 3 PID: 8587 Comm: drmgr Not tainted 5.3.0-rc2-00190-g9b123d1ea237-dirty #46 NIP: c0000000000eeb2c LR: c0000000000eeac4 CTR: c0000000000ee9e0 REGS: c0000001f745b6c0 TRAP: 0700 Not tainted (5.3.0-rc2-00190-g9b123d1ea237-dirty) MSR: 800000010282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44002448 XER: 00000000 CFAR: c00000000195d718 IRQMASK: 0 GPR00: c0000000000eeac4 c0000001f745b950 c0000000032f6200 0000000000000008 GPR04: 0000000000000008 c000000003349c78 0000000000000040 00000000000001ff GPR08: 0000000000000008 0000000000000000 0000000000000001 0007ffffffffffff GPR12: 0000000084002844 c00000001ecacb80 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000008 GPR24: c000000003349ee0 c00000000334a2e4 c0000000fca4d7a8 c000000001d20048 GPR28: 0000000000000001 ffffffffffffffff ffffffffffffffff c0000000fca4d7c4 NIP [c0000000000eeb2c] pseries_smp_notifier+0x14c/0x2e0 LR [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0 Call Trace: [c0000001f745b950] [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0 (unreliable) [c0000001f745ba10] [c0000000001ac774] notifier_call_chain+0xb4/0x190 [c0000001f745bab0] [c0000000001ad62c] blocking_notifier_call_chain+0x7c/0xb0 [c0000001f745baf0] [c00000000167bda0] of_detach_node+0xc0/0x110 [c0000001f745bb50] [c0000000000e7ae4] dlpar_detach_node+0x64/0xa0 [c0000001f745bb80] [c0000000000edefc] dlpar_cpu_add+0x31c/0x360 [c0000001f745bc10] [c0000000000ee980] dlpar_cpu_probe+0x50/0xb0 [c0000001f745bc50] [c00000000002cf70] arch_cpu_probe+0x40/0x70 [c0000001f745bc70] [c000000000ccd808] cpu_probe_store+0x48/0x80 [c0000001f745bcb0] [c000000000cbcef8] dev_attr_store+0x38/0x60 [c0000001f745bcd0] [c00000000059c980] sysfs_kf_write+0x70/0xb0 [c0000001f745bd10] [c00000000059afb8] kernfs_fop_write+0xf8/0x280 [c0000001f745bd60] [c0000000004b437c] __vfs_write+0x3c/0x70 [c0000001f745bd80] [c0000000004b8710] vfs_write+0xd0/0x220 [c0000001f745bdd0] [c0000000004b8acc] ksys_write+0x7c/0x140 [c0000001f745be20] [c00000000000bbd8] system_call+0x5c/0x68 Move dlpar_offline_cpu() up in the file so that dlpar_online_cpu() can use it to re-offline any threads that have been onlined when an error is encountered. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Fixes: e666ae0b10aa ("powerpc/pseries: Update CPU hotplug error recovery") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191016183611.10867-3-nathanl@linux.ibm.com
2019-10-16 21:36:11 +03:00
if (rc) {
dlpar_offline_cpu(dn);
goto out;
powerpc/pseries: safely roll back failed DLPAR cpu add dlpar_online_cpu() attempts to online all threads of a core that has been added to an LPAR. If onlining a non-primary thread fails (e.g. due to an allocation failure), the core is left with at least one thread online. dlpar_cpu_add() attempts to roll back the whole operation, releasing the core back to the platform. However, since some threads of the core being removed are still online, the BUG_ON(cpu_online(cpu)) in pseries_remove_processor() strikes: LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: CPU: 3 PID: 8587 Comm: drmgr Not tainted 5.3.0-rc2-00190-g9b123d1ea237-dirty #46 NIP: c0000000000eeb2c LR: c0000000000eeac4 CTR: c0000000000ee9e0 REGS: c0000001f745b6c0 TRAP: 0700 Not tainted (5.3.0-rc2-00190-g9b123d1ea237-dirty) MSR: 800000010282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44002448 XER: 00000000 CFAR: c00000000195d718 IRQMASK: 0 GPR00: c0000000000eeac4 c0000001f745b950 c0000000032f6200 0000000000000008 GPR04: 0000000000000008 c000000003349c78 0000000000000040 00000000000001ff GPR08: 0000000000000008 0000000000000000 0000000000000001 0007ffffffffffff GPR12: 0000000084002844 c00000001ecacb80 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000008 GPR24: c000000003349ee0 c00000000334a2e4 c0000000fca4d7a8 c000000001d20048 GPR28: 0000000000000001 ffffffffffffffff ffffffffffffffff c0000000fca4d7c4 NIP [c0000000000eeb2c] pseries_smp_notifier+0x14c/0x2e0 LR [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0 Call Trace: [c0000001f745b950] [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0 (unreliable) [c0000001f745ba10] [c0000000001ac774] notifier_call_chain+0xb4/0x190 [c0000001f745bab0] [c0000000001ad62c] blocking_notifier_call_chain+0x7c/0xb0 [c0000001f745baf0] [c00000000167bda0] of_detach_node+0xc0/0x110 [c0000001f745bb50] [c0000000000e7ae4] dlpar_detach_node+0x64/0xa0 [c0000001f745bb80] [c0000000000edefc] dlpar_cpu_add+0x31c/0x360 [c0000001f745bc10] [c0000000000ee980] dlpar_cpu_probe+0x50/0xb0 [c0000001f745bc50] [c00000000002cf70] arch_cpu_probe+0x40/0x70 [c0000001f745bc70] [c000000000ccd808] cpu_probe_store+0x48/0x80 [c0000001f745bcb0] [c000000000cbcef8] dev_attr_store+0x38/0x60 [c0000001f745bcd0] [c00000000059c980] sysfs_kf_write+0x70/0xb0 [c0000001f745bd10] [c00000000059afb8] kernfs_fop_write+0xf8/0x280 [c0000001f745bd60] [c0000000004b437c] __vfs_write+0x3c/0x70 [c0000001f745bd80] [c0000000004b8710] vfs_write+0xd0/0x220 [c0000001f745bdd0] [c0000000004b8acc] ksys_write+0x7c/0x140 [c0000001f745be20] [c00000000000bbd8] system_call+0x5c/0x68 Move dlpar_offline_cpu() up in the file so that dlpar_online_cpu() can use it to re-offline any threads that have been onlined when an error is encountered. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Fixes: e666ae0b10aa ("powerpc/pseries: Update CPU hotplug error recovery") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191016183611.10867-3-nathanl@linux.ibm.com
2019-10-16 21:36:11 +03:00
}
cpu_maps_update_begin();
break;
}
if (cpu == num_possible_cpus())
printk(KERN_WARNING "Could not find cpu to online "
"with physical id 0x%x\n", thread);
}
cpu_maps_update_done();
out:
return rc;
}
static bool dlpar_cpu_exists(struct device_node *parent, u32 drc_index)
{
struct device_node *child = NULL;
u32 my_drc_index;
bool found;
int rc;
/* Assume cpu doesn't exist */
found = false;
for_each_child_of_node(parent, child) {
rc = of_property_read_u32(child, "ibm,my-drc-index",
&my_drc_index);
if (rc)
continue;
if (my_drc_index == drc_index) {
of_node_put(child);
found = true;
break;
}
}
return found;
}
static bool drc_info_valid_index(struct device_node *parent, u32 drc_index)
{
struct property *info;
struct of_drc_info drc;
const __be32 *value;
u32 index;
int count, i, j;
info = of_find_property(parent, "ibm,drc-info", NULL);
if (!info)
return false;
value = of_prop_next_u32(info, NULL, &count);
/* First value of ibm,drc-info is number of drc-info records */
if (value)
value++;
else
return false;
for (i = 0; i < count; i++) {
if (of_read_drc_info_cell(&info, &value, &drc))
return false;
if (strncmp(drc.drc_type, "CPU", 3))
break;
if (drc_index > drc.last_drc_index)
continue;
index = drc.drc_index_start;
for (j = 0; j < drc.num_sequential_elems; j++) {
if (drc_index == index)
return true;
index += drc.sequential_inc;
}
}
return false;
}
static bool valid_cpu_drc_index(struct device_node *parent, u32 drc_index)
{
bool found = false;
int rc, index;
if (of_property_present(parent, "ibm,drc-info"))
return drc_info_valid_index(parent, drc_index);
/* Note that the format of the ibm,drc-indexes array is
* the number of entries in the array followed by the array
* of drc values so we start looking at index = 1.
*/
index = 1;
while (!found) {
u32 drc;
rc = of_property_read_u32_index(parent, "ibm,drc-indexes",
index++, &drc);
if (rc)
break;
if (drc == drc_index)
found = true;
}
return found;
}
powerpc/pseries/cpuhp: cache node corrections On pseries, cache nodes in the device tree can be added and removed by the CPU DLPAR code as well as the partition migration (mobility) code. PowerVM partitions in dedicated processor mode typically have L2 and L3 cache nodes. The CPU DLPAR code has the following shortcomings: * Cache nodes returned as siblings of a new CPU node by ibm,configure-connector are silently discarded; only the CPU node is added to the device tree. * Cache nodes which become unreferenced in the processor removal path are not removed from the device tree. This can lead to duplicate nodes when the post-migration device tree update code replaces cache nodes. This is long-standing behavior. Presumably it has gone mostly unnoticed because the two bugs have the property of obscuring each other in common simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice only if you cared to inspect the device tree or the sysfs cacheinfo information. Booted with two processors: $ pwd /sys/firmware/devicetree/base/cpus $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@8/ $ lsprop */l2-cache l2-cache@2010/l2-cache 00003110 (12560) l2-cache@2011/l2-cache 00003111 (12561) PowerPC,POWER9@0/l2-cache 00002010 (8208) PowerPC,POWER9@8/l2-cache 00002011 (8209) $ ls /sys/devices/system/cpu/cpu0/cache/ index0 index1 index2 index3 After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is missing a cache level in its sched domain hierarchy: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@10/ PowerPC,POWER9@8/ $ lsprop PowerPC\,POWER9@10/l2-cache PowerPC,POWER9@10/l2-cache 00002012 (8210) $ ls /sys/devices/system/cpu/cpu16/cache/ index0 index1 $ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name /sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE When removing PowerPC,POWER9@8, we see that its cache nodes are left behind: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ When DLPAR is combined with VM migration, we can get duplicate nodes. E.g. removing one processor, then migrating, adding a processor, and then migrating again can result in warnings from the OF core during post-migration device tree updates: Duplicate name in cpus, renamed to "l2-cache@2011#1" Duplicate name in cpus, renamed to "l3-cache@3111#1" and nodes with duplicated phandles in the tree, making lookup behavior unpredictable: $ lsprop l[23]-cache@*/ibm,phandle l2-cache@2010/ibm,phandle 00002010 (8208) l2-cache@2011#1/ibm,phandle 00002011 (8209) l2-cache@2011/ibm,phandle 00002011 (8209) l3-cache@3110/ibm,phandle 00003110 (12560) l3-cache@3111#1/ibm,phandle 00003111 (12561) l3-cache@3111/ibm,phandle 00003111 (12561) Address these issues by: * Correctly processing siblings of the node returned from dlpar_configure_connector(). * Removing cache nodes in the CPU remove path when it can be determined that they are not associated with other CPUs or caches. Use the of_changeset API in both cases, which allows us to keep the error handling in this code from becoming more complex while ensuring that the device tree cannot become inconsistent. Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality") Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
static int pseries_cpuhp_attach_nodes(struct device_node *dn)
{
struct of_changeset cs;
int ret;
/*
* This device node is unattached but may have siblings; open-code the
* traversal.
*/
for (of_changeset_init(&cs); dn != NULL; dn = dn->sibling) {
ret = of_changeset_attach_node(&cs, dn);
if (ret)
goto out;
}
ret = of_changeset_apply(&cs);
out:
of_changeset_destroy(&cs);
return ret;
}
static ssize_t dlpar_cpu_add(u32 drc_index)
{
struct device_node *dn, *parent;
int rc, saved_rc;
pr_debug("Attempting to add CPU, drc index: %x\n", drc_index);
parent = of_find_node_by_path("/cpus");
if (!parent) {
pr_warn("Failed to find CPU root node \"/cpus\"\n");
return -ENODEV;
}
if (dlpar_cpu_exists(parent, drc_index)) {
of_node_put(parent);
pr_warn("CPU with drc index %x already exists\n", drc_index);
return -EINVAL;
}
if (!valid_cpu_drc_index(parent, drc_index)) {
of_node_put(parent);
pr_warn("Cannot find CPU (drc index %x) to add.\n", drc_index);
return -EINVAL;
}
rc = dlpar_acquire_drc(drc_index);
if (rc) {
pr_warn("Failed to acquire DRC, rc: %d, drc index: %x\n",
rc, drc_index);
of_node_put(parent);
return -EINVAL;
}
dn = dlpar_configure_connector(cpu_to_be32(drc_index), parent);
if (!dn) {
pr_warn("Failed call to configure-connector, drc index: %x\n",
drc_index);
dlpar_release_drc(drc_index);
powerpc/pseries: Fix "OF: ERROR: Bad of_node_put() on /cpus" during DLPAR Commit 215ee763f8cb ("powerpc: pseries: remove dlpar_attach_node dependency on full path") reworked dlpar_attach_node() to no longer look up the parent node "/cpus", but instead to have the parent node passed by the caller in the function parameter list. As a result dlpar_attach_node() is no longer responsible for freeing the reference to the parent node. However, commit 215ee763f8cb failed to remove the of_node_put(parent) call in dlpar_attach_node(), or to take into account that the reference to the parent in the caller dlpar_cpu_add() needs to be held until after dlpar_attach_node() returns. As a result doing repeated cpu add/remove dlpar operations will eventually result in the following error: OF: ERROR: Bad of_node_put() on /cpus CPU: 0 PID: 10896 Comm: drmgr Not tainted 4.13.0-autotest #1 Call Trace: dump_stack+0x15c/0x1f8 (unreliable) of_node_release+0x1a4/0x1c0 kobject_put+0x1a8/0x310 kobject_del+0xbc/0xf0 __of_detach_node_sysfs+0x144/0x210 of_detach_node+0xf0/0x180 dlpar_detach_node+0xc4/0x120 dlpar_cpu_remove+0x280/0x560 dlpar_cpu_release+0xbc/0x1b0 arch_cpu_release+0x6c/0xb0 cpu_release_store+0xa0/0x100 dev_attr_store+0x68/0xa0 sysfs_kf_write+0xa8/0xf0 kernfs_fop_write+0x2cc/0x400 __vfs_write+0x5c/0x340 vfs_write+0x1a8/0x3d0 SyS_write+0xa8/0x1a0 system_call+0x58/0x6c Fix the issue by removing the of_node_put(parent) call from dlpar_attach_node(), and ensuring that the reference to the parent node is properly held and released by the caller dlpar_cpu_add(). Fixes: 215ee763f8cb ("powerpc: pseries: remove dlpar_attach_node dependency on full path") Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com> Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com> [mpe: Add a comment in the code and frob the change log slightly] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-09-21 00:02:51 +03:00
of_node_put(parent);
return -EINVAL;
}
powerpc/pseries/cpuhp: cache node corrections On pseries, cache nodes in the device tree can be added and removed by the CPU DLPAR code as well as the partition migration (mobility) code. PowerVM partitions in dedicated processor mode typically have L2 and L3 cache nodes. The CPU DLPAR code has the following shortcomings: * Cache nodes returned as siblings of a new CPU node by ibm,configure-connector are silently discarded; only the CPU node is added to the device tree. * Cache nodes which become unreferenced in the processor removal path are not removed from the device tree. This can lead to duplicate nodes when the post-migration device tree update code replaces cache nodes. This is long-standing behavior. Presumably it has gone mostly unnoticed because the two bugs have the property of obscuring each other in common simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice only if you cared to inspect the device tree or the sysfs cacheinfo information. Booted with two processors: $ pwd /sys/firmware/devicetree/base/cpus $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@8/ $ lsprop */l2-cache l2-cache@2010/l2-cache 00003110 (12560) l2-cache@2011/l2-cache 00003111 (12561) PowerPC,POWER9@0/l2-cache 00002010 (8208) PowerPC,POWER9@8/l2-cache 00002011 (8209) $ ls /sys/devices/system/cpu/cpu0/cache/ index0 index1 index2 index3 After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is missing a cache level in its sched domain hierarchy: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@10/ PowerPC,POWER9@8/ $ lsprop PowerPC\,POWER9@10/l2-cache PowerPC,POWER9@10/l2-cache 00002012 (8210) $ ls /sys/devices/system/cpu/cpu16/cache/ index0 index1 $ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name /sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE When removing PowerPC,POWER9@8, we see that its cache nodes are left behind: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ When DLPAR is combined with VM migration, we can get duplicate nodes. E.g. removing one processor, then migrating, adding a processor, and then migrating again can result in warnings from the OF core during post-migration device tree updates: Duplicate name in cpus, renamed to "l2-cache@2011#1" Duplicate name in cpus, renamed to "l3-cache@3111#1" and nodes with duplicated phandles in the tree, making lookup behavior unpredictable: $ lsprop l[23]-cache@*/ibm,phandle l2-cache@2010/ibm,phandle 00002010 (8208) l2-cache@2011#1/ibm,phandle 00002011 (8209) l2-cache@2011/ibm,phandle 00002011 (8209) l3-cache@3110/ibm,phandle 00003110 (12560) l3-cache@3111#1/ibm,phandle 00003111 (12561) l3-cache@3111/ibm,phandle 00003111 (12561) Address these issues by: * Correctly processing siblings of the node returned from dlpar_configure_connector(). * Removing cache nodes in the CPU remove path when it can be determined that they are not associated with other CPUs or caches. Use the of_changeset API in both cases, which allows us to keep the error handling in this code from becoming more complex while ensuring that the device tree cannot become inconsistent. Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality") Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
rc = pseries_cpuhp_attach_nodes(dn);
powerpc/pseries: Fix "OF: ERROR: Bad of_node_put() on /cpus" during DLPAR Commit 215ee763f8cb ("powerpc: pseries: remove dlpar_attach_node dependency on full path") reworked dlpar_attach_node() to no longer look up the parent node "/cpus", but instead to have the parent node passed by the caller in the function parameter list. As a result dlpar_attach_node() is no longer responsible for freeing the reference to the parent node. However, commit 215ee763f8cb failed to remove the of_node_put(parent) call in dlpar_attach_node(), or to take into account that the reference to the parent in the caller dlpar_cpu_add() needs to be held until after dlpar_attach_node() returns. As a result doing repeated cpu add/remove dlpar operations will eventually result in the following error: OF: ERROR: Bad of_node_put() on /cpus CPU: 0 PID: 10896 Comm: drmgr Not tainted 4.13.0-autotest #1 Call Trace: dump_stack+0x15c/0x1f8 (unreliable) of_node_release+0x1a4/0x1c0 kobject_put+0x1a8/0x310 kobject_del+0xbc/0xf0 __of_detach_node_sysfs+0x144/0x210 of_detach_node+0xf0/0x180 dlpar_detach_node+0xc4/0x120 dlpar_cpu_remove+0x280/0x560 dlpar_cpu_release+0xbc/0x1b0 arch_cpu_release+0x6c/0xb0 cpu_release_store+0xa0/0x100 dev_attr_store+0x68/0xa0 sysfs_kf_write+0xa8/0xf0 kernfs_fop_write+0x2cc/0x400 __vfs_write+0x5c/0x340 vfs_write+0x1a8/0x3d0 SyS_write+0xa8/0x1a0 system_call+0x58/0x6c Fix the issue by removing the of_node_put(parent) call from dlpar_attach_node(), and ensuring that the reference to the parent node is properly held and released by the caller dlpar_cpu_add(). Fixes: 215ee763f8cb ("powerpc: pseries: remove dlpar_attach_node dependency on full path") Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com> Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com> [mpe: Add a comment in the code and frob the change log slightly] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-09-21 00:02:51 +03:00
/* Regardless we are done with parent now */
of_node_put(parent);
if (rc) {
saved_rc = rc;
pr_warn("Failed to attach node %pOFn, rc: %d, drc index: %x\n",
dn, rc, drc_index);
rc = dlpar_release_drc(drc_index);
if (!rc)
dlpar_free_cc_nodes(dn);
return saved_rc;
}
update_numa_distance(dn);
rc = dlpar_online_cpu(dn);
if (rc) {
saved_rc = rc;
pr_warn("Failed to online cpu %pOFn, rc: %d, drc index: %x\n",
dn, rc, drc_index);
rc = dlpar_detach_node(dn);
if (!rc)
dlpar_release_drc(drc_index);
return saved_rc;
}
pr_debug("Successfully added CPU %pOFn, drc index: %x\n", dn,
drc_index);
return rc;
}
powerpc/pseries/cpuhp: cache node corrections On pseries, cache nodes in the device tree can be added and removed by the CPU DLPAR code as well as the partition migration (mobility) code. PowerVM partitions in dedicated processor mode typically have L2 and L3 cache nodes. The CPU DLPAR code has the following shortcomings: * Cache nodes returned as siblings of a new CPU node by ibm,configure-connector are silently discarded; only the CPU node is added to the device tree. * Cache nodes which become unreferenced in the processor removal path are not removed from the device tree. This can lead to duplicate nodes when the post-migration device tree update code replaces cache nodes. This is long-standing behavior. Presumably it has gone mostly unnoticed because the two bugs have the property of obscuring each other in common simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice only if you cared to inspect the device tree or the sysfs cacheinfo information. Booted with two processors: $ pwd /sys/firmware/devicetree/base/cpus $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@8/ $ lsprop */l2-cache l2-cache@2010/l2-cache 00003110 (12560) l2-cache@2011/l2-cache 00003111 (12561) PowerPC,POWER9@0/l2-cache 00002010 (8208) PowerPC,POWER9@8/l2-cache 00002011 (8209) $ ls /sys/devices/system/cpu/cpu0/cache/ index0 index1 index2 index3 After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is missing a cache level in its sched domain hierarchy: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@10/ PowerPC,POWER9@8/ $ lsprop PowerPC\,POWER9@10/l2-cache PowerPC,POWER9@10/l2-cache 00002012 (8210) $ ls /sys/devices/system/cpu/cpu16/cache/ index0 index1 $ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name /sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE When removing PowerPC,POWER9@8, we see that its cache nodes are left behind: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ When DLPAR is combined with VM migration, we can get duplicate nodes. E.g. removing one processor, then migrating, adding a processor, and then migrating again can result in warnings from the OF core during post-migration device tree updates: Duplicate name in cpus, renamed to "l2-cache@2011#1" Duplicate name in cpus, renamed to "l3-cache@3111#1" and nodes with duplicated phandles in the tree, making lookup behavior unpredictable: $ lsprop l[23]-cache@*/ibm,phandle l2-cache@2010/ibm,phandle 00002010 (8208) l2-cache@2011#1/ibm,phandle 00002011 (8209) l2-cache@2011/ibm,phandle 00002011 (8209) l3-cache@3110/ibm,phandle 00003110 (12560) l3-cache@3111#1/ibm,phandle 00003111 (12561) l3-cache@3111/ibm,phandle 00003111 (12561) Address these issues by: * Correctly processing siblings of the node returned from dlpar_configure_connector(). * Removing cache nodes in the CPU remove path when it can be determined that they are not associated with other CPUs or caches. Use the of_changeset API in both cases, which allows us to keep the error handling in this code from becoming more complex while ensuring that the device tree cannot become inconsistent. Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality") Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
static unsigned int pseries_cpuhp_cache_use_count(const struct device_node *cachedn)
{
unsigned int use_count = 0;
struct device_node *dn, *tn;
powerpc/pseries/cpuhp: cache node corrections On pseries, cache nodes in the device tree can be added and removed by the CPU DLPAR code as well as the partition migration (mobility) code. PowerVM partitions in dedicated processor mode typically have L2 and L3 cache nodes. The CPU DLPAR code has the following shortcomings: * Cache nodes returned as siblings of a new CPU node by ibm,configure-connector are silently discarded; only the CPU node is added to the device tree. * Cache nodes which become unreferenced in the processor removal path are not removed from the device tree. This can lead to duplicate nodes when the post-migration device tree update code replaces cache nodes. This is long-standing behavior. Presumably it has gone mostly unnoticed because the two bugs have the property of obscuring each other in common simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice only if you cared to inspect the device tree or the sysfs cacheinfo information. Booted with two processors: $ pwd /sys/firmware/devicetree/base/cpus $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@8/ $ lsprop */l2-cache l2-cache@2010/l2-cache 00003110 (12560) l2-cache@2011/l2-cache 00003111 (12561) PowerPC,POWER9@0/l2-cache 00002010 (8208) PowerPC,POWER9@8/l2-cache 00002011 (8209) $ ls /sys/devices/system/cpu/cpu0/cache/ index0 index1 index2 index3 After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is missing a cache level in its sched domain hierarchy: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@10/ PowerPC,POWER9@8/ $ lsprop PowerPC\,POWER9@10/l2-cache PowerPC,POWER9@10/l2-cache 00002012 (8210) $ ls /sys/devices/system/cpu/cpu16/cache/ index0 index1 $ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name /sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE When removing PowerPC,POWER9@8, we see that its cache nodes are left behind: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ When DLPAR is combined with VM migration, we can get duplicate nodes. E.g. removing one processor, then migrating, adding a processor, and then migrating again can result in warnings from the OF core during post-migration device tree updates: Duplicate name in cpus, renamed to "l2-cache@2011#1" Duplicate name in cpus, renamed to "l3-cache@3111#1" and nodes with duplicated phandles in the tree, making lookup behavior unpredictable: $ lsprop l[23]-cache@*/ibm,phandle l2-cache@2010/ibm,phandle 00002010 (8208) l2-cache@2011#1/ibm,phandle 00002011 (8209) l2-cache@2011/ibm,phandle 00002011 (8209) l3-cache@3110/ibm,phandle 00003110 (12560) l3-cache@3111#1/ibm,phandle 00003111 (12561) l3-cache@3111/ibm,phandle 00003111 (12561) Address these issues by: * Correctly processing siblings of the node returned from dlpar_configure_connector(). * Removing cache nodes in the CPU remove path when it can be determined that they are not associated with other CPUs or caches. Use the of_changeset API in both cases, which allows us to keep the error handling in this code from becoming more complex while ensuring that the device tree cannot become inconsistent. Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality") Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
WARN_ON(!of_node_is_type(cachedn, "cache"));
for_each_of_cpu_node(dn) {
tn = of_find_next_cache_node(dn);
of_node_put(tn);
if (tn == cachedn)
powerpc/pseries/cpuhp: cache node corrections On pseries, cache nodes in the device tree can be added and removed by the CPU DLPAR code as well as the partition migration (mobility) code. PowerVM partitions in dedicated processor mode typically have L2 and L3 cache nodes. The CPU DLPAR code has the following shortcomings: * Cache nodes returned as siblings of a new CPU node by ibm,configure-connector are silently discarded; only the CPU node is added to the device tree. * Cache nodes which become unreferenced in the processor removal path are not removed from the device tree. This can lead to duplicate nodes when the post-migration device tree update code replaces cache nodes. This is long-standing behavior. Presumably it has gone mostly unnoticed because the two bugs have the property of obscuring each other in common simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice only if you cared to inspect the device tree or the sysfs cacheinfo information. Booted with two processors: $ pwd /sys/firmware/devicetree/base/cpus $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@8/ $ lsprop */l2-cache l2-cache@2010/l2-cache 00003110 (12560) l2-cache@2011/l2-cache 00003111 (12561) PowerPC,POWER9@0/l2-cache 00002010 (8208) PowerPC,POWER9@8/l2-cache 00002011 (8209) $ ls /sys/devices/system/cpu/cpu0/cache/ index0 index1 index2 index3 After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is missing a cache level in its sched domain hierarchy: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@10/ PowerPC,POWER9@8/ $ lsprop PowerPC\,POWER9@10/l2-cache PowerPC,POWER9@10/l2-cache 00002012 (8210) $ ls /sys/devices/system/cpu/cpu16/cache/ index0 index1 $ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name /sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE When removing PowerPC,POWER9@8, we see that its cache nodes are left behind: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ When DLPAR is combined with VM migration, we can get duplicate nodes. E.g. removing one processor, then migrating, adding a processor, and then migrating again can result in warnings from the OF core during post-migration device tree updates: Duplicate name in cpus, renamed to "l2-cache@2011#1" Duplicate name in cpus, renamed to "l3-cache@3111#1" and nodes with duplicated phandles in the tree, making lookup behavior unpredictable: $ lsprop l[23]-cache@*/ibm,phandle l2-cache@2010/ibm,phandle 00002010 (8208) l2-cache@2011#1/ibm,phandle 00002011 (8209) l2-cache@2011/ibm,phandle 00002011 (8209) l3-cache@3110/ibm,phandle 00003110 (12560) l3-cache@3111#1/ibm,phandle 00003111 (12561) l3-cache@3111/ibm,phandle 00003111 (12561) Address these issues by: * Correctly processing siblings of the node returned from dlpar_configure_connector(). * Removing cache nodes in the CPU remove path when it can be determined that they are not associated with other CPUs or caches. Use the of_changeset API in both cases, which allows us to keep the error handling in this code from becoming more complex while ensuring that the device tree cannot become inconsistent. Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality") Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
use_count++;
}
for_each_node_by_type(dn, "cache") {
tn = of_find_next_cache_node(dn);
of_node_put(tn);
if (tn == cachedn)
powerpc/pseries/cpuhp: cache node corrections On pseries, cache nodes in the device tree can be added and removed by the CPU DLPAR code as well as the partition migration (mobility) code. PowerVM partitions in dedicated processor mode typically have L2 and L3 cache nodes. The CPU DLPAR code has the following shortcomings: * Cache nodes returned as siblings of a new CPU node by ibm,configure-connector are silently discarded; only the CPU node is added to the device tree. * Cache nodes which become unreferenced in the processor removal path are not removed from the device tree. This can lead to duplicate nodes when the post-migration device tree update code replaces cache nodes. This is long-standing behavior. Presumably it has gone mostly unnoticed because the two bugs have the property of obscuring each other in common simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice only if you cared to inspect the device tree or the sysfs cacheinfo information. Booted with two processors: $ pwd /sys/firmware/devicetree/base/cpus $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@8/ $ lsprop */l2-cache l2-cache@2010/l2-cache 00003110 (12560) l2-cache@2011/l2-cache 00003111 (12561) PowerPC,POWER9@0/l2-cache 00002010 (8208) PowerPC,POWER9@8/l2-cache 00002011 (8209) $ ls /sys/devices/system/cpu/cpu0/cache/ index0 index1 index2 index3 After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is missing a cache level in its sched domain hierarchy: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@10/ PowerPC,POWER9@8/ $ lsprop PowerPC\,POWER9@10/l2-cache PowerPC,POWER9@10/l2-cache 00002012 (8210) $ ls /sys/devices/system/cpu/cpu16/cache/ index0 index1 $ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name /sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE When removing PowerPC,POWER9@8, we see that its cache nodes are left behind: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ When DLPAR is combined with VM migration, we can get duplicate nodes. E.g. removing one processor, then migrating, adding a processor, and then migrating again can result in warnings from the OF core during post-migration device tree updates: Duplicate name in cpus, renamed to "l2-cache@2011#1" Duplicate name in cpus, renamed to "l3-cache@3111#1" and nodes with duplicated phandles in the tree, making lookup behavior unpredictable: $ lsprop l[23]-cache@*/ibm,phandle l2-cache@2010/ibm,phandle 00002010 (8208) l2-cache@2011#1/ibm,phandle 00002011 (8209) l2-cache@2011/ibm,phandle 00002011 (8209) l3-cache@3110/ibm,phandle 00003110 (12560) l3-cache@3111#1/ibm,phandle 00003111 (12561) l3-cache@3111/ibm,phandle 00003111 (12561) Address these issues by: * Correctly processing siblings of the node returned from dlpar_configure_connector(). * Removing cache nodes in the CPU remove path when it can be determined that they are not associated with other CPUs or caches. Use the of_changeset API in both cases, which allows us to keep the error handling in this code from becoming more complex while ensuring that the device tree cannot become inconsistent. Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality") Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
use_count++;
}
return use_count;
}
static int pseries_cpuhp_detach_nodes(struct device_node *cpudn)
{
struct device_node *dn;
struct of_changeset cs;
int ret = 0;
of_changeset_init(&cs);
ret = of_changeset_detach_node(&cs, cpudn);
if (ret)
goto out;
dn = cpudn;
while ((dn = of_find_next_cache_node(dn))) {
if (pseries_cpuhp_cache_use_count(dn) > 1) {
of_node_put(dn);
powerpc/pseries/cpuhp: cache node corrections On pseries, cache nodes in the device tree can be added and removed by the CPU DLPAR code as well as the partition migration (mobility) code. PowerVM partitions in dedicated processor mode typically have L2 and L3 cache nodes. The CPU DLPAR code has the following shortcomings: * Cache nodes returned as siblings of a new CPU node by ibm,configure-connector are silently discarded; only the CPU node is added to the device tree. * Cache nodes which become unreferenced in the processor removal path are not removed from the device tree. This can lead to duplicate nodes when the post-migration device tree update code replaces cache nodes. This is long-standing behavior. Presumably it has gone mostly unnoticed because the two bugs have the property of obscuring each other in common simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice only if you cared to inspect the device tree or the sysfs cacheinfo information. Booted with two processors: $ pwd /sys/firmware/devicetree/base/cpus $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@8/ $ lsprop */l2-cache l2-cache@2010/l2-cache 00003110 (12560) l2-cache@2011/l2-cache 00003111 (12561) PowerPC,POWER9@0/l2-cache 00002010 (8208) PowerPC,POWER9@8/l2-cache 00002011 (8209) $ ls /sys/devices/system/cpu/cpu0/cache/ index0 index1 index2 index3 After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is missing a cache level in its sched domain hierarchy: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@10/ PowerPC,POWER9@8/ $ lsprop PowerPC\,POWER9@10/l2-cache PowerPC,POWER9@10/l2-cache 00002012 (8210) $ ls /sys/devices/system/cpu/cpu16/cache/ index0 index1 $ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name /sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE When removing PowerPC,POWER9@8, we see that its cache nodes are left behind: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ When DLPAR is combined with VM migration, we can get duplicate nodes. E.g. removing one processor, then migrating, adding a processor, and then migrating again can result in warnings from the OF core during post-migration device tree updates: Duplicate name in cpus, renamed to "l2-cache@2011#1" Duplicate name in cpus, renamed to "l3-cache@3111#1" and nodes with duplicated phandles in the tree, making lookup behavior unpredictable: $ lsprop l[23]-cache@*/ibm,phandle l2-cache@2010/ibm,phandle 00002010 (8208) l2-cache@2011#1/ibm,phandle 00002011 (8209) l2-cache@2011/ibm,phandle 00002011 (8209) l3-cache@3110/ibm,phandle 00003110 (12560) l3-cache@3111#1/ibm,phandle 00003111 (12561) l3-cache@3111/ibm,phandle 00003111 (12561) Address these issues by: * Correctly processing siblings of the node returned from dlpar_configure_connector(). * Removing cache nodes in the CPU remove path when it can be determined that they are not associated with other CPUs or caches. Use the of_changeset API in both cases, which allows us to keep the error handling in this code from becoming more complex while ensuring that the device tree cannot become inconsistent. Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality") Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
break;
}
powerpc/pseries/cpuhp: cache node corrections On pseries, cache nodes in the device tree can be added and removed by the CPU DLPAR code as well as the partition migration (mobility) code. PowerVM partitions in dedicated processor mode typically have L2 and L3 cache nodes. The CPU DLPAR code has the following shortcomings: * Cache nodes returned as siblings of a new CPU node by ibm,configure-connector are silently discarded; only the CPU node is added to the device tree. * Cache nodes which become unreferenced in the processor removal path are not removed from the device tree. This can lead to duplicate nodes when the post-migration device tree update code replaces cache nodes. This is long-standing behavior. Presumably it has gone mostly unnoticed because the two bugs have the property of obscuring each other in common simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice only if you cared to inspect the device tree or the sysfs cacheinfo information. Booted with two processors: $ pwd /sys/firmware/devicetree/base/cpus $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@8/ $ lsprop */l2-cache l2-cache@2010/l2-cache 00003110 (12560) l2-cache@2011/l2-cache 00003111 (12561) PowerPC,POWER9@0/l2-cache 00002010 (8208) PowerPC,POWER9@8/l2-cache 00002011 (8209) $ ls /sys/devices/system/cpu/cpu0/cache/ index0 index1 index2 index3 After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is missing a cache level in its sched domain hierarchy: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@10/ PowerPC,POWER9@8/ $ lsprop PowerPC\,POWER9@10/l2-cache PowerPC,POWER9@10/l2-cache 00002012 (8210) $ ls /sys/devices/system/cpu/cpu16/cache/ index0 index1 $ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name /sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE When removing PowerPC,POWER9@8, we see that its cache nodes are left behind: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ When DLPAR is combined with VM migration, we can get duplicate nodes. E.g. removing one processor, then migrating, adding a processor, and then migrating again can result in warnings from the OF core during post-migration device tree updates: Duplicate name in cpus, renamed to "l2-cache@2011#1" Duplicate name in cpus, renamed to "l3-cache@3111#1" and nodes with duplicated phandles in the tree, making lookup behavior unpredictable: $ lsprop l[23]-cache@*/ibm,phandle l2-cache@2010/ibm,phandle 00002010 (8208) l2-cache@2011#1/ibm,phandle 00002011 (8209) l2-cache@2011/ibm,phandle 00002011 (8209) l3-cache@3110/ibm,phandle 00003110 (12560) l3-cache@3111#1/ibm,phandle 00003111 (12561) l3-cache@3111/ibm,phandle 00003111 (12561) Address these issues by: * Correctly processing siblings of the node returned from dlpar_configure_connector(). * Removing cache nodes in the CPU remove path when it can be determined that they are not associated with other CPUs or caches. Use the of_changeset API in both cases, which allows us to keep the error handling in this code from becoming more complex while ensuring that the device tree cannot become inconsistent. Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality") Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
ret = of_changeset_detach_node(&cs, dn);
of_node_put(dn);
powerpc/pseries/cpuhp: cache node corrections On pseries, cache nodes in the device tree can be added and removed by the CPU DLPAR code as well as the partition migration (mobility) code. PowerVM partitions in dedicated processor mode typically have L2 and L3 cache nodes. The CPU DLPAR code has the following shortcomings: * Cache nodes returned as siblings of a new CPU node by ibm,configure-connector are silently discarded; only the CPU node is added to the device tree. * Cache nodes which become unreferenced in the processor removal path are not removed from the device tree. This can lead to duplicate nodes when the post-migration device tree update code replaces cache nodes. This is long-standing behavior. Presumably it has gone mostly unnoticed because the two bugs have the property of obscuring each other in common simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice only if you cared to inspect the device tree or the sysfs cacheinfo information. Booted with two processors: $ pwd /sys/firmware/devicetree/base/cpus $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@8/ $ lsprop */l2-cache l2-cache@2010/l2-cache 00003110 (12560) l2-cache@2011/l2-cache 00003111 (12561) PowerPC,POWER9@0/l2-cache 00002010 (8208) PowerPC,POWER9@8/l2-cache 00002011 (8209) $ ls /sys/devices/system/cpu/cpu0/cache/ index0 index1 index2 index3 After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is missing a cache level in its sched domain hierarchy: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@10/ PowerPC,POWER9@8/ $ lsprop PowerPC\,POWER9@10/l2-cache PowerPC,POWER9@10/l2-cache 00002012 (8210) $ ls /sys/devices/system/cpu/cpu16/cache/ index0 index1 $ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name /sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE When removing PowerPC,POWER9@8, we see that its cache nodes are left behind: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ When DLPAR is combined with VM migration, we can get duplicate nodes. E.g. removing one processor, then migrating, adding a processor, and then migrating again can result in warnings from the OF core during post-migration device tree updates: Duplicate name in cpus, renamed to "l2-cache@2011#1" Duplicate name in cpus, renamed to "l3-cache@3111#1" and nodes with duplicated phandles in the tree, making lookup behavior unpredictable: $ lsprop l[23]-cache@*/ibm,phandle l2-cache@2010/ibm,phandle 00002010 (8208) l2-cache@2011#1/ibm,phandle 00002011 (8209) l2-cache@2011/ibm,phandle 00002011 (8209) l3-cache@3110/ibm,phandle 00003110 (12560) l3-cache@3111#1/ibm,phandle 00003111 (12561) l3-cache@3111/ibm,phandle 00003111 (12561) Address these issues by: * Correctly processing siblings of the node returned from dlpar_configure_connector(). * Removing cache nodes in the CPU remove path when it can be determined that they are not associated with other CPUs or caches. Use the of_changeset API in both cases, which allows us to keep the error handling in this code from becoming more complex while ensuring that the device tree cannot become inconsistent. Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality") Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
if (ret)
goto out;
}
ret = of_changeset_apply(&cs);
out:
of_changeset_destroy(&cs);
return ret;
}
static ssize_t dlpar_cpu_remove(struct device_node *dn, u32 drc_index)
{
int rc;
pr_debug("Attempting to remove CPU %pOFn, drc index: %x\n",
dn, drc_index);
rc = dlpar_offline_cpu(dn);
if (rc) {
pr_warn("Failed to offline CPU %pOFn, rc: %d\n", dn, rc);
return -EINVAL;
}
rc = dlpar_release_drc(drc_index);
if (rc) {
pr_warn("Failed to release drc (%x) for CPU %pOFn, rc: %d\n",
drc_index, dn, rc);
dlpar_online_cpu(dn);
return rc;
}
powerpc/pseries/cpuhp: cache node corrections On pseries, cache nodes in the device tree can be added and removed by the CPU DLPAR code as well as the partition migration (mobility) code. PowerVM partitions in dedicated processor mode typically have L2 and L3 cache nodes. The CPU DLPAR code has the following shortcomings: * Cache nodes returned as siblings of a new CPU node by ibm,configure-connector are silently discarded; only the CPU node is added to the device tree. * Cache nodes which become unreferenced in the processor removal path are not removed from the device tree. This can lead to duplicate nodes when the post-migration device tree update code replaces cache nodes. This is long-standing behavior. Presumably it has gone mostly unnoticed because the two bugs have the property of obscuring each other in common simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice only if you cared to inspect the device tree or the sysfs cacheinfo information. Booted with two processors: $ pwd /sys/firmware/devicetree/base/cpus $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@8/ $ lsprop */l2-cache l2-cache@2010/l2-cache 00003110 (12560) l2-cache@2011/l2-cache 00003111 (12561) PowerPC,POWER9@0/l2-cache 00002010 (8208) PowerPC,POWER9@8/l2-cache 00002011 (8209) $ ls /sys/devices/system/cpu/cpu0/cache/ index0 index1 index2 index3 After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is missing a cache level in its sched domain hierarchy: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@10/ PowerPC,POWER9@8/ $ lsprop PowerPC\,POWER9@10/l2-cache PowerPC,POWER9@10/l2-cache 00002012 (8210) $ ls /sys/devices/system/cpu/cpu16/cache/ index0 index1 $ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name /sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE When removing PowerPC,POWER9@8, we see that its cache nodes are left behind: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ When DLPAR is combined with VM migration, we can get duplicate nodes. E.g. removing one processor, then migrating, adding a processor, and then migrating again can result in warnings from the OF core during post-migration device tree updates: Duplicate name in cpus, renamed to "l2-cache@2011#1" Duplicate name in cpus, renamed to "l3-cache@3111#1" and nodes with duplicated phandles in the tree, making lookup behavior unpredictable: $ lsprop l[23]-cache@*/ibm,phandle l2-cache@2010/ibm,phandle 00002010 (8208) l2-cache@2011#1/ibm,phandle 00002011 (8209) l2-cache@2011/ibm,phandle 00002011 (8209) l3-cache@3110/ibm,phandle 00003110 (12560) l3-cache@3111#1/ibm,phandle 00003111 (12561) l3-cache@3111/ibm,phandle 00003111 (12561) Address these issues by: * Correctly processing siblings of the node returned from dlpar_configure_connector(). * Removing cache nodes in the CPU remove path when it can be determined that they are not associated with other CPUs or caches. Use the of_changeset API in both cases, which allows us to keep the error handling in this code from becoming more complex while ensuring that the device tree cannot become inconsistent. Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality") Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
rc = pseries_cpuhp_detach_nodes(dn);
if (rc) {
int saved_rc = rc;
pr_warn("Failed to detach CPU %pOFn, rc: %d", dn, rc);
rc = dlpar_acquire_drc(drc_index);
if (!rc)
dlpar_online_cpu(dn);
return saved_rc;
}
pr_debug("Successfully removed CPU, drc index: %x\n", drc_index);
return 0;
}
static struct device_node *cpu_drc_index_to_dn(u32 drc_index)
{
struct device_node *dn;
u32 my_index;
int rc;
for_each_node_by_type(dn, "cpu") {
rc = of_property_read_u32(dn, "ibm,my-drc-index", &my_index);
if (rc)
continue;
if (my_index == drc_index)
break;
}
return dn;
}
static int dlpar_cpu_remove_by_index(u32 drc_index)
{
struct device_node *dn;
int rc;
dn = cpu_drc_index_to_dn(drc_index);
if (!dn) {
pr_warn("Cannot find CPU (drc index %x) to remove\n",
drc_index);
return -ENODEV;
}
rc = dlpar_cpu_remove(dn, drc_index);
of_node_put(dn);
return rc;
}
int dlpar_cpu(struct pseries_hp_errorlog *hp_elog)
{
powerpc/pseries/cpuhp: cache node corrections On pseries, cache nodes in the device tree can be added and removed by the CPU DLPAR code as well as the partition migration (mobility) code. PowerVM partitions in dedicated processor mode typically have L2 and L3 cache nodes. The CPU DLPAR code has the following shortcomings: * Cache nodes returned as siblings of a new CPU node by ibm,configure-connector are silently discarded; only the CPU node is added to the device tree. * Cache nodes which become unreferenced in the processor removal path are not removed from the device tree. This can lead to duplicate nodes when the post-migration device tree update code replaces cache nodes. This is long-standing behavior. Presumably it has gone mostly unnoticed because the two bugs have the property of obscuring each other in common simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice only if you cared to inspect the device tree or the sysfs cacheinfo information. Booted with two processors: $ pwd /sys/firmware/devicetree/base/cpus $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@8/ $ lsprop */l2-cache l2-cache@2010/l2-cache 00003110 (12560) l2-cache@2011/l2-cache 00003111 (12561) PowerPC,POWER9@0/l2-cache 00002010 (8208) PowerPC,POWER9@8/l2-cache 00002011 (8209) $ ls /sys/devices/system/cpu/cpu0/cache/ index0 index1 index2 index3 After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is missing a cache level in its sched domain hierarchy: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ PowerPC,POWER9@10/ PowerPC,POWER9@8/ $ lsprop PowerPC\,POWER9@10/l2-cache PowerPC,POWER9@10/l2-cache 00002012 (8210) $ ls /sys/devices/system/cpu/cpu16/cache/ index0 index1 $ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name /sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE /sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE /sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT /sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE When removing PowerPC,POWER9@8, we see that its cache nodes are left behind: $ ls -1d */ l2-cache@2010/ l2-cache@2011/ l3-cache@3110/ l3-cache@3111/ PowerPC,POWER9@0/ When DLPAR is combined with VM migration, we can get duplicate nodes. E.g. removing one processor, then migrating, adding a processor, and then migrating again can result in warnings from the OF core during post-migration device tree updates: Duplicate name in cpus, renamed to "l2-cache@2011#1" Duplicate name in cpus, renamed to "l3-cache@3111#1" and nodes with duplicated phandles in the tree, making lookup behavior unpredictable: $ lsprop l[23]-cache@*/ibm,phandle l2-cache@2010/ibm,phandle 00002010 (8208) l2-cache@2011#1/ibm,phandle 00002011 (8209) l2-cache@2011/ibm,phandle 00002011 (8209) l3-cache@3110/ibm,phandle 00003110 (12560) l3-cache@3111#1/ibm,phandle 00003111 (12561) l3-cache@3111/ibm,phandle 00003111 (12561) Address these issues by: * Correctly processing siblings of the node returned from dlpar_configure_connector(). * Removing cache nodes in the CPU remove path when it can be determined that they are not associated with other CPUs or caches. Use the of_changeset API in both cases, which allows us to keep the error handling in this code from becoming more complex while ensuring that the device tree cannot become inconsistent. Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality") Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
u32 drc_index;
int rc;
drc_index = hp_elog->_drc_u.drc_index;
lock_device_hotplug();
switch (hp_elog->action) {
case PSERIES_HP_ELOG_ACTION_REMOVE:
powerpc/pseries/cpuhp: delete add/remove_by_count code The core DLPAR code supports two actions (add and remove) and three subtypes of action: * By DRC index: the action is attempted on a single specified resource. This is the usual case for processors. * By indexed count: the action is attempted on a range of resources beginning at the specified index. This is implemented only by the memory DLPAR code. * By count: the lower layer (CPU or memory) is responsible for locating the specified number of resources to which the action can be applied. I cannot find any evidence of the "by count" subtype being used by drmgr or qemu for processors. And when I try to exercise this code, the add case does not work: $ ppc64_cpu --smt ; nproc SMT=8 24 $ printf "cpu remove count 2" > /sys/kernel/dlpar $ nproc 8 $ printf "cpu add count 2" > /sys/kernel/dlpar -bash: printf: write error: Invalid argument $ dmesg | tail -2 pseries-hotplug-cpu: Failed to find enough CPUs (1 of 2) to add dlpar: Could not handle DLPAR request "cpu add count 2" $ nproc 8 $ drmgr -c cpu -a -q 2 # this uses the by-index method Validating CPU DLPAR capability...yes. CPU 1 CPU 17 $ nproc 24 This is because find_drc_info_cpus_to_add() does not increment drc_index appropriately during its search. This is not hard to fix. But the _by_count() functions also have the property that they attempt to roll back all prior operations if the entire request cannot be satisfied, even though the rollback itself can encounter errors. It's not possible to provide transaction-like behavior at this level, and it's undesirable to have code that can only pretend to do that. Any users of these functions cannot know what the state of the system is in the error case. And the error paths are, to my knowledge, impossible to test without adding custom error injection code. Summary: * This code has not worked reliably since its introduction. * There is no evidence that it is used. * It contains questionable rollback behaviors in error paths which are difficult to test. So let's remove it. Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality") Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality") Fixes: b015f6bc9547 ("powerpc/pseries: Add cpu DLPAR support for drc-info property") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210927201933.76786-4-nathanl@linux.ibm.com
2021-09-27 23:19:32 +03:00
if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX) {
rc = dlpar_cpu_remove_by_index(drc_index);
/*
* Setting the isolation state of an UNISOLATED/CONFIGURED
* device to UNISOLATE is a no-op, but the hypervisor can
* use it as a hint that the CPU removal failed.
*/
if (rc)
dlpar_unisolate_drc(drc_index);
}
else
rc = -EINVAL;
break;
case PSERIES_HP_ELOG_ACTION_ADD:
powerpc/pseries/cpuhp: delete add/remove_by_count code The core DLPAR code supports two actions (add and remove) and three subtypes of action: * By DRC index: the action is attempted on a single specified resource. This is the usual case for processors. * By indexed count: the action is attempted on a range of resources beginning at the specified index. This is implemented only by the memory DLPAR code. * By count: the lower layer (CPU or memory) is responsible for locating the specified number of resources to which the action can be applied. I cannot find any evidence of the "by count" subtype being used by drmgr or qemu for processors. And when I try to exercise this code, the add case does not work: $ ppc64_cpu --smt ; nproc SMT=8 24 $ printf "cpu remove count 2" > /sys/kernel/dlpar $ nproc 8 $ printf "cpu add count 2" > /sys/kernel/dlpar -bash: printf: write error: Invalid argument $ dmesg | tail -2 pseries-hotplug-cpu: Failed to find enough CPUs (1 of 2) to add dlpar: Could not handle DLPAR request "cpu add count 2" $ nproc 8 $ drmgr -c cpu -a -q 2 # this uses the by-index method Validating CPU DLPAR capability...yes. CPU 1 CPU 17 $ nproc 24 This is because find_drc_info_cpus_to_add() does not increment drc_index appropriately during its search. This is not hard to fix. But the _by_count() functions also have the property that they attempt to roll back all prior operations if the entire request cannot be satisfied, even though the rollback itself can encounter errors. It's not possible to provide transaction-like behavior at this level, and it's undesirable to have code that can only pretend to do that. Any users of these functions cannot know what the state of the system is in the error case. And the error paths are, to my knowledge, impossible to test without adding custom error injection code. Summary: * This code has not worked reliably since its introduction. * There is no evidence that it is used. * It contains questionable rollback behaviors in error paths which are difficult to test. So let's remove it. Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality") Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality") Fixes: b015f6bc9547 ("powerpc/pseries: Add cpu DLPAR support for drc-info property") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210927201933.76786-4-nathanl@linux.ibm.com
2021-09-27 23:19:32 +03:00
if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX)
rc = dlpar_cpu_add(drc_index);
else
rc = -EINVAL;
break;
default:
pr_err("Invalid action (%d) specified\n", hp_elog->action);
rc = -EINVAL;
break;
}
unlock_device_hotplug();
return rc;
}
#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
static ssize_t dlpar_cpu_probe(const char *buf, size_t count)
{
u32 drc_index;
int rc;
rc = kstrtou32(buf, 0, &drc_index);
if (rc)
return -EINVAL;
rc = dlpar_cpu_add(drc_index);
return rc ? rc : count;
}
static ssize_t dlpar_cpu_release(const char *buf, size_t count)
{
struct device_node *dn;
u32 drc_index;
int rc;
dn = of_find_node_by_path(buf);
if (!dn)
return -EINVAL;
rc = of_property_read_u32(dn, "ibm,my-drc-index", &drc_index);
if (rc) {
of_node_put(dn);
return -EINVAL;
}
rc = dlpar_cpu_remove(dn, drc_index);
of_node_put(dn);
return rc ? rc : count;
}
#endif /* CONFIG_ARCH_CPU_PROBE_RELEASE */
static int pseries_smp_notifier(struct notifier_block *nb,
unsigned long action, void *data)
{
struct of_reconfig_data *rd = data;
int err = 0;
switch (action) {
case OF_RECONFIG_ATTACH_NODE:
err = pseries_add_processor(rd->dn);
break;
case OF_RECONFIG_DETACH_NODE:
pseries_remove_processor(rd->dn);
break;
}
return notifier_from_errno(err);
}
static struct notifier_block pseries_smp_nb = {
.notifier_call = pseries_smp_notifier,
};
void __init pseries_cpu_hotplug_init(void)
{
int qcss_tok;
rtas_stop_self_token = rtas_function_token(RTAS_FN_STOP_SELF);
qcss_tok = rtas_function_token(RTAS_FN_QUERY_CPU_STOPPED_STATE);
if (rtas_stop_self_token == RTAS_UNKNOWN_SERVICE ||
qcss_tok == RTAS_UNKNOWN_SERVICE) {
printk(KERN_INFO "CPU Hotplug not supported by firmware "
"- disabling.\n");
return;
}
smp_ops->cpu_offline_self = pseries_cpu_offline_self;
smp_ops->cpu_disable = pseries_cpu_disable;
smp_ops->cpu_die = pseries_cpu_die;
}
static int __init pseries_dlpar_init(void)
{
unsigned int node;
#ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
ppc_md.cpu_probe = dlpar_cpu_probe;
ppc_md.cpu_release = dlpar_cpu_release;
#endif /* CONFIG_ARCH_CPU_PROBE_RELEASE */
/* Processors can be added/removed only on LPAR */
powerpc/pseries: Prevent free CPU ids being reused on another node When a CPU is hot added, the CPU ids are taken from the available mask from the lower possible set. If that set of values was previously used for a CPU attached to a different node, it appears to an application as if these CPUs have migrated from one node to another node which is not expected. To prevent this, it is needed to record the CPU ids used for each node and to not reuse them on another node. However, to prevent CPU hot plug to fail, in the case the CPU ids is starved on a node, the capability to reuse other nodes’ free CPU ids is kept. A warning is displayed in such a case to warn the user. A new CPU bit mask (node_recorded_ids_map) is introduced for each possible node. It is populated with the CPU onlined at boot time, and then when a CPU is hot plugged to a node. The bits in that mask remain when the CPU is hot unplugged, to remind this CPU ids have been used for this node. If no id set was found, a retry is made without removing the ids used on the other nodes to try reusing them. This is the way ids have been allocated prior to this patch. The effect of this patch can be seen by removing and adding CPUs using the Qemu monitor. In the following case, the first CPU from the node 2 is removed, then the first one from the node 1 is removed too. Later, the first CPU of the node 2 is added back. Without that patch, the kernel will number these CPUs using the first CPU ids available which are the ones freed when removing the second CPU of the node 0. This leads to the CPU ids 16-23 to move from the node 1 to the node 2. With the patch applied, the CPU ids 32-39 are used since they are the lowest free ones which have not been used on another node. At boot time: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Vanilla kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47 Patched kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
if (firmware_has_feature(FW_FEATURE_LPAR)) {
for_each_node(node) {
if (!alloc_cpumask_var_node(&node_recorded_ids_map[node],
GFP_KERNEL, node))
return -ENOMEM;
powerpc/pseries: Prevent free CPU ids being reused on another node When a CPU is hot added, the CPU ids are taken from the available mask from the lower possible set. If that set of values was previously used for a CPU attached to a different node, it appears to an application as if these CPUs have migrated from one node to another node which is not expected. To prevent this, it is needed to record the CPU ids used for each node and to not reuse them on another node. However, to prevent CPU hot plug to fail, in the case the CPU ids is starved on a node, the capability to reuse other nodes’ free CPU ids is kept. A warning is displayed in such a case to warn the user. A new CPU bit mask (node_recorded_ids_map) is introduced for each possible node. It is populated with the CPU onlined at boot time, and then when a CPU is hot plugged to a node. The bits in that mask remain when the CPU is hot unplugged, to remind this CPU ids have been used for this node. If no id set was found, a retry is made without removing the ids used on the other nodes to try reusing them. This is the way ids have been allocated prior to this patch. The effect of this patch can be seen by removing and adding CPUs using the Qemu monitor. In the following case, the first CPU from the node 2 is removed, then the first one from the node 1 is removed too. Later, the first CPU of the node 2 is added back. Without that patch, the kernel will number these CPUs using the first CPU ids available which are the ones freed when removing the second CPU of the node 0. This leads to the CPU ids 16-23 to move from the node 1 to the node 2. With the patch applied, the CPU ids 32-39 are used since they are the lowest free ones which have not been used on another node. At boot time: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Vanilla kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47 Patched kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
/* Record ids of CPU added at boot time */
cpumask_copy(node_recorded_ids_map[node],
cpumask_of_node(node));
powerpc/pseries: Prevent free CPU ids being reused on another node When a CPU is hot added, the CPU ids are taken from the available mask from the lower possible set. If that set of values was previously used for a CPU attached to a different node, it appears to an application as if these CPUs have migrated from one node to another node which is not expected. To prevent this, it is needed to record the CPU ids used for each node and to not reuse them on another node. However, to prevent CPU hot plug to fail, in the case the CPU ids is starved on a node, the capability to reuse other nodes’ free CPU ids is kept. A warning is displayed in such a case to warn the user. A new CPU bit mask (node_recorded_ids_map) is introduced for each possible node. It is populated with the CPU onlined at boot time, and then when a CPU is hot plugged to a node. The bits in that mask remain when the CPU is hot unplugged, to remind this CPU ids have been used for this node. If no id set was found, a retry is made without removing the ids used on the other nodes to try reusing them. This is the way ids have been allocated prior to this patch. The effect of this patch can be seen by removing and adding CPUs using the Qemu monitor. In the following case, the first CPU from the node 2 is removed, then the first one from the node 1 is removed too. Later, the first CPU of the node 2 is added back. Without that patch, the kernel will number these CPUs using the first CPU ids available which are the ones freed when removing the second CPU of the node 0. This leads to the CPU ids 16-23 to move from the node 1 to the node 2. With the patch applied, the CPU ids 32-39 are used since they are the lowest free ones which have not been used on another node. At boot time: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Vanilla kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47 Patched kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
}
of_reconfig_notifier_register(&pseries_smp_nb);
powerpc/pseries: Prevent free CPU ids being reused on another node When a CPU is hot added, the CPU ids are taken from the available mask from the lower possible set. If that set of values was previously used for a CPU attached to a different node, it appears to an application as if these CPUs have migrated from one node to another node which is not expected. To prevent this, it is needed to record the CPU ids used for each node and to not reuse them on another node. However, to prevent CPU hot plug to fail, in the case the CPU ids is starved on a node, the capability to reuse other nodes’ free CPU ids is kept. A warning is displayed in such a case to warn the user. A new CPU bit mask (node_recorded_ids_map) is introduced for each possible node. It is populated with the CPU onlined at boot time, and then when a CPU is hot plugged to a node. The bits in that mask remain when the CPU is hot unplugged, to remind this CPU ids have been used for this node. If no id set was found, a retry is made without removing the ids used on the other nodes to try reusing them. This is the way ids have been allocated prior to this patch. The effect of this patch can be seen by removing and adding CPUs using the Qemu monitor. In the following case, the first CPU from the node 2 is removed, then the first one from the node 1 is removed too. Later, the first CPU of the node 2 is added back. Without that patch, the kernel will number these CPUs using the first CPU ids available which are the ones freed when removing the second CPU of the node 0. This leads to the CPU ids 16-23 to move from the node 1 to the node 2. With the patch applied, the CPU ids 32-39 are used since they are the lowest free ones which have not been used on another node. At boot time: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Vanilla kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47 Patched kernel, after the CPU hot unplug/plug operations: [root@vm40 ~]# numactl -H | grep cpus node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 1 cpus: 24 25 26 27 28 29 30 31 node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
}
return 0;
}
machine_arch_initcall(pseries, pseries_dlpar_init);