2019-05-27 09:55:01 +03:00
// SPDX-License-Identifier: GPL-2.0-or-later
2006-12-05 09:52:36 +03:00
/*
* pseries CPU Hotplug infrastructure .
*
2006-12-05 09:52:38 +03:00
* Split out from arch / powerpc / platforms / pseries / setup . c
* arch / powerpc / kernel / rtas . c , and arch / powerpc / platforms / pseries / smp . c
2006-12-05 09:52:36 +03:00
*
* Peter Bergner , IBM March 2001.
* Copyright ( C ) 2001 IBM .
2006-12-05 09:52:38 +03:00
* Dave Engebretsen , Peter Bergner , and
* Mike Corrigan { engebret | bergner | mikec } @ us . ibm . com
* Plus various changes from other IBM teams . . .
2006-12-05 09:52:36 +03:00
*
* Copyright ( C ) 2006 Michael Ellerman , IBM Corporation
*/
2015-12-16 23:52:39 +03:00
# define pr_fmt(fmt) "pseries-hotplug-cpu: " fmt
2006-12-05 09:52:36 +03:00
# include <linux/kernel.h>
2011-04-04 07:46:58 +04:00
# include <linux/interrupt.h>
2006-12-05 09:52:36 +03:00
# include <linux/delay.h>
2011-05-27 22:25:11 +04:00
# include <linux/sched.h> /* for idle_task_exit */
2017-02-08 20:51:36 +03:00
# include <linux/sched/hotplug.h>
2006-12-05 09:52:36 +03:00
# include <linux/cpu.h>
2012-10-02 20:57:57 +04:00
# include <linux/of.h>
2015-12-16 23:54:05 +03:00
# include <linux/slab.h>
2006-12-05 09:52:36 +03:00
# include <asm/prom.h>
# include <asm/rtas.h>
# include <asm/firmware.h>
# include <asm/machdep.h>
# include <asm/vdso_datapage.h>
2011-04-04 07:46:58 +04:00
# include <asm/xics.h>
powerpc/xive: guest exploitation of the XIVE interrupt controller
This is the framework for using XIVE in a PowerVM guest. The support
is very similar to the native one in a much simpler form.
Each source is associated with an Event State Buffer (ESB). This is a
two bit state machine which is used to trigger events. The bits are
named "P" (pending) and "Q" (queued) and can be controlled by MMIO.
The Guest OS registers event (or notifications) queues on which the HW
will post event data for a target to notify.
Instead of OPAL calls, a set of Hypervisors call are used to configure
the interrupt sources and the event/notification queues of the guest:
- H_INT_GET_SOURCE_INFO
used to obtain the address of the MMIO page of the Event State
Buffer (PQ bits) entry associated with the source.
- H_INT_SET_SOURCE_CONFIG
assigns a source to a "target".
- H_INT_GET_SOURCE_CONFIG
determines to which "target" and "priority" is assigned to a source
- H_INT_GET_QUEUE_INFO
returns the address of the notification management page associated
with the specified "target" and "priority".
- H_INT_SET_QUEUE_CONFIG
sets or resets the event queue for a given "target" and "priority".
It is also used to set the notification config associated with the
queue, only unconditional notification for the moment. Reset is
performed with a queue size of 0 and queueing is disabled in that
case.
- H_INT_GET_QUEUE_CONFIG
returns the queue settings for a given "target" and "priority".
- H_INT_RESET
resets all of the partition's interrupt exploitation structures to
their initial state, losing all configuration set via the hcalls
H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.
- H_INT_SYNC
issue a synchronisation on a source to make sure sure all
notifications have reached their queue.
As for XICS, the XIVE interface for the guest is described in the
device tree under the "interrupt-controller" node. A couple of new
properties are specific to XIVE :
- "reg"
contains the base address and size of the thread interrupt
managnement areas (TIMA), also called rings, for the User level and
for the Guest OS level. Only the Guest OS level is taken into
account today.
- "ibm,xive-eq-sizes"
the size of the event queues. One cell per size supported, contains
log2 of size, in ascending order.
- "ibm,xive-lisn-ranges"
the interrupt numbers ranges assigned to the guest. These are
allocated using a simple bitmap.
and also :
- "/ibm,plat-res-int-priorities"
contains a list of priorities that the hypervisor has reserved for
its own use.
Tested with a QEMU XIVE model for pseries and with the Power hypervisor.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-30 22:46:11 +03:00
# include <asm/xive.h>
2013-08-22 13:53:52 +04:00
# include <asm/plpar_wrappers.h>
2018-01-26 22:41:59 +03:00
# include <asm/topology.h>
2013-08-22 13:53:52 +04:00
2015-12-16 23:50:21 +03:00
# include "pseries.h"
2006-12-05 09:52:36 +03:00
/* This version can't take the spinlock, because it never returns */
2014-02-20 14:13:52 +04:00
static int rtas_stop_self_token = RTAS_UNKNOWN_SERVICE ;
2006-12-05 09:52:36 +03:00
powerpc/pseries: Prevent free CPU ids being reused on another node
When a CPU is hot added, the CPU ids are taken from the available mask
from the lower possible set. If that set of values was previously used
for a CPU attached to a different node, it appears to an application as
if these CPUs have migrated from one node to another node which is not
expected.
To prevent this, it is needed to record the CPU ids used for each node
and to not reuse them on another node. However, to prevent CPU hot plug
to fail, in the case the CPU ids is starved on a node, the capability to
reuse other nodes’ free CPU ids is kept. A warning is displayed in such
a case to warn the user.
A new CPU bit mask (node_recorded_ids_map) is introduced for each
possible node. It is populated with the CPU onlined at boot time, and
then when a CPU is hot plugged to a node. The bits in that mask remain
when the CPU is hot unplugged, to remind this CPU ids have been used for
this node.
If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.
The effect of this patch can be seen by removing and adding CPUs using
the Qemu monitor. In the following case, the first CPU from the node 2
is removed, then the first one from the node 1 is removed too. Later,
the first CPU of the node 2 is added back. Without that patch, the
kernel will number these CPUs using the first CPU ids available which
are the ones freed when removing the second CPU of the node 0. This
leads to the CPU ids 16-23 to move from the node 1 to the node 2. With
the patch applied, the CPU ids 32-39 are used since they are the lowest
free ones which have not been used on another node.
At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
/*
* Record the CPU ids used on each nodes .
* Protected by cpu_add_remove_lock .
*/
static cpumask_var_t node_recorded_ids_map [ MAX_NUMNODES ] ;
2006-12-05 09:52:37 +03:00
static void rtas_stop_self ( void )
2006-12-05 09:52:36 +03:00
{
2015-11-24 14:26:10 +03:00
static struct rtas_args args ;
2014-04-28 04:29:51 +04:00
2006-12-05 09:52:36 +03:00
local_irq_disable ( ) ;
2014-02-20 14:13:52 +04:00
BUG_ON ( rtas_stop_self_token = = RTAS_UNKNOWN_SERVICE ) ;
2006-12-05 09:52:36 +03:00
2015-11-24 14:26:10 +03:00
rtas_call_unlocked ( & args , rtas_stop_self_token , 0 , 1 , NULL ) ;
2006-12-05 09:52:36 +03:00
panic ( " Alas, I survived. \n " ) ;
}
2020-08-19 04:56:34 +03:00
static void pseries_cpu_offline_self ( void )
2006-12-05 09:52:37 +03:00
{
2009-10-29 22:22:53 +03:00
unsigned int hwcpu = hard_smp_processor_id ( ) ;
2006-12-05 09:52:37 +03:00
local_irq_disable ( ) ;
idle_task_exit ( ) ;
powerpc/xive: guest exploitation of the XIVE interrupt controller
This is the framework for using XIVE in a PowerVM guest. The support
is very similar to the native one in a much simpler form.
Each source is associated with an Event State Buffer (ESB). This is a
two bit state machine which is used to trigger events. The bits are
named "P" (pending) and "Q" (queued) and can be controlled by MMIO.
The Guest OS registers event (or notifications) queues on which the HW
will post event data for a target to notify.
Instead of OPAL calls, a set of Hypervisors call are used to configure
the interrupt sources and the event/notification queues of the guest:
- H_INT_GET_SOURCE_INFO
used to obtain the address of the MMIO page of the Event State
Buffer (PQ bits) entry associated with the source.
- H_INT_SET_SOURCE_CONFIG
assigns a source to a "target".
- H_INT_GET_SOURCE_CONFIG
determines to which "target" and "priority" is assigned to a source
- H_INT_GET_QUEUE_INFO
returns the address of the notification management page associated
with the specified "target" and "priority".
- H_INT_SET_QUEUE_CONFIG
sets or resets the event queue for a given "target" and "priority".
It is also used to set the notification config associated with the
queue, only unconditional notification for the moment. Reset is
performed with a queue size of 0 and queueing is disabled in that
case.
- H_INT_GET_QUEUE_CONFIG
returns the queue settings for a given "target" and "priority".
- H_INT_RESET
resets all of the partition's interrupt exploitation structures to
their initial state, losing all configuration set via the hcalls
H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.
- H_INT_SYNC
issue a synchronisation on a source to make sure sure all
notifications have reached their queue.
As for XICS, the XIVE interface for the guest is described in the
device tree under the "interrupt-controller" node. A couple of new
properties are specific to XIVE :
- "reg"
contains the base address and size of the thread interrupt
managnement areas (TIMA), also called rings, for the User level and
for the Guest OS level. Only the Guest OS level is taken into
account today.
- "ibm,xive-eq-sizes"
the size of the event queues. One cell per size supported, contains
log2 of size, in ascending order.
- "ibm,xive-lisn-ranges"
the interrupt numbers ranges assigned to the guest. These are
allocated using a simple bitmap.
and also :
- "/ibm,plat-res-int-priorities"
contains a list of priorities that the hypervisor has reserved for
its own use.
Tested with a QEMU XIVE model for pseries and with the Power hypervisor.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-30 22:46:11 +03:00
if ( xive_enabled ( ) )
xive_teardown_cpu ( ) ;
else
xics_teardown_cpu ( ) ;
2009-10-29 22:22:53 +03:00
2011-07-25 05:46:34 +04:00
unregister_slb_shadow ( hwcpu ) ;
2022-11-14 19:01:50 +03:00
unregister_vpa ( hwcpu ) ;
2010-03-01 05:58:16 +03:00
rtas_stop_self ( ) ;
2009-10-29 22:22:53 +03:00
2006-12-05 09:52:37 +03:00
/* Should never get here... */
BUG ( ) ;
for ( ; ; ) ;
}
2006-12-05 09:52:39 +03:00
static int pseries_cpu_disable ( void )
2006-12-05 09:52:38 +03:00
{
int cpu = smp_processor_id ( ) ;
2009-09-24 19:34:48 +04:00
set_cpu_online ( cpu , false ) ;
2006-12-05 09:52:38 +03:00
vdso_data - > processorCount - - ;
/*fix boot_cpuid here*/
if ( cpu = = boot_cpuid )
2010-04-26 19:32:42 +04:00
boot_cpuid = cpumask_any ( cpu_online_mask ) ;
2006-12-05 09:52:38 +03:00
/* FIXME: abstract this to not be platform specific later on */
powerpc/xive: guest exploitation of the XIVE interrupt controller
This is the framework for using XIVE in a PowerVM guest. The support
is very similar to the native one in a much simpler form.
Each source is associated with an Event State Buffer (ESB). This is a
two bit state machine which is used to trigger events. The bits are
named "P" (pending) and "Q" (queued) and can be controlled by MMIO.
The Guest OS registers event (or notifications) queues on which the HW
will post event data for a target to notify.
Instead of OPAL calls, a set of Hypervisors call are used to configure
the interrupt sources and the event/notification queues of the guest:
- H_INT_GET_SOURCE_INFO
used to obtain the address of the MMIO page of the Event State
Buffer (PQ bits) entry associated with the source.
- H_INT_SET_SOURCE_CONFIG
assigns a source to a "target".
- H_INT_GET_SOURCE_CONFIG
determines to which "target" and "priority" is assigned to a source
- H_INT_GET_QUEUE_INFO
returns the address of the notification management page associated
with the specified "target" and "priority".
- H_INT_SET_QUEUE_CONFIG
sets or resets the event queue for a given "target" and "priority".
It is also used to set the notification config associated with the
queue, only unconditional notification for the moment. Reset is
performed with a queue size of 0 and queueing is disabled in that
case.
- H_INT_GET_QUEUE_CONFIG
returns the queue settings for a given "target" and "priority".
- H_INT_RESET
resets all of the partition's interrupt exploitation structures to
their initial state, losing all configuration set via the hcalls
H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.
- H_INT_SYNC
issue a synchronisation on a source to make sure sure all
notifications have reached their queue.
As for XICS, the XIVE interface for the guest is described in the
device tree under the "interrupt-controller" node. A couple of new
properties are specific to XIVE :
- "reg"
contains the base address and size of the thread interrupt
managnement areas (TIMA), also called rings, for the User level and
for the Guest OS level. Only the Guest OS level is taken into
account today.
- "ibm,xive-eq-sizes"
the size of the event queues. One cell per size supported, contains
log2 of size, in ascending order.
- "ibm,xive-lisn-ranges"
the interrupt numbers ranges assigned to the guest. These are
allocated using a simple bitmap.
and also :
- "/ibm,plat-res-int-priorities"
contains a list of priorities that the hypervisor has reserved for
its own use.
Tested with a QEMU XIVE model for pseries and with the Power hypervisor.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-30 22:46:11 +03:00
if ( xive_enabled ( ) )
xive_smp_disable_cpu ( ) ;
else
xics_migrate_irqs_away ( ) ;
powerpc/64s: Trim offlined CPUs from mm_cpumasks
When offlining a CPU, powerpc/64s does not flush TLBs, rather it just
leaves the CPU set in mm_cpumasks, so it continues to receive TLBIEs
to manage its TLBs.
However the exit_flush_lazy_tlbs() function expects that after
returning, all CPUs (except self) have flushed TLBs for that mm, in
which case TLBIEL can be used for this flush. This breaks for offline
CPUs because they don't get the IPI to flush their TLB. This can lead
to stale translations.
Fix this by clearing the CPU from mm_cpumasks, then flushing all TLBs
before going offline.
These offlined CPU bits stuck in the cpumask also prevents the cpumask
from being trimmed back to local mode, which means continual broadcast
IPIs or TLBIEs are needed for TLB flushing. This patch prevents that
situation too.
A cast of many were involved in working this out, but in particular
Milton, Aneesh, Paul made key discoveries.
Fixes: 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Debugged-by: Milton Miller <miltonm@us.ibm.com>
Debugged-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Debugged-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201126102530.691335-5-npiggin@gmail.com
2020-11-26 13:25:30 +03:00
cleanup_cpu_mmu_context ( ) ;
2006-12-05 09:52:38 +03:00
return 0 ;
}
2009-10-29 22:22:53 +03:00
/*
* pseries_cpu_die : Wait for the cpu to die .
* @ cpu : logical processor id of the CPU whose death we ' re awaiting .
*
* This function is called from the context of the thread which is performing
* the cpu - offline . Here we wait for long enough to allow the cpu in question
* to self - destroy so that the cpu - offline thread can send the CPU_DEAD
* notifications .
*
2020-08-19 04:56:34 +03:00
* OTOH , pseries_cpu_offline_self ( ) is called by the @ cpu when it wants to
2009-10-29 22:22:53 +03:00
* self - destruct .
*/
2006-12-05 09:52:39 +03:00
static void pseries_cpu_die ( unsigned int cpu )
2006-12-05 09:52:38 +03:00
{
2009-10-29 22:22:53 +03:00
int cpu_status = 1 ;
2006-12-05 09:52:38 +03:00
unsigned int pcpu = get_hard_smp_processor_id ( cpu ) ;
powerpc/pseries/hotplug-cpu: wait indefinitely for vCPU death
For a power9 KVM guest with XIVE enabled, running a test loop
where we hotplug 384 vcpus and then unplug them, the following traces
can be seen (generally within a few loops) either from the unplugged
vcpu:
cpu 65 (hwid 65) Ready to die...
Querying DEAD? cpu 66 (66) shows 2
list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:56!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 ...
CPU: 66 PID: 0 Comm: swapper/66 Kdump: loaded Not tainted 4.18.0-221.el8.ppc64le #1
NIP: c0000000007ab50c LR: c0000000007ab508 CTR: 00000000000003ac
REGS: c0000009e5a17840 TRAP: 0700 Not tainted (4.18.0-221.el8.ppc64le)
MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28000842 XER: 20040000
...
NIP __list_del_entry_valid+0xac/0x100
LR __list_del_entry_valid+0xa8/0x100
Call Trace:
__list_del_entry_valid+0xa8/0x100 (unreliable)
free_pcppages_bulk+0x1f8/0x940
free_unref_page+0xd0/0x100
xive_spapr_cleanup_queue+0x148/0x1b0
xive_teardown_cpu+0x1bc/0x240
pseries_mach_cpu_die+0x78/0x2f0
cpu_die+0x48/0x70
arch_cpu_idle_dead+0x20/0x40
do_idle+0x2f4/0x4c0
cpu_startup_entry+0x38/0x40
start_secondary+0x7bc/0x8f0
start_secondary_prolog+0x10/0x14
or on the worker thread handling the unplug:
pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a
Querying DEAD? cpu 314 (314) shows 2
BUG: Bad page state in process kworker/u768:3 pfn:95de1
cpu 314 (hwid 314) Ready to die...
page:c00a000002577840 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0
flags: 0x5ffffc00000000()
raw: 005ffffc00000000 5deadbeef0000100 5deadbeef0000200 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffff7f 0000000000000000
page dumped because: nonzero mapcount
Modules linked in: kvm xt_CHECKSUM ipt_MASQUERADE xt_conntrack ...
CPU: 0 PID: 548 Comm: kworker/u768:3 Kdump: loaded Not tainted 4.18.0-224.el8.bz1856588.ppc64le #1
Workqueue: pseries hotplug workque pseries_hp_work_fn
Call Trace:
dump_stack+0xb0/0xf4 (unreliable)
bad_page+0x12c/0x1b0
free_pcppages_bulk+0x5bc/0x940
page_alloc_cpu_dead+0x118/0x120
cpuhp_invoke_callback.constprop.5+0xb8/0x760
_cpu_down+0x188/0x340
cpu_down+0x5c/0xa0
cpu_subsys_offline+0x24/0x40
device_offline+0xf0/0x130
dlpar_offline_cpu+0x1c4/0x2a0
dlpar_cpu_remove+0xb8/0x190
dlpar_cpu_remove_by_index+0x12c/0x150
dlpar_cpu+0x94/0x800
pseries_hp_work_fn+0x128/0x1e0
process_one_work+0x304/0x5d0
worker_thread+0xcc/0x7a0
kthread+0x1ac/0x1c0
ret_from_kernel_thread+0x5c/0x80
The latter trace is due to the following sequence:
page_alloc_cpu_dead
drain_pages
drain_pages_zone
free_pcppages_bulk
where drain_pages() in this case is called under the assumption that
the unplugged cpu is no longer executing. To ensure that is the case,
and early call is made to __cpu_die()->pseries_cpu_die(), which runs a
loop that waits for the cpu to reach a halted state by polling its
status via query-cpu-stopped-state RTAS calls. It only polls for 25
iterations before giving up, however, and in the trace above this
results in the following being printed only .1 seconds after the
hotplug worker thread begins processing the unplug request:
pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a
Querying DEAD? cpu 314 (314) shows 2
At that point the worker thread assumes the unplugged CPU is in some
unknown/dead state and procedes with the cleanup, causing the race
with the XIVE cleanup code executed by the unplugged CPU.
Fix this by waiting indefinitely, but also making an effort to avoid
spurious lockup messages by allowing for rescheduling after polling
the CPU status and printing a warning if we wait for longer than 120s.
Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Tested-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
[mpe: Trim oopses in change log slightly for readability]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200811161544.10513-1-mdroth@linux.vnet.ibm.com
2020-08-11 19:15:44 +03:00
unsigned long timeout = jiffies + msecs_to_jiffies ( 120000 ) ;
2006-12-05 09:52:38 +03:00
powerpc/pseries/hotplug-cpu: wait indefinitely for vCPU death
For a power9 KVM guest with XIVE enabled, running a test loop
where we hotplug 384 vcpus and then unplug them, the following traces
can be seen (generally within a few loops) either from the unplugged
vcpu:
cpu 65 (hwid 65) Ready to die...
Querying DEAD? cpu 66 (66) shows 2
list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:56!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 ...
CPU: 66 PID: 0 Comm: swapper/66 Kdump: loaded Not tainted 4.18.0-221.el8.ppc64le #1
NIP: c0000000007ab50c LR: c0000000007ab508 CTR: 00000000000003ac
REGS: c0000009e5a17840 TRAP: 0700 Not tainted (4.18.0-221.el8.ppc64le)
MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28000842 XER: 20040000
...
NIP __list_del_entry_valid+0xac/0x100
LR __list_del_entry_valid+0xa8/0x100
Call Trace:
__list_del_entry_valid+0xa8/0x100 (unreliable)
free_pcppages_bulk+0x1f8/0x940
free_unref_page+0xd0/0x100
xive_spapr_cleanup_queue+0x148/0x1b0
xive_teardown_cpu+0x1bc/0x240
pseries_mach_cpu_die+0x78/0x2f0
cpu_die+0x48/0x70
arch_cpu_idle_dead+0x20/0x40
do_idle+0x2f4/0x4c0
cpu_startup_entry+0x38/0x40
start_secondary+0x7bc/0x8f0
start_secondary_prolog+0x10/0x14
or on the worker thread handling the unplug:
pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a
Querying DEAD? cpu 314 (314) shows 2
BUG: Bad page state in process kworker/u768:3 pfn:95de1
cpu 314 (hwid 314) Ready to die...
page:c00a000002577840 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0
flags: 0x5ffffc00000000()
raw: 005ffffc00000000 5deadbeef0000100 5deadbeef0000200 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffff7f 0000000000000000
page dumped because: nonzero mapcount
Modules linked in: kvm xt_CHECKSUM ipt_MASQUERADE xt_conntrack ...
CPU: 0 PID: 548 Comm: kworker/u768:3 Kdump: loaded Not tainted 4.18.0-224.el8.bz1856588.ppc64le #1
Workqueue: pseries hotplug workque pseries_hp_work_fn
Call Trace:
dump_stack+0xb0/0xf4 (unreliable)
bad_page+0x12c/0x1b0
free_pcppages_bulk+0x5bc/0x940
page_alloc_cpu_dead+0x118/0x120
cpuhp_invoke_callback.constprop.5+0xb8/0x760
_cpu_down+0x188/0x340
cpu_down+0x5c/0xa0
cpu_subsys_offline+0x24/0x40
device_offline+0xf0/0x130
dlpar_offline_cpu+0x1c4/0x2a0
dlpar_cpu_remove+0xb8/0x190
dlpar_cpu_remove_by_index+0x12c/0x150
dlpar_cpu+0x94/0x800
pseries_hp_work_fn+0x128/0x1e0
process_one_work+0x304/0x5d0
worker_thread+0xcc/0x7a0
kthread+0x1ac/0x1c0
ret_from_kernel_thread+0x5c/0x80
The latter trace is due to the following sequence:
page_alloc_cpu_dead
drain_pages
drain_pages_zone
free_pcppages_bulk
where drain_pages() in this case is called under the assumption that
the unplugged cpu is no longer executing. To ensure that is the case,
and early call is made to __cpu_die()->pseries_cpu_die(), which runs a
loop that waits for the cpu to reach a halted state by polling its
status via query-cpu-stopped-state RTAS calls. It only polls for 25
iterations before giving up, however, and in the trace above this
results in the following being printed only .1 seconds after the
hotplug worker thread begins processing the unplug request:
pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a
Querying DEAD? cpu 314 (314) shows 2
At that point the worker thread assumes the unplugged CPU is in some
unknown/dead state and procedes with the cleanup, causing the race
with the XIVE cleanup code executed by the unplugged CPU.
Fix this by waiting indefinitely, but also making an effort to avoid
spurious lockup messages by allowing for rescheduling after polling
the CPU status and printing a warning if we wait for longer than 120s.
Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Tested-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
[mpe: Trim oopses in change log slightly for readability]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200811161544.10513-1-mdroth@linux.vnet.ibm.com
2020-08-11 19:15:44 +03:00
while ( true ) {
2020-06-12 08:12:21 +03:00
cpu_status = smp_query_cpu_stopped ( pcpu ) ;
if ( cpu_status = = QCSS_STOPPED | |
cpu_status = = QCSS_HARDWARE_ERROR )
break ;
2009-10-29 22:22:53 +03:00
powerpc/pseries/hotplug-cpu: wait indefinitely for vCPU death
For a power9 KVM guest with XIVE enabled, running a test loop
where we hotplug 384 vcpus and then unplug them, the following traces
can be seen (generally within a few loops) either from the unplugged
vcpu:
cpu 65 (hwid 65) Ready to die...
Querying DEAD? cpu 66 (66) shows 2
list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:56!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 ...
CPU: 66 PID: 0 Comm: swapper/66 Kdump: loaded Not tainted 4.18.0-221.el8.ppc64le #1
NIP: c0000000007ab50c LR: c0000000007ab508 CTR: 00000000000003ac
REGS: c0000009e5a17840 TRAP: 0700 Not tainted (4.18.0-221.el8.ppc64le)
MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28000842 XER: 20040000
...
NIP __list_del_entry_valid+0xac/0x100
LR __list_del_entry_valid+0xa8/0x100
Call Trace:
__list_del_entry_valid+0xa8/0x100 (unreliable)
free_pcppages_bulk+0x1f8/0x940
free_unref_page+0xd0/0x100
xive_spapr_cleanup_queue+0x148/0x1b0
xive_teardown_cpu+0x1bc/0x240
pseries_mach_cpu_die+0x78/0x2f0
cpu_die+0x48/0x70
arch_cpu_idle_dead+0x20/0x40
do_idle+0x2f4/0x4c0
cpu_startup_entry+0x38/0x40
start_secondary+0x7bc/0x8f0
start_secondary_prolog+0x10/0x14
or on the worker thread handling the unplug:
pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a
Querying DEAD? cpu 314 (314) shows 2
BUG: Bad page state in process kworker/u768:3 pfn:95de1
cpu 314 (hwid 314) Ready to die...
page:c00a000002577840 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0
flags: 0x5ffffc00000000()
raw: 005ffffc00000000 5deadbeef0000100 5deadbeef0000200 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffff7f 0000000000000000
page dumped because: nonzero mapcount
Modules linked in: kvm xt_CHECKSUM ipt_MASQUERADE xt_conntrack ...
CPU: 0 PID: 548 Comm: kworker/u768:3 Kdump: loaded Not tainted 4.18.0-224.el8.bz1856588.ppc64le #1
Workqueue: pseries hotplug workque pseries_hp_work_fn
Call Trace:
dump_stack+0xb0/0xf4 (unreliable)
bad_page+0x12c/0x1b0
free_pcppages_bulk+0x5bc/0x940
page_alloc_cpu_dead+0x118/0x120
cpuhp_invoke_callback.constprop.5+0xb8/0x760
_cpu_down+0x188/0x340
cpu_down+0x5c/0xa0
cpu_subsys_offline+0x24/0x40
device_offline+0xf0/0x130
dlpar_offline_cpu+0x1c4/0x2a0
dlpar_cpu_remove+0xb8/0x190
dlpar_cpu_remove_by_index+0x12c/0x150
dlpar_cpu+0x94/0x800
pseries_hp_work_fn+0x128/0x1e0
process_one_work+0x304/0x5d0
worker_thread+0xcc/0x7a0
kthread+0x1ac/0x1c0
ret_from_kernel_thread+0x5c/0x80
The latter trace is due to the following sequence:
page_alloc_cpu_dead
drain_pages
drain_pages_zone
free_pcppages_bulk
where drain_pages() in this case is called under the assumption that
the unplugged cpu is no longer executing. To ensure that is the case,
and early call is made to __cpu_die()->pseries_cpu_die(), which runs a
loop that waits for the cpu to reach a halted state by polling its
status via query-cpu-stopped-state RTAS calls. It only polls for 25
iterations before giving up, however, and in the trace above this
results in the following being printed only .1 seconds after the
hotplug worker thread begins processing the unplug request:
pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a
Querying DEAD? cpu 314 (314) shows 2
At that point the worker thread assumes the unplugged CPU is in some
unknown/dead state and procedes with the cleanup, causing the race
with the XIVE cleanup code executed by the unplugged CPU.
Fix this by waiting indefinitely, but also making an effort to avoid
spurious lockup messages by allowing for rescheduling after polling
the CPU status and printing a warning if we wait for longer than 120s.
Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Tested-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
[mpe: Trim oopses in change log slightly for readability]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200811161544.10513-1-mdroth@linux.vnet.ibm.com
2020-08-11 19:15:44 +03:00
if ( time_after ( jiffies , timeout ) ) {
pr_warn ( " CPU %i (hwid %i) didn't die after 120 seconds \n " ,
cpu , pcpu ) ;
timeout = jiffies + msecs_to_jiffies ( 120000 ) ;
}
cond_resched ( ) ;
2006-12-05 09:52:38 +03:00
}
2009-10-29 22:22:53 +03:00
powerpc/pseries/hotplug-cpu: wait indefinitely for vCPU death
For a power9 KVM guest with XIVE enabled, running a test loop
where we hotplug 384 vcpus and then unplug them, the following traces
can be seen (generally within a few loops) either from the unplugged
vcpu:
cpu 65 (hwid 65) Ready to die...
Querying DEAD? cpu 66 (66) shows 2
list_del corruption. next->prev should be c00a000002470208, but was c00a000002470048
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:56!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: fuse nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 ...
CPU: 66 PID: 0 Comm: swapper/66 Kdump: loaded Not tainted 4.18.0-221.el8.ppc64le #1
NIP: c0000000007ab50c LR: c0000000007ab508 CTR: 00000000000003ac
REGS: c0000009e5a17840 TRAP: 0700 Not tainted (4.18.0-221.el8.ppc64le)
MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28000842 XER: 20040000
...
NIP __list_del_entry_valid+0xac/0x100
LR __list_del_entry_valid+0xa8/0x100
Call Trace:
__list_del_entry_valid+0xa8/0x100 (unreliable)
free_pcppages_bulk+0x1f8/0x940
free_unref_page+0xd0/0x100
xive_spapr_cleanup_queue+0x148/0x1b0
xive_teardown_cpu+0x1bc/0x240
pseries_mach_cpu_die+0x78/0x2f0
cpu_die+0x48/0x70
arch_cpu_idle_dead+0x20/0x40
do_idle+0x2f4/0x4c0
cpu_startup_entry+0x38/0x40
start_secondary+0x7bc/0x8f0
start_secondary_prolog+0x10/0x14
or on the worker thread handling the unplug:
pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a
Querying DEAD? cpu 314 (314) shows 2
BUG: Bad page state in process kworker/u768:3 pfn:95de1
cpu 314 (hwid 314) Ready to die...
page:c00a000002577840 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0
flags: 0x5ffffc00000000()
raw: 005ffffc00000000 5deadbeef0000100 5deadbeef0000200 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffff7f 0000000000000000
page dumped because: nonzero mapcount
Modules linked in: kvm xt_CHECKSUM ipt_MASQUERADE xt_conntrack ...
CPU: 0 PID: 548 Comm: kworker/u768:3 Kdump: loaded Not tainted 4.18.0-224.el8.bz1856588.ppc64le #1
Workqueue: pseries hotplug workque pseries_hp_work_fn
Call Trace:
dump_stack+0xb0/0xf4 (unreliable)
bad_page+0x12c/0x1b0
free_pcppages_bulk+0x5bc/0x940
page_alloc_cpu_dead+0x118/0x120
cpuhp_invoke_callback.constprop.5+0xb8/0x760
_cpu_down+0x188/0x340
cpu_down+0x5c/0xa0
cpu_subsys_offline+0x24/0x40
device_offline+0xf0/0x130
dlpar_offline_cpu+0x1c4/0x2a0
dlpar_cpu_remove+0xb8/0x190
dlpar_cpu_remove_by_index+0x12c/0x150
dlpar_cpu+0x94/0x800
pseries_hp_work_fn+0x128/0x1e0
process_one_work+0x304/0x5d0
worker_thread+0xcc/0x7a0
kthread+0x1ac/0x1c0
ret_from_kernel_thread+0x5c/0x80
The latter trace is due to the following sequence:
page_alloc_cpu_dead
drain_pages
drain_pages_zone
free_pcppages_bulk
where drain_pages() in this case is called under the assumption that
the unplugged cpu is no longer executing. To ensure that is the case,
and early call is made to __cpu_die()->pseries_cpu_die(), which runs a
loop that waits for the cpu to reach a halted state by polling its
status via query-cpu-stopped-state RTAS calls. It only polls for 25
iterations before giving up, however, and in the trace above this
results in the following being printed only .1 seconds after the
hotplug worker thread begins processing the unplug request:
pseries-hotplug-cpu: Attempting to remove CPU <NULL>, drc index: 1000013a
Querying DEAD? cpu 314 (314) shows 2
At that point the worker thread assumes the unplugged CPU is in some
unknown/dead state and procedes with the cleanup, causing the race
with the XIVE cleanup code executed by the unplugged CPU.
Fix this by waiting indefinitely, but also making an effort to avoid
spurious lockup messages by allowing for rescheduling after polling
the CPU status and printing a warning if we wait for longer than 120s.
Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Tested-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
[mpe: Trim oopses in change log slightly for readability]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200811161544.10513-1-mdroth@linux.vnet.ibm.com
2020-08-11 19:15:44 +03:00
if ( cpu_status = = QCSS_HARDWARE_ERROR ) {
pr_warn ( " CPU %i (hwid %i) reported error while dying \n " ,
cpu , pcpu ) ;
2006-12-05 09:52:38 +03:00
}
2018-02-13 18:08:12 +03:00
paca_ptrs [ cpu ] - > cpu_start = 0 ;
2006-12-05 09:52:38 +03:00
}
powerpc/pseries: Prevent free CPU ids being reused on another node
When a CPU is hot added, the CPU ids are taken from the available mask
from the lower possible set. If that set of values was previously used
for a CPU attached to a different node, it appears to an application as
if these CPUs have migrated from one node to another node which is not
expected.
To prevent this, it is needed to record the CPU ids used for each node
and to not reuse them on another node. However, to prevent CPU hot plug
to fail, in the case the CPU ids is starved on a node, the capability to
reuse other nodes’ free CPU ids is kept. A warning is displayed in such
a case to warn the user.
A new CPU bit mask (node_recorded_ids_map) is introduced for each
possible node. It is populated with the CPU onlined at boot time, and
then when a CPU is hot plugged to a node. The bits in that mask remain
when the CPU is hot unplugged, to remind this CPU ids have been used for
this node.
If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.
The effect of this patch can be seen by removing and adding CPUs using
the Qemu monitor. In the following case, the first CPU from the node 2
is removed, then the first one from the node 1 is removed too. Later,
the first CPU of the node 2 is added back. Without that patch, the
kernel will number these CPUs using the first CPU ids available which
are the ones freed when removing the second CPU of the node 0. This
leads to the CPU ids 16-23 to move from the node 1 to the node 2. With
the patch applied, the CPU ids 32-39 are used since they are the lowest
free ones which have not been used on another node.
At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
/**
* find_cpu_id_range - found a linear ranger of @ nthreads free CPU ids .
* @ nthreads : the number of threads ( cpu ids )
* @ assigned_node : the node it belongs to or NUMA_NO_NODE if free ids from any
* node can be peek .
* @ cpu_mask : the returned CPU mask .
*
* Returns 0 on success .
*/
static int find_cpu_id_range ( unsigned int nthreads , int assigned_node ,
cpumask_var_t * cpu_mask )
{
cpumask_var_t candidate_mask ;
unsigned int cpu , node ;
int rc = - ENOSPC ;
if ( ! zalloc_cpumask_var ( & candidate_mask , GFP_KERNEL ) )
return - ENOMEM ;
cpumask_clear ( * cpu_mask ) ;
for ( cpu = 0 ; cpu < nthreads ; cpu + + )
cpumask_set_cpu ( cpu , * cpu_mask ) ;
BUG_ON ( ! cpumask_subset ( cpu_present_mask , cpu_possible_mask ) ) ;
/* Get a bitmap of unoccupied slots. */
cpumask_xor ( candidate_mask , cpu_possible_mask , cpu_present_mask ) ;
if ( assigned_node ! = NUMA_NO_NODE ) {
/*
* Remove free ids previously assigned on the other nodes . We
* can walk only online nodes because once a node became online
* it is not turned offlined back .
*/
for_each_online_node ( node ) {
if ( node = = assigned_node )
continue ;
cpumask_andnot ( candidate_mask , candidate_mask ,
node_recorded_ids_map [ node ] ) ;
}
}
if ( cpumask_empty ( candidate_mask ) )
goto out ;
while ( ! cpumask_empty ( * cpu_mask ) ) {
if ( cpumask_subset ( * cpu_mask , candidate_mask ) )
/* Found a range where we can insert the new cpu(s) */
break ;
cpumask_shift_left ( * cpu_mask , * cpu_mask , nthreads ) ;
}
if ( ! cpumask_empty ( * cpu_mask ) )
rc = 0 ;
out :
free_cpumask_var ( candidate_mask ) ;
return rc ;
}
2006-12-05 09:52:38 +03:00
/*
2010-04-26 19:32:44 +04:00
* Update cpu_present_mask and paca ( s ) for a new cpu node . The wrinkle
powerpc/pseries: Prevent free CPU ids being reused on another node
When a CPU is hot added, the CPU ids are taken from the available mask
from the lower possible set. If that set of values was previously used
for a CPU attached to a different node, it appears to an application as
if these CPUs have migrated from one node to another node which is not
expected.
To prevent this, it is needed to record the CPU ids used for each node
and to not reuse them on another node. However, to prevent CPU hot plug
to fail, in the case the CPU ids is starved on a node, the capability to
reuse other nodes’ free CPU ids is kept. A warning is displayed in such
a case to warn the user.
A new CPU bit mask (node_recorded_ids_map) is introduced for each
possible node. It is populated with the CPU onlined at boot time, and
then when a CPU is hot plugged to a node. The bits in that mask remain
when the CPU is hot unplugged, to remind this CPU ids have been used for
this node.
If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.
The effect of this patch can be seen by removing and adding CPUs using
the Qemu monitor. In the following case, the first CPU from the node 2
is removed, then the first one from the node 1 is removed too. Later,
the first CPU of the node 2 is added back. Without that patch, the
kernel will number these CPUs using the first CPU ids available which
are the ones freed when removing the second CPU of the node 0. This
leads to the CPU ids 16-23 to move from the node 1 to the node 2. With
the patch applied, the CPU ids 32-39 are used since they are the lowest
free ones which have not been used on another node.
At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
* here is that a cpu device node may represent multiple logical cpus
2006-12-05 09:52:38 +03:00
* in the SMT case . We must honor the assumption in other code that
* the logical ids for sibling SMT threads x and y are adjacent , such
* that x ^ 1 = = y and y ^ 1 = = x .
*/
2006-12-05 09:52:39 +03:00
static int pseries_add_processor ( struct device_node * np )
2006-12-05 09:52:38 +03:00
{
powerpc/pseries: Prevent free CPU ids being reused on another node
When a CPU is hot added, the CPU ids are taken from the available mask
from the lower possible set. If that set of values was previously used
for a CPU attached to a different node, it appears to an application as
if these CPUs have migrated from one node to another node which is not
expected.
To prevent this, it is needed to record the CPU ids used for each node
and to not reuse them on another node. However, to prevent CPU hot plug
to fail, in the case the CPU ids is starved on a node, the capability to
reuse other nodes’ free CPU ids is kept. A warning is displayed in such
a case to warn the user.
A new CPU bit mask (node_recorded_ids_map) is introduced for each
possible node. It is populated with the CPU onlined at boot time, and
then when a CPU is hot plugged to a node. The bits in that mask remain
when the CPU is hot unplugged, to remind this CPU ids have been used for
this node.
If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.
The effect of this patch can be seen by removing and adding CPUs using
the Qemu monitor. In the following case, the first CPU from the node 2
is removed, then the first one from the node 1 is removed too. Later,
the first CPU of the node 2 is added back. Without that patch, the
kernel will number these CPUs using the first CPU ids available which
are the ones freed when removing the second CPU of the node 0. This
leads to the CPU ids 16-23 to move from the node 1 to the node 2. With
the patch applied, the CPU ids 32-39 are used since they are the lowest
free ones which have not been used on another node.
At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
int len , nthreads , node , cpu , assigned_node ;
int rc = 0 ;
cpumask_var_t cpu_mask ;
2014-09-17 00:15:45 +04:00
const __be32 * intserv ;
2006-12-05 09:52:38 +03:00
2007-04-03 16:26:41 +04:00
intserv = of_get_property ( np , " ibm,ppc-interrupt-server#s " , & len ) ;
2006-12-05 09:52:38 +03:00
if ( ! intserv )
return 0 ;
nthreads = len / sizeof ( u32 ) ;
powerpc/pseries: Prevent free CPU ids being reused on another node
When a CPU is hot added, the CPU ids are taken from the available mask
from the lower possible set. If that set of values was previously used
for a CPU attached to a different node, it appears to an application as
if these CPUs have migrated from one node to another node which is not
expected.
To prevent this, it is needed to record the CPU ids used for each node
and to not reuse them on another node. However, to prevent CPU hot plug
to fail, in the case the CPU ids is starved on a node, the capability to
reuse other nodes’ free CPU ids is kept. A warning is displayed in such
a case to warn the user.
A new CPU bit mask (node_recorded_ids_map) is introduced for each
possible node. It is populated with the CPU onlined at boot time, and
then when a CPU is hot plugged to a node. The bits in that mask remain
when the CPU is hot unplugged, to remind this CPU ids have been used for
this node.
If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.
The effect of this patch can be seen by removing and adding CPUs using
the Qemu monitor. In the following case, the first CPU from the node 2
is removed, then the first one from the node 1 is removed too. Later,
the first CPU of the node 2 is added back. Without that patch, the
kernel will number these CPUs using the first CPU ids available which
are the ones freed when removing the second CPU of the node 0. This
leads to the CPU ids 16-23 to move from the node 1 to the node 2. With
the patch applied, the CPU ids 32-39 are used since they are the lowest
free ones which have not been used on another node.
At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
if ( ! alloc_cpumask_var ( & cpu_mask , GFP_KERNEL ) )
return - ENOMEM ;
2006-12-05 09:52:38 +03:00
powerpc/pseries: Prevent free CPU ids being reused on another node
When a CPU is hot added, the CPU ids are taken from the available mask
from the lower possible set. If that set of values was previously used
for a CPU attached to a different node, it appears to an application as
if these CPUs have migrated from one node to another node which is not
expected.
To prevent this, it is needed to record the CPU ids used for each node
and to not reuse them on another node. However, to prevent CPU hot plug
to fail, in the case the CPU ids is starved on a node, the capability to
reuse other nodes’ free CPU ids is kept. A warning is displayed in such
a case to warn the user.
A new CPU bit mask (node_recorded_ids_map) is introduced for each
possible node. It is populated with the CPU onlined at boot time, and
then when a CPU is hot plugged to a node. The bits in that mask remain
when the CPU is hot unplugged, to remind this CPU ids have been used for
this node.
If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.
The effect of this patch can be seen by removing and adding CPUs using
the Qemu monitor. In the following case, the first CPU from the node 2
is removed, then the first one from the node 1 is removed too. Later,
the first CPU of the node 2 is added back. Without that patch, the
kernel will number these CPUs using the first CPU ids available which
are the ones freed when removing the second CPU of the node 0. This
leads to the CPU ids 16-23 to move from the node 1 to the node 2. With
the patch applied, the CPU ids 32-39 are used since they are the lowest
free ones which have not been used on another node.
At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
/*
* Fetch from the DT nodes read by dlpar_configure_connector ( ) the NUMA
* node id the added CPU belongs to .
*/
node = of_node_to_nid ( np ) ;
if ( node < 0 | | ! node_possible ( node ) )
node = first_online_node ;
2006-12-05 09:52:38 +03:00
powerpc/pseries: Prevent free CPU ids being reused on another node
When a CPU is hot added, the CPU ids are taken from the available mask
from the lower possible set. If that set of values was previously used
for a CPU attached to a different node, it appears to an application as
if these CPUs have migrated from one node to another node which is not
expected.
To prevent this, it is needed to record the CPU ids used for each node
and to not reuse them on another node. However, to prevent CPU hot plug
to fail, in the case the CPU ids is starved on a node, the capability to
reuse other nodes’ free CPU ids is kept. A warning is displayed in such
a case to warn the user.
A new CPU bit mask (node_recorded_ids_map) is introduced for each
possible node. It is populated with the CPU onlined at boot time, and
then when a CPU is hot plugged to a node. The bits in that mask remain
when the CPU is hot unplugged, to remind this CPU ids have been used for
this node.
If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.
The effect of this patch can be seen by removing and adding CPUs using
the Qemu monitor. In the following case, the first CPU from the node 2
is removed, then the first one from the node 1 is removed too. Later,
the first CPU of the node 2 is added back. Without that patch, the
kernel will number these CPUs using the first CPU ids available which
are the ones freed when removing the second CPU of the node 0. This
leads to the CPU ids 16-23 to move from the node 1 to the node 2. With
the patch applied, the CPU ids 32-39 are used since they are the lowest
free ones which have not been used on another node.
At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
BUG_ON ( node = = NUMA_NO_NODE ) ;
assigned_node = node ;
cpu_maps_update_begin ( ) ;
rc = find_cpu_id_range ( nthreads , node , & cpu_mask ) ;
if ( rc & & nr_node_ids > 1 ) {
/*
* Try again , considering the free CPU ids from the other node .
2006-12-05 09:52:38 +03:00
*/
powerpc/pseries: Prevent free CPU ids being reused on another node
When a CPU is hot added, the CPU ids are taken from the available mask
from the lower possible set. If that set of values was previously used
for a CPU attached to a different node, it appears to an application as
if these CPUs have migrated from one node to another node which is not
expected.
To prevent this, it is needed to record the CPU ids used for each node
and to not reuse them on another node. However, to prevent CPU hot plug
to fail, in the case the CPU ids is starved on a node, the capability to
reuse other nodes’ free CPU ids is kept. A warning is displayed in such
a case to warn the user.
A new CPU bit mask (node_recorded_ids_map) is introduced for each
possible node. It is populated with the CPU onlined at boot time, and
then when a CPU is hot plugged to a node. The bits in that mask remain
when the CPU is hot unplugged, to remind this CPU ids have been used for
this node.
If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.
The effect of this patch can be seen by removing and adding CPUs using
the Qemu monitor. In the following case, the first CPU from the node 2
is removed, then the first one from the node 1 is removed too. Later,
the first CPU of the node 2 is added back. Without that patch, the
kernel will number these CPUs using the first CPU ids available which
are the ones freed when removing the second CPU of the node 0. This
leads to the CPU ids 16-23 to move from the node 1 to the node 2. With
the patch applied, the CPU ids 32-39 are used since they are the lowest
free ones which have not been used on another node.
At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
node = NUMA_NO_NODE ;
rc = find_cpu_id_range ( nthreads , NUMA_NO_NODE , & cpu_mask ) ;
2006-12-05 09:52:38 +03:00
}
powerpc/pseries: Prevent free CPU ids being reused on another node
When a CPU is hot added, the CPU ids are taken from the available mask
from the lower possible set. If that set of values was previously used
for a CPU attached to a different node, it appears to an application as
if these CPUs have migrated from one node to another node which is not
expected.
To prevent this, it is needed to record the CPU ids used for each node
and to not reuse them on another node. However, to prevent CPU hot plug
to fail, in the case the CPU ids is starved on a node, the capability to
reuse other nodes’ free CPU ids is kept. A warning is displayed in such
a case to warn the user.
A new CPU bit mask (node_recorded_ids_map) is introduced for each
possible node. It is populated with the CPU onlined at boot time, and
then when a CPU is hot plugged to a node. The bits in that mask remain
when the CPU is hot unplugged, to remind this CPU ids have been used for
this node.
If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.
The effect of this patch can be seen by removing and adding CPUs using
the Qemu monitor. In the following case, the first CPU from the node 2
is removed, then the first one from the node 1 is removed too. Later,
the first CPU of the node 2 is added back. Without that patch, the
kernel will number these CPUs using the first CPU ids available which
are the ones freed when removing the second CPU of the node 0. This
leads to the CPU ids 16-23 to move from the node 1 to the node 2. With
the patch applied, the CPU ids 32-39 are used since they are the lowest
free ones which have not been used on another node.
At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
if ( rc ) {
pr_err ( " Cannot add cpu %pOF; this system configuration "
" supports %d logical cpus. \n " , np , num_possible_cpus ( ) ) ;
goto out ;
2006-12-05 09:52:38 +03:00
}
powerpc/pseries: Prevent free CPU ids being reused on another node
When a CPU is hot added, the CPU ids are taken from the available mask
from the lower possible set. If that set of values was previously used
for a CPU attached to a different node, it appears to an application as
if these CPUs have migrated from one node to another node which is not
expected.
To prevent this, it is needed to record the CPU ids used for each node
and to not reuse them on another node. However, to prevent CPU hot plug
to fail, in the case the CPU ids is starved on a node, the capability to
reuse other nodes’ free CPU ids is kept. A warning is displayed in such
a case to warn the user.
A new CPU bit mask (node_recorded_ids_map) is introduced for each
possible node. It is populated with the CPU onlined at boot time, and
then when a CPU is hot plugged to a node. The bits in that mask remain
when the CPU is hot unplugged, to remind this CPU ids have been used for
this node.
If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.
The effect of this patch can be seen by removing and adding CPUs using
the Qemu monitor. In the following case, the first CPU from the node 2
is removed, then the first one from the node 1 is removed too. Later,
the first CPU of the node 2 is added back. Without that patch, the
kernel will number these CPUs using the first CPU ids available which
are the ones freed when removing the second CPU of the node 0. This
leads to the CPU ids 16-23 to move from the node 1 to the node 2. With
the patch applied, the CPU ids 32-39 are used since they are the lowest
free ones which have not been used on another node.
At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
for_each_cpu ( cpu , cpu_mask ) {
2011-04-28 09:07:23 +04:00
BUG_ON ( cpu_present ( cpu ) ) ;
2009-09-24 19:34:48 +04:00
set_cpu_present ( cpu , true ) ;
2014-09-17 00:15:45 +04:00
set_hard_smp_processor_id ( cpu , be32_to_cpu ( * intserv + + ) ) ;
2006-12-05 09:52:38 +03:00
}
powerpc/pseries: Prevent free CPU ids being reused on another node
When a CPU is hot added, the CPU ids are taken from the available mask
from the lower possible set. If that set of values was previously used
for a CPU attached to a different node, it appears to an application as
if these CPUs have migrated from one node to another node which is not
expected.
To prevent this, it is needed to record the CPU ids used for each node
and to not reuse them on another node. However, to prevent CPU hot plug
to fail, in the case the CPU ids is starved on a node, the capability to
reuse other nodes’ free CPU ids is kept. A warning is displayed in such
a case to warn the user.
A new CPU bit mask (node_recorded_ids_map) is introduced for each
possible node. It is populated with the CPU onlined at boot time, and
then when a CPU is hot plugged to a node. The bits in that mask remain
when the CPU is hot unplugged, to remind this CPU ids have been used for
this node.
If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.
The effect of this patch can be seen by removing and adding CPUs using
the Qemu monitor. In the following case, the first CPU from the node 2
is removed, then the first one from the node 1 is removed too. Later,
the first CPU of the node 2 is added back. Without that patch, the
kernel will number these CPUs using the first CPU ids available which
are the ones freed when removing the second CPU of the node 0. This
leads to the CPU ids 16-23 to move from the node 1 to the node 2. With
the patch applied, the CPU ids 32-39 are used since they are the lowest
free ones which have not been used on another node.
At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
/* Record the newly used CPU ids for the associate node. */
cpumask_or ( node_recorded_ids_map [ assigned_node ] ,
node_recorded_ids_map [ assigned_node ] , cpu_mask ) ;
/*
* If node is set to NUMA_NO_NODE , CPU ids have be reused from
* another node , remove them from its mask .
*/
if ( node = = NUMA_NO_NODE ) {
cpu = cpumask_first ( cpu_mask ) ;
pr_warn ( " Reusing free CPU ids %d-%d from another node \n " ,
cpu , cpu + nthreads - 1 ) ;
for_each_online_node ( node ) {
if ( node = = assigned_node )
continue ;
cpumask_andnot ( node_recorded_ids_map [ node ] ,
node_recorded_ids_map [ node ] ,
cpu_mask ) ;
}
}
out :
2008-01-25 23:08:02 +03:00
cpu_maps_update_done ( ) ;
powerpc/pseries: Prevent free CPU ids being reused on another node
When a CPU is hot added, the CPU ids are taken from the available mask
from the lower possible set. If that set of values was previously used
for a CPU attached to a different node, it appears to an application as
if these CPUs have migrated from one node to another node which is not
expected.
To prevent this, it is needed to record the CPU ids used for each node
and to not reuse them on another node. However, to prevent CPU hot plug
to fail, in the case the CPU ids is starved on a node, the capability to
reuse other nodes’ free CPU ids is kept. A warning is displayed in such
a case to warn the user.
A new CPU bit mask (node_recorded_ids_map) is introduced for each
possible node. It is populated with the CPU onlined at boot time, and
then when a CPU is hot plugged to a node. The bits in that mask remain
when the CPU is hot unplugged, to remind this CPU ids have been used for
this node.
If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.
The effect of this patch can be seen by removing and adding CPUs using
the Qemu monitor. In the following case, the first CPU from the node 2
is removed, then the first one from the node 1 is removed too. Later,
the first CPU of the node 2 is added back. Without that patch, the
kernel will number these CPUs using the first CPU ids available which
are the ones freed when removing the second CPU of the node 0. This
leads to the CPU ids 16-23 to move from the node 1 to the node 2. With
the patch applied, the CPU ids 32-39 are used since they are the lowest
free ones which have not been used on another node.
At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
free_cpumask_var ( cpu_mask ) ;
return rc ;
2006-12-05 09:52:38 +03:00
}
/*
* Update the present map for a cpu node which is going away , and set
* the hard id in the paca ( s ) to - 1 to be consistent with boot time
* convention for non - present cpus .
*/
2006-12-05 09:52:39 +03:00
static void pseries_remove_processor ( struct device_node * np )
2006-12-05 09:52:38 +03:00
{
unsigned int cpu ;
int len , nthreads , i ;
2014-09-12 23:11:42 +04:00
const __be32 * intserv ;
u32 thread ;
2006-12-05 09:52:38 +03:00
2007-04-03 16:26:41 +04:00
intserv = of_get_property ( np , " ibm,ppc-interrupt-server#s " , & len ) ;
2006-12-05 09:52:38 +03:00
if ( ! intserv )
return ;
nthreads = len / sizeof ( u32 ) ;
2008-01-25 23:08:02 +03:00
cpu_maps_update_begin ( ) ;
2006-12-05 09:52:38 +03:00
for ( i = 0 ; i < nthreads ; i + + ) {
2014-09-12 23:11:42 +04:00
thread = be32_to_cpu ( intserv [ i ] ) ;
2006-12-05 09:52:38 +03:00
for_each_present_cpu ( cpu ) {
2014-09-12 23:11:42 +04:00
if ( get_hard_smp_processor_id ( cpu ) ! = thread )
2006-12-05 09:52:38 +03:00
continue ;
BUG_ON ( cpu_online ( cpu ) ) ;
2009-09-24 19:34:48 +04:00
set_cpu_present ( cpu , false ) ;
2006-12-05 09:52:38 +03:00
set_hard_smp_processor_id ( cpu , - 1 ) ;
2018-01-26 22:41:59 +03:00
update_numa_cpu_lookup_table ( cpu , - 1 ) ;
2006-12-05 09:52:38 +03:00
break ;
}
2010-04-26 19:32:42 +04:00
if ( cpu > = nr_cpu_ids )
2006-12-05 09:52:38 +03:00
printk ( KERN_WARNING " Could not find cpu to remove "
2014-09-12 23:11:42 +04:00
" with physical id 0x%x \n " , thread ) ;
2006-12-05 09:52:38 +03:00
}
2008-01-25 23:08:02 +03:00
cpu_maps_update_done ( ) ;
2006-12-05 09:52:38 +03:00
}
powerpc/pseries: safely roll back failed DLPAR cpu add
dlpar_online_cpu() attempts to online all threads of a core that has
been added to an LPAR. If onlining a non-primary thread
fails (e.g. due to an allocation failure), the core is left with at
least one thread online. dlpar_cpu_add() attempts to roll back the
whole operation, releasing the core back to the platform. However,
since some threads of the core being removed are still online, the
BUG_ON(cpu_online(cpu)) in pseries_remove_processor() strikes:
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 3 PID: 8587 Comm: drmgr Not tainted 5.3.0-rc2-00190-g9b123d1ea237-dirty #46
NIP: c0000000000eeb2c LR: c0000000000eeac4 CTR: c0000000000ee9e0
REGS: c0000001f745b6c0 TRAP: 0700 Not tainted (5.3.0-rc2-00190-g9b123d1ea237-dirty)
MSR: 800000010282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44002448 XER: 00000000
CFAR: c00000000195d718 IRQMASK: 0
GPR00: c0000000000eeac4 c0000001f745b950 c0000000032f6200 0000000000000008
GPR04: 0000000000000008 c000000003349c78 0000000000000040 00000000000001ff
GPR08: 0000000000000008 0000000000000000 0000000000000001 0007ffffffffffff
GPR12: 0000000084002844 c00000001ecacb80 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000008
GPR24: c000000003349ee0 c00000000334a2e4 c0000000fca4d7a8 c000000001d20048
GPR28: 0000000000000001 ffffffffffffffff ffffffffffffffff c0000000fca4d7c4
NIP [c0000000000eeb2c] pseries_smp_notifier+0x14c/0x2e0
LR [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0
Call Trace:
[c0000001f745b950] [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0 (unreliable)
[c0000001f745ba10] [c0000000001ac774] notifier_call_chain+0xb4/0x190
[c0000001f745bab0] [c0000000001ad62c] blocking_notifier_call_chain+0x7c/0xb0
[c0000001f745baf0] [c00000000167bda0] of_detach_node+0xc0/0x110
[c0000001f745bb50] [c0000000000e7ae4] dlpar_detach_node+0x64/0xa0
[c0000001f745bb80] [c0000000000edefc] dlpar_cpu_add+0x31c/0x360
[c0000001f745bc10] [c0000000000ee980] dlpar_cpu_probe+0x50/0xb0
[c0000001f745bc50] [c00000000002cf70] arch_cpu_probe+0x40/0x70
[c0000001f745bc70] [c000000000ccd808] cpu_probe_store+0x48/0x80
[c0000001f745bcb0] [c000000000cbcef8] dev_attr_store+0x38/0x60
[c0000001f745bcd0] [c00000000059c980] sysfs_kf_write+0x70/0xb0
[c0000001f745bd10] [c00000000059afb8] kernfs_fop_write+0xf8/0x280
[c0000001f745bd60] [c0000000004b437c] __vfs_write+0x3c/0x70
[c0000001f745bd80] [c0000000004b8710] vfs_write+0xd0/0x220
[c0000001f745bdd0] [c0000000004b8acc] ksys_write+0x7c/0x140
[c0000001f745be20] [c00000000000bbd8] system_call+0x5c/0x68
Move dlpar_offline_cpu() up in the file so that dlpar_online_cpu() can
use it to re-offline any threads that have been onlined when an error
is encountered.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Fixes: e666ae0b10aa ("powerpc/pseries: Update CPU hotplug error recovery")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191016183611.10867-3-nathanl@linux.ibm.com
2019-10-16 21:36:11 +03:00
static int dlpar_offline_cpu ( struct device_node * dn )
{
int rc = 0 ;
unsigned int cpu ;
int len , nthreads , i ;
const __be32 * intserv ;
u32 thread ;
intserv = of_get_property ( dn , " ibm,ppc-interrupt-server#s " , & len ) ;
if ( ! intserv )
return - EINVAL ;
nthreads = len / sizeof ( u32 ) ;
cpu_maps_update_begin ( ) ;
for ( i = 0 ; i < nthreads ; i + + ) {
thread = be32_to_cpu ( intserv [ i ] ) ;
for_each_present_cpu ( cpu ) {
if ( get_hard_smp_processor_id ( cpu ) ! = thread )
continue ;
2020-06-12 08:12:21 +03:00
if ( ! cpu_online ( cpu ) )
powerpc/pseries: safely roll back failed DLPAR cpu add
dlpar_online_cpu() attempts to online all threads of a core that has
been added to an LPAR. If onlining a non-primary thread
fails (e.g. due to an allocation failure), the core is left with at
least one thread online. dlpar_cpu_add() attempts to roll back the
whole operation, releasing the core back to the platform. However,
since some threads of the core being removed are still online, the
BUG_ON(cpu_online(cpu)) in pseries_remove_processor() strikes:
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 3 PID: 8587 Comm: drmgr Not tainted 5.3.0-rc2-00190-g9b123d1ea237-dirty #46
NIP: c0000000000eeb2c LR: c0000000000eeac4 CTR: c0000000000ee9e0
REGS: c0000001f745b6c0 TRAP: 0700 Not tainted (5.3.0-rc2-00190-g9b123d1ea237-dirty)
MSR: 800000010282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44002448 XER: 00000000
CFAR: c00000000195d718 IRQMASK: 0
GPR00: c0000000000eeac4 c0000001f745b950 c0000000032f6200 0000000000000008
GPR04: 0000000000000008 c000000003349c78 0000000000000040 00000000000001ff
GPR08: 0000000000000008 0000000000000000 0000000000000001 0007ffffffffffff
GPR12: 0000000084002844 c00000001ecacb80 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000008
GPR24: c000000003349ee0 c00000000334a2e4 c0000000fca4d7a8 c000000001d20048
GPR28: 0000000000000001 ffffffffffffffff ffffffffffffffff c0000000fca4d7c4
NIP [c0000000000eeb2c] pseries_smp_notifier+0x14c/0x2e0
LR [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0
Call Trace:
[c0000001f745b950] [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0 (unreliable)
[c0000001f745ba10] [c0000000001ac774] notifier_call_chain+0xb4/0x190
[c0000001f745bab0] [c0000000001ad62c] blocking_notifier_call_chain+0x7c/0xb0
[c0000001f745baf0] [c00000000167bda0] of_detach_node+0xc0/0x110
[c0000001f745bb50] [c0000000000e7ae4] dlpar_detach_node+0x64/0xa0
[c0000001f745bb80] [c0000000000edefc] dlpar_cpu_add+0x31c/0x360
[c0000001f745bc10] [c0000000000ee980] dlpar_cpu_probe+0x50/0xb0
[c0000001f745bc50] [c00000000002cf70] arch_cpu_probe+0x40/0x70
[c0000001f745bc70] [c000000000ccd808] cpu_probe_store+0x48/0x80
[c0000001f745bcb0] [c000000000cbcef8] dev_attr_store+0x38/0x60
[c0000001f745bcd0] [c00000000059c980] sysfs_kf_write+0x70/0xb0
[c0000001f745bd10] [c00000000059afb8] kernfs_fop_write+0xf8/0x280
[c0000001f745bd60] [c0000000004b437c] __vfs_write+0x3c/0x70
[c0000001f745bd80] [c0000000004b8710] vfs_write+0xd0/0x220
[c0000001f745bdd0] [c0000000004b8acc] ksys_write+0x7c/0x140
[c0000001f745be20] [c00000000000bbd8] system_call+0x5c/0x68
Move dlpar_offline_cpu() up in the file so that dlpar_online_cpu() can
use it to re-offline any threads that have been onlined when an error
is encountered.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Fixes: e666ae0b10aa ("powerpc/pseries: Update CPU hotplug error recovery")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191016183611.10867-3-nathanl@linux.ibm.com
2019-10-16 21:36:11 +03:00
break ;
powerpc/pseries/hotplug-cpu: Show 'last online CPU' error in dlpar_cpu_offline()
One of the reasons that dlpar_cpu_offline can fail is when attempting to
offline the last online CPU of the kernel. This can be observed in a
pseries QEMU guest that has hotplugged CPUs. If the user offlines all
other CPUs of the guest, and a hotplugged CPU is now the last online
CPU, trying to reclaim it will fail.
The current error message in this situation returns rc with -EBUSY and a
generic explanation, e.g.:
pseries-hotplug-cpu: Failed to offline CPU PowerPC,POWER9, rc: -16
EBUSY can be caused by other conditions, such as cpu_hotplug_disable
being true. Throwing a more specific error message for this case,
instead of just "Failed to offline CPU", makes it clearer that the error
is in fact a known error situation instead of other generic/unknown
cause.
This patch adds a 'last online' check in dlpar_cpu_offline() to catch
the 'last online CPU' offline error, eturning a more informative error
message:
pseries-hotplug-cpu: Unable to remove last online CPU PowerPC,POWER9
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210323205056.52768-2-danielhb413@gmail.com
2021-03-23 23:50:56 +03:00
/*
* device_offline ( ) will return - EBUSY ( via cpu_down ( ) ) if there
* is only one CPU left . Check it here to fail earlier and with a
* more informative error message , while also retaining the
* cpu_add_remove_lock to be sure that no CPUs are being
* online / offlined during this check .
*/
if ( num_online_cpus ( ) = = 1 ) {
pr_warn ( " Unable to remove last online CPU %pOFn \n " , dn ) ;
rc = - EBUSY ;
goto out_unlock ;
}
2020-06-12 08:12:21 +03:00
cpu_maps_update_done ( ) ;
rc = device_offline ( get_cpu_device ( cpu ) ) ;
if ( rc )
goto out ;
cpu_maps_update_begin ( ) ;
powerpc/pseries: safely roll back failed DLPAR cpu add
dlpar_online_cpu() attempts to online all threads of a core that has
been added to an LPAR. If onlining a non-primary thread
fails (e.g. due to an allocation failure), the core is left with at
least one thread online. dlpar_cpu_add() attempts to roll back the
whole operation, releasing the core back to the platform. However,
since some threads of the core being removed are still online, the
BUG_ON(cpu_online(cpu)) in pseries_remove_processor() strikes:
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 3 PID: 8587 Comm: drmgr Not tainted 5.3.0-rc2-00190-g9b123d1ea237-dirty #46
NIP: c0000000000eeb2c LR: c0000000000eeac4 CTR: c0000000000ee9e0
REGS: c0000001f745b6c0 TRAP: 0700 Not tainted (5.3.0-rc2-00190-g9b123d1ea237-dirty)
MSR: 800000010282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44002448 XER: 00000000
CFAR: c00000000195d718 IRQMASK: 0
GPR00: c0000000000eeac4 c0000001f745b950 c0000000032f6200 0000000000000008
GPR04: 0000000000000008 c000000003349c78 0000000000000040 00000000000001ff
GPR08: 0000000000000008 0000000000000000 0000000000000001 0007ffffffffffff
GPR12: 0000000084002844 c00000001ecacb80 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000008
GPR24: c000000003349ee0 c00000000334a2e4 c0000000fca4d7a8 c000000001d20048
GPR28: 0000000000000001 ffffffffffffffff ffffffffffffffff c0000000fca4d7c4
NIP [c0000000000eeb2c] pseries_smp_notifier+0x14c/0x2e0
LR [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0
Call Trace:
[c0000001f745b950] [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0 (unreliable)
[c0000001f745ba10] [c0000000001ac774] notifier_call_chain+0xb4/0x190
[c0000001f745bab0] [c0000000001ad62c] blocking_notifier_call_chain+0x7c/0xb0
[c0000001f745baf0] [c00000000167bda0] of_detach_node+0xc0/0x110
[c0000001f745bb50] [c0000000000e7ae4] dlpar_detach_node+0x64/0xa0
[c0000001f745bb80] [c0000000000edefc] dlpar_cpu_add+0x31c/0x360
[c0000001f745bc10] [c0000000000ee980] dlpar_cpu_probe+0x50/0xb0
[c0000001f745bc50] [c00000000002cf70] arch_cpu_probe+0x40/0x70
[c0000001f745bc70] [c000000000ccd808] cpu_probe_store+0x48/0x80
[c0000001f745bcb0] [c000000000cbcef8] dev_attr_store+0x38/0x60
[c0000001f745bcd0] [c00000000059c980] sysfs_kf_write+0x70/0xb0
[c0000001f745bd10] [c00000000059afb8] kernfs_fop_write+0xf8/0x280
[c0000001f745bd60] [c0000000004b437c] __vfs_write+0x3c/0x70
[c0000001f745bd80] [c0000000004b8710] vfs_write+0xd0/0x220
[c0000001f745bdd0] [c0000000004b8acc] ksys_write+0x7c/0x140
[c0000001f745be20] [c00000000000bbd8] system_call+0x5c/0x68
Move dlpar_offline_cpu() up in the file so that dlpar_online_cpu() can
use it to re-offline any threads that have been onlined when an error
is encountered.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Fixes: e666ae0b10aa ("powerpc/pseries: Update CPU hotplug error recovery")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191016183611.10867-3-nathanl@linux.ibm.com
2019-10-16 21:36:11 +03:00
break ;
}
if ( cpu = = num_possible_cpus ( ) ) {
pr_warn ( " Could not find cpu to offline with physical id 0x%x \n " ,
thread ) ;
}
}
powerpc/pseries/hotplug-cpu: Show 'last online CPU' error in dlpar_cpu_offline()
One of the reasons that dlpar_cpu_offline can fail is when attempting to
offline the last online CPU of the kernel. This can be observed in a
pseries QEMU guest that has hotplugged CPUs. If the user offlines all
other CPUs of the guest, and a hotplugged CPU is now the last online
CPU, trying to reclaim it will fail.
The current error message in this situation returns rc with -EBUSY and a
generic explanation, e.g.:
pseries-hotplug-cpu: Failed to offline CPU PowerPC,POWER9, rc: -16
EBUSY can be caused by other conditions, such as cpu_hotplug_disable
being true. Throwing a more specific error message for this case,
instead of just "Failed to offline CPU", makes it clearer that the error
is in fact a known error situation instead of other generic/unknown
cause.
This patch adds a 'last online' check in dlpar_cpu_offline() to catch
the 'last online CPU' offline error, eturning a more informative error
message:
pseries-hotplug-cpu: Unable to remove last online CPU PowerPC,POWER9
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210323205056.52768-2-danielhb413@gmail.com
2021-03-23 23:50:56 +03:00
out_unlock :
powerpc/pseries: safely roll back failed DLPAR cpu add
dlpar_online_cpu() attempts to online all threads of a core that has
been added to an LPAR. If onlining a non-primary thread
fails (e.g. due to an allocation failure), the core is left with at
least one thread online. dlpar_cpu_add() attempts to roll back the
whole operation, releasing the core back to the platform. However,
since some threads of the core being removed are still online, the
BUG_ON(cpu_online(cpu)) in pseries_remove_processor() strikes:
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 3 PID: 8587 Comm: drmgr Not tainted 5.3.0-rc2-00190-g9b123d1ea237-dirty #46
NIP: c0000000000eeb2c LR: c0000000000eeac4 CTR: c0000000000ee9e0
REGS: c0000001f745b6c0 TRAP: 0700 Not tainted (5.3.0-rc2-00190-g9b123d1ea237-dirty)
MSR: 800000010282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44002448 XER: 00000000
CFAR: c00000000195d718 IRQMASK: 0
GPR00: c0000000000eeac4 c0000001f745b950 c0000000032f6200 0000000000000008
GPR04: 0000000000000008 c000000003349c78 0000000000000040 00000000000001ff
GPR08: 0000000000000008 0000000000000000 0000000000000001 0007ffffffffffff
GPR12: 0000000084002844 c00000001ecacb80 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000008
GPR24: c000000003349ee0 c00000000334a2e4 c0000000fca4d7a8 c000000001d20048
GPR28: 0000000000000001 ffffffffffffffff ffffffffffffffff c0000000fca4d7c4
NIP [c0000000000eeb2c] pseries_smp_notifier+0x14c/0x2e0
LR [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0
Call Trace:
[c0000001f745b950] [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0 (unreliable)
[c0000001f745ba10] [c0000000001ac774] notifier_call_chain+0xb4/0x190
[c0000001f745bab0] [c0000000001ad62c] blocking_notifier_call_chain+0x7c/0xb0
[c0000001f745baf0] [c00000000167bda0] of_detach_node+0xc0/0x110
[c0000001f745bb50] [c0000000000e7ae4] dlpar_detach_node+0x64/0xa0
[c0000001f745bb80] [c0000000000edefc] dlpar_cpu_add+0x31c/0x360
[c0000001f745bc10] [c0000000000ee980] dlpar_cpu_probe+0x50/0xb0
[c0000001f745bc50] [c00000000002cf70] arch_cpu_probe+0x40/0x70
[c0000001f745bc70] [c000000000ccd808] cpu_probe_store+0x48/0x80
[c0000001f745bcb0] [c000000000cbcef8] dev_attr_store+0x38/0x60
[c0000001f745bcd0] [c00000000059c980] sysfs_kf_write+0x70/0xb0
[c0000001f745bd10] [c00000000059afb8] kernfs_fop_write+0xf8/0x280
[c0000001f745bd60] [c0000000004b437c] __vfs_write+0x3c/0x70
[c0000001f745bd80] [c0000000004b8710] vfs_write+0xd0/0x220
[c0000001f745bdd0] [c0000000004b8acc] ksys_write+0x7c/0x140
[c0000001f745be20] [c00000000000bbd8] system_call+0x5c/0x68
Move dlpar_offline_cpu() up in the file so that dlpar_online_cpu() can
use it to re-offline any threads that have been onlined when an error
is encountered.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Fixes: e666ae0b10aa ("powerpc/pseries: Update CPU hotplug error recovery")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191016183611.10867-3-nathanl@linux.ibm.com
2019-10-16 21:36:11 +03:00
cpu_maps_update_done ( ) ;
out :
return rc ;
}
2015-12-16 23:50:21 +03:00
static int dlpar_online_cpu ( struct device_node * dn )
{
int rc = 0 ;
unsigned int cpu ;
int len , nthreads , i ;
const __be32 * intserv ;
u32 thread ;
intserv = of_get_property ( dn , " ibm,ppc-interrupt-server#s " , & len ) ;
if ( ! intserv )
return - EINVAL ;
nthreads = len / sizeof ( u32 ) ;
cpu_maps_update_begin ( ) ;
for ( i = 0 ; i < nthreads ; i + + ) {
thread = be32_to_cpu ( intserv [ i ] ) ;
for_each_present_cpu ( cpu ) {
if ( get_hard_smp_processor_id ( cpu ) ! = thread )
continue ;
2023-07-05 17:51:43 +03:00
if ( ! topology_is_primary_thread ( cpu ) ) {
if ( cpu_smt_control ! = CPU_SMT_ENABLED )
break ;
if ( ! topology_smt_thread_allowed ( cpu ) )
break ;
}
2015-12-16 23:50:21 +03:00
cpu_maps_update_done ( ) ;
2022-04-11 10:49:34 +03:00
find_and_update_cpu_nid ( cpu ) ;
2015-12-16 23:50:21 +03:00
rc = device_online ( get_cpu_device ( cpu ) ) ;
powerpc/pseries: safely roll back failed DLPAR cpu add
dlpar_online_cpu() attempts to online all threads of a core that has
been added to an LPAR. If onlining a non-primary thread
fails (e.g. due to an allocation failure), the core is left with at
least one thread online. dlpar_cpu_add() attempts to roll back the
whole operation, releasing the core back to the platform. However,
since some threads of the core being removed are still online, the
BUG_ON(cpu_online(cpu)) in pseries_remove_processor() strikes:
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 3 PID: 8587 Comm: drmgr Not tainted 5.3.0-rc2-00190-g9b123d1ea237-dirty #46
NIP: c0000000000eeb2c LR: c0000000000eeac4 CTR: c0000000000ee9e0
REGS: c0000001f745b6c0 TRAP: 0700 Not tainted (5.3.0-rc2-00190-g9b123d1ea237-dirty)
MSR: 800000010282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44002448 XER: 00000000
CFAR: c00000000195d718 IRQMASK: 0
GPR00: c0000000000eeac4 c0000001f745b950 c0000000032f6200 0000000000000008
GPR04: 0000000000000008 c000000003349c78 0000000000000040 00000000000001ff
GPR08: 0000000000000008 0000000000000000 0000000000000001 0007ffffffffffff
GPR12: 0000000084002844 c00000001ecacb80 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000008
GPR24: c000000003349ee0 c00000000334a2e4 c0000000fca4d7a8 c000000001d20048
GPR28: 0000000000000001 ffffffffffffffff ffffffffffffffff c0000000fca4d7c4
NIP [c0000000000eeb2c] pseries_smp_notifier+0x14c/0x2e0
LR [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0
Call Trace:
[c0000001f745b950] [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0 (unreliable)
[c0000001f745ba10] [c0000000001ac774] notifier_call_chain+0xb4/0x190
[c0000001f745bab0] [c0000000001ad62c] blocking_notifier_call_chain+0x7c/0xb0
[c0000001f745baf0] [c00000000167bda0] of_detach_node+0xc0/0x110
[c0000001f745bb50] [c0000000000e7ae4] dlpar_detach_node+0x64/0xa0
[c0000001f745bb80] [c0000000000edefc] dlpar_cpu_add+0x31c/0x360
[c0000001f745bc10] [c0000000000ee980] dlpar_cpu_probe+0x50/0xb0
[c0000001f745bc50] [c00000000002cf70] arch_cpu_probe+0x40/0x70
[c0000001f745bc70] [c000000000ccd808] cpu_probe_store+0x48/0x80
[c0000001f745bcb0] [c000000000cbcef8] dev_attr_store+0x38/0x60
[c0000001f745bcd0] [c00000000059c980] sysfs_kf_write+0x70/0xb0
[c0000001f745bd10] [c00000000059afb8] kernfs_fop_write+0xf8/0x280
[c0000001f745bd60] [c0000000004b437c] __vfs_write+0x3c/0x70
[c0000001f745bd80] [c0000000004b8710] vfs_write+0xd0/0x220
[c0000001f745bdd0] [c0000000004b8acc] ksys_write+0x7c/0x140
[c0000001f745be20] [c00000000000bbd8] system_call+0x5c/0x68
Move dlpar_offline_cpu() up in the file so that dlpar_online_cpu() can
use it to re-offline any threads that have been onlined when an error
is encountered.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Fixes: e666ae0b10aa ("powerpc/pseries: Update CPU hotplug error recovery")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191016183611.10867-3-nathanl@linux.ibm.com
2019-10-16 21:36:11 +03:00
if ( rc ) {
dlpar_offline_cpu ( dn ) ;
2015-12-16 23:50:21 +03:00
goto out ;
powerpc/pseries: safely roll back failed DLPAR cpu add
dlpar_online_cpu() attempts to online all threads of a core that has
been added to an LPAR. If onlining a non-primary thread
fails (e.g. due to an allocation failure), the core is left with at
least one thread online. dlpar_cpu_add() attempts to roll back the
whole operation, releasing the core back to the platform. However,
since some threads of the core being removed are still online, the
BUG_ON(cpu_online(cpu)) in pseries_remove_processor() strikes:
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 3 PID: 8587 Comm: drmgr Not tainted 5.3.0-rc2-00190-g9b123d1ea237-dirty #46
NIP: c0000000000eeb2c LR: c0000000000eeac4 CTR: c0000000000ee9e0
REGS: c0000001f745b6c0 TRAP: 0700 Not tainted (5.3.0-rc2-00190-g9b123d1ea237-dirty)
MSR: 800000010282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 44002448 XER: 00000000
CFAR: c00000000195d718 IRQMASK: 0
GPR00: c0000000000eeac4 c0000001f745b950 c0000000032f6200 0000000000000008
GPR04: 0000000000000008 c000000003349c78 0000000000000040 00000000000001ff
GPR08: 0000000000000008 0000000000000000 0000000000000001 0007ffffffffffff
GPR12: 0000000084002844 c00000001ecacb80 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000008
GPR24: c000000003349ee0 c00000000334a2e4 c0000000fca4d7a8 c000000001d20048
GPR28: 0000000000000001 ffffffffffffffff ffffffffffffffff c0000000fca4d7c4
NIP [c0000000000eeb2c] pseries_smp_notifier+0x14c/0x2e0
LR [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0
Call Trace:
[c0000001f745b950] [c0000000000eeac4] pseries_smp_notifier+0xe4/0x2e0 (unreliable)
[c0000001f745ba10] [c0000000001ac774] notifier_call_chain+0xb4/0x190
[c0000001f745bab0] [c0000000001ad62c] blocking_notifier_call_chain+0x7c/0xb0
[c0000001f745baf0] [c00000000167bda0] of_detach_node+0xc0/0x110
[c0000001f745bb50] [c0000000000e7ae4] dlpar_detach_node+0x64/0xa0
[c0000001f745bb80] [c0000000000edefc] dlpar_cpu_add+0x31c/0x360
[c0000001f745bc10] [c0000000000ee980] dlpar_cpu_probe+0x50/0xb0
[c0000001f745bc50] [c00000000002cf70] arch_cpu_probe+0x40/0x70
[c0000001f745bc70] [c000000000ccd808] cpu_probe_store+0x48/0x80
[c0000001f745bcb0] [c000000000cbcef8] dev_attr_store+0x38/0x60
[c0000001f745bcd0] [c00000000059c980] sysfs_kf_write+0x70/0xb0
[c0000001f745bd10] [c00000000059afb8] kernfs_fop_write+0xf8/0x280
[c0000001f745bd60] [c0000000004b437c] __vfs_write+0x3c/0x70
[c0000001f745bd80] [c0000000004b8710] vfs_write+0xd0/0x220
[c0000001f745bdd0] [c0000000004b8acc] ksys_write+0x7c/0x140
[c0000001f745be20] [c00000000000bbd8] system_call+0x5c/0x68
Move dlpar_offline_cpu() up in the file so that dlpar_online_cpu() can
use it to re-offline any threads that have been onlined when an error
is encountered.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Fixes: e666ae0b10aa ("powerpc/pseries: Update CPU hotplug error recovery")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191016183611.10867-3-nathanl@linux.ibm.com
2019-10-16 21:36:11 +03:00
}
2015-12-16 23:50:21 +03:00
cpu_maps_update_begin ( ) ;
break ;
}
if ( cpu = = num_possible_cpus ( ) )
printk ( KERN_WARNING " Could not find cpu to online "
" with physical id 0x%x \n " , thread ) ;
}
cpu_maps_update_done ( ) ;
out :
return rc ;
}
static bool dlpar_cpu_exists ( struct device_node * parent , u32 drc_index )
{
struct device_node * child = NULL ;
u32 my_drc_index ;
bool found ;
int rc ;
/* Assume cpu doesn't exist */
found = false ;
for_each_child_of_node ( parent , child ) {
rc = of_property_read_u32 ( child , " ibm,my-drc-index " ,
& my_drc_index ) ;
if ( rc )
continue ;
if ( my_drc_index = = drc_index ) {
of_node_put ( child ) ;
found = true ;
break ;
}
}
return found ;
}
powerpc/pseries: Add cpu DLPAR support for drc-info property
Older firmwares provided information about Dynamic Reconfig
Connectors (DRC) through several device tree properties, namely
ibm,drc-types, ibm,drc-indexes, ibm,drc-names, and
ibm,drc-power-domains. New firmwares have the ability to present this
same information in a much condensed format through a device tree
property called ibm,drc-info.
The existing cpu DLPAR hotplug code only understands the older DRC
property format when validating the drc-index of a cpu during a
hotplug add. This updates those code paths to use the ibm,drc-info
property, when present, instead for validation.
Signed-off-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1573449697-5448-4-git-send-email-tyreld@linux.ibm.com
2019-11-11 08:21:30 +03:00
static bool drc_info_valid_index ( struct device_node * parent , u32 drc_index )
{
struct property * info ;
struct of_drc_info drc ;
const __be32 * value ;
u32 index ;
int count , i , j ;
info = of_find_property ( parent , " ibm,drc-info " , NULL ) ;
if ( ! info )
return false ;
value = of_prop_next_u32 ( info , NULL , & count ) ;
/* First value of ibm,drc-info is number of drc-info records */
if ( value )
value + + ;
else
return false ;
for ( i = 0 ; i < count ; i + + ) {
if ( of_read_drc_info_cell ( & info , & value , & drc ) )
return false ;
if ( strncmp ( drc . drc_type , " CPU " , 3 ) )
break ;
if ( drc_index > drc . last_drc_index )
continue ;
index = drc . drc_index_start ;
for ( j = 0 ; j < drc . num_sequential_elems ; j + + ) {
if ( drc_index = = index )
return true ;
index + = drc . sequential_inc ;
}
}
return false ;
}
2015-12-16 23:55:07 +03:00
static bool valid_cpu_drc_index ( struct device_node * parent , u32 drc_index )
{
bool found = false ;
int rc , index ;
2023-03-10 17:46:56 +03:00
if ( of_property_present ( parent , " ibm,drc-info " ) )
powerpc/pseries: Add cpu DLPAR support for drc-info property
Older firmwares provided information about Dynamic Reconfig
Connectors (DRC) through several device tree properties, namely
ibm,drc-types, ibm,drc-indexes, ibm,drc-names, and
ibm,drc-power-domains. New firmwares have the ability to present this
same information in a much condensed format through a device tree
property called ibm,drc-info.
The existing cpu DLPAR hotplug code only understands the older DRC
property format when validating the drc-index of a cpu during a
hotplug add. This updates those code paths to use the ibm,drc-info
property, when present, instead for validation.
Signed-off-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1573449697-5448-4-git-send-email-tyreld@linux.ibm.com
2019-11-11 08:21:30 +03:00
return drc_info_valid_index ( parent , drc_index ) ;
/* Note that the format of the ibm,drc-indexes array is
* the number of entries in the array followed by the array
* of drc values so we start looking at index = 1.
*/
index = 1 ;
2015-12-16 23:55:07 +03:00
while ( ! found ) {
u32 drc ;
rc = of_property_read_u32_index ( parent , " ibm,drc-indexes " ,
index + + , & drc ) ;
powerpc/pseries: Add cpu DLPAR support for drc-info property
Older firmwares provided information about Dynamic Reconfig
Connectors (DRC) through several device tree properties, namely
ibm,drc-types, ibm,drc-indexes, ibm,drc-names, and
ibm,drc-power-domains. New firmwares have the ability to present this
same information in a much condensed format through a device tree
property called ibm,drc-info.
The existing cpu DLPAR hotplug code only understands the older DRC
property format when validating the drc-index of a cpu during a
hotplug add. This updates those code paths to use the ibm,drc-info
property, when present, instead for validation.
Signed-off-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1573449697-5448-4-git-send-email-tyreld@linux.ibm.com
2019-11-11 08:21:30 +03:00
2015-12-16 23:55:07 +03:00
if ( rc )
break ;
if ( drc = = drc_index )
found = true ;
}
return found ;
}
powerpc/pseries/cpuhp: cache node corrections
On pseries, cache nodes in the device tree can be added and removed by the
CPU DLPAR code as well as the partition migration (mobility) code. PowerVM
partitions in dedicated processor mode typically have L2 and L3 cache
nodes.
The CPU DLPAR code has the following shortcomings:
* Cache nodes returned as siblings of a new CPU node by
ibm,configure-connector are silently discarded; only the CPU node is
added to the device tree.
* Cache nodes which become unreferenced in the processor removal path are
not removed from the device tree. This can lead to duplicate nodes when
the post-migration device tree update code replaces cache nodes.
This is long-standing behavior. Presumably it has gone mostly unnoticed
because the two bugs have the property of obscuring each other in common
simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice
only if you cared to inspect the device tree or the sysfs cacheinfo
information.
Booted with two processors:
$ pwd
/sys/firmware/devicetree/base/cpus
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@8/
$ lsprop */l2-cache
l2-cache@2010/l2-cache
00003110 (12560)
l2-cache@2011/l2-cache
00003111 (12561)
PowerPC,POWER9@0/l2-cache
00002010 (8208)
PowerPC,POWER9@8/l2-cache
00002011 (8209)
$ ls /sys/devices/system/cpu/cpu0/cache/
index0 index1 index2 index3
After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache
nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is
missing a cache level in its sched domain hierarchy:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@10/
PowerPC,POWER9@8/
$ lsprop PowerPC\,POWER9@10/l2-cache
PowerPC,POWER9@10/l2-cache
00002012 (8210)
$ ls /sys/devices/system/cpu/cpu16/cache/
index0 index1
$ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE
When removing PowerPC,POWER9@8, we see that its cache nodes are left
behind:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
When DLPAR is combined with VM migration, we can get duplicate nodes. E.g.
removing one processor, then migrating, adding a processor, and then
migrating again can result in warnings from the OF core during
post-migration device tree updates:
Duplicate name in cpus, renamed to "l2-cache@2011#1"
Duplicate name in cpus, renamed to "l3-cache@3111#1"
and nodes with duplicated phandles in the tree, making lookup behavior
unpredictable:
$ lsprop l[23]-cache@*/ibm,phandle
l2-cache@2010/ibm,phandle
00002010 (8208)
l2-cache@2011#1/ibm,phandle
00002011 (8209)
l2-cache@2011/ibm,phandle
00002011 (8209)
l3-cache@3110/ibm,phandle
00003110 (12560)
l3-cache@3111#1/ibm,phandle
00003111 (12561)
l3-cache@3111/ibm,phandle
00003111 (12561)
Address these issues by:
* Correctly processing siblings of the node returned from
dlpar_configure_connector().
* Removing cache nodes in the CPU remove path when it can be determined
that they are not associated with other CPUs or caches.
Use the of_changeset API in both cases, which allows us to keep the error
handling in this code from becoming more complex while ensuring that the
device tree cannot become inconsistent.
Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality")
Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
static int pseries_cpuhp_attach_nodes ( struct device_node * dn )
{
struct of_changeset cs ;
int ret ;
/*
* This device node is unattached but may have siblings ; open - code the
* traversal .
*/
for ( of_changeset_init ( & cs ) ; dn ! = NULL ; dn = dn - > sibling ) {
ret = of_changeset_attach_node ( & cs , dn ) ;
if ( ret )
goto out ;
}
ret = of_changeset_apply ( & cs ) ;
out :
of_changeset_destroy ( & cs ) ;
return ret ;
}
2015-12-16 23:51:26 +03:00
static ssize_t dlpar_cpu_add ( u32 drc_index )
2015-12-16 23:50:21 +03:00
{
struct device_node * dn , * parent ;
2015-12-16 23:52:39 +03:00
int rc , saved_rc ;
pr_debug ( " Attempting to add CPU, drc index: %x \n " , drc_index ) ;
2015-12-16 23:50:21 +03:00
parent = of_find_node_by_path ( " /cpus " ) ;
2015-12-16 23:52:39 +03:00
if ( ! parent ) {
pr_warn ( " Failed to find CPU root node \" /cpus \" \n " ) ;
2015-12-16 23:50:21 +03:00
return - ENODEV ;
2015-12-16 23:52:39 +03:00
}
2015-12-16 23:50:21 +03:00
if ( dlpar_cpu_exists ( parent , drc_index ) ) {
of_node_put ( parent ) ;
2015-12-16 23:52:39 +03:00
pr_warn ( " CPU with drc index %x already exists \n " , drc_index ) ;
2015-12-16 23:50:21 +03:00
return - EINVAL ;
}
2015-12-16 23:55:07 +03:00
if ( ! valid_cpu_drc_index ( parent , drc_index ) ) {
of_node_put ( parent ) ;
pr_warn ( " Cannot find CPU (drc index %x) to add. \n " , drc_index ) ;
return - EINVAL ;
}
2015-12-16 23:50:21 +03:00
rc = dlpar_acquire_drc ( drc_index ) ;
if ( rc ) {
2015-12-16 23:52:39 +03:00
pr_warn ( " Failed to acquire DRC, rc: %d, drc index: %x \n " ,
rc , drc_index ) ;
2015-12-16 23:50:21 +03:00
of_node_put ( parent ) ;
return - EINVAL ;
}
dn = dlpar_configure_connector ( cpu_to_be32 ( drc_index ) , parent ) ;
2015-12-16 23:52:39 +03:00
if ( ! dn ) {
pr_warn ( " Failed call to configure-connector, drc index: %x \n " ,
drc_index ) ;
dlpar_release_drc ( drc_index ) ;
2017-09-21 00:02:51 +03:00
of_node_put ( parent ) ;
2015-12-16 23:50:21 +03:00
return - EINVAL ;
2015-12-16 23:52:39 +03:00
}
2015-12-16 23:50:21 +03:00
powerpc/pseries/cpuhp: cache node corrections
On pseries, cache nodes in the device tree can be added and removed by the
CPU DLPAR code as well as the partition migration (mobility) code. PowerVM
partitions in dedicated processor mode typically have L2 and L3 cache
nodes.
The CPU DLPAR code has the following shortcomings:
* Cache nodes returned as siblings of a new CPU node by
ibm,configure-connector are silently discarded; only the CPU node is
added to the device tree.
* Cache nodes which become unreferenced in the processor removal path are
not removed from the device tree. This can lead to duplicate nodes when
the post-migration device tree update code replaces cache nodes.
This is long-standing behavior. Presumably it has gone mostly unnoticed
because the two bugs have the property of obscuring each other in common
simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice
only if you cared to inspect the device tree or the sysfs cacheinfo
information.
Booted with two processors:
$ pwd
/sys/firmware/devicetree/base/cpus
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@8/
$ lsprop */l2-cache
l2-cache@2010/l2-cache
00003110 (12560)
l2-cache@2011/l2-cache
00003111 (12561)
PowerPC,POWER9@0/l2-cache
00002010 (8208)
PowerPC,POWER9@8/l2-cache
00002011 (8209)
$ ls /sys/devices/system/cpu/cpu0/cache/
index0 index1 index2 index3
After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache
nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is
missing a cache level in its sched domain hierarchy:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@10/
PowerPC,POWER9@8/
$ lsprop PowerPC\,POWER9@10/l2-cache
PowerPC,POWER9@10/l2-cache
00002012 (8210)
$ ls /sys/devices/system/cpu/cpu16/cache/
index0 index1
$ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE
When removing PowerPC,POWER9@8, we see that its cache nodes are left
behind:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
When DLPAR is combined with VM migration, we can get duplicate nodes. E.g.
removing one processor, then migrating, adding a processor, and then
migrating again can result in warnings from the OF core during
post-migration device tree updates:
Duplicate name in cpus, renamed to "l2-cache@2011#1"
Duplicate name in cpus, renamed to "l3-cache@3111#1"
and nodes with duplicated phandles in the tree, making lookup behavior
unpredictable:
$ lsprop l[23]-cache@*/ibm,phandle
l2-cache@2010/ibm,phandle
00002010 (8208)
l2-cache@2011#1/ibm,phandle
00002011 (8209)
l2-cache@2011/ibm,phandle
00002011 (8209)
l3-cache@3110/ibm,phandle
00003110 (12560)
l3-cache@3111#1/ibm,phandle
00003111 (12561)
l3-cache@3111/ibm,phandle
00003111 (12561)
Address these issues by:
* Correctly processing siblings of the node returned from
dlpar_configure_connector().
* Removing cache nodes in the CPU remove path when it can be determined
that they are not associated with other CPUs or caches.
Use the of_changeset API in both cases, which allows us to keep the error
handling in this code from becoming more complex while ensuring that the
device tree cannot become inconsistent.
Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality")
Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
rc = pseries_cpuhp_attach_nodes ( dn ) ;
2017-09-21 00:02:51 +03:00
/* Regardless we are done with parent now */
of_node_put ( parent ) ;
2015-12-16 23:50:21 +03:00
if ( rc ) {
2015-12-16 23:52:39 +03:00
saved_rc = rc ;
2018-08-28 04:52:07 +03:00
pr_warn ( " Failed to attach node %pOFn, rc: %d, drc index: %x \n " ,
dn , rc , drc_index ) ;
2015-12-16 23:52:39 +03:00
rc = dlpar_release_drc ( drc_index ) ;
if ( ! rc )
dlpar_free_cc_nodes ( dn ) ;
return saved_rc ;
2015-12-16 23:50:21 +03:00
}
2021-08-12 16:22:21 +03:00
update_numa_distance ( dn ) ;
2015-12-16 23:50:21 +03:00
rc = dlpar_online_cpu ( dn ) ;
2015-12-16 23:52:39 +03:00
if ( rc ) {
saved_rc = rc ;
2018-08-28 04:52:07 +03:00
pr_warn ( " Failed to online cpu %pOFn, rc: %d, drc index: %x \n " ,
dn , rc , drc_index ) ;
2015-12-16 23:52:39 +03:00
rc = dlpar_detach_node ( dn ) ;
if ( ! rc )
dlpar_release_drc ( drc_index ) ;
return saved_rc ;
}
2018-08-28 04:52:07 +03:00
pr_debug ( " Successfully added CPU %pOFn, drc index: %x \n " , dn ,
2015-12-16 23:52:39 +03:00
drc_index ) ;
2015-12-16 23:51:26 +03:00
return rc ;
2015-12-16 23:50:21 +03:00
}
powerpc/pseries/cpuhp: cache node corrections
On pseries, cache nodes in the device tree can be added and removed by the
CPU DLPAR code as well as the partition migration (mobility) code. PowerVM
partitions in dedicated processor mode typically have L2 and L3 cache
nodes.
The CPU DLPAR code has the following shortcomings:
* Cache nodes returned as siblings of a new CPU node by
ibm,configure-connector are silently discarded; only the CPU node is
added to the device tree.
* Cache nodes which become unreferenced in the processor removal path are
not removed from the device tree. This can lead to duplicate nodes when
the post-migration device tree update code replaces cache nodes.
This is long-standing behavior. Presumably it has gone mostly unnoticed
because the two bugs have the property of obscuring each other in common
simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice
only if you cared to inspect the device tree or the sysfs cacheinfo
information.
Booted with two processors:
$ pwd
/sys/firmware/devicetree/base/cpus
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@8/
$ lsprop */l2-cache
l2-cache@2010/l2-cache
00003110 (12560)
l2-cache@2011/l2-cache
00003111 (12561)
PowerPC,POWER9@0/l2-cache
00002010 (8208)
PowerPC,POWER9@8/l2-cache
00002011 (8209)
$ ls /sys/devices/system/cpu/cpu0/cache/
index0 index1 index2 index3
After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache
nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is
missing a cache level in its sched domain hierarchy:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@10/
PowerPC,POWER9@8/
$ lsprop PowerPC\,POWER9@10/l2-cache
PowerPC,POWER9@10/l2-cache
00002012 (8210)
$ ls /sys/devices/system/cpu/cpu16/cache/
index0 index1
$ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE
When removing PowerPC,POWER9@8, we see that its cache nodes are left
behind:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
When DLPAR is combined with VM migration, we can get duplicate nodes. E.g.
removing one processor, then migrating, adding a processor, and then
migrating again can result in warnings from the OF core during
post-migration device tree updates:
Duplicate name in cpus, renamed to "l2-cache@2011#1"
Duplicate name in cpus, renamed to "l3-cache@3111#1"
and nodes with duplicated phandles in the tree, making lookup behavior
unpredictable:
$ lsprop l[23]-cache@*/ibm,phandle
l2-cache@2010/ibm,phandle
00002010 (8208)
l2-cache@2011#1/ibm,phandle
00002011 (8209)
l2-cache@2011/ibm,phandle
00002011 (8209)
l3-cache@3110/ibm,phandle
00003110 (12560)
l3-cache@3111#1/ibm,phandle
00003111 (12561)
l3-cache@3111/ibm,phandle
00003111 (12561)
Address these issues by:
* Correctly processing siblings of the node returned from
dlpar_configure_connector().
* Removing cache nodes in the CPU remove path when it can be determined
that they are not associated with other CPUs or caches.
Use the of_changeset API in both cases, which allows us to keep the error
handling in this code from becoming more complex while ensuring that the
device tree cannot become inconsistent.
Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality")
Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
static unsigned int pseries_cpuhp_cache_use_count ( const struct device_node * cachedn )
{
unsigned int use_count = 0 ;
2022-06-21 14:17:01 +03:00
struct device_node * dn , * tn ;
powerpc/pseries/cpuhp: cache node corrections
On pseries, cache nodes in the device tree can be added and removed by the
CPU DLPAR code as well as the partition migration (mobility) code. PowerVM
partitions in dedicated processor mode typically have L2 and L3 cache
nodes.
The CPU DLPAR code has the following shortcomings:
* Cache nodes returned as siblings of a new CPU node by
ibm,configure-connector are silently discarded; only the CPU node is
added to the device tree.
* Cache nodes which become unreferenced in the processor removal path are
not removed from the device tree. This can lead to duplicate nodes when
the post-migration device tree update code replaces cache nodes.
This is long-standing behavior. Presumably it has gone mostly unnoticed
because the two bugs have the property of obscuring each other in common
simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice
only if you cared to inspect the device tree or the sysfs cacheinfo
information.
Booted with two processors:
$ pwd
/sys/firmware/devicetree/base/cpus
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@8/
$ lsprop */l2-cache
l2-cache@2010/l2-cache
00003110 (12560)
l2-cache@2011/l2-cache
00003111 (12561)
PowerPC,POWER9@0/l2-cache
00002010 (8208)
PowerPC,POWER9@8/l2-cache
00002011 (8209)
$ ls /sys/devices/system/cpu/cpu0/cache/
index0 index1 index2 index3
After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache
nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is
missing a cache level in its sched domain hierarchy:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@10/
PowerPC,POWER9@8/
$ lsprop PowerPC\,POWER9@10/l2-cache
PowerPC,POWER9@10/l2-cache
00002012 (8210)
$ ls /sys/devices/system/cpu/cpu16/cache/
index0 index1
$ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE
When removing PowerPC,POWER9@8, we see that its cache nodes are left
behind:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
When DLPAR is combined with VM migration, we can get duplicate nodes. E.g.
removing one processor, then migrating, adding a processor, and then
migrating again can result in warnings from the OF core during
post-migration device tree updates:
Duplicate name in cpus, renamed to "l2-cache@2011#1"
Duplicate name in cpus, renamed to "l3-cache@3111#1"
and nodes with duplicated phandles in the tree, making lookup behavior
unpredictable:
$ lsprop l[23]-cache@*/ibm,phandle
l2-cache@2010/ibm,phandle
00002010 (8208)
l2-cache@2011#1/ibm,phandle
00002011 (8209)
l2-cache@2011/ibm,phandle
00002011 (8209)
l3-cache@3110/ibm,phandle
00003110 (12560)
l3-cache@3111#1/ibm,phandle
00003111 (12561)
l3-cache@3111/ibm,phandle
00003111 (12561)
Address these issues by:
* Correctly processing siblings of the node returned from
dlpar_configure_connector().
* Removing cache nodes in the CPU remove path when it can be determined
that they are not associated with other CPUs or caches.
Use the of_changeset API in both cases, which allows us to keep the error
handling in this code from becoming more complex while ensuring that the
device tree cannot become inconsistent.
Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality")
Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
WARN_ON ( ! of_node_is_type ( cachedn , " cache " ) ) ;
for_each_of_cpu_node ( dn ) {
2022-06-21 14:17:01 +03:00
tn = of_find_next_cache_node ( dn ) ;
of_node_put ( tn ) ;
if ( tn = = cachedn )
powerpc/pseries/cpuhp: cache node corrections
On pseries, cache nodes in the device tree can be added and removed by the
CPU DLPAR code as well as the partition migration (mobility) code. PowerVM
partitions in dedicated processor mode typically have L2 and L3 cache
nodes.
The CPU DLPAR code has the following shortcomings:
* Cache nodes returned as siblings of a new CPU node by
ibm,configure-connector are silently discarded; only the CPU node is
added to the device tree.
* Cache nodes which become unreferenced in the processor removal path are
not removed from the device tree. This can lead to duplicate nodes when
the post-migration device tree update code replaces cache nodes.
This is long-standing behavior. Presumably it has gone mostly unnoticed
because the two bugs have the property of obscuring each other in common
simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice
only if you cared to inspect the device tree or the sysfs cacheinfo
information.
Booted with two processors:
$ pwd
/sys/firmware/devicetree/base/cpus
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@8/
$ lsprop */l2-cache
l2-cache@2010/l2-cache
00003110 (12560)
l2-cache@2011/l2-cache
00003111 (12561)
PowerPC,POWER9@0/l2-cache
00002010 (8208)
PowerPC,POWER9@8/l2-cache
00002011 (8209)
$ ls /sys/devices/system/cpu/cpu0/cache/
index0 index1 index2 index3
After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache
nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is
missing a cache level in its sched domain hierarchy:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@10/
PowerPC,POWER9@8/
$ lsprop PowerPC\,POWER9@10/l2-cache
PowerPC,POWER9@10/l2-cache
00002012 (8210)
$ ls /sys/devices/system/cpu/cpu16/cache/
index0 index1
$ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE
When removing PowerPC,POWER9@8, we see that its cache nodes are left
behind:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
When DLPAR is combined with VM migration, we can get duplicate nodes. E.g.
removing one processor, then migrating, adding a processor, and then
migrating again can result in warnings from the OF core during
post-migration device tree updates:
Duplicate name in cpus, renamed to "l2-cache@2011#1"
Duplicate name in cpus, renamed to "l3-cache@3111#1"
and nodes with duplicated phandles in the tree, making lookup behavior
unpredictable:
$ lsprop l[23]-cache@*/ibm,phandle
l2-cache@2010/ibm,phandle
00002010 (8208)
l2-cache@2011#1/ibm,phandle
00002011 (8209)
l2-cache@2011/ibm,phandle
00002011 (8209)
l3-cache@3110/ibm,phandle
00003110 (12560)
l3-cache@3111#1/ibm,phandle
00003111 (12561)
l3-cache@3111/ibm,phandle
00003111 (12561)
Address these issues by:
* Correctly processing siblings of the node returned from
dlpar_configure_connector().
* Removing cache nodes in the CPU remove path when it can be determined
that they are not associated with other CPUs or caches.
Use the of_changeset API in both cases, which allows us to keep the error
handling in this code from becoming more complex while ensuring that the
device tree cannot become inconsistent.
Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality")
Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
use_count + + ;
}
for_each_node_by_type ( dn , " cache " ) {
2022-06-21 14:17:01 +03:00
tn = of_find_next_cache_node ( dn ) ;
of_node_put ( tn ) ;
if ( tn = = cachedn )
powerpc/pseries/cpuhp: cache node corrections
On pseries, cache nodes in the device tree can be added and removed by the
CPU DLPAR code as well as the partition migration (mobility) code. PowerVM
partitions in dedicated processor mode typically have L2 and L3 cache
nodes.
The CPU DLPAR code has the following shortcomings:
* Cache nodes returned as siblings of a new CPU node by
ibm,configure-connector are silently discarded; only the CPU node is
added to the device tree.
* Cache nodes which become unreferenced in the processor removal path are
not removed from the device tree. This can lead to duplicate nodes when
the post-migration device tree update code replaces cache nodes.
This is long-standing behavior. Presumably it has gone mostly unnoticed
because the two bugs have the property of obscuring each other in common
simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice
only if you cared to inspect the device tree or the sysfs cacheinfo
information.
Booted with two processors:
$ pwd
/sys/firmware/devicetree/base/cpus
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@8/
$ lsprop */l2-cache
l2-cache@2010/l2-cache
00003110 (12560)
l2-cache@2011/l2-cache
00003111 (12561)
PowerPC,POWER9@0/l2-cache
00002010 (8208)
PowerPC,POWER9@8/l2-cache
00002011 (8209)
$ ls /sys/devices/system/cpu/cpu0/cache/
index0 index1 index2 index3
After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache
nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is
missing a cache level in its sched domain hierarchy:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@10/
PowerPC,POWER9@8/
$ lsprop PowerPC\,POWER9@10/l2-cache
PowerPC,POWER9@10/l2-cache
00002012 (8210)
$ ls /sys/devices/system/cpu/cpu16/cache/
index0 index1
$ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE
When removing PowerPC,POWER9@8, we see that its cache nodes are left
behind:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
When DLPAR is combined with VM migration, we can get duplicate nodes. E.g.
removing one processor, then migrating, adding a processor, and then
migrating again can result in warnings from the OF core during
post-migration device tree updates:
Duplicate name in cpus, renamed to "l2-cache@2011#1"
Duplicate name in cpus, renamed to "l3-cache@3111#1"
and nodes with duplicated phandles in the tree, making lookup behavior
unpredictable:
$ lsprop l[23]-cache@*/ibm,phandle
l2-cache@2010/ibm,phandle
00002010 (8208)
l2-cache@2011#1/ibm,phandle
00002011 (8209)
l2-cache@2011/ibm,phandle
00002011 (8209)
l3-cache@3110/ibm,phandle
00003110 (12560)
l3-cache@3111#1/ibm,phandle
00003111 (12561)
l3-cache@3111/ibm,phandle
00003111 (12561)
Address these issues by:
* Correctly processing siblings of the node returned from
dlpar_configure_connector().
* Removing cache nodes in the CPU remove path when it can be determined
that they are not associated with other CPUs or caches.
Use the of_changeset API in both cases, which allows us to keep the error
handling in this code from becoming more complex while ensuring that the
device tree cannot become inconsistent.
Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality")
Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
use_count + + ;
}
return use_count ;
}
static int pseries_cpuhp_detach_nodes ( struct device_node * cpudn )
{
struct device_node * dn ;
struct of_changeset cs ;
int ret = 0 ;
of_changeset_init ( & cs ) ;
ret = of_changeset_detach_node ( & cs , cpudn ) ;
if ( ret )
goto out ;
dn = cpudn ;
while ( ( dn = of_find_next_cache_node ( dn ) ) ) {
2022-06-21 14:17:01 +03:00
if ( pseries_cpuhp_cache_use_count ( dn ) > 1 ) {
of_node_put ( dn ) ;
powerpc/pseries/cpuhp: cache node corrections
On pseries, cache nodes in the device tree can be added and removed by the
CPU DLPAR code as well as the partition migration (mobility) code. PowerVM
partitions in dedicated processor mode typically have L2 and L3 cache
nodes.
The CPU DLPAR code has the following shortcomings:
* Cache nodes returned as siblings of a new CPU node by
ibm,configure-connector are silently discarded; only the CPU node is
added to the device tree.
* Cache nodes which become unreferenced in the processor removal path are
not removed from the device tree. This can lead to duplicate nodes when
the post-migration device tree update code replaces cache nodes.
This is long-standing behavior. Presumably it has gone mostly unnoticed
because the two bugs have the property of obscuring each other in common
simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice
only if you cared to inspect the device tree or the sysfs cacheinfo
information.
Booted with two processors:
$ pwd
/sys/firmware/devicetree/base/cpus
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@8/
$ lsprop */l2-cache
l2-cache@2010/l2-cache
00003110 (12560)
l2-cache@2011/l2-cache
00003111 (12561)
PowerPC,POWER9@0/l2-cache
00002010 (8208)
PowerPC,POWER9@8/l2-cache
00002011 (8209)
$ ls /sys/devices/system/cpu/cpu0/cache/
index0 index1 index2 index3
After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache
nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is
missing a cache level in its sched domain hierarchy:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@10/
PowerPC,POWER9@8/
$ lsprop PowerPC\,POWER9@10/l2-cache
PowerPC,POWER9@10/l2-cache
00002012 (8210)
$ ls /sys/devices/system/cpu/cpu16/cache/
index0 index1
$ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE
When removing PowerPC,POWER9@8, we see that its cache nodes are left
behind:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
When DLPAR is combined with VM migration, we can get duplicate nodes. E.g.
removing one processor, then migrating, adding a processor, and then
migrating again can result in warnings from the OF core during
post-migration device tree updates:
Duplicate name in cpus, renamed to "l2-cache@2011#1"
Duplicate name in cpus, renamed to "l3-cache@3111#1"
and nodes with duplicated phandles in the tree, making lookup behavior
unpredictable:
$ lsprop l[23]-cache@*/ibm,phandle
l2-cache@2010/ibm,phandle
00002010 (8208)
l2-cache@2011#1/ibm,phandle
00002011 (8209)
l2-cache@2011/ibm,phandle
00002011 (8209)
l3-cache@3110/ibm,phandle
00003110 (12560)
l3-cache@3111#1/ibm,phandle
00003111 (12561)
l3-cache@3111/ibm,phandle
00003111 (12561)
Address these issues by:
* Correctly processing siblings of the node returned from
dlpar_configure_connector().
* Removing cache nodes in the CPU remove path when it can be determined
that they are not associated with other CPUs or caches.
Use the of_changeset API in both cases, which allows us to keep the error
handling in this code from becoming more complex while ensuring that the
device tree cannot become inconsistent.
Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality")
Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
break ;
2022-06-21 14:17:01 +03:00
}
powerpc/pseries/cpuhp: cache node corrections
On pseries, cache nodes in the device tree can be added and removed by the
CPU DLPAR code as well as the partition migration (mobility) code. PowerVM
partitions in dedicated processor mode typically have L2 and L3 cache
nodes.
The CPU DLPAR code has the following shortcomings:
* Cache nodes returned as siblings of a new CPU node by
ibm,configure-connector are silently discarded; only the CPU node is
added to the device tree.
* Cache nodes which become unreferenced in the processor removal path are
not removed from the device tree. This can lead to duplicate nodes when
the post-migration device tree update code replaces cache nodes.
This is long-standing behavior. Presumably it has gone mostly unnoticed
because the two bugs have the property of obscuring each other in common
simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice
only if you cared to inspect the device tree or the sysfs cacheinfo
information.
Booted with two processors:
$ pwd
/sys/firmware/devicetree/base/cpus
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@8/
$ lsprop */l2-cache
l2-cache@2010/l2-cache
00003110 (12560)
l2-cache@2011/l2-cache
00003111 (12561)
PowerPC,POWER9@0/l2-cache
00002010 (8208)
PowerPC,POWER9@8/l2-cache
00002011 (8209)
$ ls /sys/devices/system/cpu/cpu0/cache/
index0 index1 index2 index3
After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache
nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is
missing a cache level in its sched domain hierarchy:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@10/
PowerPC,POWER9@8/
$ lsprop PowerPC\,POWER9@10/l2-cache
PowerPC,POWER9@10/l2-cache
00002012 (8210)
$ ls /sys/devices/system/cpu/cpu16/cache/
index0 index1
$ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE
When removing PowerPC,POWER9@8, we see that its cache nodes are left
behind:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
When DLPAR is combined with VM migration, we can get duplicate nodes. E.g.
removing one processor, then migrating, adding a processor, and then
migrating again can result in warnings from the OF core during
post-migration device tree updates:
Duplicate name in cpus, renamed to "l2-cache@2011#1"
Duplicate name in cpus, renamed to "l3-cache@3111#1"
and nodes with duplicated phandles in the tree, making lookup behavior
unpredictable:
$ lsprop l[23]-cache@*/ibm,phandle
l2-cache@2010/ibm,phandle
00002010 (8208)
l2-cache@2011#1/ibm,phandle
00002011 (8209)
l2-cache@2011/ibm,phandle
00002011 (8209)
l3-cache@3110/ibm,phandle
00003110 (12560)
l3-cache@3111#1/ibm,phandle
00003111 (12561)
l3-cache@3111/ibm,phandle
00003111 (12561)
Address these issues by:
* Correctly processing siblings of the node returned from
dlpar_configure_connector().
* Removing cache nodes in the CPU remove path when it can be determined
that they are not associated with other CPUs or caches.
Use the of_changeset API in both cases, which allows us to keep the error
handling in this code from becoming more complex while ensuring that the
device tree cannot become inconsistent.
Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality")
Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
ret = of_changeset_detach_node ( & cs , dn ) ;
2022-06-21 14:17:01 +03:00
of_node_put ( dn ) ;
powerpc/pseries/cpuhp: cache node corrections
On pseries, cache nodes in the device tree can be added and removed by the
CPU DLPAR code as well as the partition migration (mobility) code. PowerVM
partitions in dedicated processor mode typically have L2 and L3 cache
nodes.
The CPU DLPAR code has the following shortcomings:
* Cache nodes returned as siblings of a new CPU node by
ibm,configure-connector are silently discarded; only the CPU node is
added to the device tree.
* Cache nodes which become unreferenced in the processor removal path are
not removed from the device tree. This can lead to duplicate nodes when
the post-migration device tree update code replaces cache nodes.
This is long-standing behavior. Presumably it has gone mostly unnoticed
because the two bugs have the property of obscuring each other in common
simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice
only if you cared to inspect the device tree or the sysfs cacheinfo
information.
Booted with two processors:
$ pwd
/sys/firmware/devicetree/base/cpus
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@8/
$ lsprop */l2-cache
l2-cache@2010/l2-cache
00003110 (12560)
l2-cache@2011/l2-cache
00003111 (12561)
PowerPC,POWER9@0/l2-cache
00002010 (8208)
PowerPC,POWER9@8/l2-cache
00002011 (8209)
$ ls /sys/devices/system/cpu/cpu0/cache/
index0 index1 index2 index3
After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache
nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is
missing a cache level in its sched domain hierarchy:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@10/
PowerPC,POWER9@8/
$ lsprop PowerPC\,POWER9@10/l2-cache
PowerPC,POWER9@10/l2-cache
00002012 (8210)
$ ls /sys/devices/system/cpu/cpu16/cache/
index0 index1
$ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE
When removing PowerPC,POWER9@8, we see that its cache nodes are left
behind:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
When DLPAR is combined with VM migration, we can get duplicate nodes. E.g.
removing one processor, then migrating, adding a processor, and then
migrating again can result in warnings from the OF core during
post-migration device tree updates:
Duplicate name in cpus, renamed to "l2-cache@2011#1"
Duplicate name in cpus, renamed to "l3-cache@3111#1"
and nodes with duplicated phandles in the tree, making lookup behavior
unpredictable:
$ lsprop l[23]-cache@*/ibm,phandle
l2-cache@2010/ibm,phandle
00002010 (8208)
l2-cache@2011#1/ibm,phandle
00002011 (8209)
l2-cache@2011/ibm,phandle
00002011 (8209)
l3-cache@3110/ibm,phandle
00003110 (12560)
l3-cache@3111#1/ibm,phandle
00003111 (12561)
l3-cache@3111/ibm,phandle
00003111 (12561)
Address these issues by:
* Correctly processing siblings of the node returned from
dlpar_configure_connector().
* Removing cache nodes in the CPU remove path when it can be determined
that they are not associated with other CPUs or caches.
Use the of_changeset API in both cases, which allows us to keep the error
handling in this code from becoming more complex while ensuring that the
device tree cannot become inconsistent.
Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality")
Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
if ( ret )
goto out ;
}
ret = of_changeset_apply ( & cs ) ;
out :
of_changeset_destroy ( & cs ) ;
return ret ;
}
2015-12-16 23:51:26 +03:00
static ssize_t dlpar_cpu_remove ( struct device_node * dn , u32 drc_index )
{
int rc ;
2018-08-28 04:52:07 +03:00
pr_debug ( " Attempting to remove CPU %pOFn, drc index: %x \n " ,
dn , drc_index ) ;
2015-12-16 23:52:39 +03:00
2015-12-16 23:51:26 +03:00
rc = dlpar_offline_cpu ( dn ) ;
2015-12-16 23:52:39 +03:00
if ( rc ) {
2018-08-28 04:52:07 +03:00
pr_warn ( " Failed to offline CPU %pOFn, rc: %d \n " , dn , rc ) ;
2015-12-16 23:51:26 +03:00
return - EINVAL ;
2015-12-16 23:52:39 +03:00
}
2015-12-16 23:51:26 +03:00
rc = dlpar_release_drc ( drc_index ) ;
2015-12-16 23:52:39 +03:00
if ( rc ) {
2018-08-28 04:52:07 +03:00
pr_warn ( " Failed to release drc (%x) for CPU %pOFn, rc: %d \n " ,
drc_index , dn , rc ) ;
2015-12-16 23:52:39 +03:00
dlpar_online_cpu ( dn ) ;
2015-12-16 23:51:26 +03:00
return rc ;
2015-12-16 23:52:39 +03:00
}
2015-12-16 23:51:26 +03:00
powerpc/pseries/cpuhp: cache node corrections
On pseries, cache nodes in the device tree can be added and removed by the
CPU DLPAR code as well as the partition migration (mobility) code. PowerVM
partitions in dedicated processor mode typically have L2 and L3 cache
nodes.
The CPU DLPAR code has the following shortcomings:
* Cache nodes returned as siblings of a new CPU node by
ibm,configure-connector are silently discarded; only the CPU node is
added to the device tree.
* Cache nodes which become unreferenced in the processor removal path are
not removed from the device tree. This can lead to duplicate nodes when
the post-migration device tree update code replaces cache nodes.
This is long-standing behavior. Presumably it has gone mostly unnoticed
because the two bugs have the property of obscuring each other in common
simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice
only if you cared to inspect the device tree or the sysfs cacheinfo
information.
Booted with two processors:
$ pwd
/sys/firmware/devicetree/base/cpus
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@8/
$ lsprop */l2-cache
l2-cache@2010/l2-cache
00003110 (12560)
l2-cache@2011/l2-cache
00003111 (12561)
PowerPC,POWER9@0/l2-cache
00002010 (8208)
PowerPC,POWER9@8/l2-cache
00002011 (8209)
$ ls /sys/devices/system/cpu/cpu0/cache/
index0 index1 index2 index3
After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache
nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is
missing a cache level in its sched domain hierarchy:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@10/
PowerPC,POWER9@8/
$ lsprop PowerPC\,POWER9@10/l2-cache
PowerPC,POWER9@10/l2-cache
00002012 (8210)
$ ls /sys/devices/system/cpu/cpu16/cache/
index0 index1
$ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE
When removing PowerPC,POWER9@8, we see that its cache nodes are left
behind:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
When DLPAR is combined with VM migration, we can get duplicate nodes. E.g.
removing one processor, then migrating, adding a processor, and then
migrating again can result in warnings from the OF core during
post-migration device tree updates:
Duplicate name in cpus, renamed to "l2-cache@2011#1"
Duplicate name in cpus, renamed to "l3-cache@3111#1"
and nodes with duplicated phandles in the tree, making lookup behavior
unpredictable:
$ lsprop l[23]-cache@*/ibm,phandle
l2-cache@2010/ibm,phandle
00002010 (8208)
l2-cache@2011#1/ibm,phandle
00002011 (8209)
l2-cache@2011/ibm,phandle
00002011 (8209)
l3-cache@3110/ibm,phandle
00003110 (12560)
l3-cache@3111#1/ibm,phandle
00003111 (12561)
l3-cache@3111/ibm,phandle
00003111 (12561)
Address these issues by:
* Correctly processing siblings of the node returned from
dlpar_configure_connector().
* Removing cache nodes in the CPU remove path when it can be determined
that they are not associated with other CPUs or caches.
Use the of_changeset API in both cases, which allows us to keep the error
handling in this code from becoming more complex while ensuring that the
device tree cannot become inconsistent.
Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality")
Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
rc = pseries_cpuhp_detach_nodes ( dn ) ;
2015-12-16 23:52:39 +03:00
if ( rc ) {
int saved_rc = rc ;
2015-12-16 23:51:26 +03:00
2018-08-28 04:52:07 +03:00
pr_warn ( " Failed to detach CPU %pOFn, rc: %d " , dn , rc ) ;
2015-12-16 23:52:39 +03:00
rc = dlpar_acquire_drc ( drc_index ) ;
if ( ! rc )
dlpar_online_cpu ( dn ) ;
return saved_rc ;
}
pr_debug ( " Successfully removed CPU, drc index: %x \n " , drc_index ) ;
return 0 ;
2015-12-16 23:51:26 +03:00
}
2015-12-16 23:54:05 +03:00
static struct device_node * cpu_drc_index_to_dn ( u32 drc_index )
{
struct device_node * dn ;
u32 my_index ;
int rc ;
for_each_node_by_type ( dn , " cpu " ) {
rc = of_property_read_u32 ( dn , " ibm,my-drc-index " , & my_index ) ;
if ( rc )
continue ;
if ( my_index = = drc_index )
break ;
}
return dn ;
}
static int dlpar_cpu_remove_by_index ( u32 drc_index )
{
struct device_node * dn ;
int rc ;
dn = cpu_drc_index_to_dn ( drc_index ) ;
if ( ! dn ) {
pr_warn ( " Cannot find CPU (drc index %x) to remove \n " ,
drc_index ) ;
return - ENODEV ;
}
rc = dlpar_cpu_remove ( dn , drc_index ) ;
of_node_put ( dn ) ;
return rc ;
}
int dlpar_cpu ( struct pseries_hp_errorlog * hp_elog )
{
powerpc/pseries/cpuhp: cache node corrections
On pseries, cache nodes in the device tree can be added and removed by the
CPU DLPAR code as well as the partition migration (mobility) code. PowerVM
partitions in dedicated processor mode typically have L2 and L3 cache
nodes.
The CPU DLPAR code has the following shortcomings:
* Cache nodes returned as siblings of a new CPU node by
ibm,configure-connector are silently discarded; only the CPU node is
added to the device tree.
* Cache nodes which become unreferenced in the processor removal path are
not removed from the device tree. This can lead to duplicate nodes when
the post-migration device tree update code replaces cache nodes.
This is long-standing behavior. Presumably it has gone mostly unnoticed
because the two bugs have the property of obscuring each other in common
simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice
only if you cared to inspect the device tree or the sysfs cacheinfo
information.
Booted with two processors:
$ pwd
/sys/firmware/devicetree/base/cpus
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@8/
$ lsprop */l2-cache
l2-cache@2010/l2-cache
00003110 (12560)
l2-cache@2011/l2-cache
00003111 (12561)
PowerPC,POWER9@0/l2-cache
00002010 (8208)
PowerPC,POWER9@8/l2-cache
00002011 (8209)
$ ls /sys/devices/system/cpu/cpu0/cache/
index0 index1 index2 index3
After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache
nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is
missing a cache level in its sched domain hierarchy:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
PowerPC,POWER9@10/
PowerPC,POWER9@8/
$ lsprop PowerPC\,POWER9@10/l2-cache
PowerPC,POWER9@10/l2-cache
00002012 (8210)
$ ls /sys/devices/system/cpu/cpu16/cache/
index0 index1
$ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE
When removing PowerPC,POWER9@8, we see that its cache nodes are left
behind:
$ ls -1d */
l2-cache@2010/
l2-cache@2011/
l3-cache@3110/
l3-cache@3111/
PowerPC,POWER9@0/
When DLPAR is combined with VM migration, we can get duplicate nodes. E.g.
removing one processor, then migrating, adding a processor, and then
migrating again can result in warnings from the OF core during
post-migration device tree updates:
Duplicate name in cpus, renamed to "l2-cache@2011#1"
Duplicate name in cpus, renamed to "l3-cache@3111#1"
and nodes with duplicated phandles in the tree, making lookup behavior
unpredictable:
$ lsprop l[23]-cache@*/ibm,phandle
l2-cache@2010/ibm,phandle
00002010 (8208)
l2-cache@2011#1/ibm,phandle
00002011 (8209)
l2-cache@2011/ibm,phandle
00002011 (8209)
l3-cache@3110/ibm,phandle
00003110 (12560)
l3-cache@3111#1/ibm,phandle
00003111 (12561)
l3-cache@3111/ibm,phandle
00003111 (12561)
Address these issues by:
* Correctly processing siblings of the node returned from
dlpar_configure_connector().
* Removing cache nodes in the CPU remove path when it can be determined
that they are not associated with other CPUs or caches.
Use the of_changeset API in both cases, which allows us to keep the error
handling in this code from becoming more complex while ensuring that the
device tree cannot become inconsistent.
Fixes: ac71380071d1 ("powerpc/pseries: Add CPU dlpar remove functionality")
Fixes: 90edf184b9b7 ("powerpc/pseries: Add CPU dlpar add functionality")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
2021-09-27 23:19:30 +03:00
u32 drc_index ;
2015-12-16 23:54:05 +03:00
int rc ;
drc_index = hp_elog - > _drc_u . drc_index ;
lock_device_hotplug ( ) ;
switch ( hp_elog - > action ) {
case PSERIES_HP_ELOG_ACTION_REMOVE :
2021-09-27 23:19:32 +03:00
if ( hp_elog - > id_type = = PSERIES_HP_ELOG_ID_DRC_INDEX ) {
2015-12-16 23:54:05 +03:00
rc = dlpar_cpu_remove_by_index ( drc_index ) ;
2021-04-17 00:02:16 +03:00
/*
* Setting the isolation state of an UNISOLATED / CONFIGURED
* device to UNISOLATE is a no - op , but the hypervisor can
* use it as a hint that the CPU removal failed .
*/
if ( rc )
dlpar_unisolate_drc ( drc_index ) ;
}
2015-12-16 23:54:05 +03:00
else
rc = - EINVAL ;
break ;
2015-12-16 23:55:07 +03:00
case PSERIES_HP_ELOG_ACTION_ADD :
2021-09-27 23:19:32 +03:00
if ( hp_elog - > id_type = = PSERIES_HP_ELOG_ID_DRC_INDEX )
2015-12-16 23:55:07 +03:00
rc = dlpar_cpu_add ( drc_index ) ;
else
rc = - EINVAL ;
break ;
2015-12-16 23:54:05 +03:00
default :
pr_err ( " Invalid action (%d) specified \n " , hp_elog - > action ) ;
rc = - EINVAL ;
break ;
}
unlock_device_hotplug ( ) ;
return rc ;
}
2015-12-16 23:51:26 +03:00
# ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
static ssize_t dlpar_cpu_probe ( const char * buf , size_t count )
{
u32 drc_index ;
int rc ;
rc = kstrtou32 ( buf , 0 , & drc_index ) ;
if ( rc )
return - EINVAL ;
rc = dlpar_cpu_add ( drc_index ) ;
return rc ? rc : count ;
}
2015-12-16 23:50:21 +03:00
static ssize_t dlpar_cpu_release ( const char * buf , size_t count )
{
struct device_node * dn ;
u32 drc_index ;
int rc ;
dn = of_find_node_by_path ( buf ) ;
if ( ! dn )
return - EINVAL ;
rc = of_property_read_u32 ( dn , " ibm,my-drc-index " , & drc_index ) ;
if ( rc ) {
of_node_put ( dn ) ;
return - EINVAL ;
}
2015-12-16 23:51:26 +03:00
rc = dlpar_cpu_remove ( dn , drc_index ) ;
2015-12-16 23:50:21 +03:00
of_node_put ( dn ) ;
2015-12-16 23:51:26 +03:00
return rc ? rc : count ;
2015-12-16 23:50:21 +03:00
}
# endif /* CONFIG_ARCH_CPU_PROBE_RELEASE */
2006-12-05 09:52:39 +03:00
static int pseries_smp_notifier ( struct notifier_block * nb ,
2014-11-24 20:58:01 +03:00
unsigned long action , void * data )
2006-12-05 09:52:38 +03:00
{
2014-11-24 20:58:01 +03:00
struct of_reconfig_data * rd = data ;
2011-06-21 07:35:56 +04:00
int err = 0 ;
2006-12-05 09:52:38 +03:00
switch ( action ) {
2012-10-02 20:57:57 +04:00
case OF_RECONFIG_ATTACH_NODE :
2014-11-24 20:58:01 +03:00
err = pseries_add_processor ( rd - > dn ) ;
2006-12-05 09:52:38 +03:00
break ;
2012-10-02 20:57:57 +04:00
case OF_RECONFIG_DETACH_NODE :
2014-11-24 20:58:01 +03:00
pseries_remove_processor ( rd - > dn ) ;
2006-12-05 09:52:38 +03:00
break ;
}
2011-06-21 07:35:56 +04:00
return notifier_from_errno ( err ) ;
2006-12-05 09:52:38 +03:00
}
2006-12-05 09:52:39 +03:00
static struct notifier_block pseries_smp_nb = {
. notifier_call = pseries_smp_notifier ,
2006-12-05 09:52:38 +03:00
} ;
2023-07-05 17:51:41 +03:00
void __init pseries_cpu_hotplug_init ( void )
2006-12-05 09:52:36 +03:00
{
2010-04-28 17:39:41 +04:00
int qcss_tok ;
2015-12-16 23:50:21 +03:00
2023-02-10 21:42:08 +03:00
rtas_stop_self_token = rtas_function_token ( RTAS_FN_STOP_SELF ) ;
qcss_tok = rtas_function_token ( RTAS_FN_QUERY_CPU_STOPPED_STATE ) ;
2006-12-05 09:52:36 +03:00
2014-02-20 14:13:52 +04:00
if ( rtas_stop_self_token = = RTAS_UNKNOWN_SERVICE | |
2006-12-05 09:52:38 +03:00
qcss_tok = = RTAS_UNKNOWN_SERVICE ) {
printk ( KERN_INFO " CPU Hotplug not supported by firmware "
" - disabling. \n " ) ;
2023-07-05 17:51:41 +03:00
return ;
2006-12-05 09:52:38 +03:00
}
2006-12-05 09:52:37 +03:00
2020-08-19 04:56:34 +03:00
smp_ops - > cpu_offline_self = pseries_cpu_offline_self ;
2006-12-05 09:52:39 +03:00
smp_ops - > cpu_disable = pseries_cpu_disable ;
smp_ops - > cpu_die = pseries_cpu_die ;
2023-07-05 17:51:41 +03:00
}
static int __init pseries_dlpar_init ( void )
{
unsigned int node ;
# ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
ppc_md . cpu_probe = dlpar_cpu_probe ;
ppc_md . cpu_release = dlpar_cpu_release ;
# endif /* CONFIG_ARCH_CPU_PROBE_RELEASE */
2006-12-05 09:52:38 +03:00
/* Processors can be added/removed only on LPAR */
powerpc/pseries: Prevent free CPU ids being reused on another node
When a CPU is hot added, the CPU ids are taken from the available mask
from the lower possible set. If that set of values was previously used
for a CPU attached to a different node, it appears to an application as
if these CPUs have migrated from one node to another node which is not
expected.
To prevent this, it is needed to record the CPU ids used for each node
and to not reuse them on another node. However, to prevent CPU hot plug
to fail, in the case the CPU ids is starved on a node, the capability to
reuse other nodes’ free CPU ids is kept. A warning is displayed in such
a case to warn the user.
A new CPU bit mask (node_recorded_ids_map) is introduced for each
possible node. It is populated with the CPU onlined at boot time, and
then when a CPU is hot plugged to a node. The bits in that mask remain
when the CPU is hot unplugged, to remind this CPU ids have been used for
this node.
If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.
The effect of this patch can be seen by removing and adding CPUs using
the Qemu monitor. In the following case, the first CPU from the node 2
is removed, then the first one from the node 1 is removed too. Later,
the first CPU of the node 2 is added back. Without that patch, the
kernel will number these CPUs using the first CPU ids available which
are the ones freed when removing the second CPU of the node 0. This
leads to the CPU ids 16-23 to move from the node 1 to the node 2. With
the patch applied, the CPU ids 32-39 are used since they are the lowest
free ones which have not been used on another node.
At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
if ( firmware_has_feature ( FW_FEATURE_LPAR ) ) {
for_each_node ( node ) {
2021-11-05 16:29:23 +03:00
if ( ! alloc_cpumask_var_node ( & node_recorded_ids_map [ node ] ,
GFP_KERNEL , node ) )
return - ENOMEM ;
powerpc/pseries: Prevent free CPU ids being reused on another node
When a CPU is hot added, the CPU ids are taken from the available mask
from the lower possible set. If that set of values was previously used
for a CPU attached to a different node, it appears to an application as
if these CPUs have migrated from one node to another node which is not
expected.
To prevent this, it is needed to record the CPU ids used for each node
and to not reuse them on another node. However, to prevent CPU hot plug
to fail, in the case the CPU ids is starved on a node, the capability to
reuse other nodes’ free CPU ids is kept. A warning is displayed in such
a case to warn the user.
A new CPU bit mask (node_recorded_ids_map) is introduced for each
possible node. It is populated with the CPU onlined at boot time, and
then when a CPU is hot plugged to a node. The bits in that mask remain
when the CPU is hot unplugged, to remind this CPU ids have been used for
this node.
If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.
The effect of this patch can be seen by removing and adding CPUs using
the Qemu monitor. In the following case, the first CPU from the node 2
is removed, then the first one from the node 1 is removed too. Later,
the first CPU of the node 2 is added back. Without that patch, the
kernel will number these CPUs using the first CPU ids available which
are the ones freed when removing the second CPU of the node 0. This
leads to the CPU ids 16-23 to move from the node 1 to the node 2. With
the patch applied, the CPU ids 32-39 are used since they are the lowest
free ones which have not been used on another node.
At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
/* Record ids of CPU added at boot time */
2021-11-05 16:29:23 +03:00
cpumask_copy ( node_recorded_ids_map [ node ] ,
cpumask_of_node ( node ) ) ;
powerpc/pseries: Prevent free CPU ids being reused on another node
When a CPU is hot added, the CPU ids are taken from the available mask
from the lower possible set. If that set of values was previously used
for a CPU attached to a different node, it appears to an application as
if these CPUs have migrated from one node to another node which is not
expected.
To prevent this, it is needed to record the CPU ids used for each node
and to not reuse them on another node. However, to prevent CPU hot plug
to fail, in the case the CPU ids is starved on a node, the capability to
reuse other nodes’ free CPU ids is kept. A warning is displayed in such
a case to warn the user.
A new CPU bit mask (node_recorded_ids_map) is introduced for each
possible node. It is populated with the CPU onlined at boot time, and
then when a CPU is hot plugged to a node. The bits in that mask remain
when the CPU is hot unplugged, to remind this CPU ids have been used for
this node.
If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.
The effect of this patch can be seen by removing and adding CPUs using
the Qemu monitor. In the following case, the first CPU from the node 2
is removed, then the first one from the node 1 is removed too. Later,
the first CPU of the node 2 is added back. Without that patch, the
kernel will number these CPUs using the first CPU ids available which
are the ones freed when removing the second CPU of the node 0. This
leads to the CPU ids 16-23 to move from the node 1 to the node 2. With
the patch applied, the CPU ids 32-39 are used since they are the lowest
free ones which have not been used on another node.
At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
}
2012-10-02 20:57:57 +04:00
of_reconfig_notifier_register ( & pseries_smp_nb ) ;
powerpc/pseries: Prevent free CPU ids being reused on another node
When a CPU is hot added, the CPU ids are taken from the available mask
from the lower possible set. If that set of values was previously used
for a CPU attached to a different node, it appears to an application as
if these CPUs have migrated from one node to another node which is not
expected.
To prevent this, it is needed to record the CPU ids used for each node
and to not reuse them on another node. However, to prevent CPU hot plug
to fail, in the case the CPU ids is starved on a node, the capability to
reuse other nodes’ free CPU ids is kept. A warning is displayed in such
a case to warn the user.
A new CPU bit mask (node_recorded_ids_map) is introduced for each
possible node. It is populated with the CPU onlined at boot time, and
then when a CPU is hot plugged to a node. The bits in that mask remain
when the CPU is hot unplugged, to remind this CPU ids have been used for
this node.
If no id set was found, a retry is made without removing the ids used on
the other nodes to try reusing them. This is the way ids have been
allocated prior to this patch.
The effect of this patch can be seen by removing and adding CPUs using
the Qemu monitor. In the following case, the first CPU from the node 2
is removed, then the first one from the node 1 is removed too. Later,
the first CPU of the node 2 is added back. Without that patch, the
kernel will number these CPUs using the first CPU ids available which
are the ones freed when removing the second CPU of the node 0. This
leads to the CPU ids 16-23 to move from the node 1 to the node 2. With
the patch applied, the CPU ids 32-39 are used since they are the lowest
free ones which have not been used on another node.
At boot time:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Vanilla kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 16 17 18 19 20 21 22 23 40 41 42 43 44 45 46 47
Patched kernel, after the CPU hot unplug/plug operations:
[root@vm40 ~]# numactl -H | grep cpus
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 1 cpus: 24 25 26 27 28 29 30 31
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210429174908.16613-1-ldufour@linux.ibm.com
2021-04-29 20:49:08 +03:00
}
2006-12-05 09:52:38 +03:00
2006-12-05 09:52:36 +03:00
return 0 ;
}
2023-07-05 17:51:41 +03:00
machine_arch_initcall ( pseries , pseries_dlpar_init ) ;