2011-01-24 18:42:41 +11:00
/ *
2016-07-08 11:50:49 +05:30
* This f i l e c o n t a i n s i d l e e n t r y / e x i t f u n c t i o n s f o r P O W E R 7 ,
* POWER8 a n d P O W E R 9 C P U s .
2011-01-24 18:42:41 +11:00
*
* This p r o g r a m i s f r e e s o f t w a r e ; you can redistribute it and/or
* modify i t u n d e r t h e t e r m s o f t h e G N U G e n e r a l P u b l i c L i c e n s e
* as p u b l i s h e d b y t h e F r e e S o f t w a r e F o u n d a t i o n ; either version
* 2 of t h e L i c e n s e , o r ( a t y o u r o p t i o n ) a n y l a t e r v e r s i o n .
* /
# include < l i n u x / t h r e a d s . h >
# include < a s m / p r o c e s s o r . h >
# include < a s m / p a g e . h >
# include < a s m / c p u t a b l e . h >
# include < a s m / t h r e a d _ i n f o . h >
# include < a s m / p p c _ a s m . h >
# include < a s m / a s m - o f f s e t s . h >
# include < a s m / p p c - o p c o d e . h >
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 18:27:59 +11:00
# include < a s m / h w _ i r q . h >
2012-02-03 00:54:17 +00:00
# include < a s m / k v m _ b o o k 3 s _ a s m . h >
2014-02-26 05:38:43 +05:30
# include < a s m / o p a l . h >
2014-12-10 00:26:52 +05:30
# include < a s m / c p u i d l e . h >
2016-03-01 12:59:20 +05:30
# include < a s m / b o o k 3 s / 6 4 / m m u - h a s h . h >
2016-07-08 11:50:49 +05:30
# include < a s m / m m u . h >
2011-01-24 18:42:41 +11:00
# undef D E B U G
2014-12-10 00:26:53 +05:30
/ *
* Use u n u s e d s p a c e i n t h e i n t e r r u p t s t a c k t o s a v e a n d r e s t o r e
* registers f o r w i n k l e s u p p o r t .
* /
# define _ S D R 1 G P R 3
# define _ R P R G P R 4
# define _ S P U R R G P R 5
# define _ P U R R G P R 6
# define _ T S C R G P R 7
# define _ D S C R G P R 8
# define _ A M O R G P R 9
# define _ W O R T G P R 1 0
# define _ W O R C G P R 1 1
2016-07-08 11:50:49 +05:30
# define _ P T C R G P R 1 2
# define P S S C R _ H V _ T E M P L A T E P S S C R _ E S L | P S S C R _ E C | \
PSSCR_ P S L L _ M A S K | P S S C R _ T R _ M A S K | \
PSSCR_ M T L _ M A S K
2014-12-10 00:26:53 +05:30
2014-02-26 05:38:25 +05:30
.text
2016-07-08 11:50:48 +05:30
/ *
* Used b y t h r e a d s b e f o r e e n t e r i n g d e e p i d l e s t a t e s . S a v e s S P R s
* in i n t e r r u p t s t a c k f r a m e
* /
save_sprs_to_stack :
/ *
* Note a l l r e g i s t e r i . e p e r - c o r e , p e r - s u b c o r e o r p e r - t h r e a d i s s a v e d
* here s i n c e a n y t h r e a d i n t h e c o r e m i g h t w a k e u p f i r s t
* /
2016-07-08 11:50:49 +05:30
BEGIN_ F T R _ S E C T I O N
mfspr r3 ,S P R N _ P T C R
std r3 ,_ P T C R ( r1 )
/ *
* Note - S D R 1 i s d r o p p e d i n P o w e r I S A v3 . H e n c e n o t r e s t o r i n g
* SDR1 h e r e
* /
FTR_ S E C T I O N _ E L S E
2016-07-08 11:50:48 +05:30
mfspr r3 ,S P R N _ S D R 1
std r3 ,_ S D R 1 ( r1 )
2016-07-08 11:50:49 +05:30
ALT_ F T R _ S E C T I O N _ E N D _ I F S E T ( C P U _ F T R _ A R C H _ 3 0 0 )
2016-07-08 11:50:48 +05:30
mfspr r3 ,S P R N _ R P R
std r3 ,_ R P R ( r1 )
mfspr r3 ,S P R N _ S P U R R
std r3 ,_ S P U R R ( r1 )
mfspr r3 ,S P R N _ P U R R
std r3 ,_ P U R R ( r1 )
mfspr r3 ,S P R N _ T S C R
std r3 ,_ T S C R ( r1 )
mfspr r3 ,S P R N _ D S C R
std r3 ,_ D S C R ( r1 )
mfspr r3 ,S P R N _ A M O R
std r3 ,_ A M O R ( r1 )
mfspr r3 ,S P R N _ W O R T
std r3 ,_ W O R T ( r1 )
mfspr r3 ,S P R N _ W O R C
std r3 ,_ W O R C ( r1 )
blr
powerpc/powernv: Fix race in updating core_idle_state
core_idle_state is maintained for each core. It uses 0-7 bits to track
whether a thread in the core has entered fastsleep or winkle. 8th bit is
used as a lock bit.
The lock bit is set in these 2 scenarios-
- The thread is first in subcore to wakeup from sleep/winkle.
- If its the last thread in the core about to enter sleep/winkle
While the lock bit is set, if any other thread in the core wakes up, it
loops until the lock bit is cleared before proceeding in the wakeup
path. This helps prevent race conditions w.r.t fastsleep workaround and
prevents threads from switching to process context before core/subcore
resources are restored.
But, in the path to sleep/winkle entry, we currently don't check for
lock-bit. This exposes us to following race when running with subcore
on-
First thread in the subcorea Another thread in the same
waking up core entering sleep/winkle
lwarx r15,0,r14
ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
stwcx. r15,0,r14
[Code to restore subcore state]
lwarx r15,0,r14
[clear thread bit]
stwcx. r15,0,r14
andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS
stw r15,0(r14)
Here, after the thread entering sleep clears its thread bit in
core_idle_state, the value is overwritten by the thread waking up.
In such cases when the core enters fastsleep, code mistakes an idle
thread as running. Because of this, the first thread waking up from
fastsleep which is supposed to resync timebase skips it. So we can
end up having a core with stale timebase value.
This patch fixes the above race by looping on the lock bit even while
entering the idle states.
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Fixes: 7b54e9f213f76 'powernv/powerpc: Add winkle support for offline cpus'
Cc: stable@vger.kernel.org # 3.19+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-07 01:39:23 +05:30
/ *
* Used b y t h r e a d s w h e n t h e l o c k b i t o f c o r e _ i d l e _ s t a t e i s s e t .
* Threads w i l l s p i n i n H M T _ L O W u n t i l t h e l o c k b i t i s c l e a r e d .
* r1 4 - p o i n t e r t o c o r e _ i d l e _ s t a t e
* r1 5 - u s e d t o l o a d c o n t e n t s o f c o r e _ i d l e _ s t a t e
* /
core_idle_lock_held :
HMT_ L O W
3 : lwz r15 ,0 ( r14 )
andi. r15 ,r15 ,P N V _ C O R E _ I D L E _ L O C K _ B I T
bne 3 b
HMT_ M E D I U M
lwarx r15 ,0 ,r14
blr
2014-02-26 05:38:25 +05:30
/ *
* Pass r e q u e s t e d s t a t e i n r3 :
2016-07-08 11:50:49 +05:30
* r3 - P N V _ T H R E A D _ N A P / S L E E P / W I N K L E i n P O W E R 8
* - Requested S T O P s t a t e i n P O W E R 9
2014-05-23 18:15:26 +10:00
*
* To c h e c k I R Q _ H A P P E N E D i n r4
* 0 - don' t c h e c k
* 1 - check
2016-07-08 11:50:47 +05:30
*
* Address t o ' r f i d ' t o i n r5
2014-02-26 05:38:25 +05:30
* /
2016-07-08 11:50:46 +05:30
_ GLOBAL( p n v _ p o w e r s a v e _ c o m m o n )
2014-02-26 05:38:25 +05:30
/* Use r3 to pass state nap/sleep/winkle */
2011-01-24 18:42:41 +11:00
/ * NAP i s a s t a t e l o s s , w e c r e a t e a r e g s f r a m e o n t h e
* stack, f i l l i t u p w i t h t h e s t a t e w e c a r e a b o u t a n d
* stick a p o i n t e r t o i t i n P A C A R 1 . W e r e a l l y o n l y
* need t o s a v e P C , s o m e C R b i t s a n d t h e N V G P R s ,
* but f o r n o w a n i n t e r r u p t f r a m e w i l l d o .
* /
mflr r0
std r0 ,1 6 ( r1 )
stdu r1 ,- I N T _ F R A M E _ S I Z E ( r1 )
std r0 ,_ L I N K ( r1 )
std r0 ,_ N I P ( r1 )
/* Hard disable interrupts */
mfmsr r9
rldicl r9 ,r9 ,4 8 ,1
rotldi r9 ,r9 ,1 6
mtmsrd r9 ,1 / * h a r d - d i s a b l e i n t e r r u p t s * /
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 18:27:59 +11:00
/* Check if something happened while soft-disabled */
lbz r0 ,P A C A I R Q H A P P E N E D ( r13 )
powerpc/powernv: Don't call generic code on offline cpus
On PowerNV platforms, when a CPU is offline, we put it into nap mode.
It's possible that the CPU wakes up from nap mode while it is still
offline due to a stray IPI. A misdirected device interrupt could also
potentially cause it to wake up. In that circumstance, we need to clear
the interrupt so that the CPU can go back to nap mode.
In the past the clearing of the interrupt was accomplished by briefly
enabling interrupts and allowing the normal interrupt handling code
(do_IRQ() etc.) to handle the interrupt. This has the problem that
this code calls irq_enter() and irq_exit(), which call functions such
as account_system_vtime() which use RCU internally. Use of RCU is not
permitted on offline CPUs and will trigger errors if RCU checking is
enabled.
To avoid calling into any generic code which might use RCU, we adopt
a different method of clearing interrupts on offline CPUs. Since we
are on the PowerNV platform, we know that the system interrupt
controller is a XICS being driven directly (i.e. not via hcalls) by
the kernel. Hence this adds a new icp_native_flush_interrupt()
function to the native-mode XICS driver and arranges to call that
when an offline CPU is woken from nap. This new function reads the
interrupt from the XICS. If it is an IPI, it clears the IPI; if it
is a device interrupt, it prints a warning and disables the source.
Then it does the end-of-interrupt processing for the interrupt.
The other thing that briefly enabling interrupts did was to check and
clear the irq_happened flag in this CPU's PACA. Therefore, after
flushing the interrupt from the XICS, we also clear all bits except
the PACA_IRQ_HARD_DIS (interrupts are hard disabled) bit from the
irq_happened flag. The PACA_IRQ_HARD_DIS flag is set by power7_nap()
and is left set to indicate that interrupts are hard disabled. This
means we then have to ignore that flag in power7_nap(), which is
reasonable since it doesn't indicate that any interrupt event needs
servicing.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-09-02 14:23:16 +10:00
andi. r0 ,r0 ,~ P A C A _ I R Q _ H A R D _ D I S @l
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 18:27:59 +11:00
beq 1 f
2014-05-23 18:15:26 +10:00
cmpwi c r0 ,r4 ,0
beq 1 f
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 18:27:59 +11:00
addi r1 ,r1 ,I N T _ F R A M E _ S I Z E
ld r0 ,1 6 ( r1 )
2015-03-20 10:10:18 +11:00
li r3 ,0 / * R e t u r n 0 ( n o n a p ) * /
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 18:27:59 +11:00
mtlr r0
blr
1 : / * We m a r k i r q s h a r d d i s a b l e d a s t h i s i s t h e s t a t e w e ' l l
* be i n w h e n r e t u r n i n g a n d w e n e e d t o t e l l a r c h _ l o c a l _ i r q _ r e s t o r e ( )
* about i t
* /
li r0 ,P A C A _ I R Q _ H A R D _ D I S
stb r0 ,P A C A I R Q H A P P E N E D ( r13 )
/* We haven't lost state ... yet */
2011-01-24 18:42:41 +11:00
li r0 ,0
2011-12-05 19:47:26 +00:00
stb r0 ,P A C A _ N A P S T A T E L O S T ( r13 )
2011-01-24 18:42:41 +11:00
/* Continue saving state */
SAVE_ G P R ( 2 , r1 )
SAVE_ N V G P R S ( r1 )
2014-02-26 05:38:25 +05:30
mfcr r4
std r4 ,_ C C R ( r1 )
2011-01-24 18:42:41 +11:00
std r9 ,_ M S R ( r1 )
std r1 ,P A C A R 1 ( r13 )
2016-07-08 11:50:47 +05:30
# ifdef C O N F I G _ K V M _ B O O K 3 S _ H V _ P O S S I B L E
/* Tell KVM we're entering idle */
2016-07-08 11:50:49 +05:30
li r4 ,K V M _ H W T H R E A D _ I N _ I D L E
2016-07-08 11:50:47 +05:30
stb r4 ,H S T A T E _ H W T H R E A D _ S T A T E ( r13 )
# endif
powerpc/powernv: Switch off MMU before entering nap/sleep/rvwinkle mode
Currently, when going idle, we set the flag indicating that we are in
nap mode (paca->kvm_hstate.hwthread_state) and then execute the nap
(or sleep or rvwinkle) instruction, all with the MMU on. This is bad
for two reasons: (a) the architecture specifies that those instructions
must be executed with the MMU off, and in fact with only the SF, HV, ME
and possibly RI bits set, and (b) this introduces a race, because as
soon as we set the flag, another thread can switch the MMU to a guest
context. If the race is lost, this thread will typically start looping
on relocation-on ISIs at 0xc...4400.
This fixes it by setting the MSR as required by the architecture before
setting the flag or executing the nap/sleep/rvwinkle instruction.
Cc: stable@vger.kernel.org
[ shreyas@linux.vnet.ibm.com: Edited to handle LE ]
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-12-10 00:26:50 +05:30
/ *
* Go t o r e a l m o d e t o d o t h e n a p , a s r e q u i r e d b y t h e a r c h i t e c t u r e .
* Also, w e n e e d t o b e i n r e a l m o d e b e f o r e s e t t i n g h w t h r e a d _ s t a t e ,
* because a s s o o n a s w e d o t h a t , a n o t h e r t h r e a d c a n s w i t c h
* the M M U c o n t e x t t o t h e g u e s t .
* /
2016-07-08 11:50:47 +05:30
LOAD_ R E G _ I M M E D I A T E ( r7 , M S R _ I D L E )
powerpc/powernv: Switch off MMU before entering nap/sleep/rvwinkle mode
Currently, when going idle, we set the flag indicating that we are in
nap mode (paca->kvm_hstate.hwthread_state) and then execute the nap
(or sleep or rvwinkle) instruction, all with the MMU on. This is bad
for two reasons: (a) the architecture specifies that those instructions
must be executed with the MMU off, and in fact with only the SF, HV, ME
and possibly RI bits set, and (b) this introduces a race, because as
soon as we set the flag, another thread can switch the MMU to a guest
context. If the race is lost, this thread will typically start looping
on relocation-on ISIs at 0xc...4400.
This fixes it by setting the MSR as required by the architecture before
setting the flag or executing the nap/sleep/rvwinkle instruction.
Cc: stable@vger.kernel.org
[ shreyas@linux.vnet.ibm.com: Edited to handle LE ]
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-12-10 00:26:50 +05:30
li r6 , M S R _ R I
andc r6 , r9 , r6
mtmsrd r6 , 1 / * c l e a r R I b e f o r e s e t t i n g S R R 0 / 1 * /
2016-07-08 11:50:47 +05:30
mtspr S P R N _ S R R 0 , r5
mtspr S P R N _ S R R 1 , r7
powerpc/powernv: Switch off MMU before entering nap/sleep/rvwinkle mode
Currently, when going idle, we set the flag indicating that we are in
nap mode (paca->kvm_hstate.hwthread_state) and then execute the nap
(or sleep or rvwinkle) instruction, all with the MMU on. This is bad
for two reasons: (a) the architecture specifies that those instructions
must be executed with the MMU off, and in fact with only the SF, HV, ME
and possibly RI bits set, and (b) this introduces a race, because as
soon as we set the flag, another thread can switch the MMU to a guest
context. If the race is lost, this thread will typically start looping
on relocation-on ISIs at 0xc...4400.
This fixes it by setting the MSR as required by the architecture before
setting the flag or executing the nap/sleep/rvwinkle instruction.
Cc: stable@vger.kernel.org
[ shreyas@linux.vnet.ibm.com: Edited to handle LE ]
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-12-10 00:26:50 +05:30
rfid
2016-07-08 11:50:46 +05:30
.globl pnv_enter_arch207_idle_mode
pnv_enter_arch207_idle_mode :
2014-12-10 00:26:52 +05:30
stb r3 ,P A C A _ T H R E A D _ I D L E _ S T A T E ( r13 )
2014-12-10 00:26:53 +05:30
cmpwi c r3 ,r3 ,P N V _ T H R E A D _ S L E E P
bge c r3 ,2 f
2014-02-26 05:38:25 +05:30
IDLE_ S T A T E _ E N T E R _ S E Q ( P P C _ N A P )
/* No return */
2014-12-10 00:26:52 +05:30
2 :
/* Sleep or winkle */
lbz r7 ,P A C A _ T H R E A D _ M A S K ( r13 )
ld r14 ,P A C A _ C O R E _ I D L E _ S T A T E _ P T R ( r13 )
lwarx_loop1 :
lwarx r15 ,0 ,r14
powerpc/powernv: Fix race in updating core_idle_state
core_idle_state is maintained for each core. It uses 0-7 bits to track
whether a thread in the core has entered fastsleep or winkle. 8th bit is
used as a lock bit.
The lock bit is set in these 2 scenarios-
- The thread is first in subcore to wakeup from sleep/winkle.
- If its the last thread in the core about to enter sleep/winkle
While the lock bit is set, if any other thread in the core wakes up, it
loops until the lock bit is cleared before proceeding in the wakeup
path. This helps prevent race conditions w.r.t fastsleep workaround and
prevents threads from switching to process context before core/subcore
resources are restored.
But, in the path to sleep/winkle entry, we currently don't check for
lock-bit. This exposes us to following race when running with subcore
on-
First thread in the subcorea Another thread in the same
waking up core entering sleep/winkle
lwarx r15,0,r14
ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
stwcx. r15,0,r14
[Code to restore subcore state]
lwarx r15,0,r14
[clear thread bit]
stwcx. r15,0,r14
andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS
stw r15,0(r14)
Here, after the thread entering sleep clears its thread bit in
core_idle_state, the value is overwritten by the thread waking up.
In such cases when the core enters fastsleep, code mistakes an idle
thread as running. Because of this, the first thread waking up from
fastsleep which is supposed to resync timebase skips it. So we can
end up having a core with stale timebase value.
This patch fixes the above race by looping on the lock bit even while
entering the idle states.
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Fixes: 7b54e9f213f76 'powernv/powerpc: Add winkle support for offline cpus'
Cc: stable@vger.kernel.org # 3.19+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-07 01:39:23 +05:30
andi. r9 ,r15 ,P N V _ C O R E _ I D L E _ L O C K _ B I T
bnel c o r e _ i d l e _ l o c k _ h e l d
2014-12-10 00:26:52 +05:30
andc r15 ,r15 ,r7 / * C l e a r t h r e a d b i t * /
andi. r15 ,r15 ,P N V _ C O R E _ I D L E _ T H R E A D _ B I T S
/ *
* If c r0 = 0 , t h e n c u r r e n t t h r e a d i s t h e l a s t t h r e a d o f t h e c o r e e n t e r i n g
* sleep. L a s t t h r e a d n e e d s t o e x e c u t e t h e h a r d w a r e b u g w o r k a r o u n d c o d e i f
* required b y t h e p l a t f o r m .
* Make t h e w o r k a r o u n d c a l l u n c o n d i t i o n a l l y h e r e . T h e b e l o w b r a n c h c a l l i s
* patched o u t w h e n t h e i d l e s t a t e s a r e d i s c o v e r e d i f t h e p l a t f o r m d o e s n o t
* require i t .
* /
.global pnv_fastsleep_workaround_at_entry
pnv_fastsleep_workaround_at_entry :
beq f a s t s l e e p _ w o r k a r o u n d _ a t _ e n t r y
stwcx. r15 ,0 ,r14
bne- l w a r x _ l o o p1
isync
2014-12-10 00:26:53 +05:30
common_enter : /* common code for all the threads entering sleep or winkle */
bgt c r3 ,e n t e r _ w i n k l e
2014-12-10 00:26:52 +05:30
IDLE_ S T A T E _ E N T E R _ S E Q ( P P C _ S L E E P )
fastsleep_workaround_at_entry :
ori r15 ,r15 ,P N V _ C O R E _ I D L E _ L O C K _ B I T
stwcx. r15 ,0 ,r14
bne- l w a r x _ l o o p1
isync
/* Fast sleep workaround */
li r3 ,1
li r4 ,1
2016-07-08 16:37:11 +10:00
bl o p a l _ r m _ c o n f i g _ c p u _ i d l e _ s t a t e
2014-12-10 00:26:52 +05:30
/* Clear Lock bit */
li r0 ,0
lwsync
stw r0 ,0 ( r14 )
b c o m m o n _ e n t e r
2014-12-10 00:26:53 +05:30
enter_winkle :
2016-07-08 11:50:48 +05:30
bl s a v e _ s p r s _ t o _ s t a c k
2014-12-10 00:26:53 +05:30
IDLE_ S T A T E _ E N T E R _ S E Q ( P P C _ W I N K L E )
2012-02-03 00:54:17 +00:00
2016-07-08 11:50:49 +05:30
/ *
* r3 - r e q u e s t e d s t o p s t a t e
* /
power_enter_stop :
/ *
* Check i f t h e r e q u e s t e d s t a t e i s a d e e p i d l e s t a t e .
* /
LOAD_ R E G _ A D D R B A S E ( r5 ,p n v _ f i r s t _ d e e p _ s t o p _ s t a t e )
ld r4 ,A D D R O F F ( p n v _ f i r s t _ d e e p _ s t o p _ s t a t e ) ( r5 )
cmpd r3 ,r4
bge 2 f
IDLE_ S T A T E _ E N T E R _ S E Q ( P P C _ S T O P )
2 :
/ *
* Entering d e e p i d l e s t a t e .
* Clear t h r e a d b i t i n P A C A _ C O R E _ I D L E _ S T A T E , s a v e S P R s t o
* stack a n d e n t e r s t o p
* /
lbz r7 ,P A C A _ T H R E A D _ M A S K ( r13 )
ld r14 ,P A C A _ C O R E _ I D L E _ S T A T E _ P T R ( r13 )
lwarx_loop_stop :
lwarx r15 ,0 ,r14
andi. r9 ,r15 ,P N V _ C O R E _ I D L E _ L O C K _ B I T
bnel c o r e _ i d l e _ l o c k _ h e l d
andc r15 ,r15 ,r7 / * C l e a r t h r e a d b i t * /
stwcx. r15 ,0 ,r14
bne- l w a r x _ l o o p _ s t o p
isync
bl s a v e _ s p r s _ t o _ s t a c k
IDLE_ S T A T E _ E N T E R _ S E Q ( P P C _ S T O P )
2014-02-26 05:38:25 +05:30
_ GLOBAL( p o w e r7 _ i d l e )
/* Now check if user or arch enabled NAP mode */
LOAD_ R E G _ A D D R B A S E ( r3 ,p o w e r s a v e _ n a p )
lwz r4 ,A D D R O F F ( p o w e r s a v e _ n a p ) ( r3 )
cmpwi 0 ,r4 ,0
beqlr
2014-05-23 18:15:26 +10:00
li r3 , 1
2014-02-26 05:38:25 +05:30
/* fall through */
_ GLOBAL( p o w e r7 _ n a p )
2014-05-23 18:15:26 +10:00
mr r4 ,r3
2014-12-10 00:26:52 +05:30
li r3 ,P N V _ T H R E A D _ N A P
2016-07-08 11:50:47 +05:30
LOAD_ R E G _ A D D R ( r5 , p n v _ e n t e r _ a r c h20 7 _ i d l e _ m o d e )
2016-07-08 11:50:46 +05:30
b p n v _ p o w e r s a v e _ c o m m o n
2014-02-26 05:38:25 +05:30
/* No return */
_ GLOBAL( p o w e r7 _ s l e e p )
2014-12-10 00:26:52 +05:30
li r3 ,P N V _ T H R E A D _ S L E E P
2014-07-02 09:19:35 +05:30
li r4 ,1
2016-07-08 11:50:47 +05:30
LOAD_ R E G _ A D D R ( r5 , p n v _ e n t e r _ a r c h20 7 _ i d l e _ m o d e )
2016-07-08 11:50:46 +05:30
b p n v _ p o w e r s a v e _ c o m m o n
2014-02-26 05:38:25 +05:30
/* No return */
2011-01-24 18:42:41 +11:00
2014-12-10 00:26:53 +05:30
_ GLOBAL( p o w e r7 _ w i n k l e )
2016-07-08 11:50:43 +05:30
li r3 ,P N V _ T H R E A D _ W I N K L E
2014-12-10 00:26:53 +05:30
li r4 ,1
2016-07-08 11:50:47 +05:30
LOAD_ R E G _ A D D R ( r5 , p n v _ e n t e r _ a r c h20 7 _ i d l e _ m o d e )
2016-07-08 11:50:46 +05:30
b p n v _ p o w e r s a v e _ c o m m o n
2014-12-10 00:26:53 +05:30
/* No return */
2014-07-29 18:40:13 +05:30
# define C H E C K _ H M I _ I N T E R R U P T \
mfspr r0 ,S P R N _ S R R 1 ; \
BEGIN_ F T R _ S E C T I O N _ N E S T E D ( 6 6 ) ; \
rlwinm r0 ,r0 ,4 5 - 3 1 ,0 x f ; /* extract wake reason field (P8) */ \
FTR_ S E C T I O N _ E L S E _ N E S T E D ( 6 6 ) ; \
rlwinm r0 ,r0 ,4 5 - 3 1 ,0 x e ; /* P7 wake reason field is 3 bits */ \
ALT_ F T R _ S E C T I O N _ E N D _ N E S T E D _ I F S E T ( C P U _ F T R _ A R C H _ 2 0 7 S , 6 6 ) ; \
cmpwi r0 ,0 x a ; /* Hypervisor maintenance ? */ \
bne 2 0 f ; \
/* Invoke opal call to handle hmi */ \
ld r2 ,P A C A T O C ( r13 ) ; \
ld r1 ,P A C A R 1 ( r13 ) ; \
std r3 ,O R I G _ G P R 3 ( r1 ) ; /* Save original r3 */ \
KVM: PPC: Book3S HV: Fix TB corruption in guest exit path on HMI interrupt
When a guest is assigned to a core it converts the host Timebase (TB)
into guest TB by adding guest timebase offset before entering into
guest. During guest exit it restores the guest TB to host TB. This means
under certain conditions (Guest migration) host TB and guest TB can differ.
When we get an HMI for TB related issues the opal HMI handler would
try fixing errors and restore the correct host TB value. With no guest
running, we don't have any issues. But with guest running on the core
we run into TB corruption issues.
If we get an HMI while in the guest, the current HMI handler invokes opal
hmi handler before forcing guest to exit. The guest exit path subtracts
the guest TB offset from the current TB value which may have already
been restored with host value by opal hmi handler. This leads to incorrect
host and guest TB values.
With split-core, things become more complex. With split-core, TB also gets
split and each subcore gets its own TB register. When a hmi handler fixes
a TB error and restores the TB value, it affects all the TB values of
sibling subcores on the same core. On TB errors all the thread in the core
gets HMI. With existing code, the individual threads call opal hmi handle
independently which can easily throw TB out of sync if we have guest
running on subcores. Hence we will need to co-ordinate with all the
threads before making opal hmi handler call followed by TB resync.
This patch introduces a sibling subcore state structure (shared by all
threads in the core) in paca which holds information about whether sibling
subcores are in Guest mode or host mode. An array in_guest[] of size
MAX_SUBCORE_PER_CORE=4 is used to maintain the state of each subcore.
The subcore id is used as index into in_guest[] array. Only primary
thread entering/exiting the guest is responsible to set/unset its
designated array element.
On TB error, we get HMI interrupt on every thread on the core. Upon HMI,
this patch will now force guest to vacate the core/subcore. Primary
thread from each subcore will then turn off its respective bit
from the above bitmap during the guest exit path just after the
guest->host partition switch is complete.
All other threads that have just exited the guest OR were already in host
will wait until all other subcores clears their respective bit.
Once all the subcores turn off their respective bit, all threads will
will make call to opal hmi handler.
It is not necessary that opal hmi handler would resync the TB value for
every HMI interrupts. It would do so only for the HMI caused due to
TB errors. For rest, it would not touch TB value. Hence to make things
simpler, primary thread would call TB resync explicitly once for each
core immediately after opal hmi handler instead of subtracting guest
offset from TB. TB resync call will restore the TB with host value.
Thus we can be sure about the TB state.
One of the primary threads exiting the guest will take up the
responsibility of calling TB resync. It will use one of the top bits
(bit 63) from subcore state flags bitmap to make the decision. The first
primary thread (among the subcores) that is able to set the bit will
have to call the TB resync. Rest all other threads will wait until TB
resync is complete. Once TB resync is complete all threads will then
proceed.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2016-05-15 09:44:26 +05:30
li r3 ,0 ; /* NULL argument */ \
bl h m i _ e x c e p t i o n _ r e a l m o d e ; \
nop; \
2014-07-29 18:40:13 +05:30
ld r3 ,O R I G _ G P R 3 ( r1 ) ; /* Restore original r3 */ \
20 : nop;
2016-07-08 11:50:49 +05:30
/ *
* r3 - r e q u e s t e d s t o p s t a t e
* /
_ GLOBAL( p o w e r9 _ i d l e _ s t o p )
LOAD_ R E G _ I M M E D I A T E ( r4 , P S S C R _ H V _ T E M P L A T E )
or r4 ,r4 ,r3
mtspr S P R N _ P S S C R , r4
li r4 , 1
LOAD_ R E G _ A D D R ( r5 ,p o w e r _ e n t e r _ s t o p )
b p n v _ p o w e r s a v e _ c o m m o n
/* No return */
2016-07-08 11:50:44 +05:30
/ *
* Called f r o m r e s e t v e c t o r . C h e c k w h e t h e r w e h a v e w o k e n u p w i t h
* hypervisor s t a t e l o s s . I f y e s , r e s t o r e h y p e r v i s o r s t a t e a n d r e t u r n
* back t o r e s e t v e c t o r .
*
* r1 3 - C o n t e n t s o f H S P R G 0
* cr3 - s e t t o g t i f w a k i n g u p w i t h p a r t i a l / c o m p l e t e h y p e r v i s o r s t a t e l o s s
* /
2016-07-08 11:50:46 +05:30
_ GLOBAL( p n v _ r e s t o r e _ h y p _ r e s o u r c e )
2016-07-08 11:50:49 +05:30
BEGIN_ F T R _ S E C T I O N
2016-08-05 19:13:12 +05:30
ld r2 ,P A C A T O C ( r13 ) ;
2016-07-08 11:50:49 +05:30
/ *
* POWER I S A 3 . U s e P S S C R t o d e t e r m i n e i f w e
* are w a k i n g u p f r o m d e e p i d l e s t a t e
* /
LOAD_ R E G _ A D D R B A S E ( r5 ,p n v _ f i r s t _ d e e p _ s t o p _ s t a t e )
ld r4 ,A D D R O F F ( p n v _ f i r s t _ d e e p _ s t o p _ s t a t e ) ( r5 )
mfspr r5 ,S P R N _ P S S C R
2016-07-08 11:50:44 +05:30
/ *
2016-07-08 11:50:49 +05:30
* 0 - 3 bits c o r r e s p o n d t o P o w e r - S a v i n g L e v e l S t a t u s
* which i n d i c a t e s t h e i d l e s t a t e w e a r e w a k i n g u p f r o m
* /
rldicl r5 ,r5 ,4 ,6 0
cmpd c r4 ,r5 ,r4
bge c r4 ,p n v _ w a k e u p _ t b _ l o s s
/ *
* Waking u p w i t h o u t h y p e r v i s o r s t a t e l o s s . R e t u r n t o
* reset v e c t o r
* /
blr
END_ F T R _ S E C T I O N _ I F S E T ( C P U _ F T R _ A R C H _ 3 0 0 )
/ *
* POWER I S A 2 . 0 7 o r l e s s .
2016-07-08 11:50:44 +05:30
* Check i f l a s t b i t o f H S P G R 0 i s s e t . T h i s i n d i c a t e s w h e t h e r w e a r e
* waking u p f r o m w i n k l e .
* /
clrldi r5 ,r13 ,6 3
clrrdi r13 ,r13 ,1
2016-08-05 19:13:12 +05:30
/* Now that we are sure r13 is corrected, load TOC */
ld r2 ,P A C A T O C ( r13 ) ;
2016-07-08 11:50:44 +05:30
cmpwi c r4 ,r5 ,1
mtspr S P R N _ H S P R G 0 ,r13
lbz r0 ,P A C A _ T H R E A D _ I D L E _ S T A T E ( r13 )
cmpwi c r2 ,r0 ,P N V _ T H R E A D _ N A P
2016-07-08 11:50:46 +05:30
bgt c r2 ,p n v _ w a k e u p _ t b _ l o s s / * E i t h e r s l e e p o r W i n k l e * /
2016-07-08 11:50:44 +05:30
/ *
* We f a l l t h r o u g h h e r e i f P A C A _ T H R E A D _ I D L E _ S T A T E s h o w s w e a r e w a k i n g
* up f r o m n a p . A t t h i s s t a g e C R 3 s h o u l d n ' t c o n t a i n s ' g t ' s i n c e t h a t
* indicates w e a r e w a k i n g w i t h h y p e r v i s o r s t a t e l o s s f r o m n a p .
* /
bgt c r3 ,.
blr / * R e t u r n b a c k t o S y s t e m R e s e t v e c t o r f r o m w h e r e
2016-07-08 11:50:46 +05:30
pnv_ r e s t o r e _ h y p _ r e s o u r c e w a s i n v o k e d * /
2016-07-08 11:50:44 +05:30
2016-07-08 11:50:49 +05:30
/ *
* Called i f w a k i n g u p f r o m i d l e s t a t e w h i c h c a n c a u s e e i t h e r p a r t i a l o r
* complete h y p s t a t e l o s s .
* In P O W E R 8 , c a l l e d i f w a k i n g u p f r o m f a s t s l e e p o r w i n k l e
* In P O W E R 9 , c a l l e d i f w a k i n g u p f r o m s t o p s t a t e > = p n v _ f i r s t _ d e e p _ s t o p _ s t a t e
*
* r1 3 - P A C A
* cr3 - g t i f w a k i n g u p w i t h p a r t i a l / c o m p l e t e h y p e r v i s o r s t a t e l o s s
* cr4 - e q i f w a k i n g u p f r o m c o m p l e t e h y p e r v i s o r s t a t e l o s s .
* /
2016-07-08 11:50:46 +05:30
_ GLOBAL( p n v _ w a k e u p _ t b _ l o s s )
2014-02-26 05:38:43 +05:30
ld r1 ,P A C A R 1 ( r13 )
2014-12-10 00:26:52 +05:30
/ *
* Before e n t e r i n g a n y i d l e s t a t e , t h e N V G P R s a r e s a v e d i n t h e s t a c k
* and t h e y a r e r e s t o r e d b e f o r e s w i t c h i n g t o t h e p r o c e s s c o n t e x t . H e n c e
* until t h e y a r e r e s t o r e d , t h e y a r e f r e e t o b e u s e d .
*
2016-07-08 11:50:44 +05:30
* Save S R R 1 a n d L R i n N V G P R s a s t h e y m i g h t b e c l o b b e r e d i n
2016-07-08 16:37:11 +10:00
* opal_ c a l l ( ) ( c a l l e d i n C H E C K _ H M I _ I N T E R R U P T ) . S R R 1 i s r e q u i r e d
2016-07-08 11:50:44 +05:30
* to d e t e r m i n e t h e w a k e u p r e a s o n i f w e b r a n c h t o k v m _ s t a r t _ g u e s t . L R
* is r e q u i r e d t o r e t u r n b a c k t o r e s e t v e c t o r a f t e r h y p e r v i s o r s t a t e
* restore i s c o m p l e t e .
2014-12-10 00:26:52 +05:30
* /
2016-07-08 11:50:44 +05:30
mflr r17
2014-12-10 00:26:52 +05:30
mfspr r16 ,S P R N _ S R R 1
2014-07-29 18:40:13 +05:30
BEGIN_ F T R _ S E C T I O N
CHECK_ H M I _ I N T E R R U P T
END_ F T R _ S E C T I O N _ I F S E T ( C P U _ F T R _ H V M O D E )
2014-12-10 00:26:52 +05:30
lbz r7 ,P A C A _ T H R E A D _ M A S K ( r13 )
ld r14 ,P A C A _ C O R E _ I D L E _ S T A T E _ P T R ( r13 )
lwarx_loop2 :
lwarx r15 ,0 ,r14
andi. r9 ,r15 ,P N V _ C O R E _ I D L E _ L O C K _ B I T
/ *
* Lock b i t i s s e t i n o n e o f t h e 2 c a s e s -
* a. I n t h e s l e e p / w i n k l e e n t e r p a t h , t h e l a s t t h r e a d i s e x e c u t i n g
* fastsleep w o r k a r o u n d c o d e .
* b. I n t h e w a k e u p p a t h , a n o t h e r t h r e a d i s e x e c u t i n g f a s t s l e e p
* workaround u n d o c o d e o r r e s y n c i n g t i m e b a s e o r r e s t o r i n g c o n t e x t
* In e i t h e r c a s e l o o p u n t i l t h e l o c k b i t i s c l e a r e d .
* /
powerpc/powernv: Fix race in updating core_idle_state
core_idle_state is maintained for each core. It uses 0-7 bits to track
whether a thread in the core has entered fastsleep or winkle. 8th bit is
used as a lock bit.
The lock bit is set in these 2 scenarios-
- The thread is first in subcore to wakeup from sleep/winkle.
- If its the last thread in the core about to enter sleep/winkle
While the lock bit is set, if any other thread in the core wakes up, it
loops until the lock bit is cleared before proceeding in the wakeup
path. This helps prevent race conditions w.r.t fastsleep workaround and
prevents threads from switching to process context before core/subcore
resources are restored.
But, in the path to sleep/winkle entry, we currently don't check for
lock-bit. This exposes us to following race when running with subcore
on-
First thread in the subcorea Another thread in the same
waking up core entering sleep/winkle
lwarx r15,0,r14
ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
stwcx. r15,0,r14
[Code to restore subcore state]
lwarx r15,0,r14
[clear thread bit]
stwcx. r15,0,r14
andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS
stw r15,0(r14)
Here, after the thread entering sleep clears its thread bit in
core_idle_state, the value is overwritten by the thread waking up.
In such cases when the core enters fastsleep, code mistakes an idle
thread as running. Because of this, the first thread waking up from
fastsleep which is supposed to resync timebase skips it. So we can
end up having a core with stale timebase value.
This patch fixes the above race by looping on the lock bit even while
entering the idle states.
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Fixes: 7b54e9f213f76 'powernv/powerpc: Add winkle support for offline cpus'
Cc: stable@vger.kernel.org # 3.19+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-07-07 01:39:23 +05:30
bnel c o r e _ i d l e _ l o c k _ h e l d
2014-12-10 00:26:52 +05:30
cmpwi c r2 ,r15 ,0
2014-12-10 00:26:53 +05:30
/ *
* At t h i s s t a g e
2016-07-08 11:50:49 +05:30
* cr2 - e q i f f i r s t t h r e a d t o w a k e u p i n c o r e
* cr3 - g t i f w a k i n g u p w i t h p a r t i a l / c o m p l e t e h y p e r v i s o r s t a t e l o s s
* cr4 - e q i f w a k i n g u p f r o m c o m p l e t e h y p e r v i s o r s t a t e l o s s .
2014-12-10 00:26:53 +05:30
* /
2014-12-10 00:26:52 +05:30
ori r15 ,r15 ,P N V _ C O R E _ I D L E _ L O C K _ B I T
stwcx. r15 ,0 ,r14
bne- l w a r x _ l o o p2
isync
2016-07-08 11:50:49 +05:30
BEGIN_ F T R _ S E C T I O N
lbz r4 ,P A C A _ S U B C O R E _ S I B L I N G _ M A S K ( r13 )
and r4 ,r4 ,r15
cmpwi r4 ,0 / * C h e c k i f f i r s t i n s u b c o r e * /
or r15 ,r15 ,r7 / * S e t t h r e a d b i t * /
beq f i r s t _ t h r e a d _ i n _ s u b c o r e
END_ F T R _ S E C T I O N _ I F C L R ( C P U _ F T R _ A R C H _ 3 0 0 )
or r15 ,r15 ,r7 / * S e t t h r e a d b i t * /
beq c r2 ,f i r s t _ t h r e a d _ i n _ c o r e
/* Not first thread in core or subcore to wake up */
b c l e a r _ l o c k
first_thread_in_subcore :
2014-12-10 00:26:53 +05:30
/ *
* If w a k i n g u p f r o m s l e e p , s u b c o r e s t a t e i s n o t l o s t . H e n c e
* skip s u b c o r e s t a t e r e s t o r e
* /
bne c r4 ,s u b c o r e _ s t a t e _ r e s t o r e d
/* Restore per-subcore state */
ld r4 ,_ S D R 1 ( r1 )
mtspr S P R N _ S D R 1 ,r4
2016-07-08 11:50:49 +05:30
2014-12-10 00:26:53 +05:30
ld r4 ,_ R P R ( r1 )
mtspr S P R N _ R P R ,r4
ld r4 ,_ A M O R ( r1 )
mtspr S P R N _ A M O R ,r4
subcore_state_restored :
/ *
* Check i f t h e t h r e a d i s a l s o t h e f i r s t t h r e a d i n t h e c o r e . I f n o t ,
* skip t o c l e a r _ l o c k .
* /
bne c r2 ,c l e a r _ l o c k
first_thread_in_core :
2014-12-10 00:26:52 +05:30
/ *
2016-07-08 11:50:49 +05:30
* First t h r e a d i n t h e c o r e w a k i n g u p f r o m a n y s t a t e w h i c h c a n c a u s e
* partial o r c o m p l e t e h y p e r v i s o r s t a t e l o s s . I t n e e d s t o
2014-12-10 00:26:52 +05:30
* call t h e f a s t s l e e p w o r k a r o u n d c o d e i f t h e p l a t f o r m r e q u i r e s i t .
* Call i t u n c o n d i t i o n a l l y h e r e . T h e b e l o w b r a n c h i n s t r u c t i o n w i l l
2016-07-08 11:50:49 +05:30
* be p a t c h e d o u t i f t h e p l a t f o r m d o e s n o t h a v e f a s t s l e e p o r d o e s n o t
* require t h e w o r k a r o u n d . P a t c h i n g w i l l b e p e r f o r m e d d u r i n g t h e
* discovery o f i d l e - s t a t e s .
2014-12-10 00:26:52 +05:30
* /
.global pnv_fastsleep_workaround_at_exit
pnv_fastsleep_workaround_at_exit :
b f a s t s l e e p _ w o r k a r o u n d _ a t _ e x i t
timebase_resync :
2016-07-08 11:50:49 +05:30
/ *
* Use c r3 w h i c h i n d i c a t e s t h a t w e a r e w a k i n g u p w i t h a t l e a s t p a r t i a l
* hypervisor s t a t e l o s s t o d e t e r m i n e i f T I M E B A S E R E S Y N C i s n e e d e d .
* /
2014-12-10 00:26:52 +05:30
ble c r3 ,c l e a r _ l o c k
2014-02-26 05:38:43 +05:30
/* Time base re-sync */
2016-07-08 16:37:11 +10:00
bl o p a l _ r m _ r e s y n c _ t i m e b a s e ;
2014-12-10 00:26:53 +05:30
/ *
* If w a k i n g u p f r o m s l e e p , p e r c o r e s t a t e i s n o t l o s t , s k i p t o
* clear_ l o c k .
* /
bne c r4 ,c l e a r _ l o c k
2016-07-08 11:50:49 +05:30
/ *
* First t h r e a d i n t h e c o r e t o w a k e u p a n d i t s w a k i n g u p w i t h
* complete h y p e r v i s o r s t a t e l o s s . R e s t o r e p e r c o r e h y p e r v i s o r
* state.
* /
BEGIN_ F T R _ S E C T I O N
ld r4 ,_ P T C R ( r1 )
mtspr S P R N _ P T C R ,r4
ld r4 ,_ R P R ( r1 )
mtspr S P R N _ R P R ,r4
END_ F T R _ S E C T I O N _ I F S E T ( C P U _ F T R _ A R C H _ 3 0 0 )
2014-12-10 00:26:53 +05:30
ld r4 ,_ T S C R ( r1 )
mtspr S P R N _ T S C R ,r4
ld r4 ,_ W O R C ( r1 )
mtspr S P R N _ W O R C ,r4
2014-12-10 00:26:52 +05:30
clear_lock :
andi. r15 ,r15 ,P N V _ C O R E _ I D L E _ T H R E A D _ B I T S
lwsync
stw r15 ,0 ( r14 )
common_exit :
2014-12-10 00:26:53 +05:30
/ *
* Common t o a l l t h r e a d s .
*
* If w a k i n g u p f r o m s l e e p , h y p e r v i s o r s t a t e i s n o t l o s t . H e n c e
* skip h y p e r v i s o r s t a t e r e s t o r e .
* /
bne c r4 ,h y p e r v i s o r _ s t a t e _ r e s t o r e d
/* Waking up from winkle */
2016-07-08 11:50:49 +05:30
BEGIN_ M M U _ F T R _ S E C T I O N
b n o _ s e g m e n t s
2016-07-27 13:19:01 +10:00
END_ M M U _ F T R _ S E C T I O N _ I F S E T ( M M U _ F T R _ T Y P E _ R A D I X )
2014-12-10 00:26:53 +05:30
/* Restore SLB from PACA */
ld r8 ,P A C A _ S L B S H A D O W P T R ( r13 )
.rept SLB_NUM_BOLTED
li r3 , S L B S H A D O W _ S A V E A R E A
LDX_ B E r5 , r8 , r3
addi r3 , r3 , 8
LDX_ B E r6 , r8 , r3
andis. r7 ,r5 ,S L B _ E S I D _ V @h
beq 1 f
slbmte r6 ,r5
1 : addi r8 ,r8 ,1 6
.endr
2016-07-08 11:50:49 +05:30
no_segments :
/* Restore per thread state */
2014-12-10 00:26:53 +05:30
ld r4 ,_ S P U R R ( r1 )
mtspr S P R N _ S P U R R ,r4
ld r4 ,_ P U R R ( r1 )
mtspr S P R N _ P U R R ,r4
ld r4 ,_ D S C R ( r1 )
mtspr S P R N _ D S C R ,r4
ld r4 ,_ W O R T ( r1 )
mtspr S P R N _ W O R T ,r4
2016-07-08 11:50:49 +05:30
/* Call cur_cpu_spec->cpu_restore() */
LOAD_ R E G _ A D D R ( r4 , c u r _ c p u _ s p e c )
ld r4 ,0 ( r4 )
ld r12 ,C P U _ S P E C _ R E S T O R E ( r4 )
# ifdef P P C 6 4 _ E L F _ A B I _ v1
ld r12 ,0 ( r12 )
# endif
mtctr r12
bctrl
2014-12-10 00:26:53 +05:30
hypervisor_state_restored :
2014-12-10 00:26:52 +05:30
mtspr S P R N _ S R R 1 ,r16
2016-07-08 11:50:44 +05:30
mtlr r17
blr / * R e t u r n b a c k t o S y s t e m R e s e t v e c t o r f r o m w h e r e
2016-07-08 11:50:46 +05:30
pnv_ r e s t o r e _ h y p _ r e s o u r c e w a s i n v o k e d * /
2014-02-26 05:38:43 +05:30
2014-12-10 00:26:52 +05:30
fastsleep_workaround_at_exit :
li r3 ,1
li r4 ,0
2016-07-08 16:37:11 +10:00
bl o p a l _ r m _ c o n f i g _ c p u _ i d l e _ s t a t e
2014-12-10 00:26:52 +05:30
b t i m e b a s e _ r e s y n c
powerpc/powernv: Return to cpu offline loop when finished in KVM guest
When a secondary hardware thread has finished running a KVM guest, we
currently put that thread into nap mode using a nap instruction in
the KVM code. This changes the code so that instead of doing a nap
instruction directly, we instead cause the call to power7_nap() that
put the thread into nap mode to return. The reason for doing this is
to avoid having the KVM code having to know what low-power mode to
put the thread into.
In the case of a secondary thread used to run a KVM guest, the thread
will be offline from the point of view of the host kernel, and the
relevant power7_nap() call is the one in pnv_smp_cpu_disable().
In this case we don't want to clear pending IPIs in the offline loop
in that function, since that might cause us to miss the wakeup for
the next time the thread needs to run a guest. To tell whether or
not to clear the interrupt, we use the SRR1 value returned from
power7_nap(), and check if it indicates an external interrupt. We
arrange that the return from power7_nap() when we have finished running
a guest returns 0, so pending interrupts don't get flushed in that
case.
Note that it is important a secondary thread that has finished
executing in the guest, or that didn't have a guest to run, should
not return to power7_nap's caller while the kvm_hstate.hwthread_req
flag in the PACA is non-zero, because the return from power7_nap
will reenable the MMU, and the MMU might still be in guest context.
In this situation we spin at low priority in real mode waiting for
hwthread_req to become zero.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-12-03 14:48:40 +11:00
/ *
* R3 h e r e c o n t a i n s t h e v a l u e t h a t w i l l b e r e t u r n e d t o t h e c a l l e r
* of p o w e r7 _ n a p .
* /
2016-07-08 11:50:46 +05:30
_ GLOBAL( p n v _ w a k e u p _ l o s s )
2011-01-24 18:42:41 +11:00
ld r1 ,P A C A R 1 ( r13 )
2014-07-29 18:40:13 +05:30
BEGIN_ F T R _ S E C T I O N
CHECK_ H M I _ I N T E R R U P T
END_ F T R _ S E C T I O N _ I F S E T ( C P U _ F T R _ H V M O D E )
2011-01-24 18:42:41 +11:00
REST_ N V G P R S ( r1 )
REST_ G P R ( 2 , r1 )
powerpc/powernv: Return to cpu offline loop when finished in KVM guest
When a secondary hardware thread has finished running a KVM guest, we
currently put that thread into nap mode using a nap instruction in
the KVM code. This changes the code so that instead of doing a nap
instruction directly, we instead cause the call to power7_nap() that
put the thread into nap mode to return. The reason for doing this is
to avoid having the KVM code having to know what low-power mode to
put the thread into.
In the case of a secondary thread used to run a KVM guest, the thread
will be offline from the point of view of the host kernel, and the
relevant power7_nap() call is the one in pnv_smp_cpu_disable().
In this case we don't want to clear pending IPIs in the offline loop
in that function, since that might cause us to miss the wakeup for
the next time the thread needs to run a guest. To tell whether or
not to clear the interrupt, we use the SRR1 value returned from
power7_nap(), and check if it indicates an external interrupt. We
arrange that the return from power7_nap() when we have finished running
a guest returns 0, so pending interrupts don't get flushed in that
case.
Note that it is important a secondary thread that has finished
executing in the guest, or that didn't have a guest to run, should
not return to power7_nap's caller while the kvm_hstate.hwthread_req
flag in the PACA is non-zero, because the return from power7_nap
will reenable the MMU, and the MMU might still be in guest context.
In this situation we spin at low priority in real mode waiting for
hwthread_req to become zero.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-12-03 14:48:40 +11:00
ld r6 ,_ C C R ( r1 )
2011-01-24 18:42:41 +11:00
ld r4 ,_ M S R ( r1 )
ld r5 ,_ N I P ( r1 )
addi r1 ,r1 ,I N T _ F R A M E _ S I Z E
powerpc/powernv: Return to cpu offline loop when finished in KVM guest
When a secondary hardware thread has finished running a KVM guest, we
currently put that thread into nap mode using a nap instruction in
the KVM code. This changes the code so that instead of doing a nap
instruction directly, we instead cause the call to power7_nap() that
put the thread into nap mode to return. The reason for doing this is
to avoid having the KVM code having to know what low-power mode to
put the thread into.
In the case of a secondary thread used to run a KVM guest, the thread
will be offline from the point of view of the host kernel, and the
relevant power7_nap() call is the one in pnv_smp_cpu_disable().
In this case we don't want to clear pending IPIs in the offline loop
in that function, since that might cause us to miss the wakeup for
the next time the thread needs to run a guest. To tell whether or
not to clear the interrupt, we use the SRR1 value returned from
power7_nap(), and check if it indicates an external interrupt. We
arrange that the return from power7_nap() when we have finished running
a guest returns 0, so pending interrupts don't get flushed in that
case.
Note that it is important a secondary thread that has finished
executing in the guest, or that didn't have a guest to run, should
not return to power7_nap's caller while the kvm_hstate.hwthread_req
flag in the PACA is non-zero, because the return from power7_nap
will reenable the MMU, and the MMU might still be in guest context.
In this situation we spin at low priority in real mode waiting for
hwthread_req to become zero.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-12-03 14:48:40 +11:00
mtcr r6
2011-01-24 18:42:41 +11:00
mtspr S P R N _ S R R 1 ,r4
mtspr S P R N _ S R R 0 ,r5
rfid
powerpc/powernv: Return to cpu offline loop when finished in KVM guest
When a secondary hardware thread has finished running a KVM guest, we
currently put that thread into nap mode using a nap instruction in
the KVM code. This changes the code so that instead of doing a nap
instruction directly, we instead cause the call to power7_nap() that
put the thread into nap mode to return. The reason for doing this is
to avoid having the KVM code having to know what low-power mode to
put the thread into.
In the case of a secondary thread used to run a KVM guest, the thread
will be offline from the point of view of the host kernel, and the
relevant power7_nap() call is the one in pnv_smp_cpu_disable().
In this case we don't want to clear pending IPIs in the offline loop
in that function, since that might cause us to miss the wakeup for
the next time the thread needs to run a guest. To tell whether or
not to clear the interrupt, we use the SRR1 value returned from
power7_nap(), and check if it indicates an external interrupt. We
arrange that the return from power7_nap() when we have finished running
a guest returns 0, so pending interrupts don't get flushed in that
case.
Note that it is important a secondary thread that has finished
executing in the guest, or that didn't have a guest to run, should
not return to power7_nap's caller while the kvm_hstate.hwthread_req
flag in the PACA is non-zero, because the return from power7_nap
will reenable the MMU, and the MMU might still be in guest context.
In this situation we spin at low priority in real mode waiting for
hwthread_req to become zero.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-12-03 14:48:40 +11:00
/ *
* R3 h e r e c o n t a i n s t h e v a l u e t h a t w i l l b e r e t u r n e d t o t h e c a l l e r
* of p o w e r7 _ n a p .
* /
2016-07-08 11:50:46 +05:30
_ GLOBAL( p n v _ w a k e u p _ n o l o s s )
2011-12-05 19:47:26 +00:00
lbz r0 ,P A C A _ N A P S T A T E L O S T ( r13 )
cmpwi r0 ,0
2016-07-08 11:50:46 +05:30
bne p n v _ w a k e u p _ l o s s
2014-07-29 18:40:13 +05:30
BEGIN_ F T R _ S E C T I O N
CHECK_ H M I _ I N T E R R U P T
END_ F T R _ S E C T I O N _ I F S E T ( C P U _ F T R _ H V M O D E )
2011-01-24 18:42:41 +11:00
ld r1 ,P A C A R 1 ( r13 )
powerpc/powernv: Restore non-volatile CRs after nap
Patches 7cba160ad "powernv/cpuidle: Redesign idle states management"
and 77b54e9f2 "powernv/powerpc: Add winkle support for offline cpus"
use non-volatile condition registers (cr2, cr3 and cr4) early in the system
reset interrupt handler (system_reset_pSeries()) before it has been determined
if state loss has occurred. If state loss has not occurred, control returns via
the power7_wakeup_noloss() path which does not restore those condition
registers, leaving them corrupted.
Fix this by restoring the condition registers in the power7_wakeup_noloss()
case.
This is apparent when running a KVM guest on hardware that does not
support winkle or sleep and the guest makes use of secondary threads. In
practice this means Power7 machines, though some early unreleased Power8
machines may also be susceptible.
The secondary CPUs are taken off line before the guest is started and
they call pnv_smp_cpu_kill_self(). This checks support for sleep
states (in this case there is no support) and power7_nap() is called.
When the CPU is woken, power7_nap() returns and because the CPU is
still off line, the main while loop executes again. The sleep states
support test is executed again, but because the tested values cannot
have changed, the compiler has optimized the test away and instead we
rely on the result of the first test, which has been left in cr3
and/or cr4. With the result overwritten, the wrong branch is taken and
power7_winkle() is called on a CPU that does not support it, leading
to it stalling.
Fixes: 7cba160ad789 ("powernv/cpuidle: Redesign idle states management")
Fixes: 77b54e9f213f ("powernv/powerpc: Add winkle support for offline cpus")
[mpe: Massage change log a bit more]
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-05-01 16:50:34 +10:00
ld r6 ,_ C C R ( r1 )
2011-01-24 18:42:41 +11:00
ld r4 ,_ M S R ( r1 )
ld r5 ,_ N I P ( r1 )
addi r1 ,r1 ,I N T _ F R A M E _ S I Z E
powerpc/powernv: Restore non-volatile CRs after nap
Patches 7cba160ad "powernv/cpuidle: Redesign idle states management"
and 77b54e9f2 "powernv/powerpc: Add winkle support for offline cpus"
use non-volatile condition registers (cr2, cr3 and cr4) early in the system
reset interrupt handler (system_reset_pSeries()) before it has been determined
if state loss has occurred. If state loss has not occurred, control returns via
the power7_wakeup_noloss() path which does not restore those condition
registers, leaving them corrupted.
Fix this by restoring the condition registers in the power7_wakeup_noloss()
case.
This is apparent when running a KVM guest on hardware that does not
support winkle or sleep and the guest makes use of secondary threads. In
practice this means Power7 machines, though some early unreleased Power8
machines may also be susceptible.
The secondary CPUs are taken off line before the guest is started and
they call pnv_smp_cpu_kill_self(). This checks support for sleep
states (in this case there is no support) and power7_nap() is called.
When the CPU is woken, power7_nap() returns and because the CPU is
still off line, the main while loop executes again. The sleep states
support test is executed again, but because the tested values cannot
have changed, the compiler has optimized the test away and instead we
rely on the result of the first test, which has been left in cr3
and/or cr4. With the result overwritten, the wrong branch is taken and
power7_winkle() is called on a CPU that does not support it, leading
to it stalling.
Fixes: 7cba160ad789 ("powernv/cpuidle: Redesign idle states management")
Fixes: 77b54e9f213f ("powernv/powerpc: Add winkle support for offline cpus")
[mpe: Massage change log a bit more]
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-05-01 16:50:34 +10:00
mtcr r6
2011-01-24 18:42:41 +11:00
mtspr S P R N _ S R R 1 ,r4
mtspr S P R N _ S R R 0 ,r5
rfid