2019-05-27 08:55:01 +02:00
// SPDX-License-Identifier: GPL-2.0-or-later
2005-04-16 15:20:36 -07:00
/*
* Derived from arch / i386 / kernel / irq . c
* Copyright ( C ) 1992 Linus Torvalds
* Adapted from arch / i386 by Gary Thomas
* Copyright ( C ) 1995 - 1996 Gary Thomas ( gdt @ linuxppc . org )
2005-11-09 18:07:45 +11:00
* Updated and modified by Cort Dougan < cort @ fsmlabs . com >
* Copyright ( C ) 1996 - 2001 Cort Dougan
2005-04-16 15:20:36 -07:00
* Adapted for Power Macintosh by Paul Mackerras
* Copyright ( C ) 1996 Paul Mackerras ( paulus @ cs . anu . edu . au )
2005-11-09 18:07:45 +11:00
*
2005-04-16 15:20:36 -07:00
* This file contains the code used by various IRQ handling routines :
* asking for different IRQ ' s should be done through these routines
* instead of just grabbing them . Thus setups with different IRQ numbers
* shouldn ' t result in any weird surprises , and installing new handlers
* should be easier .
2005-11-09 18:07:45 +11:00
*
* The MPC8xx has an interrupt mask in the SIU . If a bit is set , the
* interrupt is _enabled_ . As expected , IRQ0 is bit 0 in the 32 - bit
* mask register ( of which only 16 are defined ) , hence the weird shifting
* and complement of the cached_irq_mask . I want to be able to stuff
* this right into the SIU SMASK register .
2017-08-08 13:58:54 +02:00
* Many of the prep / chrp functions are conditional compiled on CONFIG_PPC_8xx
2005-11-09 18:07:45 +11:00
* to reduce code space and undefined function references .
2005-04-16 15:20:36 -07:00
*/
2006-07-03 21:36:01 +10:00
# undef DEBUG
2011-07-22 18:24:23 -04:00
# include <linux/export.h>
2005-04-16 15:20:36 -07:00
# include <linux/threads.h>
# include <linux/kernel_stat.h>
# include <linux/signal.h>
# include <linux/sched.h>
2005-11-09 18:07:45 +11:00
# include <linux/ptrace.h>
2005-04-16 15:20:36 -07:00
# include <linux/ioport.h>
# include <linux/interrupt.h>
# include <linux/timex.h>
# include <linux/init.h>
# include <linux/slab.h>
# include <linux/delay.h>
# include <linux/irq.h>
2005-11-09 18:07:45 +11:00
# include <linux/seq_file.h>
# include <linux/cpumask.h>
2005-04-16 15:20:36 -07:00
# include <linux/profile.h>
# include <linux/bitops.h>
2006-07-03 21:36:01 +10:00
# include <linux/list.h>
# include <linux/radix-tree.h>
# include <linux/mutex.h>
2006-07-27 13:17:25 -05:00
# include <linux/pci.h>
2007-08-28 18:47:57 +10:00
# include <linux/debugfs.h>
2010-06-18 11:09:59 -06:00
# include <linux/of.h>
# include <linux/of_irq.h>
2019-12-21 08:32:30 +00:00
# include <linux/vmalloc.h>
2020-06-08 21:32:42 -07:00
# include <linux/pgtable.h>
2005-04-16 15:20:36 -07:00
2016-12-24 11:46:01 -08:00
# include <linux/uaccess.h>
2021-01-30 23:08:38 +10:00
# include <asm/interrupt.h>
2005-04-16 15:20:36 -07:00
# include <asm/io.h>
# include <asm/irq.h>
# include <asm/cache.h>
# include <asm/prom.h>
# include <asm/ptrace.h>
# include <asm/machdep.h>
2006-07-03 21:36:01 +10:00
# include <asm/udbg.h>
2010-08-18 06:44:25 +00:00
# include <asm/smp.h>
2016-03-24 22:04:04 +11:00
# include <asm/livepatch.h>
2016-09-06 15:32:43 +10:00
# include <asm/asm-prototypes.h>
2017-12-20 09:25:42 +05:30
# include <asm/hw_irq.h>
2021-02-10 00:40:53 +01:00
# include <asm/softirq_stack.h>
2010-07-09 15:31:28 +10:00
[POWERPC] Lazy interrupt disabling for 64-bit machines
This implements a lazy strategy for disabling interrupts. This means
that local_irq_disable() et al. just clear the 'interrupts are
enabled' flag in the paca. If an interrupt comes along, the interrupt
entry code notices that interrupts are supposed to be disabled, and
clears the EE bit in SRR1, clears the 'interrupts are hard-enabled'
flag in the paca, and returns. This means that interrupts only
actually get disabled in the processor when an interrupt comes along.
When interrupts are enabled by local_irq_enable() et al., the code
sets the interrupts-enabled flag in the paca, and then checks whether
interrupts got hard-disabled. If so, it also sets the EE bit in the
MSR to hard-enable the interrupts.
This has the potential to improve performance, and also makes it
easier to make a kernel that can boot on iSeries and on other 64-bit
machines, since this lazy-disable strategy is very similar to the
soft-disable strategy that iSeries already uses.
This version renames paca->proc_enabled to paca->soft_enabled, and
changes a couple of soft-disables in the kexec code to hard-disables,
which should fix the crash that Michael Ellerman saw. This doesn't
yet use a reserved CR field for the soft_enabled and hard_enabled
flags. This applies on top of Stephen Rothwell's patches to make it
possible to build a combined iSeries/other kernel.
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-10-04 16:47:49 +10:00
# ifdef CONFIG_PPC64
2005-04-16 15:20:36 -07:00
# include <asm/paca.h>
[POWERPC] Lazy interrupt disabling for 64-bit machines
This implements a lazy strategy for disabling interrupts. This means
that local_irq_disable() et al. just clear the 'interrupts are
enabled' flag in the paca. If an interrupt comes along, the interrupt
entry code notices that interrupts are supposed to be disabled, and
clears the EE bit in SRR1, clears the 'interrupts are hard-enabled'
flag in the paca, and returns. This means that interrupts only
actually get disabled in the processor when an interrupt comes along.
When interrupts are enabled by local_irq_enable() et al., the code
sets the interrupts-enabled flag in the paca, and then checks whether
interrupts got hard-disabled. If so, it also sets the EE bit in the
MSR to hard-enable the interrupts.
This has the potential to improve performance, and also makes it
easier to make a kernel that can boot on iSeries and on other 64-bit
machines, since this lazy-disable strategy is very similar to the
soft-disable strategy that iSeries already uses.
This version renames paca->proc_enabled to paca->soft_enabled, and
changes a couple of soft-disables in the kexec code to hard-disables,
which should fix the crash that Michael Ellerman saw. This doesn't
yet use a reserved CR field for the soft_enabled and hard_enabled
flags. This applies on top of Stephen Rothwell's patches to make it
possible to build a combined iSeries/other kernel.
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-10-04 16:47:49 +10:00
# include <asm/firmware.h>
2007-05-01 07:01:07 +10:00
# include <asm/lv1call.h>
2020-02-26 03:35:36 +10:00
# include <asm/dbell.h>
2005-11-09 18:07:45 +11:00
# endif
2009-10-26 18:47:42 +00:00
# define CREATE_TRACE_POINTS
# include <asm/trace.h>
2016-07-23 14:42:40 +05:30
# include <asm/cpu_has_feature.h>
2005-04-16 15:20:36 -07:00
2010-01-31 20:30:23 +00:00
DEFINE_PER_CPU_SHARED_ALIGNED ( irq_cpustat_t , irq_stat ) ;
EXPORT_PER_CPU_SYMBOL ( irq_stat ) ;
2005-11-09 18:07:45 +11:00
# ifdef CONFIG_PPC32
2006-07-03 19:32:51 +10:00
atomic_t ppc_n_lost_interrupts ;
2005-11-09 18:07:45 +11:00
# ifdef CONFIG_TAU_INT
extern int tau_initialized ;
2018-04-04 22:10:28 +02:00
u32 tau_interrupts ( unsigned long cpu ) ;
2005-11-09 18:07:45 +11:00
# endif
2006-07-03 19:32:51 +10:00
# endif /* CONFIG_PPC32 */
2005-11-09 18:07:45 +11:00
# ifdef CONFIG_PPC64
2009-10-13 19:45:03 +00:00
2005-04-16 15:20:36 -07:00
int distribute_irqs = 1 ;
[POWERPC] Lazy interrupt disabling for 64-bit machines
This implements a lazy strategy for disabling interrupts. This means
that local_irq_disable() et al. just clear the 'interrupts are
enabled' flag in the paca. If an interrupt comes along, the interrupt
entry code notices that interrupts are supposed to be disabled, and
clears the EE bit in SRR1, clears the 'interrupts are hard-enabled'
flag in the paca, and returns. This means that interrupts only
actually get disabled in the processor when an interrupt comes along.
When interrupts are enabled by local_irq_enable() et al., the code
sets the interrupts-enabled flag in the paca, and then checks whether
interrupts got hard-disabled. If so, it also sets the EE bit in the
MSR to hard-enable the interrupts.
This has the potential to improve performance, and also makes it
easier to make a kernel that can boot on iSeries and on other 64-bit
machines, since this lazy-disable strategy is very similar to the
soft-disable strategy that iSeries already uses.
This version renames paca->proc_enabled to paca->soft_enabled, and
changes a couple of soft-disables in the kexec code to hard-disables,
which should fix the crash that Michael Ellerman saw. This doesn't
yet use a reserved CR field for the soft_enabled and hard_enabled
flags. This applies on top of Stephen Rothwell's patches to make it
possible to build a combined iSeries/other kernel.
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-10-04 16:47:49 +10:00
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 18:27:59 +11:00
static inline notrace unsigned long get_irq_happened ( void )
2006-11-10 21:32:40 +00:00
{
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 18:27:59 +11:00
unsigned long happened ;
2006-11-10 21:32:40 +00:00
__asm__ __volatile__ ( " lbz %0,%1(13) "
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 18:27:59 +11:00
: " =r " ( happened ) : " i " ( offsetof ( struct paca_struct , irq_happened ) ) ) ;
2006-11-10 21:32:40 +00:00
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 18:27:59 +11:00
return happened ;
2006-11-10 21:32:40 +00:00
}
powerpc/64s: Implement interrupt exit logic in C
Implement the bulk of interrupt return logic in C. The asm return code
must handle a few cases: restoring full GPRs, and emulating stack
store.
The stack store emulation is significantly simplfied, rather than
creating a new return frame and switching to that before performing
the store, it uses the PACA to keep a scratch register around to
perform the store.
The asm return code is moved into 64e for now. The new logic has made
allowance for 64e, but I don't have a full environment that works well
to test it, and even booting in emulated qemu is not great for stress
testing. 64e shouldn't be too far off working with this, given a bit
more testing and auditing of the logic.
This is slightly faster on a POWER9 (page fault speed increases about
1.1%), probably due to reduced mtmsrd.
mpe: Includes fixes from Nick for _TIF_EMULATE_STACK_STORE
handling (including the fast_interrupt_return path), to remove
trace_hardirqs_on(), and fixes the interrupt-return part of the
MSR_VSX restore bug caught by tm-unavailable selftest.
mpe: Incorporate fix from Nick:
The return-to-kernel path has to replay any soft-pending interrupts if
it is returning to a context that had interrupts soft-enabled. It has
to do this carefully and avoid plain enabling interrupts if this is an
irq context, which can cause multiple nesting of interrupts on the
stack, and other unexpected issues.
The code which avoided this case got the soft-mask state wrong, and
marked interrupts as enabled before going around again to retry. This
seems to be mostly harmless except when PREEMPT=y, this calls
preempt_schedule_irq with irqs apparently enabled and runs into a BUG
in kernel/sched/core.c
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200225173541.1549955-29-npiggin@gmail.com
2020-02-26 03:35:37 +10:00
void replay_soft_interrupts ( void )
2020-02-26 03:35:36 +10:00
{
2021-01-23 16:12:44 +10:00
struct pt_regs regs ;
2020-02-26 03:35:36 +10:00
/*
2021-01-23 16:12:44 +10:00
* Be careful here , calling these interrupt handlers can cause
* softirqs to be raised , which they may run when calling irq_exit ,
* which will cause local_irq_enable ( ) to be run , which can then
* recurse into this function . Don ' t keep any state across
* interrupt handler calls which may change underneath us .
*
* We use local_paca rather than get_paca ( ) to avoid all the
* debug_smp_processor_id ( ) business in this low level function .
2020-02-26 03:35:36 +10:00
*/
ppc_save_regs ( & regs ) ;
2020-09-15 21:46:46 +10:00
regs . softe = IRQS_ENABLED ;
2021-06-30 17:46:18 +10:00
regs . msr | = MSR_EE ;
2020-02-26 03:35:36 +10:00
again :
if ( IS_ENABLED ( CONFIG_PPC_IRQ_SOFT_MASK_DEBUG ) )
WARN_ON_ONCE ( mfmsr ( ) & MSR_EE ) ;
/*
* Force the delivery of pending soft - disabled interrupts on PS3 .
* Any HV call will have this side effect .
*/
if ( firmware_has_feature ( FW_FEATURE_PS3_LV1 ) ) {
u64 tmp , tmp2 ;
lv1_get_version_info ( & tmp , & tmp2 ) ;
}
/*
* Check if an hypervisor Maintenance interrupt happened .
* This is a higher priority interrupt than the others , so
* replay it first .
*/
2021-01-23 16:12:44 +10:00
if ( IS_ENABLED ( CONFIG_PPC_BOOK3S ) & & ( local_paca - > irq_happened & PACA_IRQ_HMI ) ) {
2020-02-26 03:35:36 +10:00
local_paca - > irq_happened & = ~ PACA_IRQ_HMI ;
2021-04-19 15:48:11 +00:00
regs . trap = INTERRUPT_HMI ;
2020-02-26 03:35:36 +10:00
handle_hmi_exception ( & regs ) ;
if ( ! ( local_paca - > irq_happened & PACA_IRQ_HARD_DIS ) )
hard_irq_disable ( ) ;
}
2021-01-23 16:12:44 +10:00
if ( local_paca - > irq_happened & PACA_IRQ_DEC ) {
2020-02-26 03:35:36 +10:00
local_paca - > irq_happened & = ~ PACA_IRQ_DEC ;
2021-04-19 15:48:11 +00:00
regs . trap = INTERRUPT_DECREMENTER ;
2020-02-26 03:35:36 +10:00
timer_interrupt ( & regs ) ;
if ( ! ( local_paca - > irq_happened & PACA_IRQ_HARD_DIS ) )
hard_irq_disable ( ) ;
}
2021-01-23 16:12:44 +10:00
if ( local_paca - > irq_happened & PACA_IRQ_EE ) {
2020-02-26 03:35:36 +10:00
local_paca - > irq_happened & = ~ PACA_IRQ_EE ;
2021-04-19 15:48:11 +00:00
regs . trap = INTERRUPT_EXTERNAL ;
2020-02-26 03:35:36 +10:00
do_IRQ ( & regs ) ;
if ( ! ( local_paca - > irq_happened & PACA_IRQ_HARD_DIS ) )
hard_irq_disable ( ) ;
}
2021-01-23 16:12:44 +10:00
if ( IS_ENABLED ( CONFIG_PPC_DOORBELL ) & & ( local_paca - > irq_happened & PACA_IRQ_DBELL ) ) {
2020-02-26 03:35:36 +10:00
local_paca - > irq_happened & = ~ PACA_IRQ_DBELL ;
2021-04-19 15:48:11 +00:00
regs . trap = INTERRUPT_DOORBELL ;
2020-02-26 03:35:36 +10:00
doorbell_exception ( & regs ) ;
if ( ! ( local_paca - > irq_happened & PACA_IRQ_HARD_DIS ) )
hard_irq_disable ( ) ;
}
/* Book3E does not support soft-masking PMI interrupts */
2021-01-23 16:12:44 +10:00
if ( IS_ENABLED ( CONFIG_PPC_BOOK3S ) & & ( local_paca - > irq_happened & PACA_IRQ_PMI ) ) {
2020-02-26 03:35:36 +10:00
local_paca - > irq_happened & = ~ PACA_IRQ_PMI ;
2021-04-19 15:48:11 +00:00
regs . trap = INTERRUPT_PERFMON ;
2020-02-26 03:35:36 +10:00
performance_monitor_exception ( & regs ) ;
if ( ! ( local_paca - > irq_happened & PACA_IRQ_HARD_DIS ) )
hard_irq_disable ( ) ;
}
2021-01-23 16:12:44 +10:00
if ( local_paca - > irq_happened & ~ PACA_IRQ_HARD_DIS ) {
2020-02-26 03:35:36 +10:00
/*
* We are responding to the next interrupt , so interrupt - off
* latencies should be reset here .
*/
trace_hardirqs_on ( ) ;
trace_hardirqs_off ( ) ;
goto again ;
}
}
powerpc/kuap: Restore AMR after replaying soft interrupts
Since de78a9c42a79 ("powerpc: Add a framework for Kernel Userspace
Access Protection"), user access helpers call user_{read|write}_access_{begin|end}
when user space access is allowed.
Commit 890274c2dc4c ("powerpc/64s: Implement KUAP for Radix MMU") made
the mentioned helpers program a AMR special register to allow such
access for a short period of time, most of the time AMR is expected to
block user memory access by the kernel.
Since the code accesses the user space memory, unsafe_get_user() calls
might_fault() which calls arch_local_irq_restore() if either
CONFIG_PROVE_LOCKING or CONFIG_DEBUG_ATOMIC_SLEEP is enabled.
arch_local_irq_restore() then attempts to replay pending soft
interrupts as KUAP regions have hardware interrupts enabled.
If a pending interrupt happens to do user access (performance
interrupts do that), it enables access for a short period of time so
after returning from the replay, the user access state remains blocked
and if a user page fault happens - "Bug: Read fault blocked by AMR!"
appears and SIGSEGV is sent.
An example trace:
Bug: Read fault blocked by AMR!
WARNING: CPU: 0 PID: 1603 at /home/aik/p/kernel/arch/powerpc/include/asm/book3s/64/kup-radix.h:145
CPU: 0 PID: 1603 Comm: amr Not tainted 5.10.0-rc6_v5.10-rc6_a+fstn1 #24
NIP: c00000000009ece8 LR: c00000000009ece4 CTR: 0000000000000000
REGS: c00000000dc63560 TRAP: 0700 Not tainted (5.10.0-rc6_v5.10-rc6_a+fstn1)
MSR: 8000000000021033 <SF,ME,IR,DR,RI,LE> CR: 28002888 XER: 20040000
CFAR: c0000000001fa928 IRQMASK: 1
GPR00: c00000000009ece4 c00000000dc637f0 c000000002397600 000000000000001f
GPR04: c0000000020eb318 0000000000000000 c00000000dc63494 0000000000000027
GPR08: c00000007fe4de68 c00000000dfe9180 0000000000000000 0000000000000001
GPR12: 0000000000002000 c0000000030a0000 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 bfffffffffffffff
GPR20: 0000000000000000 c0000000134a4020 c0000000019c2218 0000000000000fe0
GPR24: 0000000000000000 0000000000000000 c00000000d106200 0000000040000000
GPR28: 0000000000000000 0000000000000300 c00000000dc63910 c000000001946730
NIP __do_page_fault+0xb38/0xde0
LR __do_page_fault+0xb34/0xde0
Call Trace:
__do_page_fault+0xb34/0xde0 (unreliable)
handle_page_fault+0x10/0x2c
--- interrupt: 300 at strncpy_from_user+0x290/0x440
LR = strncpy_from_user+0x284/0x440
strncpy_from_user+0x2f0/0x440 (unreliable)
getname_flags+0x88/0x2c0
do_sys_openat2+0x2d4/0x5f0
do_sys_open+0xcc/0x140
system_call_exception+0x160/0x240
system_call_common+0xf0/0x27c
To fix it save/restore the AMR when replaying interrupts, and also
add a check if AMR was not blocked prior to replaying interrupts.
Originally found by syzkaller.
Fixes: 890274c2dc4c ("powerpc/64s: Implement KUAP for Radix MMU")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Use normal commit citation format and add full oops log to
change log, move kuap_check_amr() into the restore routine to
avoid warnings about unreconciled IRQ state]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210202091541.36499-1-aik@ozlabs.ru
2021-02-02 20:15:41 +11:00
# if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_KUAP)
static inline void replay_soft_interrupts_irqrestore ( void )
{
unsigned long kuap_state = get_kuap ( ) ;
/*
* Check if anything calls local_irq_enable / restore ( ) when KUAP is
* disabled ( user access enabled ) . We handle that case here by saving
* and re - locking AMR but we shouldn ' t get here in the first place ,
* hence the warning .
*/
2021-03-12 12:50:48 +00:00
kuap_assert_locked ( ) ;
powerpc/kuap: Restore AMR after replaying soft interrupts
Since de78a9c42a79 ("powerpc: Add a framework for Kernel Userspace
Access Protection"), user access helpers call user_{read|write}_access_{begin|end}
when user space access is allowed.
Commit 890274c2dc4c ("powerpc/64s: Implement KUAP for Radix MMU") made
the mentioned helpers program a AMR special register to allow such
access for a short period of time, most of the time AMR is expected to
block user memory access by the kernel.
Since the code accesses the user space memory, unsafe_get_user() calls
might_fault() which calls arch_local_irq_restore() if either
CONFIG_PROVE_LOCKING or CONFIG_DEBUG_ATOMIC_SLEEP is enabled.
arch_local_irq_restore() then attempts to replay pending soft
interrupts as KUAP regions have hardware interrupts enabled.
If a pending interrupt happens to do user access (performance
interrupts do that), it enables access for a short period of time so
after returning from the replay, the user access state remains blocked
and if a user page fault happens - "Bug: Read fault blocked by AMR!"
appears and SIGSEGV is sent.
An example trace:
Bug: Read fault blocked by AMR!
WARNING: CPU: 0 PID: 1603 at /home/aik/p/kernel/arch/powerpc/include/asm/book3s/64/kup-radix.h:145
CPU: 0 PID: 1603 Comm: amr Not tainted 5.10.0-rc6_v5.10-rc6_a+fstn1 #24
NIP: c00000000009ece8 LR: c00000000009ece4 CTR: 0000000000000000
REGS: c00000000dc63560 TRAP: 0700 Not tainted (5.10.0-rc6_v5.10-rc6_a+fstn1)
MSR: 8000000000021033 <SF,ME,IR,DR,RI,LE> CR: 28002888 XER: 20040000
CFAR: c0000000001fa928 IRQMASK: 1
GPR00: c00000000009ece4 c00000000dc637f0 c000000002397600 000000000000001f
GPR04: c0000000020eb318 0000000000000000 c00000000dc63494 0000000000000027
GPR08: c00000007fe4de68 c00000000dfe9180 0000000000000000 0000000000000001
GPR12: 0000000000002000 c0000000030a0000 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 bfffffffffffffff
GPR20: 0000000000000000 c0000000134a4020 c0000000019c2218 0000000000000fe0
GPR24: 0000000000000000 0000000000000000 c00000000d106200 0000000040000000
GPR28: 0000000000000000 0000000000000300 c00000000dc63910 c000000001946730
NIP __do_page_fault+0xb38/0xde0
LR __do_page_fault+0xb34/0xde0
Call Trace:
__do_page_fault+0xb34/0xde0 (unreliable)
handle_page_fault+0x10/0x2c
--- interrupt: 300 at strncpy_from_user+0x290/0x440
LR = strncpy_from_user+0x284/0x440
strncpy_from_user+0x2f0/0x440 (unreliable)
getname_flags+0x88/0x2c0
do_sys_openat2+0x2d4/0x5f0
do_sys_open+0xcc/0x140
system_call_exception+0x160/0x240
system_call_common+0xf0/0x27c
To fix it save/restore the AMR when replaying interrupts, and also
add a check if AMR was not blocked prior to replaying interrupts.
Originally found by syzkaller.
Fixes: 890274c2dc4c ("powerpc/64s: Implement KUAP for Radix MMU")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Use normal commit citation format and add full oops log to
change log, move kuap_check_amr() into the restore routine to
avoid warnings about unreconciled IRQ state]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210202091541.36499-1-aik@ozlabs.ru
2021-02-02 20:15:41 +11:00
if ( kuap_state ! = AMR_KUAP_BLOCKED )
set_kuap ( AMR_KUAP_BLOCKED ) ;
replay_soft_interrupts ( ) ;
if ( kuap_state ! = AMR_KUAP_BLOCKED )
set_kuap ( kuap_state ) ;
}
# else
# define replay_soft_interrupts_irqrestore() replay_soft_interrupts()
# endif
2021-06-18 01:51:09 +10:00
# ifdef CONFIG_CC_HAS_ASM_GOTO
notrace void arch_local_irq_restore ( unsigned long mask )
{
unsigned char irq_happened ;
/* Write the new soft-enabled value if it is a disable */
if ( mask ) {
irq_soft_mask_set ( mask ) ;
return ;
}
/*
* After the stb , interrupts are unmasked and there are no interrupts
* pending replay . The restart sequence makes this atomic with
* respect to soft - masked interrupts . If this was just a simple code
* sequence , a soft - masked interrupt could become pending right after
* the comparison and before the stb .
*
* This allows interrupts to be unmasked without hard disabling , and
* also without new hard interrupts coming in ahead of pending ones .
*/
asm_volatile_goto (
" 1: \n "
" lbz 9,%0(13) \n "
" cmpwi 9,0 \n "
" bne %l[happened] \n "
" stb 9,%1(13) \n "
" 2: \n "
RESTART_TABLE ( 1 b , 2 b , 1 b )
: : " i " ( offsetof ( struct paca_struct , irq_happened ) ) ,
" i " ( offsetof ( struct paca_struct , irq_soft_mask ) )
: " cr0 " , " r9 "
: happened ) ;
if ( IS_ENABLED ( CONFIG_PPC_IRQ_SOFT_MASK_DEBUG ) )
WARN_ON_ONCE ( ! ( mfmsr ( ) & MSR_EE ) ) ;
return ;
happened :
irq_happened = get_irq_happened ( ) ;
if ( IS_ENABLED ( CONFIG_PPC_IRQ_SOFT_MASK_DEBUG ) )
WARN_ON_ONCE ( ! irq_happened ) ;
if ( irq_happened = = PACA_IRQ_HARD_DIS ) {
if ( IS_ENABLED ( CONFIG_PPC_IRQ_SOFT_MASK_DEBUG ) )
WARN_ON_ONCE ( mfmsr ( ) & MSR_EE ) ;
irq_soft_mask_set ( IRQS_ENABLED ) ;
local_paca - > irq_happened = 0 ;
__hard_irq_enable ( ) ;
return ;
}
/* Have interrupts to replay, need to hard disable first */
if ( ! ( irq_happened & PACA_IRQ_HARD_DIS ) ) {
if ( IS_ENABLED ( CONFIG_PPC_IRQ_SOFT_MASK_DEBUG ) ) {
if ( ! ( mfmsr ( ) & MSR_EE ) ) {
/*
* An interrupt could have come in and cleared
* MSR [ EE ] and set IRQ_HARD_DIS , so check
* IRQ_HARD_DIS again and warn if it is still
* clear .
*/
irq_happened = get_irq_happened ( ) ;
WARN_ON_ONCE ( ! ( irq_happened & PACA_IRQ_HARD_DIS ) ) ;
}
}
__hard_irq_disable ( ) ;
local_paca - > irq_happened | = PACA_IRQ_HARD_DIS ;
} else {
if ( IS_ENABLED ( CONFIG_PPC_IRQ_SOFT_MASK_DEBUG ) ) {
if ( WARN_ON_ONCE ( mfmsr ( ) & MSR_EE ) )
__hard_irq_disable ( ) ;
}
}
/*
* Disable preempt here , so that the below preempt_enable will
* perform resched if required ( a replayed interrupt may set
* need_resched ) .
*/
preempt_disable ( ) ;
irq_soft_mask_set ( IRQS_ALL_DISABLED ) ;
trace_hardirqs_off ( ) ;
replay_soft_interrupts_irqrestore ( ) ;
local_paca - > irq_happened = 0 ;
trace_hardirqs_on ( ) ;
irq_soft_mask_set ( IRQS_ENABLED ) ;
__hard_irq_enable ( ) ;
preempt_enable ( ) ;
}
# else
powerpc/64: Change soft_enabled from flag to bitmask
"paca->soft_enabled" is used as a flag to mask some of interrupts.
Currently supported flags values and their details:
soft_enabled MSR[EE]
0 0 Disabled (PMI and HMI not masked)
1 1 Enabled
"paca->soft_enabled" is initialized to 1 to make the interripts as
enabled. arch_local_irq_disable() will toggle the value when
interrupts needs to disbled. At this point, the interrupts are not
actually disabled, instead, interrupt vector has code to check for the
flag and mask it when it occurs. By "mask it", it update interrupt
paca->irq_happened and return. arch_local_irq_restore() is called to
re-enable interrupts, which checks and replays interrupts if any
occured.
Now, as mentioned, current logic doesnot mask "performance monitoring
interrupts" and PMIs are implemented as NMI. But this patchset depends
on local_irq_* for a successful local_* update. Meaning, mask all
possible interrupts during local_* update and replay them after the
update.
So the idea here is to reserve the "paca->soft_enabled" logic. New
values and details:
soft_enabled MSR[EE]
1 0 Disabled (PMI and HMI not masked)
0 1 Enabled
Reason for the this change is to create foundation for a third mask
value "0x2" for "soft_enabled" to add support to mask PMIs. When
->soft_enabled is set to a value "3", PMI interrupts are mask and when
set to a value of "1", PMI are not mask. With this patch also extends
soft_enabled as interrupt disable mask.
Current flags are renamed from IRQ_[EN?DIS}ABLED to
IRQS_ENABLED and IRQS_DISABLED.
Patch also fixes the ptrace call to force the user to see the softe
value to be alway 1. Reason being, even though userspace has no
business knowing about softe, it is part of pt_regs. Like-wise in
signal context.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20 09:25:49 +05:30
notrace void arch_local_irq_restore ( unsigned long mask )
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 18:27:59 +11:00
{
unsigned char irq_happened ;
/* Write the new soft-enabled value */
2017-12-20 09:25:50 +05:30
irq_soft_mask_set ( mask ) ;
if ( mask )
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 18:27:59 +11:00
return ;
powerpc/64: Change soft_enabled from flag to bitmask
"paca->soft_enabled" is used as a flag to mask some of interrupts.
Currently supported flags values and their details:
soft_enabled MSR[EE]
0 0 Disabled (PMI and HMI not masked)
1 1 Enabled
"paca->soft_enabled" is initialized to 1 to make the interripts as
enabled. arch_local_irq_disable() will toggle the value when
interrupts needs to disbled. At this point, the interrupts are not
actually disabled, instead, interrupt vector has code to check for the
flag and mask it when it occurs. By "mask it", it update interrupt
paca->irq_happened and return. arch_local_irq_restore() is called to
re-enable interrupts, which checks and replays interrupts if any
occured.
Now, as mentioned, current logic doesnot mask "performance monitoring
interrupts" and PMIs are implemented as NMI. But this patchset depends
on local_irq_* for a successful local_* update. Meaning, mask all
possible interrupts during local_* update and replay them after the
update.
So the idea here is to reserve the "paca->soft_enabled" logic. New
values and details:
soft_enabled MSR[EE]
1 0 Disabled (PMI and HMI not masked)
0 1 Enabled
Reason for the this change is to create foundation for a third mask
value "0x2" for "soft_enabled" to add support to mask PMIs. When
->soft_enabled is set to a value "3", PMI interrupts are mask and when
set to a value of "1", PMI are not mask. With this patch also extends
soft_enabled as interrupt disable mask.
Current flags are renamed from IRQ_[EN?DIS}ABLED to
IRQS_ENABLED and IRQS_DISABLED.
Patch also fixes the ptrace call to force the user to see the softe
value to be alway 1. Reason being, even though userspace has no
business knowing about softe, it is part of pt_regs. Like-wise in
signal context.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20 09:25:49 +05:30
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 18:27:59 +11:00
/*
* From this point onward , we can take interrupts , preempt ,
* etc . . . unless we got hard - disabled . We check if an event
* happened . If none happened , we know we can just return .
*
* We may have preempted before the check below , in which case
* we are checking the " new " CPU instead of the old one . This
* is only a problem if an event happened on the " old " CPU .
*
2012-03-21 18:23:27 +00:00
* External interrupt events will have caused interrupts to
* be hard - disabled , so there is no problem , we
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 18:27:59 +11:00
* cannot have preempted .
2006-11-10 21:32:40 +00:00
*/
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 18:27:59 +11:00
irq_happened = get_irq_happened ( ) ;
2018-06-03 22:24:32 +10:00
if ( ! irq_happened ) {
2020-02-26 03:35:36 +10:00
if ( IS_ENABLED ( CONFIG_PPC_IRQ_SOFT_MASK_DEBUG ) )
WARN_ON_ONCE ( ! ( mfmsr ( ) & MSR_EE ) ) ;
[POWERPC] Lazy interrupt disabling for 64-bit machines
This implements a lazy strategy for disabling interrupts. This means
that local_irq_disable() et al. just clear the 'interrupts are
enabled' flag in the paca. If an interrupt comes along, the interrupt
entry code notices that interrupts are supposed to be disabled, and
clears the EE bit in SRR1, clears the 'interrupts are hard-enabled'
flag in the paca, and returns. This means that interrupts only
actually get disabled in the processor when an interrupt comes along.
When interrupts are enabled by local_irq_enable() et al., the code
sets the interrupts-enabled flag in the paca, and then checks whether
interrupts got hard-disabled. If so, it also sets the EE bit in the
MSR to hard-enable the interrupts.
This has the potential to improve performance, and also makes it
easier to make a kernel that can boot on iSeries and on other 64-bit
machines, since this lazy-disable strategy is very similar to the
soft-disable strategy that iSeries already uses.
This version renames paca->proc_enabled to paca->soft_enabled, and
changes a couple of soft-disables in the kexec code to hard-disables,
which should fix the crash that Michael Ellerman saw. This doesn't
yet use a reserved CR field for the soft_enabled and hard_enabled
flags. This applies on top of Stephen Rothwell's patches to make it
possible to build a combined iSeries/other kernel.
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-10-04 16:47:49 +10:00
return ;
2018-06-03 22:24:32 +10:00
}
2006-11-10 21:32:40 +00:00
2020-02-26 03:35:36 +10:00
/* We need to hard disable to replay. */
2018-06-03 22:24:32 +10:00
if ( ! ( irq_happened & PACA_IRQ_HARD_DIS ) ) {
2020-02-26 03:35:36 +10:00
if ( IS_ENABLED ( CONFIG_PPC_IRQ_SOFT_MASK_DEBUG ) )
WARN_ON_ONCE ( ! ( mfmsr ( ) & MSR_EE ) ) ;
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 18:27:59 +11:00
__hard_irq_disable ( ) ;
2020-11-07 11:43:36 +10:00
local_paca - > irq_happened | = PACA_IRQ_HARD_DIS ;
2018-06-03 22:24:32 +10:00
} else {
2012-05-10 16:12:38 +00:00
/*
* We should already be hard disabled here . We had bugs
* where that wasn ' t the case so let ' s dbl check it and
* warn if we are wrong . Only do that when IRQ tracing
* is enabled as mfmsr ( ) can be costly .
*/
2020-02-26 03:35:36 +10:00
if ( IS_ENABLED ( CONFIG_PPC_IRQ_SOFT_MASK_DEBUG ) ) {
if ( WARN_ON_ONCE ( mfmsr ( ) & MSR_EE ) )
__hard_irq_disable ( ) ;
}
if ( irq_happened = = PACA_IRQ_HARD_DIS ) {
local_paca - > irq_happened = 0 ;
__hard_irq_enable ( ) ;
return ;
}
2018-06-03 22:24:32 +10:00
}
2012-05-10 16:12:38 +00:00
2020-09-15 21:46:45 +10:00
/*
* Disable preempt here , so that the below preempt_enable will
* perform resched if required ( a replayed interrupt may set
* need_resched ) .
*/
preempt_disable ( ) ;
2017-12-20 09:25:53 +05:30
irq_soft_mask_set ( IRQS_ALL_DISABLED ) ;
2017-10-21 17:56:06 +10:00
trace_hardirqs_off ( ) ;
2010-07-09 15:30:22 +10:00
powerpc/kuap: Restore AMR after replaying soft interrupts
Since de78a9c42a79 ("powerpc: Add a framework for Kernel Userspace
Access Protection"), user access helpers call user_{read|write}_access_{begin|end}
when user space access is allowed.
Commit 890274c2dc4c ("powerpc/64s: Implement KUAP for Radix MMU") made
the mentioned helpers program a AMR special register to allow such
access for a short period of time, most of the time AMR is expected to
block user memory access by the kernel.
Since the code accesses the user space memory, unsafe_get_user() calls
might_fault() which calls arch_local_irq_restore() if either
CONFIG_PROVE_LOCKING or CONFIG_DEBUG_ATOMIC_SLEEP is enabled.
arch_local_irq_restore() then attempts to replay pending soft
interrupts as KUAP regions have hardware interrupts enabled.
If a pending interrupt happens to do user access (performance
interrupts do that), it enables access for a short period of time so
after returning from the replay, the user access state remains blocked
and if a user page fault happens - "Bug: Read fault blocked by AMR!"
appears and SIGSEGV is sent.
An example trace:
Bug: Read fault blocked by AMR!
WARNING: CPU: 0 PID: 1603 at /home/aik/p/kernel/arch/powerpc/include/asm/book3s/64/kup-radix.h:145
CPU: 0 PID: 1603 Comm: amr Not tainted 5.10.0-rc6_v5.10-rc6_a+fstn1 #24
NIP: c00000000009ece8 LR: c00000000009ece4 CTR: 0000000000000000
REGS: c00000000dc63560 TRAP: 0700 Not tainted (5.10.0-rc6_v5.10-rc6_a+fstn1)
MSR: 8000000000021033 <SF,ME,IR,DR,RI,LE> CR: 28002888 XER: 20040000
CFAR: c0000000001fa928 IRQMASK: 1
GPR00: c00000000009ece4 c00000000dc637f0 c000000002397600 000000000000001f
GPR04: c0000000020eb318 0000000000000000 c00000000dc63494 0000000000000027
GPR08: c00000007fe4de68 c00000000dfe9180 0000000000000000 0000000000000001
GPR12: 0000000000002000 c0000000030a0000 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 bfffffffffffffff
GPR20: 0000000000000000 c0000000134a4020 c0000000019c2218 0000000000000fe0
GPR24: 0000000000000000 0000000000000000 c00000000d106200 0000000040000000
GPR28: 0000000000000000 0000000000000300 c00000000dc63910 c000000001946730
NIP __do_page_fault+0xb38/0xde0
LR __do_page_fault+0xb34/0xde0
Call Trace:
__do_page_fault+0xb34/0xde0 (unreliable)
handle_page_fault+0x10/0x2c
--- interrupt: 300 at strncpy_from_user+0x290/0x440
LR = strncpy_from_user+0x284/0x440
strncpy_from_user+0x2f0/0x440 (unreliable)
getname_flags+0x88/0x2c0
do_sys_openat2+0x2d4/0x5f0
do_sys_open+0xcc/0x140
system_call_exception+0x160/0x240
system_call_common+0xf0/0x27c
To fix it save/restore the AMR when replaying interrupts, and also
add a check if AMR was not blocked prior to replaying interrupts.
Originally found by syzkaller.
Fixes: 890274c2dc4c ("powerpc/64s: Implement KUAP for Radix MMU")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Use normal commit citation format and add full oops log to
change log, move kuap_check_amr() into the restore routine to
avoid warnings about unreconciled IRQ state]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210202091541.36499-1-aik@ozlabs.ru
2021-02-02 20:15:41 +11:00
replay_soft_interrupts_irqrestore ( ) ;
2020-02-26 03:35:36 +10:00
local_paca - > irq_happened = 0 ;
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 18:27:59 +11:00
2017-10-21 17:56:06 +10:00
trace_hardirqs_on ( ) ;
2017-12-20 09:25:50 +05:30
irq_soft_mask_set ( IRQS_ENABLED ) ;
2007-05-10 22:22:45 -07:00
__hard_irq_enable ( ) ;
2020-09-15 21:46:45 +10:00
preempt_enable ( ) ;
[POWERPC] Lazy interrupt disabling for 64-bit machines
This implements a lazy strategy for disabling interrupts. This means
that local_irq_disable() et al. just clear the 'interrupts are
enabled' flag in the paca. If an interrupt comes along, the interrupt
entry code notices that interrupts are supposed to be disabled, and
clears the EE bit in SRR1, clears the 'interrupts are hard-enabled'
flag in the paca, and returns. This means that interrupts only
actually get disabled in the processor when an interrupt comes along.
When interrupts are enabled by local_irq_enable() et al., the code
sets the interrupts-enabled flag in the paca, and then checks whether
interrupts got hard-disabled. If so, it also sets the EE bit in the
MSR to hard-enable the interrupts.
This has the potential to improve performance, and also makes it
easier to make a kernel that can boot on iSeries and on other 64-bit
machines, since this lazy-disable strategy is very similar to the
soft-disable strategy that iSeries already uses.
This version renames paca->proc_enabled to paca->soft_enabled, and
changes a couple of soft-disables in the kexec code to hard-disables,
which should fix the crash that Michael Ellerman saw. This doesn't
yet use a reserved CR field for the soft_enabled and hard_enabled
flags. This applies on top of Stephen Rothwell's patches to make it
possible to build a combined iSeries/other kernel.
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-10-04 16:47:49 +10:00
}
2021-06-18 01:51:09 +10:00
# endif
2010-10-07 14:08:55 +01:00
EXPORT_SYMBOL ( arch_local_irq_restore ) ;
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 18:27:59 +11:00
2012-07-10 18:36:40 +10:00
/*
* This is a helper to use when about to go into idle low - power
* when the latter has the side effect of re - enabling interrupts
* ( such as calling H_CEDE under pHyp ) .
*
* You call this function with interrupts soft - disabled ( this is
* already the case when ppc_md . power_save is called ) . The function
* will return whether to enter power save or just return .
*
* In the former case , it will have notified lockdep of interrupts
* being re - enabled and generally sanitized the lazy irq state ,
* and in the latter case it will leave with interrupts hard
* disabled and marked as such , so the local_irq_enable ( ) call
2014-06-06 14:38:33 -07:00
* in arch_cpu_idle ( ) will properly re - enable everything .
2012-07-10 18:36:40 +10:00
*/
bool prep_irq_for_idle ( void )
{
/*
* First we need to hard disable to ensure no interrupt
* occurs before we effectively enter the low power state
*/
2017-06-13 23:05:45 +10:00
__hard_irq_disable ( ) ;
local_paca - > irq_happened | = PACA_IRQ_HARD_DIS ;
2012-07-10 18:36:40 +10:00
/*
* If anything happened while we were soft - disabled ,
* we return now and do not enter the low power state .
*/
if ( lazy_irq_pending ( ) )
return false ;
/* Tell lockdep we are about to re-enable */
trace_hardirqs_on ( ) ;
/*
* Mark interrupts as soft - enabled and clear the
* PACA_IRQ_HARD_DIS from the pending mask since we
* are about to hard enable as well as a side effect
* of entering the low power state .
*/
local_paca - > irq_happened & = ~ PACA_IRQ_HARD_DIS ;
2017-12-20 09:25:50 +05:30
irq_soft_mask_set ( IRQS_ENABLED ) ;
2012-07-10 18:36:40 +10:00
/* Tell the caller to enter the low power state */
return true ;
}
2017-06-13 23:05:47 +10:00
# ifdef CONFIG_PPC_BOOK3S
2017-06-13 23:05:45 +10:00
/*
* This is for idle sequences that return with IRQs off , but the
* idle state itself wakes on interrupt . Tell the irq tracer that
* IRQs are enabled for the duration of idle so it does not get long
* off times . Must be paired with fini_irq_for_idle_irqsoff .
*/
bool prep_irq_for_idle_irqsoff ( void )
{
WARN_ON ( ! irqs_disabled ( ) ) ;
/*
* First we need to hard disable to ensure no interrupt
* occurs before we effectively enter the low power state
*/
__hard_irq_disable ( ) ;
local_paca - > irq_happened | = PACA_IRQ_HARD_DIS ;
/*
* If anything happened while we were soft - disabled ,
* we return now and do not enter the low power state .
*/
if ( lazy_irq_pending ( ) )
return false ;
/* Tell lockdep we are about to re-enable */
trace_hardirqs_on ( ) ;
return true ;
}
2017-06-13 23:05:47 +10:00
/*
* Take the SRR1 wakeup reason , index into this table to find the
* appropriate irq_happened bit .
2017-09-29 13:29:41 +10:00
*
* Sytem reset exceptions taken in idle state also come through here ,
* but they are NMI interrupts so do not need to wait for IRQs to be
* restored , and should be taken as early as practical . These are marked
* with 0xff in the table . The Power ISA specifies 0100 b as the system
* reset interrupt reason .
2017-06-13 23:05:47 +10:00
*/
2017-09-29 13:29:41 +10:00
# define IRQ_SYSTEM_RESET 0xff
2017-06-13 23:05:47 +10:00
static const u8 srr1_to_lazyirq [ 0x10 ] = {
0 , 0 , 0 ,
PACA_IRQ_DBELL ,
2017-09-29 13:29:41 +10:00
IRQ_SYSTEM_RESET ,
2017-06-13 23:05:47 +10:00
PACA_IRQ_DBELL ,
PACA_IRQ_DEC ,
0 ,
PACA_IRQ_EE ,
PACA_IRQ_EE ,
PACA_IRQ_HMI ,
0 , 0 , 0 , 0 , 0 } ;
2017-11-05 23:33:55 +11:00
void replay_system_reset ( void )
2017-09-29 13:29:41 +10:00
{
struct pt_regs regs ;
ppc_save_regs ( & regs ) ;
regs . trap = 0x100 ;
get_paca ( ) - > in_nmi = 1 ;
system_reset_exception ( & regs ) ;
get_paca ( ) - > in_nmi = 0 ;
}
2017-11-05 23:33:55 +11:00
EXPORT_SYMBOL_GPL ( replay_system_reset ) ;
2017-09-29 13:29:41 +10:00
2017-06-13 23:05:47 +10:00
void irq_set_pending_from_srr1 ( unsigned long srr1 )
{
unsigned int idx = ( srr1 & SRR1_WAKEMASK_P8 ) > > 18 ;
2017-09-29 13:29:41 +10:00
u8 reason = srr1_to_lazyirq [ idx ] ;
/*
* Take the system reset now , which is immediately after registers
* are restored from idle . It ' s an NMI , so interrupts need not be
* re - enabled before it is taken .
*/
if ( unlikely ( reason = = IRQ_SYSTEM_RESET ) ) {
replay_system_reset ( ) ;
return ;
}
2017-06-13 23:05:47 +10:00
2020-04-02 22:12:12 +10:00
if ( reason = = PACA_IRQ_DBELL ) {
/*
* When doorbell triggers a system reset wakeup , the message
* is not cleared , so if the doorbell interrupt is replayed
* and the IPI handled , the doorbell interrupt would still
* fire when EE is enabled .
*
* To avoid taking the superfluous doorbell interrupt ,
* execute a msgclr here before the interrupt is replayed .
*/
ppc_msgclr ( PPC_DBELL_MSGTYPE ) ;
}
2017-06-13 23:05:47 +10:00
/*
* The 0 index ( SRR1 [ 42 : 45 ] = b0000 ) must always evaluate to 0 ,
2017-09-29 13:29:41 +10:00
* so this can be called unconditionally with the SRR1 wake
* reason as returned by the idle code , which uses 0 to mean no
* interrupt .
*
* If a future CPU was to designate this as an interrupt reason ,
* then a new index for no interrupt must be assigned .
2017-06-13 23:05:47 +10:00
*/
2017-09-29 13:29:41 +10:00
local_paca - > irq_happened | = reason ;
2017-06-13 23:05:47 +10:00
}
# endif /* CONFIG_PPC_BOOK3S */
2016-07-08 16:37:07 +10:00
/*
* Force a replay of the external interrupt handler on this CPU .
*/
void force_external_irq_replay ( void )
{
/*
* This must only be called with interrupts soft - disabled ,
* the replay will happen when re - enabling .
*/
WARN_ON ( ! arch_irqs_disabled ( ) ) ;
2018-03-21 12:22:28 +10:00
/*
* Interrupts must always be hard disabled before irq_happened is
* modified ( to prevent lost update in case of interrupt between
* load and store ) .
*/
__hard_irq_disable ( ) ;
local_paca - > irq_happened | = PACA_IRQ_HARD_DIS ;
2016-07-08 16:37:07 +10:00
/* Indicate in the PACA that we have an interrupt to replay */
local_paca - > irq_happened | = PACA_IRQ_EE ;
}
2005-11-09 18:07:45 +11:00
# endif /* CONFIG_PPC64 */
2005-04-16 15:20:36 -07:00
2011-03-25 17:04:59 +01:00
int arch_show_interrupts ( struct seq_file * p , int prec )
2010-01-31 20:33:18 +00:00
{
int j ;
# if defined(CONFIG_PPC32) && defined(CONFIG_TAU_INT)
if ( tau_initialized ) {
seq_printf ( p , " %*s: " , prec , " TAU " ) ;
for_each_online_cpu ( j )
seq_printf ( p , " %10u " , tau_interrupts ( j ) ) ;
seq_puts ( p , " PowerPC Thermal Assist (cpu temp) \n " ) ;
}
# endif /* CONFIG_PPC32 && CONFIG_TAU_INT */
2010-01-31 20:34:06 +00:00
seq_printf ( p , " %*s: " , prec , " LOC " ) ;
for_each_online_cpu ( j )
2013-01-23 16:06:11 +08:00
seq_printf ( p , " %10u " , per_cpu ( irq_stat , j ) . timer_irqs_event ) ;
seq_printf ( p , " Local timer interrupts for timer event device \n " ) ;
2018-05-05 03:19:35 +10:00
seq_printf ( p , " %*s: " , prec , " BCT " ) ;
for_each_online_cpu ( j )
seq_printf ( p , " %10u " , per_cpu ( irq_stat , j ) . broadcast_irqs_event ) ;
seq_printf ( p , " Broadcast timer interrupts for timer event device \n " ) ;
2013-01-23 16:06:11 +08:00
seq_printf ( p , " %*s: " , prec , " LOC " ) ;
for_each_online_cpu ( j )
seq_printf ( p , " %10u " , per_cpu ( irq_stat , j ) . timer_irqs_others ) ;
seq_printf ( p , " Local timer interrupts for others \n " ) ;
2010-01-31 20:34:06 +00:00
2010-01-31 20:34:36 +00:00
seq_printf ( p , " %*s: " , prec , " SPU " ) ;
for_each_online_cpu ( j )
seq_printf ( p , " %10u " , per_cpu ( irq_stat , j ) . spurious_irqs ) ;
seq_printf ( p , " Spurious interrupts \n " ) ;
2013-06-04 14:21:17 +10:00
seq_printf ( p , " %*s: " , prec , " PMI " ) ;
2010-01-31 20:34:06 +00:00
for_each_online_cpu ( j )
seq_printf ( p , " %10u " , per_cpu ( irq_stat , j ) . pmu_irqs ) ;
seq_printf ( p , " Performance monitoring interrupts \n " ) ;
seq_printf ( p , " %*s: " , prec , " MCE " ) ;
for_each_online_cpu ( j )
seq_printf ( p , " %10u " , per_cpu ( irq_stat , j ) . mce_exceptions ) ;
seq_printf ( p , " Machine check exceptions \n " ) ;
2020-06-23 15:57:50 +05:30
# ifdef CONFIG_PPC_BOOK3S_64
2014-07-29 18:40:01 +05:30
if ( cpu_has_feature ( CPU_FTR_HVMODE ) ) {
seq_printf ( p , " %*s: " , prec , " HMI " ) ;
for_each_online_cpu ( j )
2020-06-23 15:57:50 +05:30
seq_printf ( p , " %10u " , paca_ptrs [ j ] - > hmi_irqs ) ;
2014-07-29 18:40:01 +05:30
seq_printf ( p , " Hypervisor Maintenance Interrupts \n " ) ;
}
2020-06-23 15:57:50 +05:30
# endif
2014-07-29 18:40:01 +05:30
2017-08-01 22:00:53 +10:00
seq_printf ( p , " %*s: " , prec , " NMI " ) ;
for_each_online_cpu ( j )
seq_printf ( p , " %10u " , per_cpu ( irq_stat , j ) . sreset_irqs ) ;
seq_printf ( p , " System Reset interrupts \n " ) ;
2017-08-01 22:00:54 +10:00
# ifdef CONFIG_PPC_WATCHDOG
seq_printf ( p , " %*s: " , prec , " WDG " ) ;
for_each_online_cpu ( j )
seq_printf ( p , " %10u " , per_cpu ( irq_stat , j ) . soft_nmi_irqs ) ;
seq_printf ( p , " Watchdog soft-NMI interrupts \n " ) ;
# endif
2013-03-21 19:22:52 +00:00
# ifdef CONFIG_PPC_DOORBELL
if ( cpu_has_feature ( CPU_FTR_DBELL ) ) {
seq_printf ( p , " %*s: " , prec , " DBL " ) ;
for_each_online_cpu ( j )
seq_printf ( p , " %10u " , per_cpu ( irq_stat , j ) . doorbell_irqs ) ;
seq_printf ( p , " Doorbell interrupts \n " ) ;
}
# endif
2010-01-31 20:33:18 +00:00
return 0 ;
}
2010-01-31 20:34:06 +00:00
/*
* / proc / stat helpers
*/
u64 arch_irq_stat_cpu ( unsigned int cpu )
{
2013-01-23 16:06:11 +08:00
u64 sum = per_cpu ( irq_stat , cpu ) . timer_irqs_event ;
2010-01-31 20:34:06 +00:00
2018-05-05 03:19:35 +10:00
sum + = per_cpu ( irq_stat , cpu ) . broadcast_irqs_event ;
2010-01-31 20:34:06 +00:00
sum + = per_cpu ( irq_stat , cpu ) . pmu_irqs ;
sum + = per_cpu ( irq_stat , cpu ) . mce_exceptions ;
2010-01-31 20:34:36 +00:00
sum + = per_cpu ( irq_stat , cpu ) . spurious_irqs ;
2013-01-23 16:06:11 +08:00
sum + = per_cpu ( irq_stat , cpu ) . timer_irqs_others ;
2020-06-23 15:57:50 +05:30
# ifdef CONFIG_PPC_BOOK3S_64
sum + = paca_ptrs [ cpu ] - > hmi_irqs ;
# endif
2017-08-01 22:00:53 +10:00
sum + = per_cpu ( irq_stat , cpu ) . sreset_irqs ;
2017-08-01 22:00:54 +10:00
# ifdef CONFIG_PPC_WATCHDOG
sum + = per_cpu ( irq_stat , cpu ) . soft_nmi_irqs ;
# endif
2013-03-21 19:22:52 +00:00
# ifdef CONFIG_PPC_DOORBELL
sum + = per_cpu ( irq_stat , cpu ) . doorbell_irqs ;
# endif
2010-01-31 20:34:06 +00:00
return sum ;
}
2009-04-22 15:31:37 +00:00
static inline void check_stack_overflow ( void )
{
long sp ;
2020-02-20 22:51:40 +11:00
if ( ! IS_ENABLED ( CONFIG_DEBUG_STACKOVERFLOW ) )
return ;
2020-02-20 22:51:39 +11:00
sp = current_stack_pointer & ( THREAD_SIZE - 1 ) ;
2009-04-22 15:31:37 +00:00
/* check for stack overflow: is there less than 2KB free? */
2019-01-31 10:09:00 +00:00
if ( unlikely ( sp < 2048 ) ) {
pr_err ( " do_IRQ: stack overflow: %ld \n " , sp ) ;
2009-04-22 15:31:37 +00:00
dump_stack ( ) ;
}
}
2021-03-19 06:34:43 +00:00
static __always_inline void call_do_softirq ( const void * sp )
{
/* Temporarily switch r1 to sp, call __do_softirq() then restore r1. */
asm volatile (
PPC_STLU " %%r1, %[offset](%[sp]) ; "
" mr %%r1, %[sp] ; "
" bl %[callee] ; "
PPC_LL " %%r1, 0(%%r1) ; "
: // Outputs
: // Inputs
[ sp ] " b " ( sp ) , [ offset ] " i " ( THREAD_SIZE - STACK_FRAME_OVERHEAD ) ,
[ callee ] " i " ( __do_softirq )
: // Clobbers
" lr " , " xer " , " ctr " , " memory " , " cr0 " , " cr1 " , " cr5 " , " cr6 " ,
" cr7 " , " r0 " , " r3 " , " r4 " , " r5 " , " r6 " , " r7 " , " r8 " , " r9 " , " r10 " ,
" r11 " , " r12 "
) ;
}
static __always_inline void call_do_irq ( struct pt_regs * regs , void * sp )
{
register unsigned long r3 asm ( " r3 " ) = ( unsigned long ) regs ;
/* Temporarily switch r1 to sp, call __do_irq() then restore r1. */
asm volatile (
PPC_STLU " %%r1, %[offset](%[sp]) ; "
" mr %%r1, %[sp] ; "
" bl %[callee] ; "
PPC_LL " %%r1, 0(%%r1) ; "
: // Outputs
" +r " ( r3 )
: // Inputs
[ sp ] " b " ( sp ) , [ offset ] " i " ( THREAD_SIZE - STACK_FRAME_OVERHEAD ) ,
[ callee ] " i " ( __do_irq )
: // Clobbers
" lr " , " xer " , " ctr " , " memory " , " cr0 " , " cr1 " , " cr5 " , " cr6 " ,
" cr7 " , " r0 " , " r4 " , " r5 " , " r6 " , " r7 " , " r8 " , " r9 " , " r10 " ,
" r11 " , " r12 "
) ;
}
2013-09-23 14:29:11 +10:00
void __do_irq ( struct pt_regs * regs )
2005-04-16 15:20:36 -07:00
{
2006-07-03 21:36:01 +10:00
unsigned int irq ;
2005-04-16 15:20:36 -07:00
2012-09-10 15:37:43 +00:00
trace_irq_entry ( regs ) ;
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 18:27:59 +11:00
/*
* Query the platform PIC for the interrupt & ack it .
*
* This will typically lower the interrupt line to the CPU
*/
2006-10-07 22:08:26 +10:00
irq = ppc_md . get_irq ( ) ;
2005-04-16 15:20:36 -07:00
2013-09-23 14:29:11 +10:00
/* We can hard enable interrupts now to allow perf interrupts */
powerpc: Rework lazy-interrupt handling
The current implementation of lazy interrupts handling has some
issues that this tries to address.
We don't do the various workarounds we need to do when re-enabling
interrupts in some cases such as when returning from an interrupt
and thus we may still lose or get delayed decrementer or doorbell
interrupts.
The current scheme also makes it much harder to handle the external
"edge" interrupts provided by some BookE processors when using the
EPR facility (External Proxy) and the Freescale Hypervisor.
Additionally, we tend to keep interrupts hard disabled in a number
of cases, such as decrementer interrupts, external interrupts, or
when a masked decrementer interrupt is pending. This is sub-optimal.
This is an attempt at fixing it all in one go by reworking the way
we do the lazy interrupt disabling from the ground up.
The base idea is to replace the "hard_enabled" field with a
"irq_happened" field in which we store a bit mask of what interrupt
occurred while soft-disabled.
When re-enabling, either via arch_local_irq_restore() or when returning
from an interrupt, we can now decide what to do by testing bits in that
field.
We then implement replaying of the missed interrupts either by
re-using the existing exception frame (in exception exit case) or via
the creation of a new one from an assembly trampoline (in the
arch_local_irq_enable case).
This removes the need to play with the decrementer to try to create
fake interrupts, among others.
In addition, this adds a few refinements:
- We no longer hard disable decrementer interrupts that occur
while soft-disabled. We now simply bump the decrementer back to max
(on BookS) or leave it stopped (on BookE) and continue with hard interrupts
enabled, which means that we'll potentially get better sample quality from
performance monitor interrupts.
- Timer, decrementer and doorbell interrupts now hard-enable
shortly after removing the source of the interrupt, which means
they no longer run entirely hard disabled. Again, this will improve
perf sample quality.
- On Book3E 64-bit, we now make the performance monitor interrupt
act as an NMI like Book3S (the necessary C code for that to work
appear to already be present in the FSL perf code, notably calling
nmi_enter instead of irq_enter). (This also fixes a bug where BookE
perfmon interrupts could clobber r14 ... oops)
- We could make "masked" decrementer interrupts act as NMIs when doing
timer-based perf sampling to improve the sample quality.
Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2:
- Add hard-enable to decrementer, timer and doorbells
- Fix CR clobber in masked irq handling on BookE
- Make embedded perf interrupt act as an NMI
- Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
to retrigger an interrupt without preventing hard-enable
v3:
- Fix or vs. ori bug on Book3E
- Fix enabling of interrupts for some exceptions on Book3E
v4:
- Fix resend of doorbells on return from interrupt on Book3E
v5:
- Rebased on top of my latest series, which involves some significant
rework of some aspects of the patch.
v6:
- 32-bit compile fix
- more compile fixes with various .config combos
- factor out the asm code to soft-disable interrupts
- remove the C wrapper around preempt_schedule_irq
v7:
- Fix a bug with hard irq state tracking on native power7
2012-03-06 18:27:59 +11:00
may_hard_irq_enable ( ) ;
/* And finally process it */
2016-09-06 21:53:24 +10:00
if ( unlikely ( ! irq ) )
powerpc: Replace __get_cpu_var uses
This still has not been merged and now powerpc is the only arch that does
not have this change. Sorry about missing linuxppc-dev before.
V2->V2
- Fix up to work against 3.18-rc1
__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x). This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.
Other use cases are for storing and retrieving data from the current
processors percpu area. __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.
__get_cpu_var() is defined as :
__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.
this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.
This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset. Thereby address calculations are avoided and less registers
are used when code is generated.
At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.
The patch set includes passes over all arches as well. Once these operations
are used throughout then specialized macros can be defined in non -x86
arches as well in order to optimize per cpu access by f.e. using a global
register that may be set to the per cpu base.
Transformations done to __get_cpu_var()
1. Determine the address of the percpu instance of the current processor.
DEFINE_PER_CPU(int, y);
int *x = &__get_cpu_var(y);
Converts to
int *x = this_cpu_ptr(&y);
2. Same as #1 but this time an array structure is involved.
DEFINE_PER_CPU(int, y[20]);
int *x = __get_cpu_var(y);
Converts to
int *x = this_cpu_ptr(y);
3. Retrieve the content of the current processors instance of a per cpu
variable.
DEFINE_PER_CPU(int, y);
int x = __get_cpu_var(y)
Converts to
int x = __this_cpu_read(y);
4. Retrieve the content of a percpu struct
DEFINE_PER_CPU(struct mystruct, y);
struct mystruct x = __get_cpu_var(y);
Converts to
memcpy(&x, this_cpu_ptr(&y), sizeof(x));
5. Assignment to a per cpu variable
DEFINE_PER_CPU(int, y)
__get_cpu_var(y) = x;
Converts to
__this_cpu_write(y, x);
6. Increment/Decrement etc of a per cpu variable
DEFINE_PER_CPU(int, y);
__get_cpu_var(y)++
Converts to
__this_cpu_inc(y)
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
Signed-off-by: Christoph Lameter <cl@linux.com>
[mpe: Fix build errors caused by set/or_softirq_pending(), and rework
assignment in __set_breakpoint() to use memcpy().]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-10-21 15:23:25 -05:00
__this_cpu_inc ( irq_stat . spurious_irqs ) ;
2014-02-23 21:40:08 +00:00
else
2017-06-15 16:20:46 +10:00
generic_handle_irq ( irq ) ;
2005-11-16 18:53:29 +11:00
2012-09-10 15:37:43 +00:00
trace_irq_exit ( regs ) ;
2013-09-23 14:29:11 +10:00
}
powerpc/interrupt: Fix OOPS by not calling do_IRQ() from timer_interrupt()
An interrupt handler shall not be called from another interrupt
handler otherwise this leads to problems like the following:
Kernel attempted to write user page (afd4fa84) - exploit attempt? (uid: 1000)
------------[ cut here ]------------
Bug: Write fault blocked by KUAP!
WARNING: CPU: 0 PID: 1617 at arch/powerpc/mm/fault.c:230 do_page_fault+0x484/0x720
Modules linked in:
CPU: 0 PID: 1617 Comm: sshd Tainted: G W 5.13.0-pmac-00010-g8393422eb77 #7
NIP: c001b77c LR: c001b77c CTR: 00000000
REGS: cb9e5bc0 TRAP: 0700 Tainted: G W (5.13.0-pmac-00010-g8393422eb77)
MSR: 00021032 <ME,IR,DR,RI> CR: 24942424 XER: 00000000
GPR00: c001b77c cb9e5c80 c1582c00 00000021 3ffffbff 085b0000 00000027 c8eb644c
GPR08: 00000023 00000000 00000000 00000000 24942424 0063f8c8 00000000 000186a0
GPR16: afd52dd4 afd52dd0 afd52dcc afd52dc8 0065a990 c07640c4 cb9e5e98 cb9e5e90
GPR24: 00000040 afd4fa96 00000040 02000000 c1fda6c0 afd4fa84 00000300 cb9e5cc0
NIP [c001b77c] do_page_fault+0x484/0x720
LR [c001b77c] do_page_fault+0x484/0x720
Call Trace:
[cb9e5c80] [c001b77c] do_page_fault+0x484/0x720 (unreliable)
[cb9e5cb0] [c000424c] DataAccess_virt+0xd4/0xe4
--- interrupt: 300 at __copy_tofrom_user+0x110/0x20c
NIP: c001f9b4 LR: c03250a0 CTR: 00000004
REGS: cb9e5cc0 TRAP: 0300 Tainted: G W (5.13.0-pmac-00010-g8393422eb77)
MSR: 00009032 <EE,ME,IR,DR,RI> CR: 48028468 XER: 20000000
DAR: afd4fa84 DSISR: 0a000000
GPR00: 20726f6f cb9e5d80 c1582c00 00000004 cb9e5e3a 00000016 afd4fa80 00000000
GPR08: 3835202d 72777872 2d78722d 00000004 28028464 0063f8c8 00000000 000186a0
GPR16: afd52dd4 afd52dd0 afd52dcc afd52dc8 0065a990 c07640c4 cb9e5e98 cb9e5e90
GPR24: 00000040 afd4fa96 00000040 cb9e5e0c 00000daa a0000000 cb9e5e98 afd4fa56
NIP [c001f9b4] __copy_tofrom_user+0x110/0x20c
LR [c03250a0] _copy_to_iter+0x144/0x990
--- interrupt: 300
[cb9e5d80] [c03e89c0] n_tty_read+0xa4/0x598 (unreliable)
[cb9e5df0] [c03e2a0c] tty_read+0xdc/0x2b4
[cb9e5e80] [c0156bf8] vfs_read+0x274/0x340
[cb9e5f00] [c01571ac] ksys_read+0x70/0x118
[cb9e5f30] [c0016048] ret_from_syscall+0x0/0x28
--- interrupt: c00 at 0xa7855c88
NIP: a7855c88 LR: a7855c5c CTR: 00000000
REGS: cb9e5f40 TRAP: 0c00 Tainted: G W (5.13.0-pmac-00010-g8393422eb77)
MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 2402446c XER: 00000000
GPR00: 00000003 afd4ec70 a72137d0 0000000b afd4ecac 00004000 0065a990 00000800
GPR08: 00000000 a7947930 00000000 00000004 c15831b0 0063f8c8 00000000 000186a0
GPR16: afd52dd4 afd52dd0 afd52dcc afd52dc8 0065a990 0065a9e0 00000001 0065fac0
GPR24: 00000000 00000089 00664050 00000000 00668e30 a720c8dc a7943ff4 0065f9b0
NIP [a7855c88] 0xa7855c88
LR [a7855c5c] 0xa7855c5c
--- interrupt: c00
Instruction dump:
3884aa88 38630178 48076861 807f0080 48042e45 2f830000 419e0148 3c80c079
3c60c076 38841be4 386301c0 4801f705 <0fe00000> 3860000b 4bfffe30 3c80c06b
---[ end trace fd69b91a8046c2e5 ]---
Here the problem is that by re-enterring an exception handler,
kuap_save_and_lock() is called a second time with this time KUAP
access locked, leading to regs->kuap being overwritten hence
KUAP not being unlocked at exception exit as expected.
Do not call do_IRQ() from timer_interrupt() directly. Instead,
redefine do_IRQ() as a standard function named __do_IRQ(), and
call it from both do_IRQ() and time_interrupt() handlers.
Fixes: 3a96570ffceb ("powerpc: convert interrupt handlers to use wrappers")
Cc: stable@vger.kernel.org # v5.12+
Reported-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c17d234f4927d39a1d7100864a8e1145323d33a0.1628611927.git.christophe.leroy@csgroup.eu
2021-08-10 16:13:16 +00:00
void __do_IRQ ( struct pt_regs * regs )
2013-09-23 14:29:11 +10:00
{
struct pt_regs * old_regs = set_irq_regs ( regs ) ;
2019-01-12 09:55:53 +00:00
void * cursp , * irqsp , * sirqsp ;
2013-09-23 14:29:11 +10:00
/* Switch to the irq stack to handle this */
2020-02-20 22:51:41 +11:00
cursp = ( void * ) ( current_stack_pointer & ~ ( THREAD_SIZE - 1 ) ) ;
2019-01-12 09:55:53 +00:00
irqsp = hardirq_ctx [ raw_smp_processor_id ( ) ] ;
sirqsp = softirq_ctx [ raw_smp_processor_id ( ) ] ;
2013-09-23 14:29:11 +10:00
2019-12-09 06:19:08 +00:00
check_stack_overflow ( ) ;
2013-09-23 14:29:11 +10:00
/* Already there ? */
2019-01-12 09:55:53 +00:00
if ( unlikely ( cursp = = irqsp | | cursp = = sirqsp ) ) {
2013-09-23 14:29:11 +10:00
__do_irq ( regs ) ;
set_irq_regs ( old_regs ) ;
return ;
}
/* Switch stack and call */
2019-01-12 09:55:53 +00:00
call_do_irq ( regs , irqsp ) ;
2013-09-23 14:29:11 +10:00
IRQ: Maintain regs pointer globally rather than passing to IRQ handlers
Maintain a per-CPU global "struct pt_regs *" variable which can be used instead
of passing regs around manually through all ~1800 interrupt handlers in the
Linux kernel.
The regs pointer is used in few places, but it potentially costs both stack
space and code to pass it around. On the FRV arch, removing the regs parameter
from all the genirq function results in a 20% speed up of the IRQ exit path
(ie: from leaving timer_interrupt() to leaving do_IRQ()).
Where appropriate, an arch may override the generic storage facility and do
something different with the variable. On FRV, for instance, the address is
maintained in GR28 at all times inside the kernel as part of general exception
handling.
Having looked over the code, it appears that the parameter may be handed down
through up to twenty or so layers of functions. Consider a USB character
device attached to a USB hub, attached to a USB controller that posts its
interrupts through a cascaded auxiliary interrupt controller. A character
device driver may want to pass regs to the sysrq handler through the input
layer which adds another few layers of parameter passing.
I've build this code with allyesconfig for x86_64 and i386. I've runtested the
main part of the code on FRV and i386, though I can't test most of the drivers.
I've also done partial conversion for powerpc and MIPS - these at least compile
with minimal configurations.
This will affect all archs. Mostly the changes should be relatively easy.
Take do_IRQ(), store the regs pointer at the beginning, saving the old one:
struct pt_regs *old_regs = set_irq_regs(regs);
And put the old one back at the end:
set_irq_regs(old_regs);
Don't pass regs through to generic_handle_irq() or __do_IRQ().
In timer_interrupt(), this sort of change will be necessary:
- update_process_times(user_mode(regs));
- profile_tick(CPU_PROFILING, regs);
+ update_process_times(user_mode(get_irq_regs()));
+ profile_tick(CPU_PROFILING);
I'd like to move update_process_times()'s use of get_irq_regs() into itself,
except that i386, alone of the archs, uses something other than user_mode().
Some notes on the interrupt handling in the drivers:
(*) input_dev() is now gone entirely. The regs pointer is no longer stored in
the input_dev struct.
(*) finish_unlinks() in drivers/usb/host/ohci-q.c needs checking. It does
something different depending on whether it's been supplied with a regs
pointer or not.
(*) Various IRQ handler function pointers have been moved to type
irq_handler_t.
Signed-Off-By: David Howells <dhowells@redhat.com>
(cherry picked from 1b16e7ac850969f38b375e511e3fa2f474a33867 commit)
2006-10-05 14:55:46 +01:00
set_irq_regs ( old_regs ) ;
2005-11-16 18:53:29 +11:00
}
2005-04-16 15:20:36 -07:00
powerpc/interrupt: Fix OOPS by not calling do_IRQ() from timer_interrupt()
An interrupt handler shall not be called from another interrupt
handler otherwise this leads to problems like the following:
Kernel attempted to write user page (afd4fa84) - exploit attempt? (uid: 1000)
------------[ cut here ]------------
Bug: Write fault blocked by KUAP!
WARNING: CPU: 0 PID: 1617 at arch/powerpc/mm/fault.c:230 do_page_fault+0x484/0x720
Modules linked in:
CPU: 0 PID: 1617 Comm: sshd Tainted: G W 5.13.0-pmac-00010-g8393422eb77 #7
NIP: c001b77c LR: c001b77c CTR: 00000000
REGS: cb9e5bc0 TRAP: 0700 Tainted: G W (5.13.0-pmac-00010-g8393422eb77)
MSR: 00021032 <ME,IR,DR,RI> CR: 24942424 XER: 00000000
GPR00: c001b77c cb9e5c80 c1582c00 00000021 3ffffbff 085b0000 00000027 c8eb644c
GPR08: 00000023 00000000 00000000 00000000 24942424 0063f8c8 00000000 000186a0
GPR16: afd52dd4 afd52dd0 afd52dcc afd52dc8 0065a990 c07640c4 cb9e5e98 cb9e5e90
GPR24: 00000040 afd4fa96 00000040 02000000 c1fda6c0 afd4fa84 00000300 cb9e5cc0
NIP [c001b77c] do_page_fault+0x484/0x720
LR [c001b77c] do_page_fault+0x484/0x720
Call Trace:
[cb9e5c80] [c001b77c] do_page_fault+0x484/0x720 (unreliable)
[cb9e5cb0] [c000424c] DataAccess_virt+0xd4/0xe4
--- interrupt: 300 at __copy_tofrom_user+0x110/0x20c
NIP: c001f9b4 LR: c03250a0 CTR: 00000004
REGS: cb9e5cc0 TRAP: 0300 Tainted: G W (5.13.0-pmac-00010-g8393422eb77)
MSR: 00009032 <EE,ME,IR,DR,RI> CR: 48028468 XER: 20000000
DAR: afd4fa84 DSISR: 0a000000
GPR00: 20726f6f cb9e5d80 c1582c00 00000004 cb9e5e3a 00000016 afd4fa80 00000000
GPR08: 3835202d 72777872 2d78722d 00000004 28028464 0063f8c8 00000000 000186a0
GPR16: afd52dd4 afd52dd0 afd52dcc afd52dc8 0065a990 c07640c4 cb9e5e98 cb9e5e90
GPR24: 00000040 afd4fa96 00000040 cb9e5e0c 00000daa a0000000 cb9e5e98 afd4fa56
NIP [c001f9b4] __copy_tofrom_user+0x110/0x20c
LR [c03250a0] _copy_to_iter+0x144/0x990
--- interrupt: 300
[cb9e5d80] [c03e89c0] n_tty_read+0xa4/0x598 (unreliable)
[cb9e5df0] [c03e2a0c] tty_read+0xdc/0x2b4
[cb9e5e80] [c0156bf8] vfs_read+0x274/0x340
[cb9e5f00] [c01571ac] ksys_read+0x70/0x118
[cb9e5f30] [c0016048] ret_from_syscall+0x0/0x28
--- interrupt: c00 at 0xa7855c88
NIP: a7855c88 LR: a7855c5c CTR: 00000000
REGS: cb9e5f40 TRAP: 0c00 Tainted: G W (5.13.0-pmac-00010-g8393422eb77)
MSR: 0000d032 <EE,PR,ME,IR,DR,RI> CR: 2402446c XER: 00000000
GPR00: 00000003 afd4ec70 a72137d0 0000000b afd4ecac 00004000 0065a990 00000800
GPR08: 00000000 a7947930 00000000 00000004 c15831b0 0063f8c8 00000000 000186a0
GPR16: afd52dd4 afd52dd0 afd52dcc afd52dc8 0065a990 0065a9e0 00000001 0065fac0
GPR24: 00000000 00000089 00664050 00000000 00668e30 a720c8dc a7943ff4 0065f9b0
NIP [a7855c88] 0xa7855c88
LR [a7855c5c] 0xa7855c5c
--- interrupt: c00
Instruction dump:
3884aa88 38630178 48076861 807f0080 48042e45 2f830000 419e0148 3c80c079
3c60c076 38841be4 386301c0 4801f705 <0fe00000> 3860000b 4bfffe30 3c80c06b
---[ end trace fd69b91a8046c2e5 ]---
Here the problem is that by re-enterring an exception handler,
kuap_save_and_lock() is called a second time with this time KUAP
access locked, leading to regs->kuap being overwritten hence
KUAP not being unlocked at exception exit as expected.
Do not call do_IRQ() from timer_interrupt() directly. Instead,
redefine do_IRQ() as a standard function named __do_IRQ(), and
call it from both do_IRQ() and time_interrupt() handlers.
Fixes: 3a96570ffceb ("powerpc: convert interrupt handlers to use wrappers")
Cc: stable@vger.kernel.org # v5.12+
Reported-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c17d234f4927d39a1d7100864a8e1145323d33a0.1628611927.git.christophe.leroy@csgroup.eu
2021-08-10 16:13:16 +00:00
DEFINE_INTERRUPT_HANDLER_ASYNC ( do_IRQ )
{
__do_IRQ ( regs ) ;
}
2019-12-21 08:32:30 +00:00
static void * __init alloc_vm_stack ( void )
{
2020-06-01 21:52:10 -07:00
return __vmalloc_node ( THREAD_SIZE , THREAD_ALIGN , THREADINFO_GFP ,
NUMA_NO_NODE , ( void * ) _RET_IP_ ) ;
2019-12-21 08:32:30 +00:00
}
static void __init vmap_irqstack_init ( void )
{
int i ;
for_each_possible_cpu ( i ) {
softirq_ctx [ i ] = alloc_vm_stack ( ) ;
hardirq_ctx [ i ] = alloc_vm_stack ( ) ;
}
}
2005-04-16 15:20:36 -07:00
void __init init_IRQ ( void )
{
2019-12-21 08:32:30 +00:00
if ( IS_ENABLED ( CONFIG_VMAP_STACK ) )
vmap_irqstack_init ( ) ;
2007-07-10 03:31:44 +10:00
if ( ppc_md . init_IRQ )
ppc_md . init_IRQ ( ) ;
2005-04-16 15:20:36 -07:00
}
2008-04-30 03:49:55 -05:00
# if defined(CONFIG_BOOKE) || defined(CONFIG_40x)
2019-01-31 10:09:00 +00:00
void * critirq_ctx [ NR_CPUS ] __read_mostly ;
void * dbgirq_ctx [ NR_CPUS ] __read_mostly ;
void * mcheckirq_ctx [ NR_CPUS ] __read_mostly ;
2008-04-30 03:49:55 -05:00
# endif
2005-04-16 15:20:36 -07:00
2019-01-31 10:09:00 +00:00
void * softirq_ctx [ NR_CPUS ] __read_mostly ;
void * hardirq_ctx [ NR_CPUS ] __read_mostly ;
2005-04-16 15:20:36 -07:00
2013-09-05 15:49:45 +02:00
void do_softirq_own_stack ( void )
powerpc: Implement accurate task and CPU time accounting
This implements accurate task and cpu time accounting for 64-bit
powerpc kernels. Instead of accounting a whole jiffy of time to a
task on a timer interrupt because that task happened to be running at
the time, we now account time in units of timebase ticks according to
the actual time spent by the task in user mode and kernel mode. We
also count the time spent processing hardware and software interrupts
accurately. This is conditional on CONFIG_VIRT_CPU_ACCOUNTING. If
that is not set, we do tick-based approximate accounting as before.
To get this accurate information, we read either the PURR (processor
utilization of resources register) on POWER5 machines, or the timebase
on other machines on
* each entry to the kernel from usermode
* each exit to usermode
* transitions between process context, hard irq context and soft irq
context in kernel mode
* context switches.
On POWER5 systems with shared-processor logical partitioning we also
read both the PURR and the timebase at each timer interrupt and
context switch in order to determine how much time has been taken by
the hypervisor to run other partitions ("steal" time). Unfortunately,
since we need values of the PURR on both threads at the same time to
accurately calculate the steal time, and since we can only calculate
steal time on a per-core basis, the apportioning of the steal time
between idle time (time which we ceded to the hypervisor in the idle
loop) and actual stolen time is somewhat approximate at the moment.
This is all based quite heavily on what s390 does, and it uses the
generic interfaces that were added by the s390 developers,
i.e. account_system_time(), account_user_time(), etc.
This patch doesn't add any new interfaces between the kernel and
userspace, and doesn't change the units in which time is reported to
userspace by things such as /proc/stat, /proc/<pid>/stat, getrusage(),
times(), etc. Internally the various task and cpu times are stored in
timebase units, but they are converted to USER_HZ units (1/100th of a
second) when reported to userspace. Some precision is therefore lost
but there should not be any accumulating error, since the internal
accumulation is at full precision.
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-02-24 10:06:59 +11:00
{
2019-01-12 09:55:53 +00:00
call_do_softirq ( softirq_ctx [ smp_processor_id ( ) ] ) ;
powerpc: Implement accurate task and CPU time accounting
This implements accurate task and cpu time accounting for 64-bit
powerpc kernels. Instead of accounting a whole jiffy of time to a
task on a timer interrupt because that task happened to be running at
the time, we now account time in units of timebase ticks according to
the actual time spent by the task in user mode and kernel mode. We
also count the time spent processing hardware and software interrupts
accurately. This is conditional on CONFIG_VIRT_CPU_ACCOUNTING. If
that is not set, we do tick-based approximate accounting as before.
To get this accurate information, we read either the PURR (processor
utilization of resources register) on POWER5 machines, or the timebase
on other machines on
* each entry to the kernel from usermode
* each exit to usermode
* transitions between process context, hard irq context and soft irq
context in kernel mode
* context switches.
On POWER5 systems with shared-processor logical partitioning we also
read both the PURR and the timebase at each timer interrupt and
context switch in order to determine how much time has been taken by
the hypervisor to run other partitions ("steal" time). Unfortunately,
since we need values of the PURR on both threads at the same time to
accurately calculate the steal time, and since we can only calculate
steal time on a per-core basis, the apportioning of the steal time
between idle time (time which we ceded to the hypervisor in the idle
loop) and actual stolen time is somewhat approximate at the moment.
This is all based quite heavily on what s390 does, and it uses the
generic interfaces that were added by the s390 developers,
i.e. account_system_time(), account_user_time(), etc.
This patch doesn't add any new interfaces between the kernel and
userspace, and doesn't change the units in which time is reported to
userspace by things such as /proc/stat, /proc/<pid>/stat, getrusage(),
times(), etc. Internally the various task and cpu times are stored in
timebase units, but they are converted to USER_HZ units (1/100th of a
second) when reported to userspace. Some precision is therefore lost
but there should not be any accumulating error, since the internal
accumulation is at full precision.
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-02-24 10:06:59 +11:00
}
2005-04-16 15:20:36 -07:00
2007-06-04 14:47:04 +10:00
irq_hw_number_t virq_to_hw ( unsigned int virq )
{
2012-02-14 14:06:51 -07:00
struct irq_data * irq_data = irq_get_irq_data ( virq ) ;
return WARN_ON ( ! irq_data ) ? 0 : irq_data - > hwirq ;
2007-06-04 14:47:04 +10:00
}
EXPORT_SYMBOL_GPL ( virq_to_hw ) ;
2011-05-19 08:54:26 -05:00
# ifdef CONFIG_SMP
int irq_choose_cpu ( const struct cpumask * mask )
{
int cpuid ;
2012-05-17 15:11:45 +00:00
if ( cpumask_equal ( mask , cpu_online_mask ) ) {
2011-05-19 08:54:26 -05:00
static int irq_rover ;
static DEFINE_RAW_SPINLOCK ( irq_rover_lock ) ;
unsigned long flags ;
/* Round-robin distribution... */
do_round_robin :
raw_spin_lock_irqsave ( & irq_rover_lock , flags ) ;
irq_rover = cpumask_next ( irq_rover , cpu_online_mask ) ;
if ( irq_rover > = nr_cpu_ids )
irq_rover = cpumask_first ( cpu_online_mask ) ;
cpuid = irq_rover ;
raw_spin_unlock_irqrestore ( & irq_rover_lock , flags ) ;
} else {
cpuid = cpumask_first_and ( mask , cpu_online_mask ) ;
if ( cpuid > = nr_cpu_ids )
goto do_round_robin ;
}
return get_hard_smp_processor_id ( cpuid ) ;
}
# else
int irq_choose_cpu ( const struct cpumask * mask )
{
return hard_smp_processor_id ( ) ;
}
# endif
2006-07-03 21:36:01 +10:00
powerpc: Implement accurate task and CPU time accounting
This implements accurate task and cpu time accounting for 64-bit
powerpc kernels. Instead of accounting a whole jiffy of time to a
task on a timer interrupt because that task happened to be running at
the time, we now account time in units of timebase ticks according to
the actual time spent by the task in user mode and kernel mode. We
also count the time spent processing hardware and software interrupts
accurately. This is conditional on CONFIG_VIRT_CPU_ACCOUNTING. If
that is not set, we do tick-based approximate accounting as before.
To get this accurate information, we read either the PURR (processor
utilization of resources register) on POWER5 machines, or the timebase
on other machines on
* each entry to the kernel from usermode
* each exit to usermode
* transitions between process context, hard irq context and soft irq
context in kernel mode
* context switches.
On POWER5 systems with shared-processor logical partitioning we also
read both the PURR and the timebase at each timer interrupt and
context switch in order to determine how much time has been taken by
the hypervisor to run other partitions ("steal" time). Unfortunately,
since we need values of the PURR on both threads at the same time to
accurately calculate the steal time, and since we can only calculate
steal time on a per-core basis, the apportioning of the steal time
between idle time (time which we ceded to the hypervisor in the idle
loop) and actual stolen time is somewhat approximate at the moment.
This is all based quite heavily on what s390 does, and it uses the
generic interfaces that were added by the s390 developers,
i.e. account_system_time(), account_user_time(), etc.
This patch doesn't add any new interfaces between the kernel and
userspace, and doesn't change the units in which time is reported to
userspace by things such as /proc/stat, /proc/<pid>/stat, getrusage(),
times(), etc. Internally the various task and cpu times are stored in
timebase units, but they are converted to USER_HZ units (1/100th of a
second) when reported to userspace. Some precision is therefore lost
but there should not be any accumulating error, since the internal
accumulation is at full precision.
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-02-24 10:06:59 +11:00
# ifdef CONFIG_PPC64
2005-04-16 15:20:36 -07:00
static int __init setup_noirqdistrib ( char * str )
{
distribute_irqs = 0 ;
return 1 ;
}
__setup ( " noirqdistrib " , setup_noirqdistrib ) ;
2005-11-09 18:07:45 +11:00
# endif /* CONFIG_PPC64 */