2006-12-04 00:42:59 +09:00
/ *
* This f i l e i s s u b j e c t t o t h e t e r m s a n d c o n d i t i o n s o f t h e G N U G e n e r a l P u b l i c
* License. S e e t h e f i l e " C O P Y I N G " i n t h e m a i n d i r e c t o r y o f t h i s a r c h i v e
* for m o r e d e t a i l s .
*
* Quick' n ' d i r t y I P c h e c k s u m . . .
*
* Copyright ( C ) 1 9 9 8 , 1 9 9 9 R a l f B a e c h l e
* Copyright ( C ) 1 9 9 9 S i l i c o n G r a p h i c s , I n c .
2007-10-23 12:43:25 +01:00
* Copyright ( C ) 2 0 0 7 M a c i e j W . R o z y c k i
2013-12-12 16:21:00 +00:00
* Copyright ( C ) 2 0 1 4 I m a g i n a t i o n T e c h n o l o g i e s L t d .
2006-12-04 00:42:59 +09:00
* /
2006-12-13 01:22:06 +09:00
# include < l i n u x / e r r n o . h >
2006-12-04 00:42:59 +09:00
# include < a s m / a s m . h >
2006-12-13 01:22:06 +09:00
# include < a s m / a s m - o f f s e t s . h >
2016-11-07 11:14:13 +00:00
# include < a s m / e x p o r t . h >
2006-12-04 00:42:59 +09:00
# include < a s m / r e g d e f . h >
# ifdef C O N F I G _ 6 4 B I T
2006-12-08 01:04:31 +09:00
/ *
* As w e a r e s h a r i n g c o d e b a s e w i t h t h e m i p s32 t r e e ( w h i c h u s e t h e o 3 2 A B I
* register d e f i n i t i o n s ) . W e n e e d t o r e d e f i n e t h e r e g i s t e r d e f i n i t i o n s f r o m
* the n 6 4 A B I r e g i s t e r n a m i n g t o t h e o 3 2 A B I r e g i s t e r n a m i n g .
* /
# undef t 0
# undef t 1
# undef t 2
# undef t 3
# define t 0 $ 8
# define t 1 $ 9
# define t 2 $ 1 0
# define t 3 $ 1 1
# define t 4 $ 1 2
# define t 5 $ 1 3
# define t 6 $ 1 4
# define t 7 $ 1 5
2006-12-08 01:04:51 +09:00
# define U S E _ D O U B L E
2006-12-04 00:42:59 +09:00
# endif
2006-12-08 01:04:51 +09:00
# ifdef U S E _ D O U B L E
# define L O A D l d
2008-09-20 17:20:04 +02:00
# define L O A D 3 2 l w u
2006-12-08 01:04:51 +09:00
# define A D D d a d d u
# define N B Y T E S 8
# else
# define L O A D l w
2008-09-20 17:20:04 +02:00
# define L O A D 3 2 l w
2006-12-08 01:04:51 +09:00
# define A D D a d d u
# define N B Y T E S 4
# endif / * U S E _ D O U B L E * /
# define U N I T ( u n i t ) ( ( u n i t ) * N B Y T E S )
2006-12-04 00:42:59 +09:00
# define A D D C ( s u m ,r e g ) \
2014-04-04 03:32:54 +01:00
.set push; \
.set noat; \
2006-12-08 01:04:51 +09:00
ADD s u m , r e g ; \
2006-12-04 00:42:59 +09:00
sltu v1 , s u m , r e g ; \
2007-10-23 12:43:25 +01:00
ADD s u m , v1 ; \
2014-04-04 03:32:54 +01:00
.set pop
2006-12-04 00:42:59 +09:00
2008-09-20 17:20:04 +02:00
# define A D D C 3 2 ( s u m ,r e g ) \
2014-04-04 03:32:54 +01:00
.set push; \
.set noat; \
2008-09-20 17:20:04 +02:00
addu s u m , r e g ; \
sltu v1 , s u m , r e g ; \
addu s u m , v1 ; \
2014-04-04 03:32:54 +01:00
.set pop
2008-09-20 17:20:04 +02:00
2006-12-08 01:04:51 +09:00
# define C S U M _ B I G C H U N K 1 ( s r c , o f f s e t , s u m , _ t 0 , _ t 1 , _ t 2 , _ t 3 ) \
LOAD _ t 0 , ( o f f s e t + U N I T ( 0 ) ) ( s r c ) ; \
LOAD _ t 1 , ( o f f s e t + U N I T ( 1 ) ) ( s r c ) ; \
2013-01-22 12:59:30 +01:00
LOAD _ t 2 , ( o f f s e t + U N I T ( 2 ) ) ( s r c ) ; \
LOAD _ t 3 , ( o f f s e t + U N I T ( 3 ) ) ( s r c ) ; \
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-27 01:07:24 +08:00
ADDC( _ t 0 , _ t 1 ) ; \
ADDC( _ t 2 , _ t 3 ) ; \
2006-12-04 00:42:59 +09:00
ADDC( s u m , _ t 0 ) ; \
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-27 01:07:24 +08:00
ADDC( s u m , _ t 2 )
2006-12-08 01:04:51 +09:00
# ifdef U S E _ D O U B L E
# define C S U M _ B I G C H U N K ( s r c , o f f s e t , s u m , _ t 0 , _ t 1 , _ t 2 , _ t 3 ) \
CSUM_ B I G C H U N K 1 ( s r c , o f f s e t , s u m , _ t 0 , _ t 1 , _ t 2 , _ t 3 )
# else
# define C S U M _ B I G C H U N K ( s r c , o f f s e t , s u m , _ t 0 , _ t 1 , _ t 2 , _ t 3 ) \
CSUM_ B I G C H U N K 1 ( s r c , o f f s e t , s u m , _ t 0 , _ t 1 , _ t 2 , _ t 3 ) ; \
CSUM_ B I G C H U N K 1 ( s r c , o f f s e t + 0 x10 , s u m , _ t 0 , _ t 1 , _ t 2 , _ t 3 )
# endif
2006-12-04 00:42:59 +09:00
/ *
* a0 : source a d d r e s s
* a1 : length o f t h e a r e a t o c h e c k s u m
* a2 : partial c h e c k s u m
* /
# define s r c a0
# define s u m v0
.text
.set noreorder
.align 5
LEAF( c s u m _ p a r t i a l )
2016-11-07 11:14:13 +00:00
EXPORT_ S Y M B O L ( c s u m _ p a r t i a l )
2006-12-04 00:42:59 +09:00
move s u m , z e r o
2006-12-08 01:04:31 +09:00
move t 7 , z e r o
2006-12-04 00:42:59 +09:00
sltiu t 8 , a1 , 0 x8
2008-01-29 10:14:59 +00:00
bnez t 8 , . L s m a l l _ c s u m c p y / * < 8 b y t e s t o c o p y * /
2006-12-08 01:04:31 +09:00
move t 2 , a1
2006-12-04 00:42:59 +09:00
2006-12-08 01:04:45 +09:00
andi t 7 , s r c , 0 x1 / * o d d b u f f e r ? * /
2006-12-04 00:42:59 +09:00
2008-01-29 10:14:59 +00:00
.Lhword_align :
beqz t 7 , . L w o r d _ a l i g n
2006-12-04 00:42:59 +09:00
andi t 8 , s r c , 0 x2
2006-12-08 01:04:31 +09:00
lbu t 0 , ( s r c )
2006-12-04 00:42:59 +09:00
LONG_ S U B U a1 , a1 , 0 x1
# ifdef _ _ M I P S E L _ _
2006-12-08 01:04:31 +09:00
sll t 0 , t 0 , 8
2006-12-04 00:42:59 +09:00
# endif
2006-12-08 01:04:31 +09:00
ADDC( s u m , t 0 )
2006-12-04 00:42:59 +09:00
PTR_ A D D U s r c , s r c , 0 x1
andi t 8 , s r c , 0 x2
2008-01-29 10:14:59 +00:00
.Lword_align :
beqz t 8 , . L d w o r d _ a l i g n
2006-12-04 00:42:59 +09:00
sltiu t 8 , a1 , 5 6
2006-12-08 01:04:31 +09:00
lhu t 0 , ( s r c )
2006-12-04 00:42:59 +09:00
LONG_ S U B U a1 , a1 , 0 x2
2006-12-08 01:04:31 +09:00
ADDC( s u m , t 0 )
2006-12-04 00:42:59 +09:00
sltiu t 8 , a1 , 5 6
PTR_ A D D U s r c , s r c , 0 x2
2008-01-29 10:14:59 +00:00
.Ldword_align :
bnez t 8 , . L d o _ e n d _ w o r d s
2006-12-04 00:42:59 +09:00
move t 8 , a1
andi t 8 , s r c , 0 x4
2008-01-29 10:14:59 +00:00
beqz t 8 , . L q w o r d _ a l i g n
2006-12-04 00:42:59 +09:00
andi t 8 , s r c , 0 x8
2008-09-20 17:20:04 +02:00
LOAD3 2 t 0 , 0 x00 ( s r c )
2006-12-04 00:42:59 +09:00
LONG_ S U B U a1 , a1 , 0 x4
2006-12-08 01:04:31 +09:00
ADDC( s u m , t 0 )
2006-12-04 00:42:59 +09:00
PTR_ A D D U s r c , s r c , 0 x4
andi t 8 , s r c , 0 x8
2008-01-29 10:14:59 +00:00
.Lqword_align :
beqz t 8 , . L o w o r d _ a l i g n
2006-12-04 00:42:59 +09:00
andi t 8 , s r c , 0 x10
2006-12-08 01:04:51 +09:00
# ifdef U S E _ D O U B L E
ld t 0 , 0 x00 ( s r c )
LONG_ S U B U a1 , a1 , 0 x8
ADDC( s u m , t 0 )
# else
2006-12-08 01:04:31 +09:00
lw t 0 , 0 x00 ( s r c )
lw t 1 , 0 x04 ( s r c )
2006-12-04 00:42:59 +09:00
LONG_ S U B U a1 , a1 , 0 x8
2006-12-08 01:04:31 +09:00
ADDC( s u m , t 0 )
ADDC( s u m , t 1 )
2006-12-08 01:04:51 +09:00
# endif
2006-12-04 00:42:59 +09:00
PTR_ A D D U s r c , s r c , 0 x8
andi t 8 , s r c , 0 x10
2008-01-29 10:14:59 +00:00
.Loword_align :
beqz t 8 , . L b e g i n _ m o v e m e n t
2006-12-04 00:42:59 +09:00
LONG_ S R L t 8 , a1 , 0 x7
2006-12-08 01:04:51 +09:00
# ifdef U S E _ D O U B L E
ld t 0 , 0 x00 ( s r c )
ld t 1 , 0 x08 ( s r c )
2006-12-08 01:04:31 +09:00
ADDC( s u m , t 0 )
ADDC( s u m , t 1 )
2006-12-08 01:04:51 +09:00
# else
CSUM_ B I G C H U N K 1 ( s r c , 0 x00 , s u m , t 0 , t 1 , t 3 , t 4 )
# endif
2006-12-04 00:42:59 +09:00
LONG_ S U B U a1 , a1 , 0 x10
PTR_ A D D U s r c , s r c , 0 x10
LONG_ S R L t 8 , a1 , 0 x7
2008-01-29 10:14:59 +00:00
.Lbegin_movement :
2006-12-04 00:42:59 +09:00
beqz t 8 , 1 f
2006-12-08 01:04:31 +09:00
andi t 2 , a1 , 0 x40
2006-12-04 00:42:59 +09:00
2008-01-29 10:14:59 +00:00
.Lmove_128bytes :
2006-12-08 01:04:31 +09:00
CSUM_ B I G C H U N K ( s r c , 0 x00 , s u m , t 0 , t 1 , t 3 , t 4 )
CSUM_ B I G C H U N K ( s r c , 0 x20 , s u m , t 0 , t 1 , t 3 , t 4 )
CSUM_ B I G C H U N K ( s r c , 0 x40 , s u m , t 0 , t 1 , t 3 , t 4 )
CSUM_ B I G C H U N K ( s r c , 0 x60 , s u m , t 0 , t 1 , t 3 , t 4 )
2006-12-04 00:42:59 +09:00
LONG_ S U B U t 8 , t 8 , 0 x01
2007-10-23 12:43:25 +01:00
.set reorder /* DADDI_WAR */
PTR_ A D D U s r c , s r c , 0 x80
2008-01-29 10:14:59 +00:00
bnez t 8 , . L m o v e _ 1 2 8 b y t e s
2007-10-23 12:43:25 +01:00
.set noreorder
2006-12-04 00:42:59 +09:00
1 :
2006-12-08 01:04:31 +09:00
beqz t 2 , 1 f
andi t 2 , a1 , 0 x20
2006-12-04 00:42:59 +09:00
2008-01-29 10:14:59 +00:00
.Lmove_64bytes :
2006-12-08 01:04:31 +09:00
CSUM_ B I G C H U N K ( s r c , 0 x00 , s u m , t 0 , t 1 , t 3 , t 4 )
CSUM_ B I G C H U N K ( s r c , 0 x20 , s u m , t 0 , t 1 , t 3 , t 4 )
2006-12-04 00:42:59 +09:00
PTR_ A D D U s r c , s r c , 0 x40
1 :
2008-01-29 10:14:59 +00:00
beqz t 2 , . L d o _ e n d _ w o r d s
2006-12-04 00:42:59 +09:00
andi t 8 , a1 , 0 x1 c
2008-01-29 10:14:59 +00:00
.Lmove_32bytes :
2006-12-08 01:04:31 +09:00
CSUM_ B I G C H U N K ( s r c , 0 x00 , s u m , t 0 , t 1 , t 3 , t 4 )
2006-12-04 00:42:59 +09:00
andi t 8 , a1 , 0 x1 c
PTR_ A D D U s r c , s r c , 0 x20
2008-01-29 10:14:59 +00:00
.Ldo_end_words :
beqz t 8 , . L s m a l l _ c s u m c p y
2006-12-08 01:04:45 +09:00
andi t 2 , a1 , 0 x3
LONG_ S R L t 8 , t 8 , 0 x2
2006-12-04 00:42:59 +09:00
2008-01-29 10:14:59 +00:00
.Lend_words :
2008-09-20 17:20:04 +02:00
LOAD3 2 t 0 , ( s r c )
2006-12-04 00:42:59 +09:00
LONG_ S U B U t 8 , t 8 , 0 x1
2006-12-08 01:04:31 +09:00
ADDC( s u m , t 0 )
2007-10-23 12:43:25 +01:00
.set reorder /* DADDI_WAR */
PTR_ A D D U s r c , s r c , 0 x4
2008-01-29 10:14:59 +00:00
bnez t 8 , . L e n d _ w o r d s
2007-10-23 12:43:25 +01:00
.set noreorder
2006-12-04 00:42:59 +09:00
2006-12-08 01:04:45 +09:00
/* unknown src alignment and < 8 bytes to go */
2008-01-29 10:14:59 +00:00
.Lsmall_csumcpy :
2006-12-08 01:04:45 +09:00
move a1 , t 2
2006-12-04 00:42:59 +09:00
2006-12-08 01:04:45 +09:00
andi t 0 , a1 , 4
beqz t 0 , 1 f
andi t 0 , a1 , 2
2006-12-04 00:42:59 +09:00
2006-12-08 01:04:45 +09:00
/* Still a full word to go */
ulw t 1 , ( s r c )
PTR_ A D D I U s r c , 4
2008-09-20 17:20:04 +02:00
# ifdef U S E _ D O U B L E
dsll t 1 , t 1 , 3 2 / * c l e a r l o w e r 3 2 b i t * /
# endif
2006-12-08 01:04:45 +09:00
ADDC( s u m , t 1 )
1 : move t 1 , z e r o
beqz t 0 , 1 f
andi t 0 , a1 , 1
/* Still a halfword to go */
ulhu t 1 , ( s r c )
PTR_ A D D I U s r c , 2
1 : beqz t 0 , 1 f
sll t 1 , t 1 , 1 6
lbu t 2 , ( s r c )
nop
# ifdef _ _ M I P S E B _ _
sll t 2 , t 2 , 8
# endif
or t 1 , t 2
1 : ADDC( s u m , t 1 )
2006-12-04 00:42:59 +09:00
2006-12-08 01:04:45 +09:00
/* fold checksum */
2006-12-08 01:04:51 +09:00
# ifdef U S E _ D O U B L E
dsll3 2 v1 , s u m , 0
daddu s u m , v1
sltu v1 , s u m , v1
dsra3 2 s u m , s u m , 0
addu s u m , v1
# endif
2006-12-08 01:04:45 +09:00
/* odd buffer alignment? */
mips: Add MIPS Release 5 support
There are five MIPS32/64 architecture releases currently available:
from 1 to 6 except fourth one, which was intentionally skipped.
Three of them can be called as major: 1st, 2nd and 6th, that not only
have some system level alterations, but also introduced significant
core/ISA level updates. The rest of the MIPS architecture releases are
minor.
Even though they don't have as much ISA/system/core level changes
as the major ones with respect to the previous releases, they still
provide a set of updates (I'd say they were intended to be the
intermediate releases before a major one) that might be useful for the
kernel and user-level code, when activated by the kernel or compiler.
In particular the following features were introduced or ended up being
available at/after MIPS32/64 Release 5 architecture:
+ the last release of the misaligned memory access instructions,
+ virtualisation - VZ ASE - is optional component of the arch,
+ SIMD - MSA ASE - is optional component of the arch,
+ DSP ASE is optional component of the arch,
+ CP0.Status.FR=1 for CP1.FIR.F64=1 (pure 64-bit FPU general registers)
must be available if FPU is implemented,
+ CP1.FIR.Has2008 support is required so CP1.FCSR.{ABS2008,NAN2008} bits
are available.
+ UFR/UNFR aliases to access CP0.Status.FR from user-space by means of
ctc1/cfc1 instructions (enabled by CP0.Config5.UFR),
+ CP0.COnfig5.LLB=1 and eretnc instruction are implemented to without
accidentally clearing LL-bit when returning from an interrupt,
exception, or error trap,
+ XPA feature together with extended versions of CPx registers is
introduced, which needs to have mfhc0/mthc0 instructions available.
So due to these changes GNU GCC provides an extended instructions set
support for MIPS32/64 Release 5 by default like eretnc/mfhc0/mthc0. Even
though the architecture alteration isn't that big, it still worth to be
taken into account by the kernel software. Finally we can't deny that
some optimization/limitations might be found in future and implemented
on some level in kernel or compiler. In this case having even
intermediate MIPS architecture releases support would be more than
useful.
So the most of the changes provided by this commit can be split into
either compile- or runtime configs related. The compile-time related
changes are caused by adding the new CONFIG_CPU_MIPS32_R5/CONFIG_CPU_MIPSR5
configs and concern the code activating MIPSR2 or MIPSR6 already
implemented features (like eretnc/LLbit, mthc0/mfhc0). In addition
CPU_HAS_MSA can be now freely enabled for MIPS32/64 release 5 based
platforms as this is done for CPU_MIPS32_R6 CPUs. The runtime changes
concerns the features which are handled with respect to the MIPS ISA
revision detected at run-time by means of CP0.Config.{AT,AR} bits. Alas
these fields can be used to detect either r1 or r2 or r6 releases.
But since we know which CPUs in fact support the R5 arch, we can manually
set MIPS_CPU_ISA_M32R5/MIPS_CPU_ISA_M64R5 bit of c->isa_level and then
use cpu_has_mips32r5/cpu_has_mips64r5 where it's appropriate.
Since XPA/EVA provide too complex alterationss and to have them used with
MIPS32 Release 2 charged kernels (for compatibility with current platform
configs) they are left to be setup as a separate kernel configs.
Co-developed-by: Alexey Malahov <Alexey.Malahov@baikalelectronics.ru>
Signed-off-by: Alexey Malahov <Alexey.Malahov@baikalelectronics.ru>
Signed-off-by: Serge Semin <Sergey.Semin@baikalelectronics.ru>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: devicetree@vger.kernel.org
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2020-05-21 17:07:14 +03:00
# if d e f i n e d ( C O N F I G _ C P U _ M I P S R 2 ) | | d e f i n e d ( C O N F I G _ C P U _ M I P S R 5 ) | | \
defined( C O N F I G _ C P U _ L O O N G S O N 6 4 )
2014-08-15 16:56:58 +08:00
.set push
.set arch=mips32r2
2008-10-11 16:18:53 +01:00
wsbh v1 , s u m
movn s u m , v1 , t 7
2014-08-15 16:56:58 +08:00
.set pop
2008-10-11 16:18:53 +01:00
# else
beqz t 7 , 1 f / * o d d b u f f e r a l i g n m e n t ? * /
lui v1 , 0 x00 f f
addu v1 , 0 x00 f f
and t 0 , s u m , v1
sll t 0 , t 0 , 8
2006-12-08 01:04:45 +09:00
srl s u m , s u m , 8
2008-10-11 16:18:53 +01:00
and s u m , s u m , v1
or s u m , s u m , t 0
2006-12-08 01:04:45 +09:00
1 :
2008-10-11 16:18:53 +01:00
# endif
2006-12-08 01:04:45 +09:00
.set reorder
2013-01-22 12:59:30 +01:00
/* Add the passed partial csum. */
2008-09-20 17:20:04 +02:00
ADDC3 2 ( s u m , a2 )
2006-12-04 00:42:59 +09:00
jr r a
2006-12-08 01:04:45 +09:00
.set noreorder
2006-12-04 00:42:59 +09:00
END( c s u m _ p a r t i a l )
2006-12-13 01:22:06 +09:00
/ *
* checksum a n d c o p y r o u t i n e s b a s e d o n m e m c p y . S
*
2020-07-19 17:37:15 -04:00
* csum_ p a r t i a l _ c o p y _ n o c h e c k ( s r c , d s t , l e n )
* _ _ csum_ p a r t i a l _ c o p y _ k e r n e l ( s r c , d s t , l e n )
2006-12-13 01:22:06 +09:00
*
2013-01-22 12:59:30 +01:00
* See " S p e c " i n m e m c p y . S f o r d e t a i l s . U n l i k e _ _ c o p y _ u s e r , a l l
2006-12-13 01:22:06 +09:00
* function i n t h i s f i l e u s e t h e s t a n d a r d c a l l i n g c o n v e n t i o n .
* /
# define s r c a0
# define d s t a1
# define l e n a2
# define s u m v0
# define o d d t 8
/ *
2020-07-19 17:37:15 -04:00
* All e x c e p t i o n h a n d l e r s s i m p l y r e t u r n 0 .
2006-12-13 01:22:06 +09:00
* /
2014-01-16 17:02:13 +00:00
/* Instruction type */
# define L D _ I N S N 1
# define S T _ I N S N 2
2014-01-17 10:48:46 +00:00
# define L E G A C Y _ M O D E 1
# define E V A _ M O D E 2
# define U S E R O P 1
# define K E R N E L O P 2
2014-01-16 17:02:13 +00:00
/ *
* Wrapper t o a d d a n e n t r y i n t h e e x c e p t i o n t a b l e
* in c a s e t h e i n s n c a u s e s a m e m o r y e x c e p t i o n .
* Arguments :
* insn : L o a d / s t o r e i n s t r u c t i o n
* type : I n s t r u c t i o n t y p e
* reg : R e g i s t e r
* addr : A d d r e s s
* handler : E x c e p t i o n h a n d l e r
* /
2020-07-19 17:37:15 -04:00
# define E X C ( i n s n , t y p e , r e g , a d d r ) \
2014-01-17 10:48:46 +00:00
.if \ mode = = L E G A C Y _ M O D E ; \
9 : insn r e g , a d d r ; \
.section _ _ ex_ t a b l e ," a " ; \
2020-07-19 17:37:15 -04:00
PTR 9 b , . L _ e x c ; \
2014-01-17 10:48:46 +00:00
.previous ; \
2014-01-17 11:36:16 +00:00
/* This is enabled in EVA mode */ \
.else ; \
/* If loading from user or storing to user */ \
.if ( ( \ from = = U S E R O P ) & & ( t y p e = = L D _ I N S N ) ) | | \
( ( \ to = = U S E R O P ) & & ( t y p e = = S T _ I N S N ) ) ; \
9 : _ _ BUILD_ E V A _ I N S N ( i n s n ## e , r e g , a d d r ) ; \
.section _ _ ex_ t a b l e ," a " ; \
2020-07-19 17:37:15 -04:00
PTR 9 b , . L _ e x c ; \
2014-01-17 11:36:16 +00:00
.previous ; \
.else ; \
/* EVA without exception */ \
insn r e g , a d d r ; \
.endif ; \
2014-01-17 10:48:46 +00:00
.endif
2006-12-13 01:22:06 +09:00
2014-01-16 17:02:13 +00:00
# undef L O A D
2006-12-13 01:22:06 +09:00
# ifdef U S E _ D O U B L E
2014-01-16 17:02:13 +00:00
# define L O A D K l d / * N o e x c e p t i o n * /
2020-07-19 17:37:15 -04:00
# define L O A D ( r e g , a d d r ) E X C ( l d , L D _ I N S N , r e g , a d d r )
# define L O A D B U ( r e g , a d d r ) E X C ( l b u , L D _ I N S N , r e g , a d d r )
# define L O A D L ( r e g , a d d r ) E X C ( l d l , L D _ I N S N , r e g , a d d r )
# define L O A D R ( r e g , a d d r ) E X C ( l d r , L D _ I N S N , r e g , a d d r )
# define S T O R E B ( r e g , a d d r ) E X C ( s b , S T _ I N S N , r e g , a d d r )
# define S T O R E L ( r e g , a d d r ) E X C ( s d l , S T _ I N S N , r e g , a d d r )
# define S T O R E R ( r e g , a d d r ) E X C ( s d r , S T _ I N S N , r e g , a d d r )
# define S T O R E ( r e g , a d d r ) E X C ( s d , S T _ I N S N , r e g , a d d r )
2006-12-13 01:22:06 +09:00
# define A D D d a d d u
# define S U B d s u b u
# define S R L d s r l
# define S L L d s l l
# define S L L V d s l l v
# define S R L V d s r l v
# define N B Y T E S 8
# define L O G _ N B Y T E S 3
# else
2014-01-16 17:02:13 +00:00
# define L O A D K l w / * N o e x c e p t i o n * /
2020-07-19 17:37:15 -04:00
# define L O A D ( r e g , a d d r ) E X C ( l w , L D _ I N S N , r e g , a d d r )
# define L O A D B U ( r e g , a d d r ) E X C ( l b u , L D _ I N S N , r e g , a d d r )
# define L O A D L ( r e g , a d d r ) E X C ( l w l , L D _ I N S N , r e g , a d d r )
# define L O A D R ( r e g , a d d r ) E X C ( l w r , L D _ I N S N , r e g , a d d r )
# define S T O R E B ( r e g , a d d r ) E X C ( s b , S T _ I N S N , r e g , a d d r )
# define S T O R E L ( r e g , a d d r ) E X C ( s w l , S T _ I N S N , r e g , a d d r )
# define S T O R E R ( r e g , a d d r ) E X C ( s w r , S T _ I N S N , r e g , a d d r )
# define S T O R E ( r e g , a d d r ) E X C ( s w , S T _ I N S N , r e g , a d d r )
2006-12-13 01:22:06 +09:00
# define A D D a d d u
# define S U B s u b u
# define S R L s r l
# define S L L s l l
# define S L L V s l l v
# define S R L V s r l v
# define N B Y T E S 4
# define L O G _ N B Y T E S 2
# endif / * U S E _ D O U B L E * /
# ifdef C O N F I G _ C P U _ L I T T L E _ E N D I A N
# define L D F I R S T L O A D R
2013-01-22 12:59:30 +01:00
# define L D R E S T L O A D L
2006-12-13 01:22:06 +09:00
# define S T F I R S T S T O R E R
2013-01-22 12:59:30 +01:00
# define S T R E S T S T O R E L
2006-12-13 01:22:06 +09:00
# define S H I F T _ D I S C A R D S L L V
# define S H I F T _ D I S C A R D _ R E V E R T S R L V
# else
# define L D F I R S T L O A D L
2013-01-22 12:59:30 +01:00
# define L D R E S T L O A D R
2006-12-13 01:22:06 +09:00
# define S T F I R S T S T O R E L
2013-01-22 12:59:30 +01:00
# define S T R E S T S T O R E R
2006-12-13 01:22:06 +09:00
# define S H I F T _ D I S C A R D S R L V
# define S H I F T _ D I S C A R D _ R E V E R T S L L V
# endif
# define F I R S T ( u n i t ) ( ( u n i t ) * N B Y T E S )
# define R E S T ( u n i t ) ( F I R S T ( u n i t ) + N B Y T E S - 1 )
# define A D D R M A S K ( N B Y T E S - 1 )
2007-10-23 12:43:25 +01:00
# ifndef C O N F I G _ C P U _ D A D D I _ W O R K A R O U N D S
2006-12-13 01:22:06 +09:00
.set noat
2007-10-23 12:43:25 +01:00
# else
.set at=v1
# endif
2006-12-13 01:22:06 +09:00
2020-07-19 17:37:15 -04:00
.macro __BUILD_CSUM_PARTIAL_COPY_USER mode, f r o m , t o
2014-01-17 10:48:46 +00:00
2020-07-19 17:37:15 -04:00
li s u m , - 1
2006-12-13 01:22:06 +09:00
move o d d , z e r o
/ *
* Note : dst & s r c m a y b e u n a l i g n e d , l e n m a y b e 0
* Temps
* /
/ *
* The " i s s u e b r e a k " s b e l o w a r e v e r y a p p r o x i m a t e .
* Issue d e l a y s f o r d c a c h e f i l l s w i l l p e r t u r b t h e s c h e d u l e , a s w i l l
* load q u e u e f u l l r e p l a y t r a p s , e t c .
*
* If l e n < N B Y T E S u s e b y t e o p e r a t i o n s .
* /
sltu t 2 , l e n , N B Y T E S
and t 1 , d s t , A D D R M A S K
2014-01-17 10:48:46 +00:00
bnez t 2 , . L c o p y _ b y t e s _ c h e c k l e n \ @
2006-12-13 01:22:06 +09:00
and t 0 , s r c , A D D R M A S K
andi o d d , d s t , 0 x1 / * o d d b u f f e r ? * /
2014-01-17 10:48:46 +00:00
bnez t 1 , . L d s t _ u n a l i g n e d \ @
2006-12-13 01:22:06 +09:00
nop
2014-01-17 10:48:46 +00:00
bnez t 0 , . L s r c _ u n a l i g n e d _ d s t _ a l i g n e d \ @
2006-12-13 01:22:06 +09:00
/ *
* use d e l a y s l o t f o r f a l l - t h r o u g h
* src a n d d s t a r e a l i g n e d ; need to compute rem
* /
2014-01-17 10:48:46 +00:00
.Lboth_aligned \ @:
2013-01-22 12:59:30 +01:00
SRL t 0 , l e n , L O G _ N B Y T E S + 3 # + 3 f o r 8 u n i t s / i t e r
2014-01-17 10:48:46 +00:00
beqz t 0 , . L c l e a n u p _ b o t h _ a l i g n e d \ @ # len < 8*NBYTES
2006-12-13 01:22:06 +09:00
nop
SUB l e n , 8 * N B Y T E S # s u b t r a c t h e r e f o r b g e z l o o p
.align 4
1 :
2020-07-19 17:37:15 -04:00
LOAD( t 0 , U N I T ( 0 ) ( s r c ) )
LOAD( t 1 , U N I T ( 1 ) ( s r c ) )
LOAD( t 2 , U N I T ( 2 ) ( s r c ) )
LOAD( t 3 , U N I T ( 3 ) ( s r c ) )
LOAD( t 4 , U N I T ( 4 ) ( s r c ) )
LOAD( t 5 , U N I T ( 5 ) ( s r c ) )
LOAD( t 6 , U N I T ( 6 ) ( s r c ) )
LOAD( t 7 , U N I T ( 7 ) ( s r c ) )
2006-12-13 01:22:06 +09:00
SUB l e n , l e n , 8 * N B Y T E S
ADD s r c , s r c , 8 * N B Y T E S
2020-07-19 17:37:15 -04:00
STORE( t 0 , U N I T ( 0 ) ( d s t ) )
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-27 01:07:24 +08:00
ADDC( t 0 , t 1 )
2020-07-19 17:37:15 -04:00
STORE( t 1 , U N I T ( 1 ) ( d s t ) )
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-27 01:07:24 +08:00
ADDC( s u m , t 0 )
2020-07-19 17:37:15 -04:00
STORE( t 2 , U N I T ( 2 ) ( d s t ) )
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-27 01:07:24 +08:00
ADDC( t 2 , t 3 )
2020-07-19 17:37:15 -04:00
STORE( t 3 , U N I T ( 3 ) ( d s t ) )
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-27 01:07:24 +08:00
ADDC( s u m , t 2 )
2020-07-19 17:37:15 -04:00
STORE( t 4 , U N I T ( 4 ) ( d s t ) )
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-27 01:07:24 +08:00
ADDC( t 4 , t 5 )
2020-07-19 17:37:15 -04:00
STORE( t 5 , U N I T ( 5 ) ( d s t ) )
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-27 01:07:24 +08:00
ADDC( s u m , t 4 )
2020-07-19 17:37:15 -04:00
STORE( t 6 , U N I T ( 6 ) ( d s t ) )
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-27 01:07:24 +08:00
ADDC( t 6 , t 7 )
2020-07-19 17:37:15 -04:00
STORE( t 7 , U N I T ( 7 ) ( d s t ) )
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-27 01:07:24 +08:00
ADDC( s u m , t 6 )
2007-10-23 12:43:25 +01:00
.set reorder /* DADDI_WAR */
ADD d s t , d s t , 8 * N B Y T E S
2006-12-13 01:22:06 +09:00
bgez l e n , 1 b
2007-10-23 12:43:25 +01:00
.set noreorder
2006-12-13 01:22:06 +09:00
ADD l e n , 8 * N B Y T E S # r e v e r t l e n ( s e e a b o v e )
/ *
* len = = t h e n u m b e r o f b y t e s l e f t t o c o p y < 8 * N B Y T E S
* /
2014-01-17 10:48:46 +00:00
.Lcleanup_both_aligned \ @:
2006-12-13 01:22:06 +09:00
# define r e m t 7
2014-01-17 10:48:46 +00:00
beqz l e n , . L d o n e \ @
2006-12-13 01:22:06 +09:00
sltu t 0 , l e n , 4 * N B Y T E S
2014-01-17 10:48:46 +00:00
bnez t 0 , . L l e s s _ t h a n _ 4 u n i t s \ @
2006-12-13 01:22:06 +09:00
and r e m , l e n , ( N B Y T E S - 1 ) # r e m = l e n % N B Y T E S
/ *
* len > = 4 * N B Y T E S
* /
2020-07-19 17:37:15 -04:00
LOAD( t 0 , U N I T ( 0 ) ( s r c ) )
LOAD( t 1 , U N I T ( 1 ) ( s r c ) )
LOAD( t 2 , U N I T ( 2 ) ( s r c ) )
LOAD( t 3 , U N I T ( 3 ) ( s r c ) )
2006-12-13 01:22:06 +09:00
SUB l e n , l e n , 4 * N B Y T E S
ADD s r c , s r c , 4 * N B Y T E S
2020-07-19 17:37:15 -04:00
STORE( t 0 , U N I T ( 0 ) ( d s t ) )
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-27 01:07:24 +08:00
ADDC( t 0 , t 1 )
2020-07-19 17:37:15 -04:00
STORE( t 1 , U N I T ( 1 ) ( d s t ) )
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-27 01:07:24 +08:00
ADDC( s u m , t 0 )
2020-07-19 17:37:15 -04:00
STORE( t 2 , U N I T ( 2 ) ( d s t ) )
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-27 01:07:24 +08:00
ADDC( t 2 , t 3 )
2020-07-19 17:37:15 -04:00
STORE( t 3 , U N I T ( 3 ) ( d s t ) )
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-27 01:07:24 +08:00
ADDC( s u m , t 2 )
2007-10-23 12:43:25 +01:00
.set reorder /* DADDI_WAR */
ADD d s t , d s t , 4 * N B Y T E S
2014-01-17 10:48:46 +00:00
beqz l e n , . L d o n e \ @
2007-10-23 12:43:25 +01:00
.set noreorder
2014-01-17 10:48:46 +00:00
.Lless_than_4units \ @:
2006-12-13 01:22:06 +09:00
/ *
* rem = l e n % N B Y T E S
* /
2014-01-17 10:48:46 +00:00
beq r e m , l e n , . L c o p y _ b y t e s \ @
2006-12-13 01:22:06 +09:00
nop
1 :
2020-07-19 17:37:15 -04:00
LOAD( t 0 , 0 ( s r c ) )
2006-12-13 01:22:06 +09:00
ADD s r c , s r c , N B Y T E S
SUB l e n , l e n , N B Y T E S
2020-07-19 17:37:15 -04:00
STORE( t 0 , 0 ( d s t ) )
2006-12-13 01:22:06 +09:00
ADDC( s u m , t 0 )
2007-10-23 12:43:25 +01:00
.set reorder /* DADDI_WAR */
ADD d s t , d s t , N B Y T E S
2006-12-13 01:22:06 +09:00
bne r e m , l e n , 1 b
2007-10-23 12:43:25 +01:00
.set noreorder
2006-12-13 01:22:06 +09:00
/ *
* src a n d d s t a r e a l i g n e d , n e e d t o c o p y r e m b y t e s ( r e m < N B Y T E S )
* A l o o p w o u l d d o o n l y a b y t e a t a t i m e w i t h p o s s i b l e b r a n c h
2013-01-22 12:59:30 +01:00
* mispredicts. C a n ' t d o a n e x p l i c i t L O A D d s t ,m a s k ,o r ,S T O R E
2006-12-13 01:22:06 +09:00
* because c a n ' t a s s u m e r e a d - a c c e s s t o d s t . I n s t e a d , u s e
* STREST d s t , w h i c h d o e s n ' t r e q u i r e r e a d a c c e s s t o d s t .
*
* This c o d e s h o u l d p e r f o r m b e t t e r t h a n a s i m p l e l o o p o n m o d e r n ,
* wide- i s s u e m i p s p r o c e s s o r s b e c a u s e t h e c o d e h a s f e w e r b r a n c h e s a n d
* more i n s t r u c t i o n - l e v e l p a r a l l e l i s m .
* /
# define b i t s t 2
2014-01-17 10:48:46 +00:00
beqz l e n , . L d o n e \ @
2006-12-13 01:22:06 +09:00
ADD t 1 , d s t , l e n # t 1 i s j u s t p a s t l a s t b y t e o f d s t
li b i t s , 8 * N B Y T E S
SLL r e m , l e n , 3 # r e m = n u m b e r o f b i t s t o k e e p
2020-07-19 17:37:15 -04:00
LOAD( t 0 , 0 ( s r c ) )
2013-01-22 12:59:30 +01:00
SUB b i t s , b i t s , r e m # b i t s = n u m b e r o f b i t s t o d i s c a r d
2006-12-13 01:22:06 +09:00
SHIFT_ D I S C A R D t 0 , t 0 , b i t s
2020-07-19 17:37:15 -04:00
STREST( t 0 , - 1 ( t 1 ) )
2006-12-13 01:22:06 +09:00
SHIFT_ D I S C A R D _ R E V E R T t 0 , t 0 , b i t s
.set reorder
ADDC( s u m , t 0 )
2014-01-17 10:48:46 +00:00
b . L d o n e \ @
2006-12-13 01:22:06 +09:00
.set noreorder
2014-01-17 10:48:46 +00:00
.Ldst_unaligned \ @:
2006-12-13 01:22:06 +09:00
/ *
* dst i s u n a l i g n e d
* t0 = s r c & A D D R M A S K
* t1 = d s t & A D D R M A S K ; T1 > 0
* len > = N B Y T E S
*
* Copy e n o u g h b y t e s t o a l i g n d s t
* Set m a t c h = ( s r c a n d d s t h a v e s a m e a l i g n m e n t )
* /
# define m a t c h r e m
2020-07-19 17:37:15 -04:00
LDFIRST( t 3 , F I R S T ( 0 ) ( s r c ) )
2006-12-13 01:22:06 +09:00
ADD t 2 , z e r o , N B Y T E S
2020-07-19 17:37:15 -04:00
LDREST( t 3 , R E S T ( 0 ) ( s r c ) )
2006-12-13 01:22:06 +09:00
SUB t 2 , t 2 , t 1 # t 2 = n u m b e r o f b y t e s c o p i e d
xor m a t c h , t 0 , t 1
2020-07-19 17:37:15 -04:00
STFIRST( t 3 , F I R S T ( 0 ) ( d s t ) )
2006-12-13 01:22:06 +09:00
SLL t 4 , t 1 , 3 # t 4 = n u m b e r o f b i t s t o d i s c a r d
SHIFT_ D I S C A R D t 3 , t 3 , t 4
/* no SHIFT_DISCARD_REVERT to handle odd buffer properly */
ADDC( s u m , t 3 )
2014-01-17 10:48:46 +00:00
beq l e n , t 2 , . L d o n e \ @
2006-12-13 01:22:06 +09:00
SUB l e n , l e n , t 2
ADD d s t , d s t , t 2
2014-01-17 10:48:46 +00:00
beqz m a t c h , . L b o t h _ a l i g n e d \ @
2006-12-13 01:22:06 +09:00
ADD s r c , s r c , t 2
2014-01-17 10:48:46 +00:00
.Lsrc_unaligned_dst_aligned \ @:
2013-01-22 12:59:30 +01:00
SRL t 0 , l e n , L O G _ N B Y T E S + 2 # + 2 f o r 4 u n i t s / i t e r
2014-01-17 10:48:46 +00:00
beqz t 0 , . L c l e a n u p _ s r c _ u n a l i g n e d \ @
2013-01-22 12:59:30 +01:00
and r e m , l e n , ( 4 * N B Y T E S - 1 ) # r e m = l e n % 4 * N B Y T E S
2006-12-13 01:22:06 +09:00
1 :
/ *
* Avoid c o n s e c u t i v e L D * ' s t o t h e s a m e r e g i s t e r s i n c e s o m e m i p s
* implementations c a n ' t i s s u e t h e m i n t h e s a m e c y c l e .
* It' s O K t o l o a d F I R S T ( N + 1 ) b e f o r e R E S T ( N ) b e c a u s e t h e t w o a d d r e s s e s
* are t o t h e s a m e u n i t ( u n l e s s s r c i s a l i g n e d , b u t i t ' s n o t ) .
* /
2020-07-19 17:37:15 -04:00
LDFIRST( t 0 , F I R S T ( 0 ) ( s r c ) )
LDFIRST( t 1 , F I R S T ( 1 ) ( s r c ) )
2013-01-22 12:59:30 +01:00
SUB l e n , l e n , 4 * N B Y T E S
2020-07-19 17:37:15 -04:00
LDREST( t 0 , R E S T ( 0 ) ( s r c ) )
LDREST( t 1 , R E S T ( 1 ) ( s r c ) )
LDFIRST( t 2 , F I R S T ( 2 ) ( s r c ) )
LDFIRST( t 3 , F I R S T ( 3 ) ( s r c ) )
LDREST( t 2 , R E S T ( 2 ) ( s r c ) )
LDREST( t 3 , R E S T ( 3 ) ( s r c ) )
2006-12-13 01:22:06 +09:00
ADD s r c , s r c , 4 * N B Y T E S
# ifdef C O N F I G _ C P U _ S B 1
nop # i m p r o v e s s l o t t i n g
# endif
2020-07-19 17:37:15 -04:00
STORE( t 0 , U N I T ( 0 ) ( d s t ) )
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-27 01:07:24 +08:00
ADDC( t 0 , t 1 )
2020-07-19 17:37:15 -04:00
STORE( t 1 , U N I T ( 1 ) ( d s t ) )
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-27 01:07:24 +08:00
ADDC( s u m , t 0 )
2020-07-19 17:37:15 -04:00
STORE( t 2 , U N I T ( 2 ) ( d s t ) )
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-27 01:07:24 +08:00
ADDC( t 2 , t 3 )
2020-07-19 17:37:15 -04:00
STORE( t 3 , U N I T ( 3 ) ( d s t ) )
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-27 01:07:24 +08:00
ADDC( s u m , t 2 )
2007-10-23 12:43:25 +01:00
.set reorder /* DADDI_WAR */
ADD d s t , d s t , 4 * N B Y T E S
2006-12-13 01:22:06 +09:00
bne l e n , r e m , 1 b
2007-10-23 12:43:25 +01:00
.set noreorder
2006-12-13 01:22:06 +09:00
2014-01-17 10:48:46 +00:00
.Lcleanup_src_unaligned \ @:
beqz l e n , . L d o n e \ @
2006-12-13 01:22:06 +09:00
and r e m , l e n , N B Y T E S - 1 # r e m = l e n % N B Y T E S
2014-01-17 10:48:46 +00:00
beq r e m , l e n , . L c o p y _ b y t e s \ @
2006-12-13 01:22:06 +09:00
nop
1 :
2020-07-19 17:37:15 -04:00
LDFIRST( t 0 , F I R S T ( 0 ) ( s r c ) )
LDREST( t 0 , R E S T ( 0 ) ( s r c ) )
2006-12-13 01:22:06 +09:00
ADD s r c , s r c , N B Y T E S
SUB l e n , l e n , N B Y T E S
2020-07-19 17:37:15 -04:00
STORE( t 0 , 0 ( d s t ) )
2006-12-13 01:22:06 +09:00
ADDC( s u m , t 0 )
2007-10-23 12:43:25 +01:00
.set reorder /* DADDI_WAR */
ADD d s t , d s t , N B Y T E S
2006-12-13 01:22:06 +09:00
bne l e n , r e m , 1 b
2007-10-23 12:43:25 +01:00
.set noreorder
2006-12-13 01:22:06 +09:00
2014-01-17 10:48:46 +00:00
.Lcopy_bytes_checklen \ @:
beqz l e n , . L d o n e \ @
2006-12-13 01:22:06 +09:00
nop
2014-01-17 10:48:46 +00:00
.Lcopy_bytes \ @:
2006-12-13 01:22:06 +09:00
/* 0 < len < NBYTES */
# ifdef C O N F I G _ C P U _ L I T T L E _ E N D I A N
# define S H I F T _ S T A R T 0
# define S H I F T _ I N C 8
# else
# define S H I F T _ S T A R T 8 * ( N B Y T E S - 1 )
# define S H I F T _ I N C - 8
# endif
move t 2 , z e r o # p a r t i a l w o r d
2013-01-22 12:59:30 +01:00
li t 3 , S H I F T _ S T A R T # s h i f t
2006-12-13 01:22:06 +09:00
# define C O P Y _ B Y T E ( N ) \
2020-07-19 17:37:15 -04:00
LOADBU( t 0 , N ( s r c ) ) ; \
2006-12-13 01:22:06 +09:00
SUB l e n , l e n , 1 ; \
2020-07-19 17:37:15 -04:00
STOREB( t 0 , N ( d s t ) ) ; \
2006-12-13 01:22:06 +09:00
SLLV t 0 , t 0 , t 3 ; \
addu t 3 , S H I F T _ I N C ; \
2014-01-17 10:48:46 +00:00
beqz l e n , . L c o p y _ b y t e s _ d o n e \ @; \
2006-12-13 01:22:06 +09:00
or t 2 , t 0
COPY_ B Y T E ( 0 )
COPY_ B Y T E ( 1 )
# ifdef U S E _ D O U B L E
COPY_ B Y T E ( 2 )
COPY_ B Y T E ( 3 )
COPY_ B Y T E ( 4 )
COPY_ B Y T E ( 5 )
# endif
2020-07-19 17:37:15 -04:00
LOADBU( t 0 , N B Y T E S - 2 ( s r c ) )
2006-12-13 01:22:06 +09:00
SUB l e n , l e n , 1
2020-07-19 17:37:15 -04:00
STOREB( t 0 , N B Y T E S - 2 ( d s t ) )
2006-12-13 01:22:06 +09:00
SLLV t 0 , t 0 , t 3
or t 2 , t 0
2014-01-17 10:48:46 +00:00
.Lcopy_bytes_done \ @:
2006-12-13 01:22:06 +09:00
ADDC( s u m , t 2 )
2014-01-17 10:48:46 +00:00
.Ldone \ @:
2006-12-13 01:22:06 +09:00
/* fold checksum */
2014-04-04 03:32:54 +01:00
.set push
.set noat
2006-12-13 01:22:06 +09:00
# ifdef U S E _ D O U B L E
dsll3 2 v1 , s u m , 0
daddu s u m , v1
sltu v1 , s u m , v1
dsra3 2 s u m , s u m , 0
addu s u m , v1
# endif
mips: Add MIPS Release 5 support
There are five MIPS32/64 architecture releases currently available:
from 1 to 6 except fourth one, which was intentionally skipped.
Three of them can be called as major: 1st, 2nd and 6th, that not only
have some system level alterations, but also introduced significant
core/ISA level updates. The rest of the MIPS architecture releases are
minor.
Even though they don't have as much ISA/system/core level changes
as the major ones with respect to the previous releases, they still
provide a set of updates (I'd say they were intended to be the
intermediate releases before a major one) that might be useful for the
kernel and user-level code, when activated by the kernel or compiler.
In particular the following features were introduced or ended up being
available at/after MIPS32/64 Release 5 architecture:
+ the last release of the misaligned memory access instructions,
+ virtualisation - VZ ASE - is optional component of the arch,
+ SIMD - MSA ASE - is optional component of the arch,
+ DSP ASE is optional component of the arch,
+ CP0.Status.FR=1 for CP1.FIR.F64=1 (pure 64-bit FPU general registers)
must be available if FPU is implemented,
+ CP1.FIR.Has2008 support is required so CP1.FCSR.{ABS2008,NAN2008} bits
are available.
+ UFR/UNFR aliases to access CP0.Status.FR from user-space by means of
ctc1/cfc1 instructions (enabled by CP0.Config5.UFR),
+ CP0.COnfig5.LLB=1 and eretnc instruction are implemented to without
accidentally clearing LL-bit when returning from an interrupt,
exception, or error trap,
+ XPA feature together with extended versions of CPx registers is
introduced, which needs to have mfhc0/mthc0 instructions available.
So due to these changes GNU GCC provides an extended instructions set
support for MIPS32/64 Release 5 by default like eretnc/mfhc0/mthc0. Even
though the architecture alteration isn't that big, it still worth to be
taken into account by the kernel software. Finally we can't deny that
some optimization/limitations might be found in future and implemented
on some level in kernel or compiler. In this case having even
intermediate MIPS architecture releases support would be more than
useful.
So the most of the changes provided by this commit can be split into
either compile- or runtime configs related. The compile-time related
changes are caused by adding the new CONFIG_CPU_MIPS32_R5/CONFIG_CPU_MIPSR5
configs and concern the code activating MIPSR2 or MIPSR6 already
implemented features (like eretnc/LLbit, mthc0/mfhc0). In addition
CPU_HAS_MSA can be now freely enabled for MIPS32/64 release 5 based
platforms as this is done for CPU_MIPS32_R6 CPUs. The runtime changes
concerns the features which are handled with respect to the MIPS ISA
revision detected at run-time by means of CP0.Config.{AT,AR} bits. Alas
these fields can be used to detect either r1 or r2 or r6 releases.
But since we know which CPUs in fact support the R5 arch, we can manually
set MIPS_CPU_ISA_M32R5/MIPS_CPU_ISA_M64R5 bit of c->isa_level and then
use cpu_has_mips32r5/cpu_has_mips64r5 where it's appropriate.
Since XPA/EVA provide too complex alterationss and to have them used with
MIPS32 Release 2 charged kernels (for compatibility with current platform
configs) they are left to be setup as a separate kernel configs.
Co-developed-by: Alexey Malahov <Alexey.Malahov@baikalelectronics.ru>
Signed-off-by: Alexey Malahov <Alexey.Malahov@baikalelectronics.ru>
Signed-off-by: Serge Semin <Sergey.Semin@baikalelectronics.ru>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: devicetree@vger.kernel.org
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2020-05-21 17:07:14 +03:00
# if d e f i n e d ( C O N F I G _ C P U _ M I P S R 2 ) | | d e f i n e d ( C O N F I G _ C P U _ M I P S R 5 ) | | \
defined( C O N F I G _ C P U _ L O O N G S O N 6 4 )
2014-08-15 16:56:58 +08:00
.set push
.set arch=mips32r2
2008-10-11 16:18:53 +01:00
wsbh v1 , s u m
movn s u m , v1 , o d d
2014-08-15 16:56:58 +08:00
.set pop
2008-10-11 16:18:53 +01:00
# else
beqz o d d , 1 f / * o d d b u f f e r a l i g n m e n t ? * /
lui v1 , 0 x00 f f
addu v1 , 0 x00 f f
and t 0 , s u m , v1
sll t 0 , t 0 , 8
2006-12-13 01:22:06 +09:00
srl s u m , s u m , 8
2008-10-11 16:18:53 +01:00
and s u m , s u m , v1
or s u m , s u m , t 0
2006-12-13 01:22:06 +09:00
1 :
2008-10-11 16:18:53 +01:00
# endif
2014-04-04 03:32:54 +01:00
.set pop
2006-12-13 01:22:06 +09:00
.set reorder
jr r a
.set noreorder
2020-07-19 17:37:15 -04:00
.endm
2006-12-13 01:22:06 +09:00
2020-07-19 17:37:15 -04:00
.set noreorder
.L_exc :
2006-12-13 01:22:06 +09:00
jr r a
2020-07-19 17:37:15 -04:00
li v0 , 0
2014-01-17 10:48:46 +00:00
2020-07-19 17:37:15 -04:00
FEXPORT( _ _ c s u m _ p a r t i a l _ c o p y _ n o c h e c k )
EXPORT_ S Y M B O L ( _ _ c s u m _ p a r t i a l _ c o p y _ n o c h e c k )
2014-01-17 11:36:16 +00:00
# ifndef C O N F I G _ E V A
2014-01-17 10:48:46 +00:00
FEXPORT( _ _ c s u m _ p a r t i a l _ c o p y _ t o _ u s e r )
2016-11-07 11:14:13 +00:00
EXPORT_ S Y M B O L ( _ _ c s u m _ p a r t i a l _ c o p y _ t o _ u s e r )
2014-01-17 10:48:46 +00:00
FEXPORT( _ _ c s u m _ p a r t i a l _ c o p y _ f r o m _ u s e r )
2016-11-07 11:14:13 +00:00
EXPORT_ S Y M B O L ( _ _ c s u m _ p a r t i a l _ c o p y _ f r o m _ u s e r )
2014-01-17 11:36:16 +00:00
# endif
2020-07-19 17:37:15 -04:00
_ _ BUILD_ C S U M _ P A R T I A L _ C O P Y _ U S E R L E G A C Y _ M O D E U S E R O P U S E R O P
2014-01-17 11:36:16 +00:00
# ifdef C O N F I G _ E V A
LEAF( _ _ c s u m _ p a r t i a l _ c o p y _ t o _ u s e r )
2020-07-19 17:37:15 -04:00
_ _ BUILD_ C S U M _ P A R T I A L _ C O P Y _ U S E R E V A _ M O D E K E R N E L O P U S E R O P
2014-01-17 11:36:16 +00:00
END( _ _ c s u m _ p a r t i a l _ c o p y _ t o _ u s e r )
LEAF( _ _ c s u m _ p a r t i a l _ c o p y _ f r o m _ u s e r )
2020-07-19 17:37:15 -04:00
_ _ BUILD_ C S U M _ P A R T I A L _ C O P Y _ U S E R E V A _ M O D E U S E R O P K E R N E L O P
2014-01-17 11:36:16 +00:00
END( _ _ c s u m _ p a r t i a l _ c o p y _ f r o m _ u s e r )
# endif