2006-12-03 18:42:59 +03:00
/ *
* This f i l e i s s u b j e c t t o t h e t e r m s a n d c o n d i t i o n s o f t h e G N U G e n e r a l P u b l i c
* License. S e e t h e f i l e " C O P Y I N G " i n t h e m a i n d i r e c t o r y o f t h i s a r c h i v e
* for m o r e d e t a i l s .
*
* Quick' n ' d i r t y I P c h e c k s u m . . .
*
* Copyright ( C ) 1 9 9 8 , 1 9 9 9 R a l f B a e c h l e
* Copyright ( C ) 1 9 9 9 S i l i c o n G r a p h i c s , I n c .
2007-10-23 15:43:25 +04:00
* Copyright ( C ) 2 0 0 7 M a c i e j W . R o z y c k i
2013-12-12 20:21:00 +04:00
* Copyright ( C ) 2 0 1 4 I m a g i n a t i o n T e c h n o l o g i e s L t d .
2006-12-03 18:42:59 +03:00
* /
2006-12-12 19:22:06 +03:00
# include < l i n u x / e r r n o . h >
2006-12-03 18:42:59 +03:00
# include < a s m / a s m . h >
2006-12-12 19:22:06 +03:00
# include < a s m / a s m - o f f s e t s . h >
2006-12-03 18:42:59 +03:00
# include < a s m / r e g d e f . h >
# ifdef C O N F I G _ 6 4 B I T
2006-12-07 19:04:31 +03:00
/ *
* As w e a r e s h a r i n g c o d e b a s e w i t h t h e m i p s32 t r e e ( w h i c h u s e t h e o 3 2 A B I
* register d e f i n i t i o n s ) . W e n e e d t o r e d e f i n e t h e r e g i s t e r d e f i n i t i o n s f r o m
* the n 6 4 A B I r e g i s t e r n a m i n g t o t h e o 3 2 A B I r e g i s t e r n a m i n g .
* /
# undef t 0
# undef t 1
# undef t 2
# undef t 3
# define t 0 $ 8
# define t 1 $ 9
# define t 2 $ 1 0
# define t 3 $ 1 1
# define t 4 $ 1 2
# define t 5 $ 1 3
# define t 6 $ 1 4
# define t 7 $ 1 5
2006-12-07 19:04:51 +03:00
# define U S E _ D O U B L E
2006-12-03 18:42:59 +03:00
# endif
2006-12-07 19:04:51 +03:00
# ifdef U S E _ D O U B L E
# define L O A D l d
2008-09-20 19:20:04 +04:00
# define L O A D 3 2 l w u
2006-12-07 19:04:51 +03:00
# define A D D d a d d u
# define N B Y T E S 8
# else
# define L O A D l w
2008-09-20 19:20:04 +04:00
# define L O A D 3 2 l w
2006-12-07 19:04:51 +03:00
# define A D D a d d u
# define N B Y T E S 4
# endif / * U S E _ D O U B L E * /
# define U N I T ( u n i t ) ( ( u n i t ) * N B Y T E S )
2006-12-03 18:42:59 +03:00
# define A D D C ( s u m ,r e g ) \
2014-04-04 06:32:54 +04:00
.set push; \
.set noat; \
2006-12-07 19:04:51 +03:00
ADD s u m , r e g ; \
2006-12-03 18:42:59 +03:00
sltu v1 , s u m , r e g ; \
2007-10-23 15:43:25 +04:00
ADD s u m , v1 ; \
2014-04-04 06:32:54 +04:00
.set pop
2006-12-03 18:42:59 +03:00
2008-09-20 19:20:04 +04:00
# define A D D C 3 2 ( s u m ,r e g ) \
2014-04-04 06:32:54 +04:00
.set push; \
.set noat; \
2008-09-20 19:20:04 +04:00
addu s u m , r e g ; \
sltu v1 , s u m , r e g ; \
addu s u m , v1 ; \
2014-04-04 06:32:54 +04:00
.set pop
2008-09-20 19:20:04 +04:00
2006-12-07 19:04:51 +03:00
# define C S U M _ B I G C H U N K 1 ( s r c , o f f s e t , s u m , _ t 0 , _ t 1 , _ t 2 , _ t 3 ) \
LOAD _ t 0 , ( o f f s e t + U N I T ( 0 ) ) ( s r c ) ; \
LOAD _ t 1 , ( o f f s e t + U N I T ( 1 ) ) ( s r c ) ; \
2013-01-22 15:59:30 +04:00
LOAD _ t 2 , ( o f f s e t + U N I T ( 2 ) ) ( s r c ) ; \
LOAD _ t 3 , ( o f f s e t + U N I T ( 3 ) ) ( s r c ) ; \
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-26 20:07:24 +03:00
ADDC( _ t 0 , _ t 1 ) ; \
ADDC( _ t 2 , _ t 3 ) ; \
2006-12-03 18:42:59 +03:00
ADDC( s u m , _ t 0 ) ; \
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-26 20:07:24 +03:00
ADDC( s u m , _ t 2 )
2006-12-07 19:04:51 +03:00
# ifdef U S E _ D O U B L E
# define C S U M _ B I G C H U N K ( s r c , o f f s e t , s u m , _ t 0 , _ t 1 , _ t 2 , _ t 3 ) \
CSUM_ B I G C H U N K 1 ( s r c , o f f s e t , s u m , _ t 0 , _ t 1 , _ t 2 , _ t 3 )
# else
# define C S U M _ B I G C H U N K ( s r c , o f f s e t , s u m , _ t 0 , _ t 1 , _ t 2 , _ t 3 ) \
CSUM_ B I G C H U N K 1 ( s r c , o f f s e t , s u m , _ t 0 , _ t 1 , _ t 2 , _ t 3 ) ; \
CSUM_ B I G C H U N K 1 ( s r c , o f f s e t + 0 x10 , s u m , _ t 0 , _ t 1 , _ t 2 , _ t 3 )
# endif
2006-12-03 18:42:59 +03:00
/ *
* a0 : source a d d r e s s
* a1 : length o f t h e a r e a t o c h e c k s u m
* a2 : partial c h e c k s u m
* /
# define s r c a0
# define s u m v0
.text
.set noreorder
.align 5
LEAF( c s u m _ p a r t i a l )
move s u m , z e r o
2006-12-07 19:04:31 +03:00
move t 7 , z e r o
2006-12-03 18:42:59 +03:00
sltiu t 8 , a1 , 0 x8
2008-01-29 13:14:59 +03:00
bnez t 8 , . L s m a l l _ c s u m c p y / * < 8 b y t e s t o c o p y * /
2006-12-07 19:04:31 +03:00
move t 2 , a1
2006-12-03 18:42:59 +03:00
2006-12-07 19:04:45 +03:00
andi t 7 , s r c , 0 x1 / * o d d b u f f e r ? * /
2006-12-03 18:42:59 +03:00
2008-01-29 13:14:59 +03:00
.Lhword_align :
beqz t 7 , . L w o r d _ a l i g n
2006-12-03 18:42:59 +03:00
andi t 8 , s r c , 0 x2
2006-12-07 19:04:31 +03:00
lbu t 0 , ( s r c )
2006-12-03 18:42:59 +03:00
LONG_ S U B U a1 , a1 , 0 x1
# ifdef _ _ M I P S E L _ _
2006-12-07 19:04:31 +03:00
sll t 0 , t 0 , 8
2006-12-03 18:42:59 +03:00
# endif
2006-12-07 19:04:31 +03:00
ADDC( s u m , t 0 )
2006-12-03 18:42:59 +03:00
PTR_ A D D U s r c , s r c , 0 x1
andi t 8 , s r c , 0 x2
2008-01-29 13:14:59 +03:00
.Lword_align :
beqz t 8 , . L d w o r d _ a l i g n
2006-12-03 18:42:59 +03:00
sltiu t 8 , a1 , 5 6
2006-12-07 19:04:31 +03:00
lhu t 0 , ( s r c )
2006-12-03 18:42:59 +03:00
LONG_ S U B U a1 , a1 , 0 x2
2006-12-07 19:04:31 +03:00
ADDC( s u m , t 0 )
2006-12-03 18:42:59 +03:00
sltiu t 8 , a1 , 5 6
PTR_ A D D U s r c , s r c , 0 x2
2008-01-29 13:14:59 +03:00
.Ldword_align :
bnez t 8 , . L d o _ e n d _ w o r d s
2006-12-03 18:42:59 +03:00
move t 8 , a1
andi t 8 , s r c , 0 x4
2008-01-29 13:14:59 +03:00
beqz t 8 , . L q w o r d _ a l i g n
2006-12-03 18:42:59 +03:00
andi t 8 , s r c , 0 x8
2008-09-20 19:20:04 +04:00
LOAD3 2 t 0 , 0 x00 ( s r c )
2006-12-03 18:42:59 +03:00
LONG_ S U B U a1 , a1 , 0 x4
2006-12-07 19:04:31 +03:00
ADDC( s u m , t 0 )
2006-12-03 18:42:59 +03:00
PTR_ A D D U s r c , s r c , 0 x4
andi t 8 , s r c , 0 x8
2008-01-29 13:14:59 +03:00
.Lqword_align :
beqz t 8 , . L o w o r d _ a l i g n
2006-12-03 18:42:59 +03:00
andi t 8 , s r c , 0 x10
2006-12-07 19:04:51 +03:00
# ifdef U S E _ D O U B L E
ld t 0 , 0 x00 ( s r c )
LONG_ S U B U a1 , a1 , 0 x8
ADDC( s u m , t 0 )
# else
2006-12-07 19:04:31 +03:00
lw t 0 , 0 x00 ( s r c )
lw t 1 , 0 x04 ( s r c )
2006-12-03 18:42:59 +03:00
LONG_ S U B U a1 , a1 , 0 x8
2006-12-07 19:04:31 +03:00
ADDC( s u m , t 0 )
ADDC( s u m , t 1 )
2006-12-07 19:04:51 +03:00
# endif
2006-12-03 18:42:59 +03:00
PTR_ A D D U s r c , s r c , 0 x8
andi t 8 , s r c , 0 x10
2008-01-29 13:14:59 +03:00
.Loword_align :
beqz t 8 , . L b e g i n _ m o v e m e n t
2006-12-03 18:42:59 +03:00
LONG_ S R L t 8 , a1 , 0 x7
2006-12-07 19:04:51 +03:00
# ifdef U S E _ D O U B L E
ld t 0 , 0 x00 ( s r c )
ld t 1 , 0 x08 ( s r c )
2006-12-07 19:04:31 +03:00
ADDC( s u m , t 0 )
ADDC( s u m , t 1 )
2006-12-07 19:04:51 +03:00
# else
CSUM_ B I G C H U N K 1 ( s r c , 0 x00 , s u m , t 0 , t 1 , t 3 , t 4 )
# endif
2006-12-03 18:42:59 +03:00
LONG_ S U B U a1 , a1 , 0 x10
PTR_ A D D U s r c , s r c , 0 x10
LONG_ S R L t 8 , a1 , 0 x7
2008-01-29 13:14:59 +03:00
.Lbegin_movement :
2006-12-03 18:42:59 +03:00
beqz t 8 , 1 f
2006-12-07 19:04:31 +03:00
andi t 2 , a1 , 0 x40
2006-12-03 18:42:59 +03:00
2008-01-29 13:14:59 +03:00
.Lmove_128bytes :
2006-12-07 19:04:31 +03:00
CSUM_ B I G C H U N K ( s r c , 0 x00 , s u m , t 0 , t 1 , t 3 , t 4 )
CSUM_ B I G C H U N K ( s r c , 0 x20 , s u m , t 0 , t 1 , t 3 , t 4 )
CSUM_ B I G C H U N K ( s r c , 0 x40 , s u m , t 0 , t 1 , t 3 , t 4 )
CSUM_ B I G C H U N K ( s r c , 0 x60 , s u m , t 0 , t 1 , t 3 , t 4 )
2006-12-03 18:42:59 +03:00
LONG_ S U B U t 8 , t 8 , 0 x01
2007-10-23 15:43:25 +04:00
.set reorder /* DADDI_WAR */
PTR_ A D D U s r c , s r c , 0 x80
2008-01-29 13:14:59 +03:00
bnez t 8 , . L m o v e _ 1 2 8 b y t e s
2007-10-23 15:43:25 +04:00
.set noreorder
2006-12-03 18:42:59 +03:00
1 :
2006-12-07 19:04:31 +03:00
beqz t 2 , 1 f
andi t 2 , a1 , 0 x20
2006-12-03 18:42:59 +03:00
2008-01-29 13:14:59 +03:00
.Lmove_64bytes :
2006-12-07 19:04:31 +03:00
CSUM_ B I G C H U N K ( s r c , 0 x00 , s u m , t 0 , t 1 , t 3 , t 4 )
CSUM_ B I G C H U N K ( s r c , 0 x20 , s u m , t 0 , t 1 , t 3 , t 4 )
2006-12-03 18:42:59 +03:00
PTR_ A D D U s r c , s r c , 0 x40
1 :
2008-01-29 13:14:59 +03:00
beqz t 2 , . L d o _ e n d _ w o r d s
2006-12-03 18:42:59 +03:00
andi t 8 , a1 , 0 x1 c
2008-01-29 13:14:59 +03:00
.Lmove_32bytes :
2006-12-07 19:04:31 +03:00
CSUM_ B I G C H U N K ( s r c , 0 x00 , s u m , t 0 , t 1 , t 3 , t 4 )
2006-12-03 18:42:59 +03:00
andi t 8 , a1 , 0 x1 c
PTR_ A D D U s r c , s r c , 0 x20
2008-01-29 13:14:59 +03:00
.Ldo_end_words :
beqz t 8 , . L s m a l l _ c s u m c p y
2006-12-07 19:04:45 +03:00
andi t 2 , a1 , 0 x3
LONG_ S R L t 8 , t 8 , 0 x2
2006-12-03 18:42:59 +03:00
2008-01-29 13:14:59 +03:00
.Lend_words :
2008-09-20 19:20:04 +04:00
LOAD3 2 t 0 , ( s r c )
2006-12-03 18:42:59 +03:00
LONG_ S U B U t 8 , t 8 , 0 x1
2006-12-07 19:04:31 +03:00
ADDC( s u m , t 0 )
2007-10-23 15:43:25 +04:00
.set reorder /* DADDI_WAR */
PTR_ A D D U s r c , s r c , 0 x4
2008-01-29 13:14:59 +03:00
bnez t 8 , . L e n d _ w o r d s
2007-10-23 15:43:25 +04:00
.set noreorder
2006-12-03 18:42:59 +03:00
2006-12-07 19:04:45 +03:00
/* unknown src alignment and < 8 bytes to go */
2008-01-29 13:14:59 +03:00
.Lsmall_csumcpy :
2006-12-07 19:04:45 +03:00
move a1 , t 2
2006-12-03 18:42:59 +03:00
2006-12-07 19:04:45 +03:00
andi t 0 , a1 , 4
beqz t 0 , 1 f
andi t 0 , a1 , 2
2006-12-03 18:42:59 +03:00
2006-12-07 19:04:45 +03:00
/* Still a full word to go */
ulw t 1 , ( s r c )
PTR_ A D D I U s r c , 4
2008-09-20 19:20:04 +04:00
# ifdef U S E _ D O U B L E
dsll t 1 , t 1 , 3 2 / * c l e a r l o w e r 3 2 b i t * /
# endif
2006-12-07 19:04:45 +03:00
ADDC( s u m , t 1 )
1 : move t 1 , z e r o
beqz t 0 , 1 f
andi t 0 , a1 , 1
/* Still a halfword to go */
ulhu t 1 , ( s r c )
PTR_ A D D I U s r c , 2
1 : beqz t 0 , 1 f
sll t 1 , t 1 , 1 6
lbu t 2 , ( s r c )
nop
# ifdef _ _ M I P S E B _ _
sll t 2 , t 2 , 8
# endif
or t 1 , t 2
1 : ADDC( s u m , t 1 )
2006-12-03 18:42:59 +03:00
2006-12-07 19:04:45 +03:00
/* fold checksum */
2006-12-07 19:04:51 +03:00
# ifdef U S E _ D O U B L E
dsll3 2 v1 , s u m , 0
daddu s u m , v1
sltu v1 , s u m , v1
dsra3 2 s u m , s u m , 0
addu s u m , v1
# endif
2006-12-07 19:04:45 +03:00
/* odd buffer alignment? */
2014-08-15 12:56:58 +04:00
# if d e f i n e d ( C O N F I G _ C P U _ M I P S R 2 ) | | d e f i n e d ( C O N F I G _ C P U _ L O O N G S O N 3 )
.set push
.set arch=mips32r2
2008-10-11 19:18:53 +04:00
wsbh v1 , s u m
movn s u m , v1 , t 7
2014-08-15 12:56:58 +04:00
.set pop
2008-10-11 19:18:53 +04:00
# else
beqz t 7 , 1 f / * o d d b u f f e r a l i g n m e n t ? * /
lui v1 , 0 x00 f f
addu v1 , 0 x00 f f
and t 0 , s u m , v1
sll t 0 , t 0 , 8
2006-12-07 19:04:45 +03:00
srl s u m , s u m , 8
2008-10-11 19:18:53 +04:00
and s u m , s u m , v1
or s u m , s u m , t 0
2006-12-07 19:04:45 +03:00
1 :
2008-10-11 19:18:53 +04:00
# endif
2006-12-07 19:04:45 +03:00
.set reorder
2013-01-22 15:59:30 +04:00
/* Add the passed partial csum. */
2008-09-20 19:20:04 +04:00
ADDC3 2 ( s u m , a2 )
2006-12-03 18:42:59 +03:00
jr r a
2006-12-07 19:04:45 +03:00
.set noreorder
2006-12-03 18:42:59 +03:00
END( c s u m _ p a r t i a l )
2006-12-12 19:22:06 +03:00
/ *
* checksum a n d c o p y r o u t i n e s b a s e d o n m e m c p y . S
*
* csum_ p a r t i a l _ c o p y _ n o c h e c k ( s r c , d s t , l e n , s u m )
2013-12-12 20:21:00 +04:00
* _ _ csum_ p a r t i a l _ c o p y _ k e r n e l ( s r c , d s t , l e n , s u m , e r r p )
2006-12-12 19:22:06 +03:00
*
2013-01-22 15:59:30 +04:00
* See " S p e c " i n m e m c p y . S f o r d e t a i l s . U n l i k e _ _ c o p y _ u s e r , a l l
2006-12-12 19:22:06 +03:00
* function i n t h i s f i l e u s e t h e s t a n d a r d c a l l i n g c o n v e n t i o n .
* /
# define s r c a0
# define d s t a1
# define l e n a2
# define p s u m a3
# define s u m v0
# define o d d t 8
# define e r r p t r t 9
/ *
* The e x c e p t i o n h a n d l e r f o r l o a d s r e q u i r e s t h a t :
* 1 - AT c o n t a i n t h e a d d r e s s o f t h e b y t e j u s t p a s t t h e e n d o f t h e s o u r c e
* of t h e c o p y ,
* 2 - src_ e n t r y < = s r c < A T , a n d
* 3 - ( dst - s r c ) = = ( d s t _ e n t r y - s r c _ e n t r y ) ,
* The _ e n t r y s u f f i x d e n o t e s v a l u e s w h e n _ _ c o p y _ u s e r w a s c a l l e d .
*
* ( 1 ) is s e t u p u p b y _ _ c s u m _ p a r t i a l _ c o p y _ f r o m _ u s e r a n d m a i n t a i n e d b y
* not w r i t i n g A T i n _ _ c s u m _ p a r t i a l _ c o p y
* ( 2 ) is m e t b y i n c r e m e n t i n g s r c b y t h e n u m b e r o f b y t e s c o p i e d
* ( 3 ) is m e t b y n o t d o i n g l o a d s b e t w e e n a p a i r o f i n c r e m e n t s o f d s t a n d s r c
*
* The e x c e p t i o n h a n d l e r s f o r s t o r e s s t o r e s - E F A U L T t o e r r p t r a n d r e t u r n .
* These h a n d l e r s d o n o t n e e d t o o v e r w r i t e a n y d a t a .
* /
2014-01-16 21:02:13 +04:00
/* Instruction type */
# define L D _ I N S N 1
# define S T _ I N S N 2
2014-01-17 14:48:46 +04:00
# define L E G A C Y _ M O D E 1
# define E V A _ M O D E 2
# define U S E R O P 1
# define K E R N E L O P 2
2014-01-16 21:02:13 +04:00
/ *
* Wrapper t o a d d a n e n t r y i n t h e e x c e p t i o n t a b l e
* in c a s e t h e i n s n c a u s e s a m e m o r y e x c e p t i o n .
* Arguments :
* insn : L o a d / s t o r e i n s t r u c t i o n
* type : I n s t r u c t i o n t y p e
* reg : R e g i s t e r
* addr : A d d r e s s
* handler : E x c e p t i o n h a n d l e r
* /
# define E X C ( i n s n , t y p e , r e g , a d d r , h a n d l e r ) \
2014-01-17 14:48:46 +04:00
.if \ mode = = L E G A C Y _ M O D E ; \
9 : insn r e g , a d d r ; \
.section _ _ ex_ t a b l e ," a " ; \
PTR 9 b , h a n d l e r ; \
.previous ; \
2014-01-17 15:36:16 +04:00
/* This is enabled in EVA mode */ \
.else ; \
/* If loading from user or storing to user */ \
.if ( ( \ from = = U S E R O P ) & & ( t y p e = = L D _ I N S N ) ) | | \
( ( \ to = = U S E R O P ) & & ( t y p e = = S T _ I N S N ) ) ; \
9 : _ _ BUILD_ E V A _ I N S N ( i n s n ## e , r e g , a d d r ) ; \
.section _ _ ex_ t a b l e ," a " ; \
PTR 9 b , h a n d l e r ; \
.previous ; \
.else ; \
/* EVA without exception */ \
insn r e g , a d d r ; \
.endif ; \
2014-01-17 14:48:46 +04:00
.endif
2006-12-12 19:22:06 +03:00
2014-01-16 21:02:13 +04:00
# undef L O A D
2006-12-12 19:22:06 +03:00
# ifdef U S E _ D O U B L E
2014-01-16 21:02:13 +04:00
# define L O A D K l d / * N o e x c e p t i o n * /
# define L O A D ( r e g , a d d r , h a n d l e r ) E X C ( l d , L D _ I N S N , r e g , a d d r , h a n d l e r )
# define L O A D B U ( r e g , a d d r , h a n d l e r ) E X C ( l b u , L D _ I N S N , r e g , a d d r , h a n d l e r )
# define L O A D L ( r e g , a d d r , h a n d l e r ) E X C ( l d l , L D _ I N S N , r e g , a d d r , h a n d l e r )
# define L O A D R ( r e g , a d d r , h a n d l e r ) E X C ( l d r , L D _ I N S N , r e g , a d d r , h a n d l e r )
# define S T O R E B ( r e g , a d d r , h a n d l e r ) E X C ( s b , S T _ I N S N , r e g , a d d r , h a n d l e r )
# define S T O R E L ( r e g , a d d r , h a n d l e r ) E X C ( s d l , S T _ I N S N , r e g , a d d r , h a n d l e r )
# define S T O R E R ( r e g , a d d r , h a n d l e r ) E X C ( s d r , S T _ I N S N , r e g , a d d r , h a n d l e r )
# define S T O R E ( r e g , a d d r , h a n d l e r ) E X C ( s d , S T _ I N S N , r e g , a d d r , h a n d l e r )
2006-12-12 19:22:06 +03:00
# define A D D d a d d u
# define S U B d s u b u
# define S R L d s r l
# define S L L d s l l
# define S L L V d s l l v
# define S R L V d s r l v
# define N B Y T E S 8
# define L O G _ N B Y T E S 3
# else
2014-01-16 21:02:13 +04:00
# define L O A D K l w / * N o e x c e p t i o n * /
# define L O A D ( r e g , a d d r , h a n d l e r ) E X C ( l w , L D _ I N S N , r e g , a d d r , h a n d l e r )
# define L O A D B U ( r e g , a d d r , h a n d l e r ) E X C ( l b u , L D _ I N S N , r e g , a d d r , h a n d l e r )
# define L O A D L ( r e g , a d d r , h a n d l e r ) E X C ( l w l , L D _ I N S N , r e g , a d d r , h a n d l e r )
# define L O A D R ( r e g , a d d r , h a n d l e r ) E X C ( l w r , L D _ I N S N , r e g , a d d r , h a n d l e r )
# define S T O R E B ( r e g , a d d r , h a n d l e r ) E X C ( s b , S T _ I N S N , r e g , a d d r , h a n d l e r )
# define S T O R E L ( r e g , a d d r , h a n d l e r ) E X C ( s w l , S T _ I N S N , r e g , a d d r , h a n d l e r )
# define S T O R E R ( r e g , a d d r , h a n d l e r ) E X C ( s w r , S T _ I N S N , r e g , a d d r , h a n d l e r )
# define S T O R E ( r e g , a d d r , h a n d l e r ) E X C ( s w , S T _ I N S N , r e g , a d d r , h a n d l e r )
2006-12-12 19:22:06 +03:00
# define A D D a d d u
# define S U B s u b u
# define S R L s r l
# define S L L s l l
# define S L L V s l l v
# define S R L V s r l v
# define N B Y T E S 4
# define L O G _ N B Y T E S 2
# endif / * U S E _ D O U B L E * /
# ifdef C O N F I G _ C P U _ L I T T L E _ E N D I A N
# define L D F I R S T L O A D R
2013-01-22 15:59:30 +04:00
# define L D R E S T L O A D L
2006-12-12 19:22:06 +03:00
# define S T F I R S T S T O R E R
2013-01-22 15:59:30 +04:00
# define S T R E S T S T O R E L
2006-12-12 19:22:06 +03:00
# define S H I F T _ D I S C A R D S L L V
# define S H I F T _ D I S C A R D _ R E V E R T S R L V
# else
# define L D F I R S T L O A D L
2013-01-22 15:59:30 +04:00
# define L D R E S T L O A D R
2006-12-12 19:22:06 +03:00
# define S T F I R S T S T O R E L
2013-01-22 15:59:30 +04:00
# define S T R E S T S T O R E R
2006-12-12 19:22:06 +03:00
# define S H I F T _ D I S C A R D S R L V
# define S H I F T _ D I S C A R D _ R E V E R T S L L V
# endif
# define F I R S T ( u n i t ) ( ( u n i t ) * N B Y T E S )
# define R E S T ( u n i t ) ( F I R S T ( u n i t ) + N B Y T E S - 1 )
# define A D D R M A S K ( N B Y T E S - 1 )
2007-10-23 15:43:25 +04:00
# ifndef C O N F I G _ C P U _ D A D D I _ W O R K A R O U N D S
2006-12-12 19:22:06 +03:00
.set noat
2007-10-23 15:43:25 +04:00
# else
.set at=v1
# endif
2006-12-12 19:22:06 +03:00
2014-01-17 14:48:46 +04:00
.macro __BUILD_CSUM_PARTIAL_COPY_USER mode, f r o m , t o , _ _ n o c h e c k
2006-12-12 19:22:06 +03:00
PTR_ A D D U A T , s r c , l e n / * S e e ( 1 ) a b o v e . * /
2014-01-17 14:48:46 +04:00
/ * initialize _ _ n o c h e c k i f t h i s t h e f i r s t t i m e w e e x e c u t e t h i s
* macro
* /
2006-12-12 19:22:06 +03:00
# ifdef C O N F I G _ 6 4 B I T
move e r r p t r , a4
# else
lw e r r p t r , 1 6 ( s p )
# endif
2014-01-17 14:48:46 +04:00
.if \ _ _ nocheck = = 1
FEXPORT( c s u m _ p a r t i a l _ c o p y _ n o c h e c k )
.endif
2006-12-12 19:22:06 +03:00
move s u m , z e r o
move o d d , z e r o
/ *
* Note : dst & s r c m a y b e u n a l i g n e d , l e n m a y b e 0
* Temps
* /
/ *
* The " i s s u e b r e a k " s b e l o w a r e v e r y a p p r o x i m a t e .
* Issue d e l a y s f o r d c a c h e f i l l s w i l l p e r t u r b t h e s c h e d u l e , a s w i l l
* load q u e u e f u l l r e p l a y t r a p s , e t c .
*
* If l e n < N B Y T E S u s e b y t e o p e r a t i o n s .
* /
sltu t 2 , l e n , N B Y T E S
and t 1 , d s t , A D D R M A S K
2014-01-17 14:48:46 +04:00
bnez t 2 , . L c o p y _ b y t e s _ c h e c k l e n \ @
2006-12-12 19:22:06 +03:00
and t 0 , s r c , A D D R M A S K
andi o d d , d s t , 0 x1 / * o d d b u f f e r ? * /
2014-01-17 14:48:46 +04:00
bnez t 1 , . L d s t _ u n a l i g n e d \ @
2006-12-12 19:22:06 +03:00
nop
2014-01-17 14:48:46 +04:00
bnez t 0 , . L s r c _ u n a l i g n e d _ d s t _ a l i g n e d \ @
2006-12-12 19:22:06 +03:00
/ *
* use d e l a y s l o t f o r f a l l - t h r o u g h
* src a n d d s t a r e a l i g n e d ; need to compute rem
* /
2014-01-17 14:48:46 +04:00
.Lboth_aligned \ @:
2013-01-22 15:59:30 +04:00
SRL t 0 , l e n , L O G _ N B Y T E S + 3 # + 3 f o r 8 u n i t s / i t e r
2014-01-17 14:48:46 +04:00
beqz t 0 , . L c l e a n u p _ b o t h _ a l i g n e d \ @ # len < 8*NBYTES
2006-12-12 19:22:06 +03:00
nop
SUB l e n , 8 * N B Y T E S # s u b t r a c t h e r e f o r b g e z l o o p
.align 4
1 :
2014-01-17 14:48:46 +04:00
LOAD( t 0 , U N I T ( 0 ) ( s r c ) , . L l _ e x c \ @)
LOAD( t 1 , U N I T ( 1 ) ( s r c ) , . L l _ e x c _ c o p y \ @)
LOAD( t 2 , U N I T ( 2 ) ( s r c ) , . L l _ e x c _ c o p y \ @)
LOAD( t 3 , U N I T ( 3 ) ( s r c ) , . L l _ e x c _ c o p y \ @)
LOAD( t 4 , U N I T ( 4 ) ( s r c ) , . L l _ e x c _ c o p y \ @)
LOAD( t 5 , U N I T ( 5 ) ( s r c ) , . L l _ e x c _ c o p y \ @)
LOAD( t 6 , U N I T ( 6 ) ( s r c ) , . L l _ e x c _ c o p y \ @)
LOAD( t 7 , U N I T ( 7 ) ( s r c ) , . L l _ e x c _ c o p y \ @)
2006-12-12 19:22:06 +03:00
SUB l e n , l e n , 8 * N B Y T E S
ADD s r c , s r c , 8 * N B Y T E S
2014-01-17 14:48:46 +04:00
STORE( t 0 , U N I T ( 0 ) ( d s t ) , . L s _ e x c \ @)
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-26 20:07:24 +03:00
ADDC( t 0 , t 1 )
2014-01-17 14:48:46 +04:00
STORE( t 1 , U N I T ( 1 ) ( d s t ) , . L s _ e x c \ @)
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-26 20:07:24 +03:00
ADDC( s u m , t 0 )
2014-01-17 14:48:46 +04:00
STORE( t 2 , U N I T ( 2 ) ( d s t ) , . L s _ e x c \ @)
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-26 20:07:24 +03:00
ADDC( t 2 , t 3 )
2014-01-17 14:48:46 +04:00
STORE( t 3 , U N I T ( 3 ) ( d s t ) , . L s _ e x c \ @)
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-26 20:07:24 +03:00
ADDC( s u m , t 2 )
2014-01-17 14:48:46 +04:00
STORE( t 4 , U N I T ( 4 ) ( d s t ) , . L s _ e x c \ @)
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-26 20:07:24 +03:00
ADDC( t 4 , t 5 )
2014-01-17 14:48:46 +04:00
STORE( t 5 , U N I T ( 5 ) ( d s t ) , . L s _ e x c \ @)
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-26 20:07:24 +03:00
ADDC( s u m , t 4 )
2014-01-17 14:48:46 +04:00
STORE( t 6 , U N I T ( 6 ) ( d s t ) , . L s _ e x c \ @)
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-26 20:07:24 +03:00
ADDC( t 6 , t 7 )
2014-01-17 14:48:46 +04:00
STORE( t 7 , U N I T ( 7 ) ( d s t ) , . L s _ e x c \ @)
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-26 20:07:24 +03:00
ADDC( s u m , t 6 )
2007-10-23 15:43:25 +04:00
.set reorder /* DADDI_WAR */
ADD d s t , d s t , 8 * N B Y T E S
2006-12-12 19:22:06 +03:00
bgez l e n , 1 b
2007-10-23 15:43:25 +04:00
.set noreorder
2006-12-12 19:22:06 +03:00
ADD l e n , 8 * N B Y T E S # r e v e r t l e n ( s e e a b o v e )
/ *
* len = = t h e n u m b e r o f b y t e s l e f t t o c o p y < 8 * N B Y T E S
* /
2014-01-17 14:48:46 +04:00
.Lcleanup_both_aligned \ @:
2006-12-12 19:22:06 +03:00
# define r e m t 7
2014-01-17 14:48:46 +04:00
beqz l e n , . L d o n e \ @
2006-12-12 19:22:06 +03:00
sltu t 0 , l e n , 4 * N B Y T E S
2014-01-17 14:48:46 +04:00
bnez t 0 , . L l e s s _ t h a n _ 4 u n i t s \ @
2006-12-12 19:22:06 +03:00
and r e m , l e n , ( N B Y T E S - 1 ) # r e m = l e n % N B Y T E S
/ *
* len > = 4 * N B Y T E S
* /
2014-01-17 14:48:46 +04:00
LOAD( t 0 , U N I T ( 0 ) ( s r c ) , . L l _ e x c \ @)
LOAD( t 1 , U N I T ( 1 ) ( s r c ) , . L l _ e x c _ c o p y \ @)
LOAD( t 2 , U N I T ( 2 ) ( s r c ) , . L l _ e x c _ c o p y \ @)
LOAD( t 3 , U N I T ( 3 ) ( s r c ) , . L l _ e x c _ c o p y \ @)
2006-12-12 19:22:06 +03:00
SUB l e n , l e n , 4 * N B Y T E S
ADD s r c , s r c , 4 * N B Y T E S
2014-01-17 14:48:46 +04:00
STORE( t 0 , U N I T ( 0 ) ( d s t ) , . L s _ e x c \ @)
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-26 20:07:24 +03:00
ADDC( t 0 , t 1 )
2014-01-17 14:48:46 +04:00
STORE( t 1 , U N I T ( 1 ) ( d s t ) , . L s _ e x c \ @)
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-26 20:07:24 +03:00
ADDC( s u m , t 0 )
2014-01-17 14:48:46 +04:00
STORE( t 2 , U N I T ( 2 ) ( d s t ) , . L s _ e x c \ @)
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-26 20:07:24 +03:00
ADDC( t 2 , t 3 )
2014-01-17 14:48:46 +04:00
STORE( t 3 , U N I T ( 3 ) ( d s t ) , . L s _ e x c \ @)
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-26 20:07:24 +03:00
ADDC( s u m , t 2 )
2007-10-23 15:43:25 +04:00
.set reorder /* DADDI_WAR */
ADD d s t , d s t , 4 * N B Y T E S
2014-01-17 14:48:46 +04:00
beqz l e n , . L d o n e \ @
2007-10-23 15:43:25 +04:00
.set noreorder
2014-01-17 14:48:46 +04:00
.Lless_than_4units \ @:
2006-12-12 19:22:06 +03:00
/ *
* rem = l e n % N B Y T E S
* /
2014-01-17 14:48:46 +04:00
beq r e m , l e n , . L c o p y _ b y t e s \ @
2006-12-12 19:22:06 +03:00
nop
1 :
2014-01-17 14:48:46 +04:00
LOAD( t 0 , 0 ( s r c ) , . L l _ e x c \ @)
2006-12-12 19:22:06 +03:00
ADD s r c , s r c , N B Y T E S
SUB l e n , l e n , N B Y T E S
2014-01-17 14:48:46 +04:00
STORE( t 0 , 0 ( d s t ) , . L s _ e x c \ @)
2006-12-12 19:22:06 +03:00
ADDC( s u m , t 0 )
2007-10-23 15:43:25 +04:00
.set reorder /* DADDI_WAR */
ADD d s t , d s t , N B Y T E S
2006-12-12 19:22:06 +03:00
bne r e m , l e n , 1 b
2007-10-23 15:43:25 +04:00
.set noreorder
2006-12-12 19:22:06 +03:00
/ *
* src a n d d s t a r e a l i g n e d , n e e d t o c o p y r e m b y t e s ( r e m < N B Y T E S )
* A l o o p w o u l d d o o n l y a b y t e a t a t i m e w i t h p o s s i b l e b r a n c h
2013-01-22 15:59:30 +04:00
* mispredicts. C a n ' t d o a n e x p l i c i t L O A D d s t ,m a s k ,o r ,S T O R E
2006-12-12 19:22:06 +03:00
* because c a n ' t a s s u m e r e a d - a c c e s s t o d s t . I n s t e a d , u s e
* STREST d s t , w h i c h d o e s n ' t r e q u i r e r e a d a c c e s s t o d s t .
*
* This c o d e s h o u l d p e r f o r m b e t t e r t h a n a s i m p l e l o o p o n m o d e r n ,
* wide- i s s u e m i p s p r o c e s s o r s b e c a u s e t h e c o d e h a s f e w e r b r a n c h e s a n d
* more i n s t r u c t i o n - l e v e l p a r a l l e l i s m .
* /
# define b i t s t 2
2014-01-17 14:48:46 +04:00
beqz l e n , . L d o n e \ @
2006-12-12 19:22:06 +03:00
ADD t 1 , d s t , l e n # t 1 i s j u s t p a s t l a s t b y t e o f d s t
li b i t s , 8 * N B Y T E S
SLL r e m , l e n , 3 # r e m = n u m b e r o f b i t s t o k e e p
2014-01-17 14:48:46 +04:00
LOAD( t 0 , 0 ( s r c ) , . L l _ e x c \ @)
2013-01-22 15:59:30 +04:00
SUB b i t s , b i t s , r e m # b i t s = n u m b e r o f b i t s t o d i s c a r d
2006-12-12 19:22:06 +03:00
SHIFT_ D I S C A R D t 0 , t 0 , b i t s
2014-01-17 14:48:46 +04:00
STREST( t 0 , - 1 ( t 1 ) , . L s _ e x c \ @)
2006-12-12 19:22:06 +03:00
SHIFT_ D I S C A R D _ R E V E R T t 0 , t 0 , b i t s
.set reorder
ADDC( s u m , t 0 )
2014-01-17 14:48:46 +04:00
b . L d o n e \ @
2006-12-12 19:22:06 +03:00
.set noreorder
2014-01-17 14:48:46 +04:00
.Ldst_unaligned \ @:
2006-12-12 19:22:06 +03:00
/ *
* dst i s u n a l i g n e d
* t0 = s r c & A D D R M A S K
* t1 = d s t & A D D R M A S K ; T1 > 0
* len > = N B Y T E S
*
* Copy e n o u g h b y t e s t o a l i g n d s t
* Set m a t c h = ( s r c a n d d s t h a v e s a m e a l i g n m e n t )
* /
# define m a t c h r e m
2014-01-17 14:48:46 +04:00
LDFIRST( t 3 , F I R S T ( 0 ) ( s r c ) , . L l _ e x c \ @)
2006-12-12 19:22:06 +03:00
ADD t 2 , z e r o , N B Y T E S
2014-01-17 14:48:46 +04:00
LDREST( t 3 , R E S T ( 0 ) ( s r c ) , . L l _ e x c _ c o p y \ @)
2006-12-12 19:22:06 +03:00
SUB t 2 , t 2 , t 1 # t 2 = n u m b e r o f b y t e s c o p i e d
xor m a t c h , t 0 , t 1
2014-01-17 14:48:46 +04:00
STFIRST( t 3 , F I R S T ( 0 ) ( d s t ) , . L s _ e x c \ @)
2006-12-12 19:22:06 +03:00
SLL t 4 , t 1 , 3 # t 4 = n u m b e r o f b i t s t o d i s c a r d
SHIFT_ D I S C A R D t 3 , t 3 , t 4
/* no SHIFT_DISCARD_REVERT to handle odd buffer properly */
ADDC( s u m , t 3 )
2014-01-17 14:48:46 +04:00
beq l e n , t 2 , . L d o n e \ @
2006-12-12 19:22:06 +03:00
SUB l e n , l e n , t 2
ADD d s t , d s t , t 2
2014-01-17 14:48:46 +04:00
beqz m a t c h , . L b o t h _ a l i g n e d \ @
2006-12-12 19:22:06 +03:00
ADD s r c , s r c , t 2
2014-01-17 14:48:46 +04:00
.Lsrc_unaligned_dst_aligned \ @:
2013-01-22 15:59:30 +04:00
SRL t 0 , l e n , L O G _ N B Y T E S + 2 # + 2 f o r 4 u n i t s / i t e r
2014-01-17 14:48:46 +04:00
beqz t 0 , . L c l e a n u p _ s r c _ u n a l i g n e d \ @
2013-01-22 15:59:30 +04:00
and r e m , l e n , ( 4 * N B Y T E S - 1 ) # r e m = l e n % 4 * N B Y T E S
2006-12-12 19:22:06 +03:00
1 :
/ *
* Avoid c o n s e c u t i v e L D * ' s t o t h e s a m e r e g i s t e r s i n c e s o m e m i p s
* implementations c a n ' t i s s u e t h e m i n t h e s a m e c y c l e .
* It' s O K t o l o a d F I R S T ( N + 1 ) b e f o r e R E S T ( N ) b e c a u s e t h e t w o a d d r e s s e s
* are t o t h e s a m e u n i t ( u n l e s s s r c i s a l i g n e d , b u t i t ' s n o t ) .
* /
2014-01-17 14:48:46 +04:00
LDFIRST( t 0 , F I R S T ( 0 ) ( s r c ) , . L l _ e x c \ @)
LDFIRST( t 1 , F I R S T ( 1 ) ( s r c ) , . L l _ e x c _ c o p y \ @)
2013-01-22 15:59:30 +04:00
SUB l e n , l e n , 4 * N B Y T E S
2014-01-17 14:48:46 +04:00
LDREST( t 0 , R E S T ( 0 ) ( s r c ) , . L l _ e x c _ c o p y \ @)
LDREST( t 1 , R E S T ( 1 ) ( s r c ) , . L l _ e x c _ c o p y \ @)
LDFIRST( t 2 , F I R S T ( 2 ) ( s r c ) , . L l _ e x c _ c o p y \ @)
LDFIRST( t 3 , F I R S T ( 3 ) ( s r c ) , . L l _ e x c _ c o p y \ @)
LDREST( t 2 , R E S T ( 2 ) ( s r c ) , . L l _ e x c _ c o p y \ @)
LDREST( t 3 , R E S T ( 3 ) ( s r c ) , . L l _ e x c _ c o p y \ @)
2006-12-12 19:22:06 +03:00
ADD s r c , s r c , 4 * N B Y T E S
# ifdef C O N F I G _ C P U _ S B 1
nop # i m p r o v e s s l o t t i n g
# endif
2014-01-17 14:48:46 +04:00
STORE( t 0 , U N I T ( 0 ) ( d s t ) , . L s _ e x c \ @)
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-26 20:07:24 +03:00
ADDC( t 0 , t 1 )
2014-01-17 14:48:46 +04:00
STORE( t 1 , U N I T ( 1 ) ( d s t ) , . L s _ e x c \ @)
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-26 20:07:24 +03:00
ADDC( s u m , t 0 )
2014-01-17 14:48:46 +04:00
STORE( t 2 , U N I T ( 2 ) ( d s t ) , . L s _ e x c \ @)
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-26 20:07:24 +03:00
ADDC( t 2 , t 3 )
2014-01-17 14:48:46 +04:00
STORE( t 3 , U N I T ( 3 ) ( d s t ) , . L s _ e x c \ @)
MIPS: csum_partial: Improve instruction parallelism.
Computing sum introduces true data dependency. This patch removes some
true data depdendencies, hence increases instruction level parallelism.
This patch brings up to 50% csum performance gain on Loongson 3a.
One example about how this patch works is in CSUM_BIGCHUNK1:
// ** original ** vs ** patch applied **
ADDC(sum, t0) ADDC(t0, t1)
ADDC(sum, t1) ADDC(t2, t3)
ADDC(sum, t2) ADDC(sum, t0)
ADDC(sum, t3) ADDC(sum, t2)
In the original implementation, each ADDC(sum, ...) depends on the sum
value updated by previous ADDC(as source operand).
With this patch applied, the first two ADDC operations are independent,
hence can be executed simultaneously if possible.
Another example is in the "copy and sum calculating chunk":
// ** original ** vs ** patch applied **
STORE(t0, UNIT(0) ... STORE(t0, UNIT(0) ...
ADDC(sum, t0) ADDC(t0, t1)
STORE(t1, UNIT(1) ... STORE(t1, UNIT(1) ...
ADDC(sum, t1) ADDC(sum, t0)
STORE(t2, UNIT(2) ... STORE(t2, UNIT(2) ...
ADDC(sum, t2) ADDC(t2, t3)
STORE(t3, UNIT(3) ... STORE(t3, UNIT(3) ...
ADDC(sum, t3) ADDC(sum, t2)
With this patch applied, ADDC and the **next next** ADDC are independent.
Signed-off-by: chenj <chenj@lemote.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9608/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-03-26 20:07:24 +03:00
ADDC( s u m , t 2 )
2007-10-23 15:43:25 +04:00
.set reorder /* DADDI_WAR */
ADD d s t , d s t , 4 * N B Y T E S
2006-12-12 19:22:06 +03:00
bne l e n , r e m , 1 b
2007-10-23 15:43:25 +04:00
.set noreorder
2006-12-12 19:22:06 +03:00
2014-01-17 14:48:46 +04:00
.Lcleanup_src_unaligned \ @:
beqz l e n , . L d o n e \ @
2006-12-12 19:22:06 +03:00
and r e m , l e n , N B Y T E S - 1 # r e m = l e n % N B Y T E S
2014-01-17 14:48:46 +04:00
beq r e m , l e n , . L c o p y _ b y t e s \ @
2006-12-12 19:22:06 +03:00
nop
1 :
2014-01-17 14:48:46 +04:00
LDFIRST( t 0 , F I R S T ( 0 ) ( s r c ) , . L l _ e x c \ @)
LDREST( t 0 , R E S T ( 0 ) ( s r c ) , . L l _ e x c _ c o p y \ @)
2006-12-12 19:22:06 +03:00
ADD s r c , s r c , N B Y T E S
SUB l e n , l e n , N B Y T E S
2014-01-17 14:48:46 +04:00
STORE( t 0 , 0 ( d s t ) , . L s _ e x c \ @)
2006-12-12 19:22:06 +03:00
ADDC( s u m , t 0 )
2007-10-23 15:43:25 +04:00
.set reorder /* DADDI_WAR */
ADD d s t , d s t , N B Y T E S
2006-12-12 19:22:06 +03:00
bne l e n , r e m , 1 b
2007-10-23 15:43:25 +04:00
.set noreorder
2006-12-12 19:22:06 +03:00
2014-01-17 14:48:46 +04:00
.Lcopy_bytes_checklen \ @:
beqz l e n , . L d o n e \ @
2006-12-12 19:22:06 +03:00
nop
2014-01-17 14:48:46 +04:00
.Lcopy_bytes \ @:
2006-12-12 19:22:06 +03:00
/* 0 < len < NBYTES */
# ifdef C O N F I G _ C P U _ L I T T L E _ E N D I A N
# define S H I F T _ S T A R T 0
# define S H I F T _ I N C 8
# else
# define S H I F T _ S T A R T 8 * ( N B Y T E S - 1 )
# define S H I F T _ I N C - 8
# endif
move t 2 , z e r o # p a r t i a l w o r d
2013-01-22 15:59:30 +04:00
li t 3 , S H I F T _ S T A R T # s h i f t
2008-01-29 13:14:59 +03:00
/* use .Ll_exc_copy here to return correct sum on fault */
2006-12-12 19:22:06 +03:00
# define C O P Y _ B Y T E ( N ) \
2014-01-17 14:48:46 +04:00
LOADBU( t 0 , N ( s r c ) , . L l _ e x c _ c o p y \ @); \
2006-12-12 19:22:06 +03:00
SUB l e n , l e n , 1 ; \
2014-01-17 14:48:46 +04:00
STOREB( t 0 , N ( d s t ) , . L s _ e x c \ @); \
2006-12-12 19:22:06 +03:00
SLLV t 0 , t 0 , t 3 ; \
addu t 3 , S H I F T _ I N C ; \
2014-01-17 14:48:46 +04:00
beqz l e n , . L c o p y _ b y t e s _ d o n e \ @; \
2006-12-12 19:22:06 +03:00
or t 2 , t 0
COPY_ B Y T E ( 0 )
COPY_ B Y T E ( 1 )
# ifdef U S E _ D O U B L E
COPY_ B Y T E ( 2 )
COPY_ B Y T E ( 3 )
COPY_ B Y T E ( 4 )
COPY_ B Y T E ( 5 )
# endif
2014-01-17 14:48:46 +04:00
LOADBU( t 0 , N B Y T E S - 2 ( s r c ) , . L l _ e x c _ c o p y \ @)
2006-12-12 19:22:06 +03:00
SUB l e n , l e n , 1
2014-01-17 14:48:46 +04:00
STOREB( t 0 , N B Y T E S - 2 ( d s t ) , . L s _ e x c \ @)
2006-12-12 19:22:06 +03:00
SLLV t 0 , t 0 , t 3
or t 2 , t 0
2014-01-17 14:48:46 +04:00
.Lcopy_bytes_done \ @:
2006-12-12 19:22:06 +03:00
ADDC( s u m , t 2 )
2014-01-17 14:48:46 +04:00
.Ldone \ @:
2006-12-12 19:22:06 +03:00
/* fold checksum */
2014-04-04 06:32:54 +04:00
.set push
.set noat
2006-12-12 19:22:06 +03:00
# ifdef U S E _ D O U B L E
dsll3 2 v1 , s u m , 0
daddu s u m , v1
sltu v1 , s u m , v1
dsra3 2 s u m , s u m , 0
addu s u m , v1
# endif
2014-08-15 12:56:58 +04:00
# if d e f i n e d ( C O N F I G _ C P U _ M I P S R 2 ) | | d e f i n e d ( C O N F I G _ C P U _ L O O N G S O N 3 )
.set push
.set arch=mips32r2
2008-10-11 19:18:53 +04:00
wsbh v1 , s u m
movn s u m , v1 , o d d
2014-08-15 12:56:58 +04:00
.set pop
2008-10-11 19:18:53 +04:00
# else
beqz o d d , 1 f / * o d d b u f f e r a l i g n m e n t ? * /
lui v1 , 0 x00 f f
addu v1 , 0 x00 f f
and t 0 , s u m , v1
sll t 0 , t 0 , 8
2006-12-12 19:22:06 +03:00
srl s u m , s u m , 8
2008-10-11 19:18:53 +04:00
and s u m , s u m , v1
or s u m , s u m , t 0
2006-12-12 19:22:06 +03:00
1 :
2008-10-11 19:18:53 +04:00
# endif
2014-04-04 06:32:54 +04:00
.set pop
2006-12-12 19:22:06 +03:00
.set reorder
2008-09-20 19:20:04 +04:00
ADDC3 2 ( s u m , p s u m )
2006-12-12 19:22:06 +03:00
jr r a
.set noreorder
2014-01-17 14:48:46 +04:00
.Ll_exc_copy \ @:
2006-12-12 19:22:06 +03:00
/ *
* Copy b y t e s f r o m s r c u n t i l f a u l t i n g l o a d a d d r e s s ( o r u n t i l a
* lb f a u l t s )
*
* When r e a c h e d b y a f a u l t i n g L D F I R S T / L D R E S T , T H R E A D _ B U A D D R ( $ 2 8 )
* may b e m o r e t h a n a b y t e b e y o n d t h e l a s t a d d r e s s .
* Hence, t h e l b b e l o w m a y g e t a n e x c e p t i o n .
*
* Assumes s r c < T H R E A D _ B U A D D R ( $ 2 8 )
* /
2014-01-16 21:02:13 +04:00
LOADK t 0 , T I _ T A S K ( $ 2 8 )
2006-12-12 19:22:06 +03:00
li t 2 , S H I F T _ S T A R T
2014-01-16 21:02:13 +04:00
LOADK t 0 , T H R E A D _ B U A D D R ( t 0 )
2006-12-12 19:22:06 +03:00
1 :
2014-01-17 14:48:46 +04:00
LOADBU( t 1 , 0 ( s r c ) , . L l _ e x c \ @)
2006-12-12 19:22:06 +03:00
ADD s r c , s r c , 1
sb t 1 , 0 ( d s t ) # c a n ' t f a u l t - - w e ' r e c o p y _ f r o m _ u s e r
SLLV t 1 , t 1 , t 2
addu t 2 , S H I F T _ I N C
ADDC( s u m , t 1 )
2007-10-23 15:43:25 +04:00
.set reorder /* DADDI_WAR */
ADD d s t , d s t , 1
2006-12-12 19:22:06 +03:00
bne s r c , t 0 , 1 b
2007-10-23 15:43:25 +04:00
.set noreorder
2014-01-17 14:48:46 +04:00
.Ll_exc \ @:
2014-01-16 21:02:13 +04:00
LOADK t 0 , T I _ T A S K ( $ 2 8 )
2006-12-12 19:22:06 +03:00
nop
2014-01-16 21:02:13 +04:00
LOADK t 0 , T H R E A D _ B U A D D R ( t 0 ) # t 0 i s j u s t p a s t l a s t g o o d a d d r e s s
2006-12-12 19:22:06 +03:00
nop
SUB l e n , A T , t 0 # l e n n u m b e r o f u n c o p i e d b y t e s
/ *
* Here' s w h e r e w e r e l y o n s r c a n d d s t b e i n g i n c r e m e n t e d i n t a n d e m ,
* See ( 3 ) a b o v e .
* dst + = ( f a u l t a d d r - s r c ) t o p u t d s t a t f i r s t b y t e t o c l e a r
* /
ADD d s t , t 0 # c o m p u t e s t a r t a d d r e s s i n a 1
SUB d s t , s r c
/ *
* Clear l e n b y t e s s t a r t i n g a t d s t . C a n ' t c a l l _ _ b z e r o b e c a u s e i t
* might m o d i f y l e n . A n i n e f f i c i e n t l o o p f o r t h e s e r a r e t i m e s . . .
* /
2007-10-23 15:43:25 +04:00
.set reorder /* DADDI_WAR */
SUB s r c , l e n , 1
2014-01-17 14:48:46 +04:00
beqz l e n , . L d o n e \ @
2007-10-23 15:43:25 +04:00
.set noreorder
2006-12-12 19:22:06 +03:00
1 : sb z e r o , 0 ( d s t )
ADD d s t , d s t , 1
2007-10-23 15:43:25 +04:00
.set push
.set noat
# ifndef C O N F I G _ C P U _ D A D D I _ W O R K A R O U N D S
2006-12-12 19:22:06 +03:00
bnez s r c , 1 b
SUB s r c , s r c , 1
2007-10-23 15:43:25 +04:00
# else
li v1 , 1
bnez s r c , 1 b
SUB s r c , s r c , v1
# endif
2006-12-12 19:22:06 +03:00
li v1 , - E F A U L T
2014-01-17 14:48:46 +04:00
b . L d o n e \ @
2006-12-12 19:22:06 +03:00
sw v1 , ( e r r p t r )
2014-01-17 14:48:46 +04:00
.Ls_exc \ @:
2006-12-12 19:22:06 +03:00
li v0 , - 1 / * i n v a l i d c h e c k s u m * /
li v1 , - E F A U L T
jr r a
sw v1 , ( e r r p t r )
2007-10-23 15:43:25 +04:00
.set pop
2014-01-17 14:48:46 +04:00
.endm
LEAF( _ _ c s u m _ p a r t i a l _ c o p y _ k e r n e l )
2014-01-17 15:36:16 +04:00
# ifndef C O N F I G _ E V A
2014-01-17 14:48:46 +04:00
FEXPORT( _ _ c s u m _ p a r t i a l _ c o p y _ t o _ u s e r )
FEXPORT( _ _ c s u m _ p a r t i a l _ c o p y _ f r o m _ u s e r )
2014-01-17 15:36:16 +04:00
# endif
2014-01-17 14:48:46 +04:00
_ _ BUILD_ C S U M _ P A R T I A L _ C O P Y _ U S E R L E G A C Y _ M O D E U S E R O P U S E R O P 1
END( _ _ c s u m _ p a r t i a l _ c o p y _ k e r n e l )
2014-01-17 15:36:16 +04:00
# ifdef C O N F I G _ E V A
LEAF( _ _ c s u m _ p a r t i a l _ c o p y _ t o _ u s e r )
_ _ BUILD_ C S U M _ P A R T I A L _ C O P Y _ U S E R E V A _ M O D E K E R N E L O P U S E R O P 0
END( _ _ c s u m _ p a r t i a l _ c o p y _ t o _ u s e r )
LEAF( _ _ c s u m _ p a r t i a l _ c o p y _ f r o m _ u s e r )
_ _ BUILD_ C S U M _ P A R T I A L _ C O P Y _ U S E R E V A _ M O D E U S E R O P K E R N E L O P 0
END( _ _ c s u m _ p a r t i a l _ c o p y _ f r o m _ u s e r )
# endif