crypto: poly1305 - Add a SSE2 SIMD variant for x86_64
Implements an x86_64 assembler driver for the Poly1305 authenticator. This
single block variant holds the 130-bit integer in 5 32-bit words, but uses
SSE to do two multiplications/additions in parallel.
When calling updates with small blocks, the overhead for kernel_fpu_begin/
kernel_fpu_end() negates the perfmance gain. We therefore use the
poly1305-generic fallback for small updates.
For large messages, throughput increases by ~5-10% compared to
poly1305-generic:
testing speed of poly1305 (poly1305-generic)
test 0 ( 96 byte blocks, 16 bytes per update, 6 updates): 4080026 opers/sec, 391682496 bytes/sec
test 1 ( 96 byte blocks, 32 bytes per update, 3 updates): 6221094 opers/sec, 597225024 bytes/sec
test 2 ( 96 byte blocks, 96 bytes per update, 1 updates): 9609750 opers/sec, 922536057 bytes/sec
test 3 ( 288 byte blocks, 16 bytes per update, 18 updates): 1459379 opers/sec, 420301267 bytes/sec
test 4 ( 288 byte blocks, 32 bytes per update, 9 updates): 2115179 opers/sec, 609171609 bytes/sec
test 5 ( 288 byte blocks, 288 bytes per update, 1 updates): 3729874 opers/sec, 1074203856 bytes/sec
test 6 ( 1056 byte blocks, 32 bytes per update, 33 updates): 593000 opers/sec, 626208000 bytes/sec
test 7 ( 1056 byte blocks, 1056 bytes per update, 1 updates): 1081536 opers/sec, 1142102332 bytes/sec
test 8 ( 2080 byte blocks, 32 bytes per update, 65 updates): 302077 opers/sec, 628320576 bytes/sec
test 9 ( 2080 byte blocks, 2080 bytes per update, 1 updates): 554384 opers/sec, 1153120176 bytes/sec
test 10 ( 4128 byte blocks, 4128 bytes per update, 1 updates): 278715 opers/sec, 1150536345 bytes/sec
test 11 ( 8224 byte blocks, 8224 bytes per update, 1 updates): 140202 opers/sec, 1153022070 bytes/sec
testing speed of poly1305 (poly1305-simd)
test 0 ( 96 byte blocks, 16 bytes per update, 6 updates): 3790063 opers/sec, 363846076 bytes/sec
test 1 ( 96 byte blocks, 32 bytes per update, 3 updates): 5913378 opers/sec, 567684355 bytes/sec
test 2 ( 96 byte blocks, 96 bytes per update, 1 updates): 9352574 opers/sec, 897847104 bytes/sec
test 3 ( 288 byte blocks, 16 bytes per update, 18 updates): 1362145 opers/sec, 392297990 bytes/sec
test 4 ( 288 byte blocks, 32 bytes per update, 9 updates): 2007075 opers/sec, 578037628 bytes/sec
test 5 ( 288 byte blocks, 288 bytes per update, 1 updates): 3709811 opers/sec, 1068425798 bytes/sec
test 6 ( 1056 byte blocks, 32 bytes per update, 33 updates): 566272 opers/sec, 597984182 bytes/sec
test 7 ( 1056 byte blocks, 1056 bytes per update, 1 updates): 1111657 opers/sec, 1173910108 bytes/sec
test 8 ( 2080 byte blocks, 32 bytes per update, 65 updates): 288857 opers/sec, 600823808 bytes/sec
test 9 ( 2080 byte blocks, 2080 bytes per update, 1 updates): 590746 opers/sec, 1228751888 bytes/sec
test 10 ( 4128 byte blocks, 4128 bytes per update, 1 updates): 301825 opers/sec, 1245936902 bytes/sec
test 11 ( 8224 byte blocks, 8224 bytes per update, 1 updates): 153075 opers/sec, 1258896201 bytes/sec
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 20:14:06 +03:00
/ *
* Poly1 3 0 5 a u t h e n t i c a t o r a l g o r i t h m , R F C 7 5 3 9 , x64 S S E 2 f u n c t i o n s
*
* Copyright ( C ) 2 0 1 5 M a r t i n W i l l i
*
* This p r o g r a m i s f r e e s o f t w a r e ; you can redistribute it and/or modify
* it u n d e r t h e t e r m s o f t h e G N U G e n e r a l P u b l i c L i c e n s e a s p u b l i s h e d b y
* the F r e e S o f t w a r e F o u n d a t i o n ; either version 2 of the License, or
* ( at y o u r o p t i o n ) a n y l a t e r v e r s i o n .
* /
# include < l i n u x / l i n k a g e . h >
.data
.align 16
ANMASK : .octa 0x00000000 0 3 ffffff0 0 0 0 0 0 0 0 0 3 f f f f f f
crypto: poly1305 - Add a two block SSE2 variant for x86_64
Extends the x86_64 SSE2 Poly1305 authenticator by a function processing two
consecutive Poly1305 blocks in parallel using a derived key r^2. Loop
unrolling can be more effectively mapped to SSE instructions, further
increasing throughput.
For large messages, throughput increases by ~45-65% compared to single
block SSE2:
testing speed of poly1305 (poly1305-simd)
test 0 ( 96 byte blocks, 16 bytes per update, 6 updates): 3790063 opers/sec, 363846076 bytes/sec
test 1 ( 96 byte blocks, 32 bytes per update, 3 updates): 5913378 opers/sec, 567684355 bytes/sec
test 2 ( 96 byte blocks, 96 bytes per update, 1 updates): 9352574 opers/sec, 897847104 bytes/sec
test 3 ( 288 byte blocks, 16 bytes per update, 18 updates): 1362145 opers/sec, 392297990 bytes/sec
test 4 ( 288 byte blocks, 32 bytes per update, 9 updates): 2007075 opers/sec, 578037628 bytes/sec
test 5 ( 288 byte blocks, 288 bytes per update, 1 updates): 3709811 opers/sec, 1068425798 bytes/sec
test 6 ( 1056 byte blocks, 32 bytes per update, 33 updates): 566272 opers/sec, 597984182 bytes/sec
test 7 ( 1056 byte blocks, 1056 bytes per update, 1 updates): 1111657 opers/sec, 1173910108 bytes/sec
test 8 ( 2080 byte blocks, 32 bytes per update, 65 updates): 288857 opers/sec, 600823808 bytes/sec
test 9 ( 2080 byte blocks, 2080 bytes per update, 1 updates): 590746 opers/sec, 1228751888 bytes/sec
test 10 ( 4128 byte blocks, 4128 bytes per update, 1 updates): 301825 opers/sec, 1245936902 bytes/sec
test 11 ( 8224 byte blocks, 8224 bytes per update, 1 updates): 153075 opers/sec, 1258896201 bytes/sec
testing speed of poly1305 (poly1305-simd)
test 0 ( 96 byte blocks, 16 bytes per update, 6 updates): 3809514 opers/sec, 365713411 bytes/sec
test 1 ( 96 byte blocks, 32 bytes per update, 3 updates): 5973423 opers/sec, 573448627 bytes/sec
test 2 ( 96 byte blocks, 96 bytes per update, 1 updates): 9446779 opers/sec, 906890803 bytes/sec
test 3 ( 288 byte blocks, 16 bytes per update, 18 updates): 1364814 opers/sec, 393066691 bytes/sec
test 4 ( 288 byte blocks, 32 bytes per update, 9 updates): 2045780 opers/sec, 589184697 bytes/sec
test 5 ( 288 byte blocks, 288 bytes per update, 1 updates): 3711946 opers/sec, 1069040592 bytes/sec
test 6 ( 1056 byte blocks, 32 bytes per update, 33 updates): 573686 opers/sec, 605812732 bytes/sec
test 7 ( 1056 byte blocks, 1056 bytes per update, 1 updates): 1647802 opers/sec, 1740079440 bytes/sec
test 8 ( 2080 byte blocks, 32 bytes per update, 65 updates): 292970 opers/sec, 609378224 bytes/sec
test 9 ( 2080 byte blocks, 2080 bytes per update, 1 updates): 943229 opers/sec, 1961916528 bytes/sec
test 10 ( 4128 byte blocks, 4128 bytes per update, 1 updates): 494623 opers/sec, 2041804569 bytes/sec
test 11 ( 8224 byte blocks, 8224 bytes per update, 1 updates): 254045 opers/sec, 2089271014 bytes/sec
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 20:14:07 +03:00
ORMASK : .octa 0x00000000 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
crypto: poly1305 - Add a SSE2 SIMD variant for x86_64
Implements an x86_64 assembler driver for the Poly1305 authenticator. This
single block variant holds the 130-bit integer in 5 32-bit words, but uses
SSE to do two multiplications/additions in parallel.
When calling updates with small blocks, the overhead for kernel_fpu_begin/
kernel_fpu_end() negates the perfmance gain. We therefore use the
poly1305-generic fallback for small updates.
For large messages, throughput increases by ~5-10% compared to
poly1305-generic:
testing speed of poly1305 (poly1305-generic)
test 0 ( 96 byte blocks, 16 bytes per update, 6 updates): 4080026 opers/sec, 391682496 bytes/sec
test 1 ( 96 byte blocks, 32 bytes per update, 3 updates): 6221094 opers/sec, 597225024 bytes/sec
test 2 ( 96 byte blocks, 96 bytes per update, 1 updates): 9609750 opers/sec, 922536057 bytes/sec
test 3 ( 288 byte blocks, 16 bytes per update, 18 updates): 1459379 opers/sec, 420301267 bytes/sec
test 4 ( 288 byte blocks, 32 bytes per update, 9 updates): 2115179 opers/sec, 609171609 bytes/sec
test 5 ( 288 byte blocks, 288 bytes per update, 1 updates): 3729874 opers/sec, 1074203856 bytes/sec
test 6 ( 1056 byte blocks, 32 bytes per update, 33 updates): 593000 opers/sec, 626208000 bytes/sec
test 7 ( 1056 byte blocks, 1056 bytes per update, 1 updates): 1081536 opers/sec, 1142102332 bytes/sec
test 8 ( 2080 byte blocks, 32 bytes per update, 65 updates): 302077 opers/sec, 628320576 bytes/sec
test 9 ( 2080 byte blocks, 2080 bytes per update, 1 updates): 554384 opers/sec, 1153120176 bytes/sec
test 10 ( 4128 byte blocks, 4128 bytes per update, 1 updates): 278715 opers/sec, 1150536345 bytes/sec
test 11 ( 8224 byte blocks, 8224 bytes per update, 1 updates): 140202 opers/sec, 1153022070 bytes/sec
testing speed of poly1305 (poly1305-simd)
test 0 ( 96 byte blocks, 16 bytes per update, 6 updates): 3790063 opers/sec, 363846076 bytes/sec
test 1 ( 96 byte blocks, 32 bytes per update, 3 updates): 5913378 opers/sec, 567684355 bytes/sec
test 2 ( 96 byte blocks, 96 bytes per update, 1 updates): 9352574 opers/sec, 897847104 bytes/sec
test 3 ( 288 byte blocks, 16 bytes per update, 18 updates): 1362145 opers/sec, 392297990 bytes/sec
test 4 ( 288 byte blocks, 32 bytes per update, 9 updates): 2007075 opers/sec, 578037628 bytes/sec
test 5 ( 288 byte blocks, 288 bytes per update, 1 updates): 3709811 opers/sec, 1068425798 bytes/sec
test 6 ( 1056 byte blocks, 32 bytes per update, 33 updates): 566272 opers/sec, 597984182 bytes/sec
test 7 ( 1056 byte blocks, 1056 bytes per update, 1 updates): 1111657 opers/sec, 1173910108 bytes/sec
test 8 ( 2080 byte blocks, 32 bytes per update, 65 updates): 288857 opers/sec, 600823808 bytes/sec
test 9 ( 2080 byte blocks, 2080 bytes per update, 1 updates): 590746 opers/sec, 1228751888 bytes/sec
test 10 ( 4128 byte blocks, 4128 bytes per update, 1 updates): 301825 opers/sec, 1245936902 bytes/sec
test 11 ( 8224 byte blocks, 8224 bytes per update, 1 updates): 153075 opers/sec, 1258896201 bytes/sec
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 20:14:06 +03:00
.text
# define h0 0 x00 ( % r d i )
# define h1 0 x04 ( % r d i )
# define h2 0 x08 ( % r d i )
# define h3 0 x0 c ( % r d i )
# define h4 0 x10 ( % r d i )
# define r0 0 x00 ( % r d x )
# define r1 0 x04 ( % r d x )
# define r2 0 x08 ( % r d x )
# define r3 0 x0 c ( % r d x )
# define r4 0 x10 ( % r d x )
# define s1 0 x00 ( % r s p )
# define s2 0 x04 ( % r s p )
# define s3 0 x08 ( % r s p )
# define s4 0 x0 c ( % r s p )
# define m % r s i
# define h01 % x m m 0
# define h23 % x m m 1
# define h44 % x m m 2
# define t 1 % x m m 3
# define t 2 % x m m 4
# define t 3 % x m m 5
# define t 4 % x m m 6
# define m a s k % x m m 7
# define d0 % r8
# define d1 % r9
# define d2 % r10
# define d3 % r11
# define d4 % r12
ENTRY( p o l y 1 3 0 5 _ b l o c k _ s s e 2 )
# % rdi : Accumulator h [ 5 ]
# % rsi : 1 6 byte i n p u t b l o c k m
# % rdx : Poly1 3 0 5 k e y r [ 5 ]
# % rcx : Block c o u n t
# This s i n g l e b l o c k v a r i a n t t r i e s t o i m p r o v e p e r f o r m a n c e b y d o i n g t w o
# multiplications i n p a r a l l e l u s i n g S S E i n s t r u c t i o n s . T h e r e i s q u i t e
# some q u a r d w o r d p a c k i n g i n v o l v e d , h e n c e t h e s p e e d u p i s m a r g i n a l .
push % r b x
push % r12
sub $ 0 x10 ,% r s p
# s1 . . s4 = r1 . . r4 * 5
mov r1 ,% e a x
lea ( % e a x ,% e a x ,4 ) ,% e a x
mov % e a x ,s1
mov r2 ,% e a x
lea ( % e a x ,% e a x ,4 ) ,% e a x
mov % e a x ,s2
mov r3 ,% e a x
lea ( % e a x ,% e a x ,4 ) ,% e a x
mov % e a x ,s3
mov r4 ,% e a x
lea ( % e a x ,% e a x ,4 ) ,% e a x
mov % e a x ,s4
movdqa A N M A S K ( % r i p ) ,m a s k
.Ldoblock :
# h0 1 = [ 0 , h1 , 0 , h0 ]
# h2 3 = [ 0 , h3 , 0 , h2 ]
# h4 4 = [ 0 , h4 , 0 , h4 ]
movd h0 ,h01
movd h1 ,t 1
movd h2 ,h23
movd h3 ,t 2
movd h4 ,h44
punpcklqdq t 1 ,h01
punpcklqdq t 2 ,h23
punpcklqdq h44 ,h44
# h0 1 + = [ ( m [ 3 - 6 ] > > 2 ) & 0 x3 f f f f f f , m [ 0 - 3 ] & 0 x3 f f f f f f ]
movd 0 x00 ( m ) ,t 1
movd 0 x03 ( m ) ,t 2
psrld $ 2 ,t 2
punpcklqdq t 2 ,t 1
pand m a s k ,t 1
paddd t 1 ,h01
# h2 3 + = [ ( m [ 9 - 1 2 ] > > 6 ) & 0 x3 f f f f f f , ( m [ 6 - 9 ] > > 4 ) & 0 x3 f f f f f f ]
movd 0 x06 ( m ) ,t 1
movd 0 x09 ( m ) ,t 2
psrld $ 4 ,t 1
psrld $ 6 ,t 2
punpcklqdq t 2 ,t 1
pand m a s k ,t 1
paddd t 1 ,h23
# h4 4 + = [ ( m [ 1 2 - 1 5 ] > > 8 ) | ( 1 < < 2 4 ) , ( m [ 1 2 - 1 5 ] > > 8 ) | ( 1 < < 2 4 ) ]
mov 0 x0 c ( m ) ,% e a x
shr $ 8 ,% e a x
or $ 0 x01 0 0 0 0 0 0 ,% e a x
movd % e a x ,t 1
pshufd $ 0 x c4 ,t 1 ,t 1
paddd t 1 ,h44
# t1 [ 0 ] = h0 * r0 + h2 * s3
# t1 [ 1 ] = h1 * s4 + h3 * s2
movd r0 ,t 1
movd s4 ,t 2
punpcklqdq t 2 ,t 1
pmuludq h01 ,t 1
movd s3 ,t 2
movd s2 ,t 3
punpcklqdq t 3 ,t 2
pmuludq h23 ,t 2
paddq t 2 ,t 1
# t2 [ 0 ] = h0 * r1 + h2 * s4
# t2 [ 1 ] = h1 * r0 + h3 * s3
movd r1 ,t 2
movd r0 ,t 3
punpcklqdq t 3 ,t 2
pmuludq h01 ,t 2
movd s4 ,t 3
movd s3 ,t 4
punpcklqdq t 4 ,t 3
pmuludq h23 ,t 3
paddq t 3 ,t 2
# t3 [ 0 ] = h4 * s1
# t3 [ 1 ] = h4 * s2
movd s1 ,t 3
movd s2 ,t 4
punpcklqdq t 4 ,t 3
pmuludq h44 ,t 3
# d0 = t 1 [ 0 ] + t 1 [ 1 ] + t 3 [ 0 ]
# d1 = t 2 [ 0 ] + t 2 [ 1 ] + t 3 [ 1 ]
movdqa t 1 ,t 4
punpcklqdq t 2 ,t 4
punpckhqdq t 2 ,t 1
paddq t 4 ,t 1
paddq t 3 ,t 1
movq t 1 ,d0
psrldq $ 8 ,t 1
movq t 1 ,d1
# t1 [ 0 ] = h0 * r2 + h2 * r0
# t1 [ 1 ] = h1 * r1 + h3 * s4
movd r2 ,t 1
movd r1 ,t 2
punpcklqdq t 2 ,t 1
pmuludq h01 ,t 1
movd r0 ,t 2
movd s4 ,t 3
punpcklqdq t 3 ,t 2
pmuludq h23 ,t 2
paddq t 2 ,t 1
# t2 [ 0 ] = h0 * r3 + h2 * r1
# t2 [ 1 ] = h1 * r2 + h3 * r0
movd r3 ,t 2
movd r2 ,t 3
punpcklqdq t 3 ,t 2
pmuludq h01 ,t 2
movd r1 ,t 3
movd r0 ,t 4
punpcklqdq t 4 ,t 3
pmuludq h23 ,t 3
paddq t 3 ,t 2
# t3 [ 0 ] = h4 * s3
# t3 [ 1 ] = h4 * s4
movd s3 ,t 3
movd s4 ,t 4
punpcklqdq t 4 ,t 3
pmuludq h44 ,t 3
# d2 = t 1 [ 0 ] + t 1 [ 1 ] + t 3 [ 0 ]
# d3 = t 2 [ 0 ] + t 2 [ 1 ] + t 3 [ 1 ]
movdqa t 1 ,t 4
punpcklqdq t 2 ,t 4
punpckhqdq t 2 ,t 1
paddq t 4 ,t 1
paddq t 3 ,t 1
movq t 1 ,d2
psrldq $ 8 ,t 1
movq t 1 ,d3
# t1 [ 0 ] = h0 * r4 + h2 * r2
# t1 [ 1 ] = h1 * r3 + h3 * r1
movd r4 ,t 1
movd r3 ,t 2
punpcklqdq t 2 ,t 1
pmuludq h01 ,t 1
movd r2 ,t 2
movd r1 ,t 3
punpcklqdq t 3 ,t 2
pmuludq h23 ,t 2
paddq t 2 ,t 1
# t3 [ 0 ] = h4 * r0
movd r0 ,t 3
pmuludq h44 ,t 3
# d4 = t 1 [ 0 ] + t 1 [ 1 ] + t 3 [ 0 ]
movdqa t 1 ,t 4
psrldq $ 8 ,t 4
paddq t 4 ,t 1
paddq t 3 ,t 1
movq t 1 ,d4
# d1 + = d0 > > 2 6
mov d0 ,% r a x
shr $ 2 6 ,% r a x
add % r a x ,d1
# h0 = d0 & 0 x3 f f f f f f
mov d0 ,% r b x
and $ 0 x3 f f f f f f ,% e b x
# d2 + = d1 > > 2 6
mov d1 ,% r a x
shr $ 2 6 ,% r a x
add % r a x ,d2
# h1 = d1 & 0 x3 f f f f f f
mov d1 ,% r a x
and $ 0 x3 f f f f f f ,% e a x
mov % e a x ,h1
# d3 + = d2 > > 2 6
mov d2 ,% r a x
shr $ 2 6 ,% r a x
add % r a x ,d3
# h2 = d2 & 0 x3 f f f f f f
mov d2 ,% r a x
and $ 0 x3 f f f f f f ,% e a x
mov % e a x ,h2
# d4 + = d3 > > 2 6
mov d3 ,% r a x
shr $ 2 6 ,% r a x
add % r a x ,d4
# h3 = d3 & 0 x3 f f f f f f
mov d3 ,% r a x
and $ 0 x3 f f f f f f ,% e a x
mov % e a x ,h3
# h0 + = ( d4 > > 2 6 ) * 5
mov d4 ,% r a x
shr $ 2 6 ,% r a x
lea ( % e a x ,% e a x ,4 ) ,% e a x
add % e a x ,% e b x
# h4 = d4 & 0 x3 f f f f f f
mov d4 ,% r a x
and $ 0 x3 f f f f f f ,% e a x
mov % e a x ,h4
# h1 + = h0 > > 2 6
mov % e b x ,% e a x
shr $ 2 6 ,% e a x
add % e a x ,h1
# h0 = h0 & 0 x3 f f f f f f
andl $ 0 x3 f f f f f f ,% e b x
mov % e b x ,h0
add $ 0 x10 ,m
dec % r c x
jnz . L d o b l o c k
add $ 0 x10 ,% r s p
pop % r12
pop % r b x
ret
ENDPROC( p o l y 1 3 0 5 _ b l o c k _ s s e 2 )
crypto: poly1305 - Add a two block SSE2 variant for x86_64
Extends the x86_64 SSE2 Poly1305 authenticator by a function processing two
consecutive Poly1305 blocks in parallel using a derived key r^2. Loop
unrolling can be more effectively mapped to SSE instructions, further
increasing throughput.
For large messages, throughput increases by ~45-65% compared to single
block SSE2:
testing speed of poly1305 (poly1305-simd)
test 0 ( 96 byte blocks, 16 bytes per update, 6 updates): 3790063 opers/sec, 363846076 bytes/sec
test 1 ( 96 byte blocks, 32 bytes per update, 3 updates): 5913378 opers/sec, 567684355 bytes/sec
test 2 ( 96 byte blocks, 96 bytes per update, 1 updates): 9352574 opers/sec, 897847104 bytes/sec
test 3 ( 288 byte blocks, 16 bytes per update, 18 updates): 1362145 opers/sec, 392297990 bytes/sec
test 4 ( 288 byte blocks, 32 bytes per update, 9 updates): 2007075 opers/sec, 578037628 bytes/sec
test 5 ( 288 byte blocks, 288 bytes per update, 1 updates): 3709811 opers/sec, 1068425798 bytes/sec
test 6 ( 1056 byte blocks, 32 bytes per update, 33 updates): 566272 opers/sec, 597984182 bytes/sec
test 7 ( 1056 byte blocks, 1056 bytes per update, 1 updates): 1111657 opers/sec, 1173910108 bytes/sec
test 8 ( 2080 byte blocks, 32 bytes per update, 65 updates): 288857 opers/sec, 600823808 bytes/sec
test 9 ( 2080 byte blocks, 2080 bytes per update, 1 updates): 590746 opers/sec, 1228751888 bytes/sec
test 10 ( 4128 byte blocks, 4128 bytes per update, 1 updates): 301825 opers/sec, 1245936902 bytes/sec
test 11 ( 8224 byte blocks, 8224 bytes per update, 1 updates): 153075 opers/sec, 1258896201 bytes/sec
testing speed of poly1305 (poly1305-simd)
test 0 ( 96 byte blocks, 16 bytes per update, 6 updates): 3809514 opers/sec, 365713411 bytes/sec
test 1 ( 96 byte blocks, 32 bytes per update, 3 updates): 5973423 opers/sec, 573448627 bytes/sec
test 2 ( 96 byte blocks, 96 bytes per update, 1 updates): 9446779 opers/sec, 906890803 bytes/sec
test 3 ( 288 byte blocks, 16 bytes per update, 18 updates): 1364814 opers/sec, 393066691 bytes/sec
test 4 ( 288 byte blocks, 32 bytes per update, 9 updates): 2045780 opers/sec, 589184697 bytes/sec
test 5 ( 288 byte blocks, 288 bytes per update, 1 updates): 3711946 opers/sec, 1069040592 bytes/sec
test 6 ( 1056 byte blocks, 32 bytes per update, 33 updates): 573686 opers/sec, 605812732 bytes/sec
test 7 ( 1056 byte blocks, 1056 bytes per update, 1 updates): 1647802 opers/sec, 1740079440 bytes/sec
test 8 ( 2080 byte blocks, 32 bytes per update, 65 updates): 292970 opers/sec, 609378224 bytes/sec
test 9 ( 2080 byte blocks, 2080 bytes per update, 1 updates): 943229 opers/sec, 1961916528 bytes/sec
test 10 ( 4128 byte blocks, 4128 bytes per update, 1 updates): 494623 opers/sec, 2041804569 bytes/sec
test 11 ( 8224 byte blocks, 8224 bytes per update, 1 updates): 254045 opers/sec, 2089271014 bytes/sec
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 20:14:07 +03:00
# define u 0 0 x00 ( % r8 )
# define u 1 0 x04 ( % r8 )
# define u 2 0 x08 ( % r8 )
# define u 3 0 x0 c ( % r8 )
# define u 4 0 x10 ( % r8 )
# define h c0 % x m m 0
# define h c1 % x m m 1
# define h c2 % x m m 2
# define h c3 % x m m 5
# define h c4 % x m m 6
# define r u 0 % x m m 7
# define r u 1 % x m m 8
# define r u 2 % x m m 9
# define r u 3 % x m m 1 0
# define r u 4 % x m m 1 1
# define s v1 % x m m 1 2
# define s v2 % x m m 1 3
# define s v3 % x m m 1 4
# define s v4 % x m m 1 5
# undef d0
# define d0 % r13
ENTRY( p o l y 1 3 0 5 _ 2 b l o c k _ s s e 2 )
# % rdi : Accumulator h [ 5 ]
# % rsi : 1 6 byte i n p u t b l o c k m
# % rdx : Poly1 3 0 5 k e y r [ 5 ]
# % rcx : Doubleblock c o u n t
# % r8 : Poly1 3 0 5 d e r i v e d k e y r ^ 2 u [ 5 ]
# This t w o - b l o c k v a r i a n t f u r t h e r i m p r o v e s p e r f o r m a n c e b y u s i n g l o o p
# unrolled b l o c k p r o c e s s i n g . T h i s i s m o r e s t r a i g h t f o r w a r d a n d d o e s
# less b y t e s h u f f l i n g , b u t r e q u i r e s a s e c o n d P o l y 1 3 0 5 k e y r ^ 2 :
# h = ( h + m ) * r = > h = ( h + m 1 ) * r ^ 2 + m 2 * r
push % r b x
push % r12
push % r13
# combine r0 ,u 0
movd u 0 ,r u 0
movd r0 ,t 1
punpcklqdq t 1 ,r u 0
# combine r1 ,u 1 a n d s1 =r1 * 5 ,v1 =u1 * 5
movd u 1 ,r u 1
movd r1 ,t 1
punpcklqdq t 1 ,r u 1
movdqa r u 1 ,s v1
pslld $ 2 ,s v1
paddd r u 1 ,s v1
# combine r2 ,u 2 a n d s2 =r2 * 5 ,v2 =u2 * 5
movd u 2 ,r u 2
movd r2 ,t 1
punpcklqdq t 1 ,r u 2
movdqa r u 2 ,s v2
pslld $ 2 ,s v2
paddd r u 2 ,s v2
# combine r3 ,u 3 a n d s3 =r3 * 5 ,v3 =u3 * 5
movd u 3 ,r u 3
movd r3 ,t 1
punpcklqdq t 1 ,r u 3
movdqa r u 3 ,s v3
pslld $ 2 ,s v3
paddd r u 3 ,s v3
# combine r4 ,u 4 a n d s4 =r4 * 5 ,v4 =u4 * 5
movd u 4 ,r u 4
movd r4 ,t 1
punpcklqdq t 1 ,r u 4
movdqa r u 4 ,s v4
pslld $ 2 ,s v4
paddd r u 4 ,s v4
.Ldoblock2 :
# hc0 = [ m [ 1 6 - 1 9 ] & 0 x3 f f f f f f , h0 + m [ 0 - 3 ] & 0 x3 f f f f f f ]
movd 0 x00 ( m ) ,h c0
movd 0 x10 ( m ) ,t 1
punpcklqdq t 1 ,h c0
pand A N M A S K ( % r i p ) ,h c0
movd h0 ,t 1
paddd t 1 ,h c0
# hc1 = [ ( m [ 1 9 - 2 2 ] > > 2 ) & 0 x3 f f f f f f , h1 + ( m [ 3 - 6 ] > > 2 ) & 0 x3 f f f f f f ]
movd 0 x03 ( m ) ,h c1
movd 0 x13 ( m ) ,t 1
punpcklqdq t 1 ,h c1
psrld $ 2 ,h c1
pand A N M A S K ( % r i p ) ,h c1
movd h1 ,t 1
paddd t 1 ,h c1
# hc2 = [ ( m [ 2 2 - 2 5 ] > > 4 ) & 0 x3 f f f f f f , h2 + ( m [ 6 - 9 ] > > 4 ) & 0 x3 f f f f f f ]
movd 0 x06 ( m ) ,h c2
movd 0 x16 ( m ) ,t 1
punpcklqdq t 1 ,h c2
psrld $ 4 ,h c2
pand A N M A S K ( % r i p ) ,h c2
movd h2 ,t 1
paddd t 1 ,h c2
# hc3 = [ ( m [ 2 5 - 2 8 ] > > 6 ) & 0 x3 f f f f f f , h3 + ( m [ 9 - 1 2 ] > > 6 ) & 0 x3 f f f f f f ]
movd 0 x09 ( m ) ,h c3
movd 0 x19 ( m ) ,t 1
punpcklqdq t 1 ,h c3
psrld $ 6 ,h c3
pand A N M A S K ( % r i p ) ,h c3
movd h3 ,t 1
paddd t 1 ,h c3
# hc4 = [ ( m [ 2 8 - 3 1 ] > > 8 ) | ( 1 < < 2 4 ) , h4 + ( m [ 1 2 - 1 5 ] > > 8 ) | ( 1 < < 2 4 ) ]
movd 0 x0 c ( m ) ,h c4
movd 0 x1 c ( m ) ,t 1
punpcklqdq t 1 ,h c4
psrld $ 8 ,h c4
por O R M A S K ( % r i p ) ,h c4
movd h4 ,t 1
paddd t 1 ,h c4
# t1 = [ h c0 [ 1 ] * r0 , h c0 [ 0 ] * u 0 ]
movdqa r u 0 ,t 1
pmuludq h c0 ,t 1
# t1 + = [ h c1 [ 1 ] * s4 , h c1 [ 0 ] * v4 ]
movdqa s v4 ,t 2
pmuludq h c1 ,t 2
paddq t 2 ,t 1
# t1 + = [ h c2 [ 1 ] * s3 , h c2 [ 0 ] * v3 ]
movdqa s v3 ,t 2
pmuludq h c2 ,t 2
paddq t 2 ,t 1
# t1 + = [ h c3 [ 1 ] * s2 , h c3 [ 0 ] * v2 ]
movdqa s v2 ,t 2
pmuludq h c3 ,t 2
paddq t 2 ,t 1
# t1 + = [ h c4 [ 1 ] * s1 , h c4 [ 0 ] * v1 ]
movdqa s v1 ,t 2
pmuludq h c4 ,t 2
paddq t 2 ,t 1
# d0 = t 1 [ 0 ] + t 1 [ 1 ]
movdqa t 1 ,t 2
psrldq $ 8 ,t 2
paddq t 2 ,t 1
movq t 1 ,d0
# t1 = [ h c0 [ 1 ] * r1 , h c0 [ 0 ] * u 1 ]
movdqa r u 1 ,t 1
pmuludq h c0 ,t 1
# t1 + = [ h c1 [ 1 ] * r0 , h c1 [ 0 ] * u 0 ]
movdqa r u 0 ,t 2
pmuludq h c1 ,t 2
paddq t 2 ,t 1
# t1 + = [ h c2 [ 1 ] * s4 , h c2 [ 0 ] * v4 ]
movdqa s v4 ,t 2
pmuludq h c2 ,t 2
paddq t 2 ,t 1
# t1 + = [ h c3 [ 1 ] * s3 , h c3 [ 0 ] * v3 ]
movdqa s v3 ,t 2
pmuludq h c3 ,t 2
paddq t 2 ,t 1
# t1 + = [ h c4 [ 1 ] * s2 , h c4 [ 0 ] * v2 ]
movdqa s v2 ,t 2
pmuludq h c4 ,t 2
paddq t 2 ,t 1
# d1 = t 1 [ 0 ] + t 1 [ 1 ]
movdqa t 1 ,t 2
psrldq $ 8 ,t 2
paddq t 2 ,t 1
movq t 1 ,d1
# t1 = [ h c0 [ 1 ] * r2 , h c0 [ 0 ] * u 2 ]
movdqa r u 2 ,t 1
pmuludq h c0 ,t 1
# t1 + = [ h c1 [ 1 ] * r1 , h c1 [ 0 ] * u 1 ]
movdqa r u 1 ,t 2
pmuludq h c1 ,t 2
paddq t 2 ,t 1
# t1 + = [ h c2 [ 1 ] * r0 , h c2 [ 0 ] * u 0 ]
movdqa r u 0 ,t 2
pmuludq h c2 ,t 2
paddq t 2 ,t 1
# t1 + = [ h c3 [ 1 ] * s4 , h c3 [ 0 ] * v4 ]
movdqa s v4 ,t 2
pmuludq h c3 ,t 2
paddq t 2 ,t 1
# t1 + = [ h c4 [ 1 ] * s3 , h c4 [ 0 ] * v3 ]
movdqa s v3 ,t 2
pmuludq h c4 ,t 2
paddq t 2 ,t 1
# d2 = t 1 [ 0 ] + t 1 [ 1 ]
movdqa t 1 ,t 2
psrldq $ 8 ,t 2
paddq t 2 ,t 1
movq t 1 ,d2
# t1 = [ h c0 [ 1 ] * r3 , h c0 [ 0 ] * u 3 ]
movdqa r u 3 ,t 1
pmuludq h c0 ,t 1
# t1 + = [ h c1 [ 1 ] * r2 , h c1 [ 0 ] * u 2 ]
movdqa r u 2 ,t 2
pmuludq h c1 ,t 2
paddq t 2 ,t 1
# t1 + = [ h c2 [ 1 ] * r1 , h c2 [ 0 ] * u 1 ]
movdqa r u 1 ,t 2
pmuludq h c2 ,t 2
paddq t 2 ,t 1
# t1 + = [ h c3 [ 1 ] * r0 , h c3 [ 0 ] * u 0 ]
movdqa r u 0 ,t 2
pmuludq h c3 ,t 2
paddq t 2 ,t 1
# t1 + = [ h c4 [ 1 ] * s4 , h c4 [ 0 ] * v4 ]
movdqa s v4 ,t 2
pmuludq h c4 ,t 2
paddq t 2 ,t 1
# d3 = t 1 [ 0 ] + t 1 [ 1 ]
movdqa t 1 ,t 2
psrldq $ 8 ,t 2
paddq t 2 ,t 1
movq t 1 ,d3
# t1 = [ h c0 [ 1 ] * r4 , h c0 [ 0 ] * u 4 ]
movdqa r u 4 ,t 1
pmuludq h c0 ,t 1
# t1 + = [ h c1 [ 1 ] * r3 , h c1 [ 0 ] * u 3 ]
movdqa r u 3 ,t 2
pmuludq h c1 ,t 2
paddq t 2 ,t 1
# t1 + = [ h c2 [ 1 ] * r2 , h c2 [ 0 ] * u 2 ]
movdqa r u 2 ,t 2
pmuludq h c2 ,t 2
paddq t 2 ,t 1
# t1 + = [ h c3 [ 1 ] * r1 , h c3 [ 0 ] * u 1 ]
movdqa r u 1 ,t 2
pmuludq h c3 ,t 2
paddq t 2 ,t 1
# t1 + = [ h c4 [ 1 ] * r0 , h c4 [ 0 ] * u 0 ]
movdqa r u 0 ,t 2
pmuludq h c4 ,t 2
paddq t 2 ,t 1
# d4 = t 1 [ 0 ] + t 1 [ 1 ]
movdqa t 1 ,t 2
psrldq $ 8 ,t 2
paddq t 2 ,t 1
movq t 1 ,d4
# d1 + = d0 > > 2 6
mov d0 ,% r a x
shr $ 2 6 ,% r a x
add % r a x ,d1
# h0 = d0 & 0 x3 f f f f f f
mov d0 ,% r b x
and $ 0 x3 f f f f f f ,% e b x
# d2 + = d1 > > 2 6
mov d1 ,% r a x
shr $ 2 6 ,% r a x
add % r a x ,d2
# h1 = d1 & 0 x3 f f f f f f
mov d1 ,% r a x
and $ 0 x3 f f f f f f ,% e a x
mov % e a x ,h1
# d3 + = d2 > > 2 6
mov d2 ,% r a x
shr $ 2 6 ,% r a x
add % r a x ,d3
# h2 = d2 & 0 x3 f f f f f f
mov d2 ,% r a x
and $ 0 x3 f f f f f f ,% e a x
mov % e a x ,h2
# d4 + = d3 > > 2 6
mov d3 ,% r a x
shr $ 2 6 ,% r a x
add % r a x ,d4
# h3 = d3 & 0 x3 f f f f f f
mov d3 ,% r a x
and $ 0 x3 f f f f f f ,% e a x
mov % e a x ,h3
# h0 + = ( d4 > > 2 6 ) * 5
mov d4 ,% r a x
shr $ 2 6 ,% r a x
lea ( % e a x ,% e a x ,4 ) ,% e a x
add % e a x ,% e b x
# h4 = d4 & 0 x3 f f f f f f
mov d4 ,% r a x
and $ 0 x3 f f f f f f ,% e a x
mov % e a x ,h4
# h1 + = h0 > > 2 6
mov % e b x ,% e a x
shr $ 2 6 ,% e a x
add % e a x ,h1
# h0 = h0 & 0 x3 f f f f f f
andl $ 0 x3 f f f f f f ,% e b x
mov % e b x ,h0
add $ 0 x20 ,m
dec % r c x
jnz . L d o b l o c k 2
pop % r13
pop % r12
pop % r b x
ret
ENDPROC( p o l y 1 3 0 5 _ 2 b l o c k _ s s e 2 )