crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 20:14:01 +03:00
/ *
* ChaCha2 0 2 5 6 - b i t c i p h e r a l g o r i t h m , R F C 7 5 3 9 , x64 S S S E 3 f u n c t i o n s
*
* Copyright ( C ) 2 0 1 5 M a r t i n W i l l i
*
* This p r o g r a m i s f r e e s o f t w a r e ; you can redistribute it and/or modify
* it u n d e r t h e t e r m s o f t h e G N U G e n e r a l P u b l i c L i c e n s e a s p u b l i s h e d b y
* the F r e e S o f t w a r e F o u n d a t i o n ; either version 2 of the License, or
* ( at y o u r o p t i o n ) a n y l a t e r v e r s i o n .
* /
# include < l i n u x / l i n k a g e . h >
.data
.align 16
ROT8 : .octa 0x0e0d0c0f 0 a0 9 0 8 0 b06 0 5 0 4 0 7 0 2 0 1 0 0 0 3
ROT16 : .octa 0x0d0c0f0e 0 9 0 8 0 b0 a05 0 4 0 7 0 6 0 1 0 0 0 3 0 2
crypto: chacha20 - Add a four block SSSE3 variant for x86_64
Extends the x86_64 SSSE3 ChaCha20 implementation by a function processing
four ChaCha20 blocks in parallel. This avoids the word shuffling needed
in the single block variant, further increasing throughput.
For large messages, throughput increases by ~110% compared to single block
SSSE3:
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 42249230 operations in 10 seconds (675987680 bytes)
test 1 (256 bit key, 64 byte blocks): 46441641 operations in 10 seconds (2972265024 bytes)
test 2 (256 bit key, 256 byte blocks): 33028112 operations in 10 seconds (8455196672 bytes)
test 3 (256 bit key, 1024 byte blocks): 11568759 operations in 10 seconds (11846409216 bytes)
test 4 (256 bit key, 8192 byte blocks): 1448761 operations in 10 seconds (11868250112 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 20:14:02 +03:00
CTRINC : .octa 0x00000003 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 20:14:01 +03:00
.text
ENTRY( c h a c h a20 _ b l o c k _ x o r _ s s s e 3 )
# % rdi : Input s t a t e m a t r i x , s
# % rsi : 1 data b l o c k o u t p u t , o
# % rdx : 1 data b l o c k i n p u t , i
# This f u n c t i o n e n c r y p t s o n e C h a C h a20 b l o c k b y l o a d i n g t h e s t a t e m a t r i x
# in f o u r S S E r e g i s t e r s . I t p e r f o r m s m a t r i x o p e r a t i o n o n f o u r w o r d s i n
# parallel, b u t r e q u i r e d s s h u f f l i n g t o r e a r r a n g e t h e w o r d s a f t e r e a c h
# round. 8 / 1 6 - b i t w o r d r o t a t i o n i s d o n e w i t h t h e s l i g h t l y b e t t e r
# performing S S S E 3 b y t e s h u f f l i n g , 7 / 1 2 - b i t w o r d r o t a t i o n u s e s
# traditional s h i f t + O R .
# x0 . . 3 = s0 . . 3
movdqa 0 x00 ( % r d i ) ,% x m m 0
movdqa 0 x10 ( % r d i ) ,% x m m 1
movdqa 0 x20 ( % r d i ) ,% x m m 2
movdqa 0 x30 ( % r d i ) ,% x m m 3
movdqa % x m m 0 ,% x m m 8
movdqa % x m m 1 ,% x m m 9
movdqa % x m m 2 ,% x m m 1 0
movdqa % x m m 3 ,% x m m 1 1
movdqa R O T 8 ( % r i p ) ,% x m m 4
movdqa R O T 1 6 ( % r i p ) ,% x m m 5
mov $ 1 0 ,% e c x
.Ldoubleround :
# x0 + = x1 , x3 = r o t l 3 2 ( x3 ^ x0 , 1 6 )
paddd % x m m 1 ,% x m m 0
pxor % x m m 0 ,% x m m 3
pshufb % x m m 5 ,% x m m 3
# x2 + = x3 , x1 = r o t l 3 2 ( x1 ^ x2 , 1 2 )
paddd % x m m 3 ,% x m m 2
pxor % x m m 2 ,% x m m 1
movdqa % x m m 1 ,% x m m 6
pslld $ 1 2 ,% x m m 6
psrld $ 2 0 ,% x m m 1
por % x m m 6 ,% x m m 1
# x0 + = x1 , x3 = r o t l 3 2 ( x3 ^ x0 , 8 )
paddd % x m m 1 ,% x m m 0
pxor % x m m 0 ,% x m m 3
pshufb % x m m 4 ,% x m m 3
# x2 + = x3 , x1 = r o t l 3 2 ( x1 ^ x2 , 7 )
paddd % x m m 3 ,% x m m 2
pxor % x m m 2 ,% x m m 1
movdqa % x m m 1 ,% x m m 7
pslld $ 7 ,% x m m 7
psrld $ 2 5 ,% x m m 1
por % x m m 7 ,% x m m 1
# x1 = s h u f f l e 3 2 ( x1 , M A S K ( 0 , 3 , 2 , 1 ) )
pshufd $ 0 x39 ,% x m m 1 ,% x m m 1
# x2 = s h u f f l e 3 2 ( x2 , M A S K ( 1 , 0 , 3 , 2 ) )
pshufd $ 0 x4 e ,% x m m 2 ,% x m m 2
# x3 = s h u f f l e 3 2 ( x3 , M A S K ( 2 , 1 , 0 , 3 ) )
pshufd $ 0 x93 ,% x m m 3 ,% x m m 3
# x0 + = x1 , x3 = r o t l 3 2 ( x3 ^ x0 , 1 6 )
paddd % x m m 1 ,% x m m 0
pxor % x m m 0 ,% x m m 3
pshufb % x m m 5 ,% x m m 3
# x2 + = x3 , x1 = r o t l 3 2 ( x1 ^ x2 , 1 2 )
paddd % x m m 3 ,% x m m 2
pxor % x m m 2 ,% x m m 1
movdqa % x m m 1 ,% x m m 6
pslld $ 1 2 ,% x m m 6
psrld $ 2 0 ,% x m m 1
por % x m m 6 ,% x m m 1
# x0 + = x1 , x3 = r o t l 3 2 ( x3 ^ x0 , 8 )
paddd % x m m 1 ,% x m m 0
pxor % x m m 0 ,% x m m 3
pshufb % x m m 4 ,% x m m 3
# x2 + = x3 , x1 = r o t l 3 2 ( x1 ^ x2 , 7 )
paddd % x m m 3 ,% x m m 2
pxor % x m m 2 ,% x m m 1
movdqa % x m m 1 ,% x m m 7
pslld $ 7 ,% x m m 7
psrld $ 2 5 ,% x m m 1
por % x m m 7 ,% x m m 1
# x1 = s h u f f l e 3 2 ( x1 , M A S K ( 2 , 1 , 0 , 3 ) )
pshufd $ 0 x93 ,% x m m 1 ,% x m m 1
# x2 = s h u f f l e 3 2 ( x2 , M A S K ( 1 , 0 , 3 , 2 ) )
pshufd $ 0 x4 e ,% x m m 2 ,% x m m 2
# x3 = s h u f f l e 3 2 ( x3 , M A S K ( 0 , 3 , 2 , 1 ) )
pshufd $ 0 x39 ,% x m m 3 ,% x m m 3
dec % e c x
jnz . L d o u b l e r o u n d
# o0 = i 0 ^ ( x0 + s0 )
movdqu 0 x00 ( % r d x ) ,% x m m 4
paddd % x m m 8 ,% x m m 0
pxor % x m m 4 ,% x m m 0
movdqu % x m m 0 ,0 x00 ( % r s i )
# o1 = i 1 ^ ( x1 + s1 )
movdqu 0 x10 ( % r d x ) ,% x m m 5
paddd % x m m 9 ,% x m m 1
pxor % x m m 5 ,% x m m 1
movdqu % x m m 1 ,0 x10 ( % r s i )
# o2 = i 2 ^ ( x2 + s2 )
movdqu 0 x20 ( % r d x ) ,% x m m 6
paddd % x m m 1 0 ,% x m m 2
pxor % x m m 6 ,% x m m 2
movdqu % x m m 2 ,0 x20 ( % r s i )
# o3 = i 3 ^ ( x3 + s3 )
movdqu 0 x30 ( % r d x ) ,% x m m 7
paddd % x m m 1 1 ,% x m m 3
pxor % x m m 7 ,% x m m 3
movdqu % x m m 3 ,0 x30 ( % r s i )
ret
ENDPROC( c h a c h a20 _ b l o c k _ x o r _ s s s e 3 )
crypto: chacha20 - Add a four block SSSE3 variant for x86_64
Extends the x86_64 SSSE3 ChaCha20 implementation by a function processing
four ChaCha20 blocks in parallel. This avoids the word shuffling needed
in the single block variant, further increasing throughput.
For large messages, throughput increases by ~110% compared to single block
SSSE3:
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 42249230 operations in 10 seconds (675987680 bytes)
test 1 (256 bit key, 64 byte blocks): 46441641 operations in 10 seconds (2972265024 bytes)
test 2 (256 bit key, 256 byte blocks): 33028112 operations in 10 seconds (8455196672 bytes)
test 3 (256 bit key, 1024 byte blocks): 11568759 operations in 10 seconds (11846409216 bytes)
test 4 (256 bit key, 8192 byte blocks): 1448761 operations in 10 seconds (11868250112 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 20:14:02 +03:00
ENTRY( c h a c h a20 _ 4 b l o c k _ x o r _ s s s e 3 )
# % rdi : Input s t a t e m a t r i x , s
# % rsi : 4 data b l o c k s o u t p u t , o
# % rdx : 4 data b l o c k s i n p u t , i
# This f u n c t i o n e n c r y p t s f o u r c o n s e c u t i v e C h a C h a20 b l o c k s b y l o a d i n g t h e
# the s t a t e m a t r i x i n S S E r e g i s t e r s f o u r t i m e s . A s w e n e e d s o m e s c r a t c h
# registers, w e s a v e t h e f i r s t f o u r r e g i s t e r s o n t h e s t a c k . T h e
# algorithm p e r f o r m s e a c h o p e r a t i o n o n t h e c o r r e s p o n d i n g w o r d o f e a c h
# state m a t r i x , h e n c e r e q u i r e s n o w o r d s h u f f l i n g . F o r f i n a l X O R i n g s t e p
# we t r a n s p o s e t h e m a t r i x b y i n t e r l e a v i n g 3 2 - a n d t h e n 6 4 - b i t w o r d s ,
# which a l l o w s u s t o d o X O R i n S S E r e g i s t e r s . 8 / 1 6 - b i t w o r d r o t a t i o n i s
# done w i t h t h e s l i g h t l y b e t t e r p e r f o r m i n g S S S E 3 b y t e s h u f f l i n g ,
# 7 / 1 2 - bit w o r d r o t a t i o n u s e s t r a d i t i o n a l s h i f t + O R .
2016-01-21 19:24:08 +03:00
mov % r s p ,% r11
sub $ 0 x80 ,% r s p
and $ ~ 6 3 ,% r s p
crypto: chacha20 - Add a four block SSSE3 variant for x86_64
Extends the x86_64 SSSE3 ChaCha20 implementation by a function processing
four ChaCha20 blocks in parallel. This avoids the word shuffling needed
in the single block variant, further increasing throughput.
For large messages, throughput increases by ~110% compared to single block
SSSE3:
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 42249230 operations in 10 seconds (675987680 bytes)
test 1 (256 bit key, 64 byte blocks): 46441641 operations in 10 seconds (2972265024 bytes)
test 2 (256 bit key, 256 byte blocks): 33028112 operations in 10 seconds (8455196672 bytes)
test 3 (256 bit key, 1024 byte blocks): 11568759 operations in 10 seconds (11846409216 bytes)
test 4 (256 bit key, 8192 byte blocks): 1448761 operations in 10 seconds (11868250112 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 20:14:02 +03:00
# x0 . . 1 5 [ 0 - 3 ] = s0 . . 3 [ 0 . . 3 ]
movq 0 x00 ( % r d i ) ,% x m m 1
pshufd $ 0 x00 ,% x m m 1 ,% x m m 0
pshufd $ 0 x55 ,% x m m 1 ,% x m m 1
movq 0 x08 ( % r d i ) ,% x m m 3
pshufd $ 0 x00 ,% x m m 3 ,% x m m 2
pshufd $ 0 x55 ,% x m m 3 ,% x m m 3
movq 0 x10 ( % r d i ) ,% x m m 5
pshufd $ 0 x00 ,% x m m 5 ,% x m m 4
pshufd $ 0 x55 ,% x m m 5 ,% x m m 5
movq 0 x18 ( % r d i ) ,% x m m 7
pshufd $ 0 x00 ,% x m m 7 ,% x m m 6
pshufd $ 0 x55 ,% x m m 7 ,% x m m 7
movq 0 x20 ( % r d i ) ,% x m m 9
pshufd $ 0 x00 ,% x m m 9 ,% x m m 8
pshufd $ 0 x55 ,% x m m 9 ,% x m m 9
movq 0 x28 ( % r d i ) ,% x m m 1 1
pshufd $ 0 x00 ,% x m m 1 1 ,% x m m 1 0
pshufd $ 0 x55 ,% x m m 1 1 ,% x m m 1 1
movq 0 x30 ( % r d i ) ,% x m m 1 3
pshufd $ 0 x00 ,% x m m 1 3 ,% x m m 1 2
pshufd $ 0 x55 ,% x m m 1 3 ,% x m m 1 3
movq 0 x38 ( % r d i ) ,% x m m 1 5
pshufd $ 0 x00 ,% x m m 1 5 ,% x m m 1 4
pshufd $ 0 x55 ,% x m m 1 5 ,% x m m 1 5
# x0 . . 3 o n s t a c k
movdqa % x m m 0 ,0 x00 ( % r s p )
movdqa % x m m 1 ,0 x10 ( % r s p )
movdqa % x m m 2 ,0 x20 ( % r s p )
movdqa % x m m 3 ,0 x30 ( % r s p )
movdqa C T R I N C ( % r i p ) ,% x m m 1
movdqa R O T 8 ( % r i p ) ,% x m m 2
movdqa R O T 1 6 ( % r i p ) ,% x m m 3
# x1 2 + = c o u n t e r v a l u e s 0 - 3
paddd % x m m 1 ,% x m m 1 2
mov $ 1 0 ,% e c x
.Ldoubleround4 :
# x0 + = x4 , x12 = r o t l 3 2 ( x12 ^ x0 , 1 6 )
movdqa 0 x00 ( % r s p ) ,% x m m 0
paddd % x m m 4 ,% x m m 0
movdqa % x m m 0 ,0 x00 ( % r s p )
pxor % x m m 0 ,% x m m 1 2
pshufb % x m m 3 ,% x m m 1 2
# x1 + = x5 , x13 = r o t l 3 2 ( x13 ^ x1 , 1 6 )
movdqa 0 x10 ( % r s p ) ,% x m m 0
paddd % x m m 5 ,% x m m 0
movdqa % x m m 0 ,0 x10 ( % r s p )
pxor % x m m 0 ,% x m m 1 3
pshufb % x m m 3 ,% x m m 1 3
# x2 + = x6 , x14 = r o t l 3 2 ( x14 ^ x2 , 1 6 )
movdqa 0 x20 ( % r s p ) ,% x m m 0
paddd % x m m 6 ,% x m m 0
movdqa % x m m 0 ,0 x20 ( % r s p )
pxor % x m m 0 ,% x m m 1 4
pshufb % x m m 3 ,% x m m 1 4
# x3 + = x7 , x15 = r o t l 3 2 ( x15 ^ x3 , 1 6 )
movdqa 0 x30 ( % r s p ) ,% x m m 0
paddd % x m m 7 ,% x m m 0
movdqa % x m m 0 ,0 x30 ( % r s p )
pxor % x m m 0 ,% x m m 1 5
pshufb % x m m 3 ,% x m m 1 5
# x8 + = x12 , x4 = r o t l 3 2 ( x4 ^ x8 , 1 2 )
paddd % x m m 1 2 ,% x m m 8
pxor % x m m 8 ,% x m m 4
movdqa % x m m 4 ,% x m m 0
pslld $ 1 2 ,% x m m 0
psrld $ 2 0 ,% x m m 4
por % x m m 0 ,% x m m 4
# x9 + = x13 , x5 = r o t l 3 2 ( x5 ^ x9 , 1 2 )
paddd % x m m 1 3 ,% x m m 9
pxor % x m m 9 ,% x m m 5
movdqa % x m m 5 ,% x m m 0
pslld $ 1 2 ,% x m m 0
psrld $ 2 0 ,% x m m 5
por % x m m 0 ,% x m m 5
# x1 0 + = x14 , x6 = r o t l 3 2 ( x6 ^ x10 , 1 2 )
paddd % x m m 1 4 ,% x m m 1 0
pxor % x m m 1 0 ,% x m m 6
movdqa % x m m 6 ,% x m m 0
pslld $ 1 2 ,% x m m 0
psrld $ 2 0 ,% x m m 6
por % x m m 0 ,% x m m 6
# x1 1 + = x15 , x7 = r o t l 3 2 ( x7 ^ x11 , 1 2 )
paddd % x m m 1 5 ,% x m m 1 1
pxor % x m m 1 1 ,% x m m 7
movdqa % x m m 7 ,% x m m 0
pslld $ 1 2 ,% x m m 0
psrld $ 2 0 ,% x m m 7
por % x m m 0 ,% x m m 7
# x0 + = x4 , x12 = r o t l 3 2 ( x12 ^ x0 , 8 )
movdqa 0 x00 ( % r s p ) ,% x m m 0
paddd % x m m 4 ,% x m m 0
movdqa % x m m 0 ,0 x00 ( % r s p )
pxor % x m m 0 ,% x m m 1 2
pshufb % x m m 2 ,% x m m 1 2
# x1 + = x5 , x13 = r o t l 3 2 ( x13 ^ x1 , 8 )
movdqa 0 x10 ( % r s p ) ,% x m m 0
paddd % x m m 5 ,% x m m 0
movdqa % x m m 0 ,0 x10 ( % r s p )
pxor % x m m 0 ,% x m m 1 3
pshufb % x m m 2 ,% x m m 1 3
# x2 + = x6 , x14 = r o t l 3 2 ( x14 ^ x2 , 8 )
movdqa 0 x20 ( % r s p ) ,% x m m 0
paddd % x m m 6 ,% x m m 0
movdqa % x m m 0 ,0 x20 ( % r s p )
pxor % x m m 0 ,% x m m 1 4
pshufb % x m m 2 ,% x m m 1 4
# x3 + = x7 , x15 = r o t l 3 2 ( x15 ^ x3 , 8 )
movdqa 0 x30 ( % r s p ) ,% x m m 0
paddd % x m m 7 ,% x m m 0
movdqa % x m m 0 ,0 x30 ( % r s p )
pxor % x m m 0 ,% x m m 1 5
pshufb % x m m 2 ,% x m m 1 5
# x8 + = x12 , x4 = r o t l 3 2 ( x4 ^ x8 , 7 )
paddd % x m m 1 2 ,% x m m 8
pxor % x m m 8 ,% x m m 4
movdqa % x m m 4 ,% x m m 0
pslld $ 7 ,% x m m 0
psrld $ 2 5 ,% x m m 4
por % x m m 0 ,% x m m 4
# x9 + = x13 , x5 = r o t l 3 2 ( x5 ^ x9 , 7 )
paddd % x m m 1 3 ,% x m m 9
pxor % x m m 9 ,% x m m 5
movdqa % x m m 5 ,% x m m 0
pslld $ 7 ,% x m m 0
psrld $ 2 5 ,% x m m 5
por % x m m 0 ,% x m m 5
# x1 0 + = x14 , x6 = r o t l 3 2 ( x6 ^ x10 , 7 )
paddd % x m m 1 4 ,% x m m 1 0
pxor % x m m 1 0 ,% x m m 6
movdqa % x m m 6 ,% x m m 0
pslld $ 7 ,% x m m 0
psrld $ 2 5 ,% x m m 6
por % x m m 0 ,% x m m 6
# x1 1 + = x15 , x7 = r o t l 3 2 ( x7 ^ x11 , 7 )
paddd % x m m 1 5 ,% x m m 1 1
pxor % x m m 1 1 ,% x m m 7
movdqa % x m m 7 ,% x m m 0
pslld $ 7 ,% x m m 0
psrld $ 2 5 ,% x m m 7
por % x m m 0 ,% x m m 7
# x0 + = x5 , x15 = r o t l 3 2 ( x15 ^ x0 , 1 6 )
movdqa 0 x00 ( % r s p ) ,% x m m 0
paddd % x m m 5 ,% x m m 0
movdqa % x m m 0 ,0 x00 ( % r s p )
pxor % x m m 0 ,% x m m 1 5
pshufb % x m m 3 ,% x m m 1 5
# x1 + = x6 , x12 = r o t l 3 2 ( x12 ^ x1 , 1 6 )
movdqa 0 x10 ( % r s p ) ,% x m m 0
paddd % x m m 6 ,% x m m 0
movdqa % x m m 0 ,0 x10 ( % r s p )
pxor % x m m 0 ,% x m m 1 2
pshufb % x m m 3 ,% x m m 1 2
# x2 + = x7 , x13 = r o t l 3 2 ( x13 ^ x2 , 1 6 )
movdqa 0 x20 ( % r s p ) ,% x m m 0
paddd % x m m 7 ,% x m m 0
movdqa % x m m 0 ,0 x20 ( % r s p )
pxor % x m m 0 ,% x m m 1 3
pshufb % x m m 3 ,% x m m 1 3
# x3 + = x4 , x14 = r o t l 3 2 ( x14 ^ x3 , 1 6 )
movdqa 0 x30 ( % r s p ) ,% x m m 0
paddd % x m m 4 ,% x m m 0
movdqa % x m m 0 ,0 x30 ( % r s p )
pxor % x m m 0 ,% x m m 1 4
pshufb % x m m 3 ,% x m m 1 4
# x1 0 + = x15 , x5 = r o t l 3 2 ( x5 ^ x10 , 1 2 )
paddd % x m m 1 5 ,% x m m 1 0
pxor % x m m 1 0 ,% x m m 5
movdqa % x m m 5 ,% x m m 0
pslld $ 1 2 ,% x m m 0
psrld $ 2 0 ,% x m m 5
por % x m m 0 ,% x m m 5
# x1 1 + = x12 , x6 = r o t l 3 2 ( x6 ^ x11 , 1 2 )
paddd % x m m 1 2 ,% x m m 1 1
pxor % x m m 1 1 ,% x m m 6
movdqa % x m m 6 ,% x m m 0
pslld $ 1 2 ,% x m m 0
psrld $ 2 0 ,% x m m 6
por % x m m 0 ,% x m m 6
# x8 + = x13 , x7 = r o t l 3 2 ( x7 ^ x8 , 1 2 )
paddd % x m m 1 3 ,% x m m 8
pxor % x m m 8 ,% x m m 7
movdqa % x m m 7 ,% x m m 0
pslld $ 1 2 ,% x m m 0
psrld $ 2 0 ,% x m m 7
por % x m m 0 ,% x m m 7
# x9 + = x14 , x4 = r o t l 3 2 ( x4 ^ x9 , 1 2 )
paddd % x m m 1 4 ,% x m m 9
pxor % x m m 9 ,% x m m 4
movdqa % x m m 4 ,% x m m 0
pslld $ 1 2 ,% x m m 0
psrld $ 2 0 ,% x m m 4
por % x m m 0 ,% x m m 4
# x0 + = x5 , x15 = r o t l 3 2 ( x15 ^ x0 , 8 )
movdqa 0 x00 ( % r s p ) ,% x m m 0
paddd % x m m 5 ,% x m m 0
movdqa % x m m 0 ,0 x00 ( % r s p )
pxor % x m m 0 ,% x m m 1 5
pshufb % x m m 2 ,% x m m 1 5
# x1 + = x6 , x12 = r o t l 3 2 ( x12 ^ x1 , 8 )
movdqa 0 x10 ( % r s p ) ,% x m m 0
paddd % x m m 6 ,% x m m 0
movdqa % x m m 0 ,0 x10 ( % r s p )
pxor % x m m 0 ,% x m m 1 2
pshufb % x m m 2 ,% x m m 1 2
# x2 + = x7 , x13 = r o t l 3 2 ( x13 ^ x2 , 8 )
movdqa 0 x20 ( % r s p ) ,% x m m 0
paddd % x m m 7 ,% x m m 0
movdqa % x m m 0 ,0 x20 ( % r s p )
pxor % x m m 0 ,% x m m 1 3
pshufb % x m m 2 ,% x m m 1 3
# x3 + = x4 , x14 = r o t l 3 2 ( x14 ^ x3 , 8 )
movdqa 0 x30 ( % r s p ) ,% x m m 0
paddd % x m m 4 ,% x m m 0
movdqa % x m m 0 ,0 x30 ( % r s p )
pxor % x m m 0 ,% x m m 1 4
pshufb % x m m 2 ,% x m m 1 4
# x1 0 + = x15 , x5 = r o t l 3 2 ( x5 ^ x10 , 7 )
paddd % x m m 1 5 ,% x m m 1 0
pxor % x m m 1 0 ,% x m m 5
movdqa % x m m 5 ,% x m m 0
pslld $ 7 ,% x m m 0
psrld $ 2 5 ,% x m m 5
por % x m m 0 ,% x m m 5
# x1 1 + = x12 , x6 = r o t l 3 2 ( x6 ^ x11 , 7 )
paddd % x m m 1 2 ,% x m m 1 1
pxor % x m m 1 1 ,% x m m 6
movdqa % x m m 6 ,% x m m 0
pslld $ 7 ,% x m m 0
psrld $ 2 5 ,% x m m 6
por % x m m 0 ,% x m m 6
# x8 + = x13 , x7 = r o t l 3 2 ( x7 ^ x8 , 7 )
paddd % x m m 1 3 ,% x m m 8
pxor % x m m 8 ,% x m m 7
movdqa % x m m 7 ,% x m m 0
pslld $ 7 ,% x m m 0
psrld $ 2 5 ,% x m m 7
por % x m m 0 ,% x m m 7
# x9 + = x14 , x4 = r o t l 3 2 ( x4 ^ x9 , 7 )
paddd % x m m 1 4 ,% x m m 9
pxor % x m m 9 ,% x m m 4
movdqa % x m m 4 ,% x m m 0
pslld $ 7 ,% x m m 0
psrld $ 2 5 ,% x m m 4
por % x m m 0 ,% x m m 4
dec % e c x
jnz . L d o u b l e r o u n d4
# x0 [ 0 - 3 ] + = s0 [ 0 ]
# x1 [ 0 - 3 ] + = s0 [ 1 ]
movq 0 x00 ( % r d i ) ,% x m m 3
pshufd $ 0 x00 ,% x m m 3 ,% x m m 2
pshufd $ 0 x55 ,% x m m 3 ,% x m m 3
paddd 0 x00 ( % r s p ) ,% x m m 2
movdqa % x m m 2 ,0 x00 ( % r s p )
paddd 0 x10 ( % r s p ) ,% x m m 3
movdqa % x m m 3 ,0 x10 ( % r s p )
# x2 [ 0 - 3 ] + = s0 [ 2 ]
# x3 [ 0 - 3 ] + = s0 [ 3 ]
movq 0 x08 ( % r d i ) ,% x m m 3
pshufd $ 0 x00 ,% x m m 3 ,% x m m 2
pshufd $ 0 x55 ,% x m m 3 ,% x m m 3
paddd 0 x20 ( % r s p ) ,% x m m 2
movdqa % x m m 2 ,0 x20 ( % r s p )
paddd 0 x30 ( % r s p ) ,% x m m 3
movdqa % x m m 3 ,0 x30 ( % r s p )
# x4 [ 0 - 3 ] + = s1 [ 0 ]
# x5 [ 0 - 3 ] + = s1 [ 1 ]
movq 0 x10 ( % r d i ) ,% x m m 3
pshufd $ 0 x00 ,% x m m 3 ,% x m m 2
pshufd $ 0 x55 ,% x m m 3 ,% x m m 3
paddd % x m m 2 ,% x m m 4
paddd % x m m 3 ,% x m m 5
# x6 [ 0 - 3 ] + = s1 [ 2 ]
# x7 [ 0 - 3 ] + = s1 [ 3 ]
movq 0 x18 ( % r d i ) ,% x m m 3
pshufd $ 0 x00 ,% x m m 3 ,% x m m 2
pshufd $ 0 x55 ,% x m m 3 ,% x m m 3
paddd % x m m 2 ,% x m m 6
paddd % x m m 3 ,% x m m 7
# x8 [ 0 - 3 ] + = s2 [ 0 ]
# x9 [ 0 - 3 ] + = s2 [ 1 ]
movq 0 x20 ( % r d i ) ,% x m m 3
pshufd $ 0 x00 ,% x m m 3 ,% x m m 2
pshufd $ 0 x55 ,% x m m 3 ,% x m m 3
paddd % x m m 2 ,% x m m 8
paddd % x m m 3 ,% x m m 9
# x1 0 [ 0 - 3 ] + = s2 [ 2 ]
# x1 1 [ 0 - 3 ] + = s2 [ 3 ]
movq 0 x28 ( % r d i ) ,% x m m 3
pshufd $ 0 x00 ,% x m m 3 ,% x m m 2
pshufd $ 0 x55 ,% x m m 3 ,% x m m 3
paddd % x m m 2 ,% x m m 1 0
paddd % x m m 3 ,% x m m 1 1
# x1 2 [ 0 - 3 ] + = s3 [ 0 ]
# x1 3 [ 0 - 3 ] + = s3 [ 1 ]
movq 0 x30 ( % r d i ) ,% x m m 3
pshufd $ 0 x00 ,% x m m 3 ,% x m m 2
pshufd $ 0 x55 ,% x m m 3 ,% x m m 3
paddd % x m m 2 ,% x m m 1 2
paddd % x m m 3 ,% x m m 1 3
# x1 4 [ 0 - 3 ] + = s3 [ 2 ]
# x1 5 [ 0 - 3 ] + = s3 [ 3 ]
movq 0 x38 ( % r d i ) ,% x m m 3
pshufd $ 0 x00 ,% x m m 3 ,% x m m 2
pshufd $ 0 x55 ,% x m m 3 ,% x m m 3
paddd % x m m 2 ,% x m m 1 4
paddd % x m m 3 ,% x m m 1 5
# x1 2 + = c o u n t e r v a l u e s 0 - 3
paddd % x m m 1 ,% x m m 1 2
# interleave 3 2 - b i t w o r d s i n s t a t e n , n + 1
movdqa 0 x00 ( % r s p ) ,% x m m 0
movdqa 0 x10 ( % r s p ) ,% x m m 1
movdqa % x m m 0 ,% x m m 2
punpckldq % x m m 1 ,% x m m 2
punpckhdq % x m m 1 ,% x m m 0
movdqa % x m m 2 ,0 x00 ( % r s p )
movdqa % x m m 0 ,0 x10 ( % r s p )
movdqa 0 x20 ( % r s p ) ,% x m m 0
movdqa 0 x30 ( % r s p ) ,% x m m 1
movdqa % x m m 0 ,% x m m 2
punpckldq % x m m 1 ,% x m m 2
punpckhdq % x m m 1 ,% x m m 0
movdqa % x m m 2 ,0 x20 ( % r s p )
movdqa % x m m 0 ,0 x30 ( % r s p )
movdqa % x m m 4 ,% x m m 0
punpckldq % x m m 5 ,% x m m 4
punpckhdq % x m m 5 ,% x m m 0
movdqa % x m m 0 ,% x m m 5
movdqa % x m m 6 ,% x m m 0
punpckldq % x m m 7 ,% x m m 6
punpckhdq % x m m 7 ,% x m m 0
movdqa % x m m 0 ,% x m m 7
movdqa % x m m 8 ,% x m m 0
punpckldq % x m m 9 ,% x m m 8
punpckhdq % x m m 9 ,% x m m 0
movdqa % x m m 0 ,% x m m 9
movdqa % x m m 1 0 ,% x m m 0
punpckldq % x m m 1 1 ,% x m m 1 0
punpckhdq % x m m 1 1 ,% x m m 0
movdqa % x m m 0 ,% x m m 1 1
movdqa % x m m 1 2 ,% x m m 0
punpckldq % x m m 1 3 ,% x m m 1 2
punpckhdq % x m m 1 3 ,% x m m 0
movdqa % x m m 0 ,% x m m 1 3
movdqa % x m m 1 4 ,% x m m 0
punpckldq % x m m 1 5 ,% x m m 1 4
punpckhdq % x m m 1 5 ,% x m m 0
movdqa % x m m 0 ,% x m m 1 5
# interleave 6 4 - b i t w o r d s i n s t a t e n , n + 2
movdqa 0 x00 ( % r s p ) ,% x m m 0
movdqa 0 x20 ( % r s p ) ,% x m m 1
movdqa % x m m 0 ,% x m m 2
punpcklqdq % x m m 1 ,% x m m 2
punpckhqdq % x m m 1 ,% x m m 0
movdqa % x m m 2 ,0 x00 ( % r s p )
movdqa % x m m 0 ,0 x20 ( % r s p )
movdqa 0 x10 ( % r s p ) ,% x m m 0
movdqa 0 x30 ( % r s p ) ,% x m m 1
movdqa % x m m 0 ,% x m m 2
punpcklqdq % x m m 1 ,% x m m 2
punpckhqdq % x m m 1 ,% x m m 0
movdqa % x m m 2 ,0 x10 ( % r s p )
movdqa % x m m 0 ,0 x30 ( % r s p )
movdqa % x m m 4 ,% x m m 0
punpcklqdq % x m m 6 ,% x m m 4
punpckhqdq % x m m 6 ,% x m m 0
movdqa % x m m 0 ,% x m m 6
movdqa % x m m 5 ,% x m m 0
punpcklqdq % x m m 7 ,% x m m 5
punpckhqdq % x m m 7 ,% x m m 0
movdqa % x m m 0 ,% x m m 7
movdqa % x m m 8 ,% x m m 0
punpcklqdq % x m m 1 0 ,% x m m 8
punpckhqdq % x m m 1 0 ,% x m m 0
movdqa % x m m 0 ,% x m m 1 0
movdqa % x m m 9 ,% x m m 0
punpcklqdq % x m m 1 1 ,% x m m 9
punpckhqdq % x m m 1 1 ,% x m m 0
movdqa % x m m 0 ,% x m m 1 1
movdqa % x m m 1 2 ,% x m m 0
punpcklqdq % x m m 1 4 ,% x m m 1 2
punpckhqdq % x m m 1 4 ,% x m m 0
movdqa % x m m 0 ,% x m m 1 4
movdqa % x m m 1 3 ,% x m m 0
punpcklqdq % x m m 1 5 ,% x m m 1 3
punpckhqdq % x m m 1 5 ,% x m m 0
movdqa % x m m 0 ,% x m m 1 5
# xor w i t h c o r r e s p o n d i n g i n p u t , w r i t e t o o u t p u t
movdqa 0 x00 ( % r s p ) ,% x m m 0
movdqu 0 x00 ( % r d x ) ,% x m m 1
pxor % x m m 1 ,% x m m 0
movdqu % x m m 0 ,0 x00 ( % r s i )
movdqa 0 x10 ( % r s p ) ,% x m m 0
movdqu 0 x80 ( % r d x ) ,% x m m 1
pxor % x m m 1 ,% x m m 0
movdqu % x m m 0 ,0 x80 ( % r s i )
movdqa 0 x20 ( % r s p ) ,% x m m 0
movdqu 0 x40 ( % r d x ) ,% x m m 1
pxor % x m m 1 ,% x m m 0
movdqu % x m m 0 ,0 x40 ( % r s i )
movdqa 0 x30 ( % r s p ) ,% x m m 0
movdqu 0 x c0 ( % r d x ) ,% x m m 1
pxor % x m m 1 ,% x m m 0
movdqu % x m m 0 ,0 x c0 ( % r s i )
movdqu 0 x10 ( % r d x ) ,% x m m 1
pxor % x m m 1 ,% x m m 4
movdqu % x m m 4 ,0 x10 ( % r s i )
movdqu 0 x90 ( % r d x ) ,% x m m 1
pxor % x m m 1 ,% x m m 5
movdqu % x m m 5 ,0 x90 ( % r s i )
movdqu 0 x50 ( % r d x ) ,% x m m 1
pxor % x m m 1 ,% x m m 6
movdqu % x m m 6 ,0 x50 ( % r s i )
movdqu 0 x d0 ( % r d x ) ,% x m m 1
pxor % x m m 1 ,% x m m 7
movdqu % x m m 7 ,0 x d0 ( % r s i )
movdqu 0 x20 ( % r d x ) ,% x m m 1
pxor % x m m 1 ,% x m m 8
movdqu % x m m 8 ,0 x20 ( % r s i )
movdqu 0 x a0 ( % r d x ) ,% x m m 1
pxor % x m m 1 ,% x m m 9
movdqu % x m m 9 ,0 x a0 ( % r s i )
movdqu 0 x60 ( % r d x ) ,% x m m 1
pxor % x m m 1 ,% x m m 1 0
movdqu % x m m 1 0 ,0 x60 ( % r s i )
movdqu 0 x e 0 ( % r d x ) ,% x m m 1
pxor % x m m 1 ,% x m m 1 1
movdqu % x m m 1 1 ,0 x e 0 ( % r s i )
movdqu 0 x30 ( % r d x ) ,% x m m 1
pxor % x m m 1 ,% x m m 1 2
movdqu % x m m 1 2 ,0 x30 ( % r s i )
movdqu 0 x b0 ( % r d x ) ,% x m m 1
pxor % x m m 1 ,% x m m 1 3
movdqu % x m m 1 3 ,0 x b0 ( % r s i )
movdqu 0 x70 ( % r d x ) ,% x m m 1
pxor % x m m 1 ,% x m m 1 4
movdqu % x m m 1 4 ,0 x70 ( % r s i )
movdqu 0 x f0 ( % r d x ) ,% x m m 1
pxor % x m m 1 ,% x m m 1 5
movdqu % x m m 1 5 ,0 x f0 ( % r s i )
2016-01-21 19:24:08 +03:00
mov % r11 ,% r s p
crypto: chacha20 - Add a four block SSSE3 variant for x86_64
Extends the x86_64 SSSE3 ChaCha20 implementation by a function processing
four ChaCha20 blocks in parallel. This avoids the word shuffling needed
in the single block variant, further increasing throughput.
For large messages, throughput increases by ~110% compared to single block
SSSE3:
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 42249230 operations in 10 seconds (675987680 bytes)
test 1 (256 bit key, 64 byte blocks): 46441641 operations in 10 seconds (2972265024 bytes)
test 2 (256 bit key, 256 byte blocks): 33028112 operations in 10 seconds (8455196672 bytes)
test 3 (256 bit key, 1024 byte blocks): 11568759 operations in 10 seconds (11846409216 bytes)
test 4 (256 bit key, 8192 byte blocks): 1448761 operations in 10 seconds (11868250112 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 20:14:02 +03:00
ret
ENDPROC( c h a c h a20 _ 4 b l o c k _ x o r _ s s s e 3 )