2017-01-11 19:41:50 +03:00
/ *
2018-11-17 04:26:25 +03:00
* ChaCha/ X C h a C h a N E O N h e l p e r f u n c t i o n s
2017-01-11 19:41:50 +03:00
*
* Copyright ( C ) 2 0 1 6 L i n a r o , L t d . < a r d . b i e s h e u v e l @linaro.org>
*
* This p r o g r a m i s f r e e s o f t w a r e ; you can redistribute it and/or modify
* it u n d e r t h e t e r m s o f t h e G N U G e n e r a l P u b l i c L i c e n s e v e r s i o n 2 a s
* published b y t h e F r e e S o f t w a r e F o u n d a t i o n .
*
* Based o n :
* ChaCha2 0 2 5 6 - b i t c i p h e r a l g o r i t h m , R F C 7 5 3 9 , x64 S S E 3 f u n c t i o n s
*
* Copyright ( C ) 2 0 1 5 M a r t i n W i l l i
*
* This p r o g r a m i s f r e e s o f t w a r e ; you can redistribute it and/or modify
* it u n d e r t h e t e r m s o f t h e G N U G e n e r a l P u b l i c L i c e n s e a s p u b l i s h e d b y
* the F r e e S o f t w a r e F o u n d a t i o n ; either version 2 of the License, or
* ( at y o u r o p t i o n ) a n y l a t e r v e r s i o n .
* /
2018-09-01 10:17:07 +03:00
/ *
* NEON d o e s n ' t h a v e a r o t a t e i n s t r u c t i o n . T h e a l t e r n a t i v e s a r e , m o r e o r l e s s :
*
* ( a) v s h l . u 3 2 + v s r i . u 3 2 ( n e e d s t e m p o r a r y r e g i s t e r )
* ( b) v s h l . u 3 2 + v s h r . u 3 2 + v o r r ( n e e d s t e m p o r a r y r e g i s t e r )
* ( c) v r e v32 . 1 6 ( 1 6 - b i t r o t a t i o n s o n l y )
* ( d) v t b l . 8 + v t b l . 8 ( m u l t i p l e o f 8 b i t s r o t a t i o n s o n l y ,
* needs i n d e x v e c t o r )
*
2018-11-17 04:26:25 +03:00
* ChaCha h a s 1 6 , 1 2 , 8 , a n d 7 - b i t r o t a t i o n s . F o r t h e 1 2 a n d 7 - b i t r o t a t i o n s ,
* the o n l y c h o i c e s a r e ( a ) a n d ( b ) . W e u s e ( a ) s i n c e i t t a k e s t w o - t h i r d s t h e
* cycles o f ( b ) o n b o t h C o r t e x - A 7 a n d C o r t e x - A 5 3 .
2018-09-01 10:17:07 +03:00
*
* For t h e 1 6 - b i t r o t a t i o n , w e u s e v r e v32 . 1 6 s i n c e i t ' s c o n s i s t e n t l y f a s t e s t
* and d o e s n ' t n e e d a t e m p o r a r y r e g i s t e r .
*
* For t h e 8 - b i t r o t a t i o n , w e u s e v t b l . 8 + v t b l . 8 . O n C o r t e x - A 7 , t h i s s e q u e n c e
* is t w i c e a s f a s t a s ( a ) , e v e n w h e n d o i n g ( a ) o n m u l t i p l e r e g i s t e r s
* simultaneously t o e l i m i n a t e t h e s t a l l b e t w e e n v s h l a n d v s r i . A l s o , i t
* parallelizes b e t t e r w h e n t e m p o r a r y r e g i s t e r s a r e s c a r c e .
*
* A d i s a d v a n t a g e i s t h a t o n C o r t e x - A 5 3 , t h e v t b l s e q u e n c e i s t h e s a m e s p e e d a s
* ( a) , s o t h e n e e d t o l o a d t h e r o t a t i o n t a b l e a c t u a l l y m a k e s t h e v t b l m e t h o d
* slightly s l o w e r o v e r a l l o n t h a t C P U ( ~ 1 . 3 % s l o w e r C h a C h a20 ) . S t i l l , i t
* seems t o b e a g o o d c o m p r o m i s e t o g e t a m o r e s i g n i f i c a n t s p e e d b o o s t o n s o m e
* CPUs, e . g . ~ 4 . 8 % f a s t e r C h a C h a20 o n C o r t e x - A 7 .
* /
2017-01-11 19:41:50 +03:00
# include < l i n u x / l i n k a g e . h >
crypto: arm/chacha-neon - optimize for non-block size multiples
The current NEON based ChaCha implementation for ARM is optimized for
multiples of 4x the ChaCha block size (64 bytes). This makes sense for
block encryption, but given that ChaCha is also often used in the
context of networking, it makes sense to consider arbitrary length
inputs as well.
For example, WireGuard typically uses 1420 byte packets, and performing
ChaCha encryption involves 5 invocations of chacha_4block_xor_neon()
and 3 invocations of chacha_block_xor_neon(), where the last one also
involves a memcpy() using a buffer on the stack to process the final
chunk of 1420 % 64 == 12 bytes.
Let's optimize for this case as well, by letting chacha_4block_xor_neon()
deal with any input size between 64 and 256 bytes, using NEON permutation
instructions and overlapping loads and stores. This way, the 140 byte
tail of a 1420 byte input buffer can simply be processed in one go.
This results in the following performance improvements for 1420 byte
blocks, without significant impact on power-of-2 input sizes. (Note
that Raspberry Pi is widely used in combination with a 32-bit kernel,
even though the core is 64-bit capable)
Cortex-A8 (BeagleBone) : 7%
Cortex-A15 (Calxeda Midway) : 21%
Cortex-A53 (Raspberry Pi 3) : 3%
Cortex-A72 (Raspberry Pi 4) : 19%
Cc: Eric Biggers <ebiggers@google.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-03 19:28:09 +03:00
# include < a s m / c a c h e . h >
2017-01-11 19:41:50 +03:00
.text
.fpu neon
.align 5
2018-11-17 04:26:24 +03:00
/ *
2018-11-17 04:26:25 +03:00
* chacha_ p e r m u t e - p e r m u t e o n e b l o c k
2018-11-17 04:26:24 +03:00
*
* Permute o n e 6 4 - b y t e b l o c k w h e r e t h e s t a t e m a t r i x i s s t o r e d i n t h e f o u r N E O N
* registers q0 - q3 . I t p e r f o r m s m a t r i x o p e r a t i o n s o n f o u r w o r d s i n p a r a l l e l ,
* but r e q u i r e s s h u f f l i n g t o r e a r r a n g e t h e w o r d s a f t e r e a c h r o u n d .
*
2018-11-17 04:26:25 +03:00
* The r o u n d c o u n t i s g i v e n i n r3 .
*
2018-11-17 04:26:24 +03:00
* Clobbers : r3 , i p , q4 - q5
* /
2018-11-17 04:26:25 +03:00
chacha_permute :
2017-01-11 19:41:50 +03:00
2018-09-01 10:17:07 +03:00
adr i p , . L r o l 8 _ t a b l e
vld1 . 8 { d10 } , [ i p , : 6 4 ]
2017-01-11 19:41:50 +03:00
.Ldoubleround :
/ / x0 + = x1 , x3 = r o t l 3 2 ( x3 ^ x0 , 1 6 )
vadd. i 3 2 q0 , q0 , q1
2018-07-25 04:29:07 +03:00
veor q3 , q3 , q0
vrev3 2 . 1 6 q3 , q3
2017-01-11 19:41:50 +03:00
/ / x2 + = x3 , x1 = r o t l 3 2 ( x1 ^ x2 , 1 2 )
vadd. i 3 2 q2 , q2 , q3
veor q4 , q1 , q2
vshl. u 3 2 q1 , q4 , #12
vsri. u 3 2 q1 , q4 , #20
/ / x0 + = x1 , x3 = r o t l 3 2 ( x3 ^ x0 , 8 )
vadd. i 3 2 q0 , q0 , q1
2018-09-01 10:17:07 +03:00
veor q3 , q3 , q0
vtbl. 8 d6 , { d6 } , d10
vtbl. 8 d7 , { d7 } , d10
2017-01-11 19:41:50 +03:00
/ / x2 + = x3 , x1 = r o t l 3 2 ( x1 ^ x2 , 7 )
vadd. i 3 2 q2 , q2 , q3
veor q4 , q1 , q2
vshl. u 3 2 q1 , q4 , #7
vsri. u 3 2 q1 , q4 , #25
/ / x1 = s h u f f l e 3 2 ( x1 , M A S K ( 0 , 3 , 2 , 1 ) )
vext. 8 q1 , q1 , q1 , #4
/ / x2 = s h u f f l e 3 2 ( x2 , M A S K ( 1 , 0 , 3 , 2 ) )
vext. 8 q2 , q2 , q2 , #8
/ / x3 = s h u f f l e 3 2 ( x3 , M A S K ( 2 , 1 , 0 , 3 ) )
vext. 8 q3 , q3 , q3 , #12
/ / x0 + = x1 , x3 = r o t l 3 2 ( x3 ^ x0 , 1 6 )
vadd. i 3 2 q0 , q0 , q1
2018-07-25 04:29:07 +03:00
veor q3 , q3 , q0
vrev3 2 . 1 6 q3 , q3
2017-01-11 19:41:50 +03:00
/ / x2 + = x3 , x1 = r o t l 3 2 ( x1 ^ x2 , 1 2 )
vadd. i 3 2 q2 , q2 , q3
veor q4 , q1 , q2
vshl. u 3 2 q1 , q4 , #12
vsri. u 3 2 q1 , q4 , #20
/ / x0 + = x1 , x3 = r o t l 3 2 ( x3 ^ x0 , 8 )
vadd. i 3 2 q0 , q0 , q1
2018-09-01 10:17:07 +03:00
veor q3 , q3 , q0
vtbl. 8 d6 , { d6 } , d10
vtbl. 8 d7 , { d7 } , d10
2017-01-11 19:41:50 +03:00
/ / x2 + = x3 , x1 = r o t l 3 2 ( x1 ^ x2 , 7 )
vadd. i 3 2 q2 , q2 , q3
veor q4 , q1 , q2
vshl. u 3 2 q1 , q4 , #7
vsri. u 3 2 q1 , q4 , #25
/ / x1 = s h u f f l e 3 2 ( x1 , M A S K ( 2 , 1 , 0 , 3 ) )
vext. 8 q1 , q1 , q1 , #12
/ / x2 = s h u f f l e 3 2 ( x2 , M A S K ( 1 , 0 , 3 , 2 ) )
vext. 8 q2 , q2 , q2 , #8
/ / x3 = s h u f f l e 3 2 ( x3 , M A S K ( 0 , 3 , 2 , 1 ) )
vext. 8 q3 , q3 , q3 , #4
2018-11-17 04:26:25 +03:00
subs r3 , r3 , #2
2017-01-11 19:41:50 +03:00
bne . L d o u b l e r o u n d
2018-11-17 04:26:24 +03:00
bx l r
2018-11-17 04:26:25 +03:00
ENDPROC( c h a c h a _ p e r m u t e )
2018-11-17 04:26:24 +03:00
2018-11-17 04:26:25 +03:00
ENTRY( c h a c h a _ b l o c k _ x o r _ n e o n )
2018-11-17 04:26:24 +03:00
/ / r0 : Input s t a t e m a t r i x , s
/ / r1 : 1 data b l o c k o u t p u t , o
/ / r2 : 1 data b l o c k i n p u t , i
2018-11-17 04:26:25 +03:00
/ / r3 : nrounds
2018-11-17 04:26:24 +03:00
push { l r }
/ / x0 . . 3 = s0 . . 3
add i p , r0 , #0x20
vld1 . 3 2 { q0 - q1 } , [ r0 ]
vld1 . 3 2 { q2 - q3 } , [ i p ]
vmov q8 , q0
vmov q9 , q1
vmov q10 , q2
vmov q11 , q3
2018-11-17 04:26:25 +03:00
bl c h a c h a _ p e r m u t e
2018-11-17 04:26:24 +03:00
2017-01-11 19:41:50 +03:00
add i p , r2 , #0x20
vld1 . 8 { q4 - q5 } , [ r2 ]
vld1 . 8 { q6 - q7 } , [ i p ]
/ / o0 = i 0 ^ ( x0 + s0 )
vadd. i 3 2 q0 , q0 , q8
veor q0 , q0 , q4
/ / o1 = i 1 ^ ( x1 + s1 )
vadd. i 3 2 q1 , q1 , q9
veor q1 , q1 , q5
/ / o2 = i 2 ^ ( x2 + s2 )
vadd. i 3 2 q2 , q2 , q10
veor q2 , q2 , q6
/ / o3 = i 3 ^ ( x3 + s3 )
vadd. i 3 2 q3 , q3 , q11
veor q3 , q3 , q7
add i p , r1 , #0x20
vst1 . 8 { q0 - q1 } , [ r1 ]
vst1 . 8 { q2 - q3 } , [ i p ]
2018-11-17 04:26:24 +03:00
pop { p c }
2018-11-17 04:26:25 +03:00
ENDPROC( c h a c h a _ b l o c k _ x o r _ n e o n )
2017-01-11 19:41:50 +03:00
2018-11-17 04:26:25 +03:00
ENTRY( h c h a c h a _ b l o c k _ n e o n )
2018-11-17 04:26:24 +03:00
/ / r0 : Input s t a t e m a t r i x , s
/ / r1 : output ( 8 3 2 - b i t w o r d s )
2018-11-17 04:26:25 +03:00
/ / r2 : nrounds
2018-11-17 04:26:24 +03:00
push { l r }
vld1 . 3 2 { q0 - q1 } , [ r0 ] !
vld1 . 3 2 { q2 - q3 } , [ r0 ]
2018-11-17 04:26:25 +03:00
mov r3 , r2
bl c h a c h a _ p e r m u t e
2018-11-17 04:26:24 +03:00
vst1 . 3 2 { q0 } , [ r1 ] !
vst1 . 3 2 { q3 } , [ r1 ]
pop { p c }
2018-11-17 04:26:25 +03:00
ENDPROC( h c h a c h a _ b l o c k _ n e o n )
2018-11-17 04:26:24 +03:00
2018-09-01 10:17:07 +03:00
.align 4
.Lctrinc : .word 0 , 1 , 2 , 3
.Lrol8_table : .byte 3 , 0 , 1 , 2 , 7 , 4 , 5 , 6
2017-01-11 19:41:50 +03:00
.align 5
2018-11-17 04:26:25 +03:00
ENTRY( c h a c h a _ 4 b l o c k _ x o r _ n e o n )
crypto: arm/chacha-neon - optimize for non-block size multiples
The current NEON based ChaCha implementation for ARM is optimized for
multiples of 4x the ChaCha block size (64 bytes). This makes sense for
block encryption, but given that ChaCha is also often used in the
context of networking, it makes sense to consider arbitrary length
inputs as well.
For example, WireGuard typically uses 1420 byte packets, and performing
ChaCha encryption involves 5 invocations of chacha_4block_xor_neon()
and 3 invocations of chacha_block_xor_neon(), where the last one also
involves a memcpy() using a buffer on the stack to process the final
chunk of 1420 % 64 == 12 bytes.
Let's optimize for this case as well, by letting chacha_4block_xor_neon()
deal with any input size between 64 and 256 bytes, using NEON permutation
instructions and overlapping loads and stores. This way, the 140 byte
tail of a 1420 byte input buffer can simply be processed in one go.
This results in the following performance improvements for 1420 byte
blocks, without significant impact on power-of-2 input sizes. (Note
that Raspberry Pi is widely used in combination with a 32-bit kernel,
even though the core is 64-bit capable)
Cortex-A8 (BeagleBone) : 7%
Cortex-A15 (Calxeda Midway) : 21%
Cortex-A53 (Raspberry Pi 3) : 3%
Cortex-A72 (Raspberry Pi 4) : 19%
Cc: Eric Biggers <ebiggers@google.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-03 19:28:09 +03:00
push { r4 , l r }
2018-09-01 10:17:07 +03:00
mov r4 , s p / / p r e s e r v e t h e s t a c k p o i n t e r
sub i p , s p , #0x20 / / a l l o c a t e a 3 2 b y t e b u f f e r
bic i p , i p , #0x1f / / a l i g n e d t o 3 2 b y t e s
mov s p , i p
2017-01-11 19:41:50 +03:00
/ / r0 : Input s t a t e m a t r i x , s
/ / r1 : 4 data b l o c k s o u t p u t , o
/ / r2 : 4 data b l o c k s i n p u t , i
2018-11-17 04:26:25 +03:00
/ / r3 : nrounds
2017-01-11 19:41:50 +03:00
/ /
2018-11-17 04:26:25 +03:00
/ / This f u n c t i o n e n c r y p t s f o u r c o n s e c u t i v e C h a C h a b l o c k s b y l o a d i n g
2017-01-11 19:41:50 +03:00
/ / the s t a t e m a t r i x i n N E O N r e g i s t e r s f o u r t i m e s . T h e a l g o r i t h m p e r f o r m s
/ / each o p e r a t i o n o n t h e c o r r e s p o n d i n g w o r d o f e a c h s t a t e m a t r i x , h e n c e
2018-09-01 10:17:07 +03:00
/ / requires n o w o r d s h u f f l i n g . T h e w o r d s a r e r e - i n t e r l e a v e d b e f o r e t h e
/ / final a d d i t i o n o f t h e o r i g i n a l s t a t e a n d t h e X O R i n g s t e p .
2017-01-11 19:41:50 +03:00
/ /
2018-09-01 10:17:07 +03:00
/ / x0 . . 1 5 [ 0 - 3 ] = s0 . . 1 5 [ 0 - 3 ]
add i p , r0 , #0x20
2017-01-11 19:41:50 +03:00
vld1 . 3 2 { q0 - q1 } , [ r0 ]
2018-09-01 10:17:07 +03:00
vld1 . 3 2 { q2 - q3 } , [ i p ]
2017-01-11 19:41:50 +03:00
crypto: arm/chacha-neon - optimize for non-block size multiples
The current NEON based ChaCha implementation for ARM is optimized for
multiples of 4x the ChaCha block size (64 bytes). This makes sense for
block encryption, but given that ChaCha is also often used in the
context of networking, it makes sense to consider arbitrary length
inputs as well.
For example, WireGuard typically uses 1420 byte packets, and performing
ChaCha encryption involves 5 invocations of chacha_4block_xor_neon()
and 3 invocations of chacha_block_xor_neon(), where the last one also
involves a memcpy() using a buffer on the stack to process the final
chunk of 1420 % 64 == 12 bytes.
Let's optimize for this case as well, by letting chacha_4block_xor_neon()
deal with any input size between 64 and 256 bytes, using NEON permutation
instructions and overlapping loads and stores. This way, the 140 byte
tail of a 1420 byte input buffer can simply be processed in one go.
This results in the following performance improvements for 1420 byte
blocks, without significant impact on power-of-2 input sizes. (Note
that Raspberry Pi is widely used in combination with a 32-bit kernel,
even though the core is 64-bit capable)
Cortex-A8 (BeagleBone) : 7%
Cortex-A15 (Calxeda Midway) : 21%
Cortex-A53 (Raspberry Pi 3) : 3%
Cortex-A72 (Raspberry Pi 4) : 19%
Cc: Eric Biggers <ebiggers@google.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-03 19:28:09 +03:00
adr l r , . L c t r i n c
2017-01-11 19:41:50 +03:00
vdup. 3 2 q15 , d7 [ 1 ]
vdup. 3 2 q14 , d7 [ 0 ]
crypto: arm/chacha-neon - optimize for non-block size multiples
The current NEON based ChaCha implementation for ARM is optimized for
multiples of 4x the ChaCha block size (64 bytes). This makes sense for
block encryption, but given that ChaCha is also often used in the
context of networking, it makes sense to consider arbitrary length
inputs as well.
For example, WireGuard typically uses 1420 byte packets, and performing
ChaCha encryption involves 5 invocations of chacha_4block_xor_neon()
and 3 invocations of chacha_block_xor_neon(), where the last one also
involves a memcpy() using a buffer on the stack to process the final
chunk of 1420 % 64 == 12 bytes.
Let's optimize for this case as well, by letting chacha_4block_xor_neon()
deal with any input size between 64 and 256 bytes, using NEON permutation
instructions and overlapping loads and stores. This way, the 140 byte
tail of a 1420 byte input buffer can simply be processed in one go.
This results in the following performance improvements for 1420 byte
blocks, without significant impact on power-of-2 input sizes. (Note
that Raspberry Pi is widely used in combination with a 32-bit kernel,
even though the core is 64-bit capable)
Cortex-A8 (BeagleBone) : 7%
Cortex-A15 (Calxeda Midway) : 21%
Cortex-A53 (Raspberry Pi 3) : 3%
Cortex-A72 (Raspberry Pi 4) : 19%
Cc: Eric Biggers <ebiggers@google.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-03 19:28:09 +03:00
vld1 . 3 2 { q4 } , [ l r , : 1 2 8 ]
2017-01-11 19:41:50 +03:00
vdup. 3 2 q13 , d6 [ 1 ]
vdup. 3 2 q12 , d6 [ 0 ]
vdup. 3 2 q11 , d5 [ 1 ]
vdup. 3 2 q10 , d5 [ 0 ]
2018-09-01 10:17:07 +03:00
vadd. u 3 2 q12 , q12 , q4 / / x12 + = c o u n t e r v a l u e s 0 - 3
2017-01-11 19:41:50 +03:00
vdup. 3 2 q9 , d4 [ 1 ]
vdup. 3 2 q8 , d4 [ 0 ]
vdup. 3 2 q7 , d3 [ 1 ]
vdup. 3 2 q6 , d3 [ 0 ]
vdup. 3 2 q5 , d2 [ 1 ]
vdup. 3 2 q4 , d2 [ 0 ]
vdup. 3 2 q3 , d1 [ 1 ]
vdup. 3 2 q2 , d1 [ 0 ]
vdup. 3 2 q1 , d0 [ 1 ]
vdup. 3 2 q0 , d0 [ 0 ]
2018-09-01 10:17:07 +03:00
adr i p , . L r o l 8 _ t a b l e
b 1 f
2017-01-11 19:41:50 +03:00
.Ldoubleround4 :
2018-09-01 10:17:07 +03:00
vld1 . 3 2 { q8 - q9 } , [ s p , : 2 5 6 ]
1 :
2017-01-11 19:41:50 +03:00
/ / x0 + = x4 , x12 = r o t l 3 2 ( x12 ^ x0 , 1 6 )
/ / x1 + = x5 , x13 = r o t l 3 2 ( x13 ^ x1 , 1 6 )
/ / x2 + = x6 , x14 = r o t l 3 2 ( x14 ^ x2 , 1 6 )
/ / x3 + = x7 , x15 = r o t l 3 2 ( x15 ^ x3 , 1 6 )
vadd. i 3 2 q0 , q0 , q4
vadd. i 3 2 q1 , q1 , q5
vadd. i 3 2 q2 , q2 , q6
vadd. i 3 2 q3 , q3 , q7
veor q12 , q12 , q0
veor q13 , q13 , q1
veor q14 , q14 , q2
veor q15 , q15 , q3
vrev3 2 . 1 6 q12 , q12
vrev3 2 . 1 6 q13 , q13
vrev3 2 . 1 6 q14 , q14
vrev3 2 . 1 6 q15 , q15
/ / x8 + = x12 , x4 = r o t l 3 2 ( x4 ^ x8 , 1 2 )
/ / x9 + = x13 , x5 = r o t l 3 2 ( x5 ^ x9 , 1 2 )
/ / x1 0 + = x14 , x6 = r o t l 3 2 ( x6 ^ x10 , 1 2 )
/ / x1 1 + = x15 , x7 = r o t l 3 2 ( x7 ^ x11 , 1 2 )
vadd. i 3 2 q8 , q8 , q12
vadd. i 3 2 q9 , q9 , q13
vadd. i 3 2 q10 , q10 , q14
vadd. i 3 2 q11 , q11 , q15
vst1 . 3 2 { q8 - q9 } , [ s p , : 2 5 6 ]
veor q8 , q4 , q8
veor q9 , q5 , q9
vshl. u 3 2 q4 , q8 , #12
vshl. u 3 2 q5 , q9 , #12
vsri. u 3 2 q4 , q8 , #20
vsri. u 3 2 q5 , q9 , #20
veor q8 , q6 , q10
veor q9 , q7 , q11
vshl. u 3 2 q6 , q8 , #12
vshl. u 3 2 q7 , q9 , #12
vsri. u 3 2 q6 , q8 , #20
vsri. u 3 2 q7 , q9 , #20
/ / x0 + = x4 , x12 = r o t l 3 2 ( x12 ^ x0 , 8 )
/ / x1 + = x5 , x13 = r o t l 3 2 ( x13 ^ x1 , 8 )
/ / x2 + = x6 , x14 = r o t l 3 2 ( x14 ^ x2 , 8 )
/ / x3 + = x7 , x15 = r o t l 3 2 ( x15 ^ x3 , 8 )
2018-09-01 10:17:07 +03:00
vld1 . 8 { d16 } , [ i p , : 6 4 ]
2017-01-11 19:41:50 +03:00
vadd. i 3 2 q0 , q0 , q4
vadd. i 3 2 q1 , q1 , q5
vadd. i 3 2 q2 , q2 , q6
vadd. i 3 2 q3 , q3 , q7
2018-09-01 10:17:07 +03:00
veor q12 , q12 , q0
veor q13 , q13 , q1
veor q14 , q14 , q2
veor q15 , q15 , q3
2017-01-11 19:41:50 +03:00
2018-09-01 10:17:07 +03:00
vtbl. 8 d24 , { d24 } , d16
vtbl. 8 d25 , { d25 } , d16
vtbl. 8 d26 , { d26 } , d16
vtbl. 8 d27 , { d27 } , d16
vtbl. 8 d28 , { d28 } , d16
vtbl. 8 d29 , { d29 } , d16
vtbl. 8 d30 , { d30 } , d16
vtbl. 8 d31 , { d31 } , d16
2017-01-11 19:41:50 +03:00
vld1 . 3 2 { q8 - q9 } , [ s p , : 2 5 6 ]
/ / x8 + = x12 , x4 = r o t l 3 2 ( x4 ^ x8 , 7 )
/ / x9 + = x13 , x5 = r o t l 3 2 ( x5 ^ x9 , 7 )
/ / x1 0 + = x14 , x6 = r o t l 3 2 ( x6 ^ x10 , 7 )
/ / x1 1 + = x15 , x7 = r o t l 3 2 ( x7 ^ x11 , 7 )
vadd. i 3 2 q8 , q8 , q12
vadd. i 3 2 q9 , q9 , q13
vadd. i 3 2 q10 , q10 , q14
vadd. i 3 2 q11 , q11 , q15
vst1 . 3 2 { q8 - q9 } , [ s p , : 2 5 6 ]
veor q8 , q4 , q8
veor q9 , q5 , q9
vshl. u 3 2 q4 , q8 , #7
vshl. u 3 2 q5 , q9 , #7
vsri. u 3 2 q4 , q8 , #25
vsri. u 3 2 q5 , q9 , #25
veor q8 , q6 , q10
veor q9 , q7 , q11
vshl. u 3 2 q6 , q8 , #7
vshl. u 3 2 q7 , q9 , #7
vsri. u 3 2 q6 , q8 , #25
vsri. u 3 2 q7 , q9 , #25
vld1 . 3 2 { q8 - q9 } , [ s p , : 2 5 6 ]
/ / x0 + = x5 , x15 = r o t l 3 2 ( x15 ^ x0 , 1 6 )
/ / x1 + = x6 , x12 = r o t l 3 2 ( x12 ^ x1 , 1 6 )
/ / x2 + = x7 , x13 = r o t l 3 2 ( x13 ^ x2 , 1 6 )
/ / x3 + = x4 , x14 = r o t l 3 2 ( x14 ^ x3 , 1 6 )
vadd. i 3 2 q0 , q0 , q5
vadd. i 3 2 q1 , q1 , q6
vadd. i 3 2 q2 , q2 , q7
vadd. i 3 2 q3 , q3 , q4
veor q15 , q15 , q0
veor q12 , q12 , q1
veor q13 , q13 , q2
veor q14 , q14 , q3
vrev3 2 . 1 6 q15 , q15
vrev3 2 . 1 6 q12 , q12
vrev3 2 . 1 6 q13 , q13
vrev3 2 . 1 6 q14 , q14
/ / x1 0 + = x15 , x5 = r o t l 3 2 ( x5 ^ x10 , 1 2 )
/ / x1 1 + = x12 , x6 = r o t l 3 2 ( x6 ^ x11 , 1 2 )
/ / x8 + = x13 , x7 = r o t l 3 2 ( x7 ^ x8 , 1 2 )
/ / x9 + = x14 , x4 = r o t l 3 2 ( x4 ^ x9 , 1 2 )
vadd. i 3 2 q10 , q10 , q15
vadd. i 3 2 q11 , q11 , q12
vadd. i 3 2 q8 , q8 , q13
vadd. i 3 2 q9 , q9 , q14
vst1 . 3 2 { q8 - q9 } , [ s p , : 2 5 6 ]
veor q8 , q7 , q8
veor q9 , q4 , q9
vshl. u 3 2 q7 , q8 , #12
vshl. u 3 2 q4 , q9 , #12
vsri. u 3 2 q7 , q8 , #20
vsri. u 3 2 q4 , q9 , #20
veor q8 , q5 , q10
veor q9 , q6 , q11
vshl. u 3 2 q5 , q8 , #12
vshl. u 3 2 q6 , q9 , #12
vsri. u 3 2 q5 , q8 , #20
vsri. u 3 2 q6 , q9 , #20
/ / x0 + = x5 , x15 = r o t l 3 2 ( x15 ^ x0 , 8 )
/ / x1 + = x6 , x12 = r o t l 3 2 ( x12 ^ x1 , 8 )
/ / x2 + = x7 , x13 = r o t l 3 2 ( x13 ^ x2 , 8 )
/ / x3 + = x4 , x14 = r o t l 3 2 ( x14 ^ x3 , 8 )
2018-09-01 10:17:07 +03:00
vld1 . 8 { d16 } , [ i p , : 6 4 ]
2017-01-11 19:41:50 +03:00
vadd. i 3 2 q0 , q0 , q5
vadd. i 3 2 q1 , q1 , q6
vadd. i 3 2 q2 , q2 , q7
vadd. i 3 2 q3 , q3 , q4
2018-09-01 10:17:07 +03:00
veor q15 , q15 , q0
veor q12 , q12 , q1
veor q13 , q13 , q2
veor q14 , q14 , q3
2017-01-11 19:41:50 +03:00
2018-09-01 10:17:07 +03:00
vtbl. 8 d30 , { d30 } , d16
vtbl. 8 d31 , { d31 } , d16
vtbl. 8 d24 , { d24 } , d16
vtbl. 8 d25 , { d25 } , d16
vtbl. 8 d26 , { d26 } , d16
vtbl. 8 d27 , { d27 } , d16
vtbl. 8 d28 , { d28 } , d16
vtbl. 8 d29 , { d29 } , d16
2017-01-11 19:41:50 +03:00
vld1 . 3 2 { q8 - q9 } , [ s p , : 2 5 6 ]
/ / x1 0 + = x15 , x5 = r o t l 3 2 ( x5 ^ x10 , 7 )
/ / x1 1 + = x12 , x6 = r o t l 3 2 ( x6 ^ x11 , 7 )
/ / x8 + = x13 , x7 = r o t l 3 2 ( x7 ^ x8 , 7 )
/ / x9 + = x14 , x4 = r o t l 3 2 ( x4 ^ x9 , 7 )
vadd. i 3 2 q10 , q10 , q15
vadd. i 3 2 q11 , q11 , q12
vadd. i 3 2 q8 , q8 , q13
vadd. i 3 2 q9 , q9 , q14
vst1 . 3 2 { q8 - q9 } , [ s p , : 2 5 6 ]
veor q8 , q7 , q8
veor q9 , q4 , q9
vshl. u 3 2 q7 , q8 , #7
vshl. u 3 2 q4 , q9 , #7
vsri. u 3 2 q7 , q8 , #25
vsri. u 3 2 q4 , q9 , #25
veor q8 , q5 , q10
veor q9 , q6 , q11
vshl. u 3 2 q5 , q8 , #7
vshl. u 3 2 q6 , q9 , #7
vsri. u 3 2 q5 , q8 , #25
vsri. u 3 2 q6 , q9 , #25
2018-11-17 04:26:25 +03:00
subs r3 , r3 , #2
2018-09-01 10:17:07 +03:00
bne . L d o u b l e r o u n d4
/ / x0 . . 7 [ 0 - 3 ] a r e i n q0 - q7 , x10 . . 1 5 [ 0 - 3 ] a r e i n q10 - q15 .
/ / x8 . . 9 [ 0 - 3 ] a r e o n t h e s t a c k .
/ / Re- i n t e r l e a v e t h e w o r d s i n t h e f i r s t t w o r o w s o f e a c h b l o c k ( x0 . . 7 ) .
/ / Also a d d t h e c o u n t e r v a l u e s 0 - 3 t o x12 [ 0 - 3 ] .
crypto: arm/chacha-neon - optimize for non-block size multiples
The current NEON based ChaCha implementation for ARM is optimized for
multiples of 4x the ChaCha block size (64 bytes). This makes sense for
block encryption, but given that ChaCha is also often used in the
context of networking, it makes sense to consider arbitrary length
inputs as well.
For example, WireGuard typically uses 1420 byte packets, and performing
ChaCha encryption involves 5 invocations of chacha_4block_xor_neon()
and 3 invocations of chacha_block_xor_neon(), where the last one also
involves a memcpy() using a buffer on the stack to process the final
chunk of 1420 % 64 == 12 bytes.
Let's optimize for this case as well, by letting chacha_4block_xor_neon()
deal with any input size between 64 and 256 bytes, using NEON permutation
instructions and overlapping loads and stores. This way, the 140 byte
tail of a 1420 byte input buffer can simply be processed in one go.
This results in the following performance improvements for 1420 byte
blocks, without significant impact on power-of-2 input sizes. (Note
that Raspberry Pi is widely used in combination with a 32-bit kernel,
even though the core is 64-bit capable)
Cortex-A8 (BeagleBone) : 7%
Cortex-A15 (Calxeda Midway) : 21%
Cortex-A53 (Raspberry Pi 3) : 3%
Cortex-A72 (Raspberry Pi 4) : 19%
Cc: Eric Biggers <ebiggers@google.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-03 19:28:09 +03:00
vld1 . 3 2 { q8 } , [ l r , : 1 2 8 ] / / l o a d c o u n t e r v a l u e s 0 - 3
2018-09-01 10:17:07 +03:00
vzip. 3 2 q0 , q1 / / = > ( 0 1 0 1 ) ( 0 1 0 1 )
vzip. 3 2 q2 , q3 / / = > ( 2 3 2 3 ) ( 2 3 2 3 )
vzip. 3 2 q4 , q5 / / = > ( 4 5 4 5 ) ( 4 5 4 5 )
vzip. 3 2 q6 , q7 / / = > ( 6 7 6 7 ) ( 6 7 6 7 )
vadd. u 3 2 q12 , q8 / / x12 + = c o u n t e r v a l u e s 0 - 3
2017-01-11 19:41:50 +03:00
vswp d1 , d4
vswp d3 , d6
2018-09-01 10:17:07 +03:00
vld1 . 3 2 { q8 - q9 } , [ r0 ] ! / / l o a d s0 . . 7
2017-01-11 19:41:50 +03:00
vswp d9 , d12
vswp d11 , d14
2018-09-01 10:17:07 +03:00
/ / Swap q1 a n d q4 s o t h a t w e ' l l f r e e u p c o n s e c u t i v e r e g i s t e r s ( q0 - q1 )
/ / after X O R i n g t h e f i r s t 3 2 b y t e s .
vswp q1 , q4
/ / First t w o r o w s o f e a c h b l o c k a r e ( q0 q1 ) ( q2 q6 ) ( q4 q5 ) ( q3 q7 )
/ / x0 . . 3 [ 0 - 3 ] + = s0 . . 3 [ 0 - 3 ] ( a d d o r i g s t a t e t o 1 s t r o w o f e a c h b l o c k )
vadd. u 3 2 q0 , q0 , q8
vadd. u 3 2 q2 , q2 , q8
vadd. u 3 2 q4 , q4 , q8
vadd. u 3 2 q3 , q3 , q8
/ / x4 . . 7 [ 0 - 3 ] + = s4 . . 7 [ 0 - 3 ] ( a d d o r i g s t a t e t o 2 n d r o w o f e a c h b l o c k )
vadd. u 3 2 q1 , q1 , q9
vadd. u 3 2 q6 , q6 , q9
vadd. u 3 2 q5 , q5 , q9
vadd. u 3 2 q7 , q7 , q9
/ / XOR f i r s t 3 2 b y t e s u s i n g k e y s t r e a m f r o m f i r s t t w o r o w s o f f i r s t b l o c k
2017-01-11 19:41:50 +03:00
vld1 . 8 { q8 - q9 } , [ r2 ] !
veor q8 , q8 , q0
2018-09-01 10:17:07 +03:00
veor q9 , q9 , q1
2017-01-11 19:41:50 +03:00
vst1 . 8 { q8 - q9 } , [ r1 ] !
2018-09-01 10:17:07 +03:00
/ / Re- i n t e r l e a v e t h e w o r d s i n t h e l a s t t w o r o w s o f e a c h b l o c k ( x8 . . 1 5 ) .
2017-01-11 19:41:50 +03:00
vld1 . 3 2 { q8 - q9 } , [ s p , : 2 5 6 ]
crypto: arm/chacha-neon - optimize for non-block size multiples
The current NEON based ChaCha implementation for ARM is optimized for
multiples of 4x the ChaCha block size (64 bytes). This makes sense for
block encryption, but given that ChaCha is also often used in the
context of networking, it makes sense to consider arbitrary length
inputs as well.
For example, WireGuard typically uses 1420 byte packets, and performing
ChaCha encryption involves 5 invocations of chacha_4block_xor_neon()
and 3 invocations of chacha_block_xor_neon(), where the last one also
involves a memcpy() using a buffer on the stack to process the final
chunk of 1420 % 64 == 12 bytes.
Let's optimize for this case as well, by letting chacha_4block_xor_neon()
deal with any input size between 64 and 256 bytes, using NEON permutation
instructions and overlapping loads and stores. This way, the 140 byte
tail of a 1420 byte input buffer can simply be processed in one go.
This results in the following performance improvements for 1420 byte
blocks, without significant impact on power-of-2 input sizes. (Note
that Raspberry Pi is widely used in combination with a 32-bit kernel,
even though the core is 64-bit capable)
Cortex-A8 (BeagleBone) : 7%
Cortex-A15 (Calxeda Midway) : 21%
Cortex-A53 (Raspberry Pi 3) : 3%
Cortex-A72 (Raspberry Pi 4) : 19%
Cc: Eric Biggers <ebiggers@google.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-03 19:28:09 +03:00
mov s p , r4 / / r e s t o r e o r i g i n a l s t a c k p o i n t e r
ldr r4 , [ r4 , #8 ] / / l o a d n u m b e r o f b y t e s
2018-09-01 10:17:07 +03:00
vzip. 3 2 q12 , q13 / / = > ( 1 2 1 3 1 2 1 3 ) ( 1 2 1 3 1 2 1 3 )
vzip. 3 2 q14 , q15 / / = > ( 1 4 1 5 1 4 1 5 ) ( 1 4 1 5 1 4 1 5 )
vzip. 3 2 q8 , q9 / / = > ( 8 9 8 9 ) ( 8 9 8 9 )
vzip. 3 2 q10 , q11 / / = > ( 1 0 1 1 1 0 1 1 ) ( 1 0 1 1 1 0 1 1 )
vld1 . 3 2 { q0 - q1 } , [ r0 ] / / l o a d s8 . . 1 5
2017-01-11 19:41:50 +03:00
vswp d25 , d28
vswp d27 , d30
2018-09-01 10:17:07 +03:00
vswp d17 , d20
vswp d19 , d22
/ / Last t w o r o w s o f e a c h b l o c k a r e ( q8 q12 ) ( q10 q14 ) ( q9 q13 ) ( q11 q15 )
/ / x8 . . 1 1 [ 0 - 3 ] + = s8 . . 1 1 [ 0 - 3 ] ( a d d o r i g s t a t e t o 3 r d r o w o f e a c h b l o c k )
vadd. u 3 2 q8 , q8 , q0
vadd. u 3 2 q10 , q10 , q0
vadd. u 3 2 q9 , q9 , q0
vadd. u 3 2 q11 , q11 , q0
/ / x1 2 . . 1 5 [ 0 - 3 ] + = s12 . . 1 5 [ 0 - 3 ] ( a d d o r i g s t a t e t o 4 t h r o w o f e a c h b l o c k )
vadd. u 3 2 q12 , q12 , q1
vadd. u 3 2 q14 , q14 , q1
vadd. u 3 2 q13 , q13 , q1
vadd. u 3 2 q15 , q15 , q1
2017-01-11 19:41:50 +03:00
2018-09-01 10:17:07 +03:00
/ / XOR t h e r e s t o f t h e d a t a w i t h t h e k e y s t r e a m
2017-01-11 19:41:50 +03:00
vld1 . 8 { q0 - q1 } , [ r2 ] !
crypto: arm/chacha-neon - optimize for non-block size multiples
The current NEON based ChaCha implementation for ARM is optimized for
multiples of 4x the ChaCha block size (64 bytes). This makes sense for
block encryption, but given that ChaCha is also often used in the
context of networking, it makes sense to consider arbitrary length
inputs as well.
For example, WireGuard typically uses 1420 byte packets, and performing
ChaCha encryption involves 5 invocations of chacha_4block_xor_neon()
and 3 invocations of chacha_block_xor_neon(), where the last one also
involves a memcpy() using a buffer on the stack to process the final
chunk of 1420 % 64 == 12 bytes.
Let's optimize for this case as well, by letting chacha_4block_xor_neon()
deal with any input size between 64 and 256 bytes, using NEON permutation
instructions and overlapping loads and stores. This way, the 140 byte
tail of a 1420 byte input buffer can simply be processed in one go.
This results in the following performance improvements for 1420 byte
blocks, without significant impact on power-of-2 input sizes. (Note
that Raspberry Pi is widely used in combination with a 32-bit kernel,
even though the core is 64-bit capable)
Cortex-A8 (BeagleBone) : 7%
Cortex-A15 (Calxeda Midway) : 21%
Cortex-A53 (Raspberry Pi 3) : 3%
Cortex-A72 (Raspberry Pi 4) : 19%
Cc: Eric Biggers <ebiggers@google.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-03 19:28:09 +03:00
subs r4 , r4 , #96
2017-01-11 19:41:50 +03:00
veor q0 , q0 , q8
veor q1 , q1 , q12
crypto: arm/chacha-neon - optimize for non-block size multiples
The current NEON based ChaCha implementation for ARM is optimized for
multiples of 4x the ChaCha block size (64 bytes). This makes sense for
block encryption, but given that ChaCha is also often used in the
context of networking, it makes sense to consider arbitrary length
inputs as well.
For example, WireGuard typically uses 1420 byte packets, and performing
ChaCha encryption involves 5 invocations of chacha_4block_xor_neon()
and 3 invocations of chacha_block_xor_neon(), where the last one also
involves a memcpy() using a buffer on the stack to process the final
chunk of 1420 % 64 == 12 bytes.
Let's optimize for this case as well, by letting chacha_4block_xor_neon()
deal with any input size between 64 and 256 bytes, using NEON permutation
instructions and overlapping loads and stores. This way, the 140 byte
tail of a 1420 byte input buffer can simply be processed in one go.
This results in the following performance improvements for 1420 byte
blocks, without significant impact on power-of-2 input sizes. (Note
that Raspberry Pi is widely used in combination with a 32-bit kernel,
even though the core is 64-bit capable)
Cortex-A8 (BeagleBone) : 7%
Cortex-A15 (Calxeda Midway) : 21%
Cortex-A53 (Raspberry Pi 3) : 3%
Cortex-A72 (Raspberry Pi 4) : 19%
Cc: Eric Biggers <ebiggers@google.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-03 19:28:09 +03:00
ble . L l e 9 6
2017-01-11 19:41:50 +03:00
vst1 . 8 { q0 - q1 } , [ r1 ] !
vld1 . 8 { q0 - q1 } , [ r2 ] !
crypto: arm/chacha-neon - optimize for non-block size multiples
The current NEON based ChaCha implementation for ARM is optimized for
multiples of 4x the ChaCha block size (64 bytes). This makes sense for
block encryption, but given that ChaCha is also often used in the
context of networking, it makes sense to consider arbitrary length
inputs as well.
For example, WireGuard typically uses 1420 byte packets, and performing
ChaCha encryption involves 5 invocations of chacha_4block_xor_neon()
and 3 invocations of chacha_block_xor_neon(), where the last one also
involves a memcpy() using a buffer on the stack to process the final
chunk of 1420 % 64 == 12 bytes.
Let's optimize for this case as well, by letting chacha_4block_xor_neon()
deal with any input size between 64 and 256 bytes, using NEON permutation
instructions and overlapping loads and stores. This way, the 140 byte
tail of a 1420 byte input buffer can simply be processed in one go.
This results in the following performance improvements for 1420 byte
blocks, without significant impact on power-of-2 input sizes. (Note
that Raspberry Pi is widely used in combination with a 32-bit kernel,
even though the core is 64-bit capable)
Cortex-A8 (BeagleBone) : 7%
Cortex-A15 (Calxeda Midway) : 21%
Cortex-A53 (Raspberry Pi 3) : 3%
Cortex-A72 (Raspberry Pi 4) : 19%
Cc: Eric Biggers <ebiggers@google.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-03 19:28:09 +03:00
subs r4 , r4 , #32
2017-01-11 19:41:50 +03:00
veor q0 , q0 , q2
veor q1 , q1 , q6
crypto: arm/chacha-neon - optimize for non-block size multiples
The current NEON based ChaCha implementation for ARM is optimized for
multiples of 4x the ChaCha block size (64 bytes). This makes sense for
block encryption, but given that ChaCha is also often used in the
context of networking, it makes sense to consider arbitrary length
inputs as well.
For example, WireGuard typically uses 1420 byte packets, and performing
ChaCha encryption involves 5 invocations of chacha_4block_xor_neon()
and 3 invocations of chacha_block_xor_neon(), where the last one also
involves a memcpy() using a buffer on the stack to process the final
chunk of 1420 % 64 == 12 bytes.
Let's optimize for this case as well, by letting chacha_4block_xor_neon()
deal with any input size between 64 and 256 bytes, using NEON permutation
instructions and overlapping loads and stores. This way, the 140 byte
tail of a 1420 byte input buffer can simply be processed in one go.
This results in the following performance improvements for 1420 byte
blocks, without significant impact on power-of-2 input sizes. (Note
that Raspberry Pi is widely used in combination with a 32-bit kernel,
even though the core is 64-bit capable)
Cortex-A8 (BeagleBone) : 7%
Cortex-A15 (Calxeda Midway) : 21%
Cortex-A53 (Raspberry Pi 3) : 3%
Cortex-A72 (Raspberry Pi 4) : 19%
Cc: Eric Biggers <ebiggers@google.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-03 19:28:09 +03:00
ble . L l e 1 2 8
2017-01-11 19:41:50 +03:00
vst1 . 8 { q0 - q1 } , [ r1 ] !
vld1 . 8 { q0 - q1 } , [ r2 ] !
crypto: arm/chacha-neon - optimize for non-block size multiples
The current NEON based ChaCha implementation for ARM is optimized for
multiples of 4x the ChaCha block size (64 bytes). This makes sense for
block encryption, but given that ChaCha is also often used in the
context of networking, it makes sense to consider arbitrary length
inputs as well.
For example, WireGuard typically uses 1420 byte packets, and performing
ChaCha encryption involves 5 invocations of chacha_4block_xor_neon()
and 3 invocations of chacha_block_xor_neon(), where the last one also
involves a memcpy() using a buffer on the stack to process the final
chunk of 1420 % 64 == 12 bytes.
Let's optimize for this case as well, by letting chacha_4block_xor_neon()
deal with any input size between 64 and 256 bytes, using NEON permutation
instructions and overlapping loads and stores. This way, the 140 byte
tail of a 1420 byte input buffer can simply be processed in one go.
This results in the following performance improvements for 1420 byte
blocks, without significant impact on power-of-2 input sizes. (Note
that Raspberry Pi is widely used in combination with a 32-bit kernel,
even though the core is 64-bit capable)
Cortex-A8 (BeagleBone) : 7%
Cortex-A15 (Calxeda Midway) : 21%
Cortex-A53 (Raspberry Pi 3) : 3%
Cortex-A72 (Raspberry Pi 4) : 19%
Cc: Eric Biggers <ebiggers@google.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-03 19:28:09 +03:00
subs r4 , r4 , #32
2017-01-11 19:41:50 +03:00
veor q0 , q0 , q10
veor q1 , q1 , q14
crypto: arm/chacha-neon - optimize for non-block size multiples
The current NEON based ChaCha implementation for ARM is optimized for
multiples of 4x the ChaCha block size (64 bytes). This makes sense for
block encryption, but given that ChaCha is also often used in the
context of networking, it makes sense to consider arbitrary length
inputs as well.
For example, WireGuard typically uses 1420 byte packets, and performing
ChaCha encryption involves 5 invocations of chacha_4block_xor_neon()
and 3 invocations of chacha_block_xor_neon(), where the last one also
involves a memcpy() using a buffer on the stack to process the final
chunk of 1420 % 64 == 12 bytes.
Let's optimize for this case as well, by letting chacha_4block_xor_neon()
deal with any input size between 64 and 256 bytes, using NEON permutation
instructions and overlapping loads and stores. This way, the 140 byte
tail of a 1420 byte input buffer can simply be processed in one go.
This results in the following performance improvements for 1420 byte
blocks, without significant impact on power-of-2 input sizes. (Note
that Raspberry Pi is widely used in combination with a 32-bit kernel,
even though the core is 64-bit capable)
Cortex-A8 (BeagleBone) : 7%
Cortex-A15 (Calxeda Midway) : 21%
Cortex-A53 (Raspberry Pi 3) : 3%
Cortex-A72 (Raspberry Pi 4) : 19%
Cc: Eric Biggers <ebiggers@google.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-03 19:28:09 +03:00
ble . L l e 1 6 0
2017-01-11 19:41:50 +03:00
vst1 . 8 { q0 - q1 } , [ r1 ] !
vld1 . 8 { q0 - q1 } , [ r2 ] !
crypto: arm/chacha-neon - optimize for non-block size multiples
The current NEON based ChaCha implementation for ARM is optimized for
multiples of 4x the ChaCha block size (64 bytes). This makes sense for
block encryption, but given that ChaCha is also often used in the
context of networking, it makes sense to consider arbitrary length
inputs as well.
For example, WireGuard typically uses 1420 byte packets, and performing
ChaCha encryption involves 5 invocations of chacha_4block_xor_neon()
and 3 invocations of chacha_block_xor_neon(), where the last one also
involves a memcpy() using a buffer on the stack to process the final
chunk of 1420 % 64 == 12 bytes.
Let's optimize for this case as well, by letting chacha_4block_xor_neon()
deal with any input size between 64 and 256 bytes, using NEON permutation
instructions and overlapping loads and stores. This way, the 140 byte
tail of a 1420 byte input buffer can simply be processed in one go.
This results in the following performance improvements for 1420 byte
blocks, without significant impact on power-of-2 input sizes. (Note
that Raspberry Pi is widely used in combination with a 32-bit kernel,
even though the core is 64-bit capable)
Cortex-A8 (BeagleBone) : 7%
Cortex-A15 (Calxeda Midway) : 21%
Cortex-A53 (Raspberry Pi 3) : 3%
Cortex-A72 (Raspberry Pi 4) : 19%
Cc: Eric Biggers <ebiggers@google.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-03 19:28:09 +03:00
subs r4 , r4 , #32
2017-01-11 19:41:50 +03:00
veor q0 , q0 , q4
veor q1 , q1 , q5
crypto: arm/chacha-neon - optimize for non-block size multiples
The current NEON based ChaCha implementation for ARM is optimized for
multiples of 4x the ChaCha block size (64 bytes). This makes sense for
block encryption, but given that ChaCha is also often used in the
context of networking, it makes sense to consider arbitrary length
inputs as well.
For example, WireGuard typically uses 1420 byte packets, and performing
ChaCha encryption involves 5 invocations of chacha_4block_xor_neon()
and 3 invocations of chacha_block_xor_neon(), where the last one also
involves a memcpy() using a buffer on the stack to process the final
chunk of 1420 % 64 == 12 bytes.
Let's optimize for this case as well, by letting chacha_4block_xor_neon()
deal with any input size between 64 and 256 bytes, using NEON permutation
instructions and overlapping loads and stores. This way, the 140 byte
tail of a 1420 byte input buffer can simply be processed in one go.
This results in the following performance improvements for 1420 byte
blocks, without significant impact on power-of-2 input sizes. (Note
that Raspberry Pi is widely used in combination with a 32-bit kernel,
even though the core is 64-bit capable)
Cortex-A8 (BeagleBone) : 7%
Cortex-A15 (Calxeda Midway) : 21%
Cortex-A53 (Raspberry Pi 3) : 3%
Cortex-A72 (Raspberry Pi 4) : 19%
Cc: Eric Biggers <ebiggers@google.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-03 19:28:09 +03:00
ble . L l e 1 9 2
2017-01-11 19:41:50 +03:00
vst1 . 8 { q0 - q1 } , [ r1 ] !
vld1 . 8 { q0 - q1 } , [ r2 ] !
crypto: arm/chacha-neon - optimize for non-block size multiples
The current NEON based ChaCha implementation for ARM is optimized for
multiples of 4x the ChaCha block size (64 bytes). This makes sense for
block encryption, but given that ChaCha is also often used in the
context of networking, it makes sense to consider arbitrary length
inputs as well.
For example, WireGuard typically uses 1420 byte packets, and performing
ChaCha encryption involves 5 invocations of chacha_4block_xor_neon()
and 3 invocations of chacha_block_xor_neon(), where the last one also
involves a memcpy() using a buffer on the stack to process the final
chunk of 1420 % 64 == 12 bytes.
Let's optimize for this case as well, by letting chacha_4block_xor_neon()
deal with any input size between 64 and 256 bytes, using NEON permutation
instructions and overlapping loads and stores. This way, the 140 byte
tail of a 1420 byte input buffer can simply be processed in one go.
This results in the following performance improvements for 1420 byte
blocks, without significant impact on power-of-2 input sizes. (Note
that Raspberry Pi is widely used in combination with a 32-bit kernel,
even though the core is 64-bit capable)
Cortex-A8 (BeagleBone) : 7%
Cortex-A15 (Calxeda Midway) : 21%
Cortex-A53 (Raspberry Pi 3) : 3%
Cortex-A72 (Raspberry Pi 4) : 19%
Cc: Eric Biggers <ebiggers@google.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-03 19:28:09 +03:00
subs r4 , r4 , #32
2017-01-11 19:41:50 +03:00
veor q0 , q0 , q9
veor q1 , q1 , q13
crypto: arm/chacha-neon - optimize for non-block size multiples
The current NEON based ChaCha implementation for ARM is optimized for
multiples of 4x the ChaCha block size (64 bytes). This makes sense for
block encryption, but given that ChaCha is also often used in the
context of networking, it makes sense to consider arbitrary length
inputs as well.
For example, WireGuard typically uses 1420 byte packets, and performing
ChaCha encryption involves 5 invocations of chacha_4block_xor_neon()
and 3 invocations of chacha_block_xor_neon(), where the last one also
involves a memcpy() using a buffer on the stack to process the final
chunk of 1420 % 64 == 12 bytes.
Let's optimize for this case as well, by letting chacha_4block_xor_neon()
deal with any input size between 64 and 256 bytes, using NEON permutation
instructions and overlapping loads and stores. This way, the 140 byte
tail of a 1420 byte input buffer can simply be processed in one go.
This results in the following performance improvements for 1420 byte
blocks, without significant impact on power-of-2 input sizes. (Note
that Raspberry Pi is widely used in combination with a 32-bit kernel,
even though the core is 64-bit capable)
Cortex-A8 (BeagleBone) : 7%
Cortex-A15 (Calxeda Midway) : 21%
Cortex-A53 (Raspberry Pi 3) : 3%
Cortex-A72 (Raspberry Pi 4) : 19%
Cc: Eric Biggers <ebiggers@google.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-03 19:28:09 +03:00
ble . L l e 2 2 4
2017-01-11 19:41:50 +03:00
vst1 . 8 { q0 - q1 } , [ r1 ] !
vld1 . 8 { q0 - q1 } , [ r2 ] !
crypto: arm/chacha-neon - optimize for non-block size multiples
The current NEON based ChaCha implementation for ARM is optimized for
multiples of 4x the ChaCha block size (64 bytes). This makes sense for
block encryption, but given that ChaCha is also often used in the
context of networking, it makes sense to consider arbitrary length
inputs as well.
For example, WireGuard typically uses 1420 byte packets, and performing
ChaCha encryption involves 5 invocations of chacha_4block_xor_neon()
and 3 invocations of chacha_block_xor_neon(), where the last one also
involves a memcpy() using a buffer on the stack to process the final
chunk of 1420 % 64 == 12 bytes.
Let's optimize for this case as well, by letting chacha_4block_xor_neon()
deal with any input size between 64 and 256 bytes, using NEON permutation
instructions and overlapping loads and stores. This way, the 140 byte
tail of a 1420 byte input buffer can simply be processed in one go.
This results in the following performance improvements for 1420 byte
blocks, without significant impact on power-of-2 input sizes. (Note
that Raspberry Pi is widely used in combination with a 32-bit kernel,
even though the core is 64-bit capable)
Cortex-A8 (BeagleBone) : 7%
Cortex-A15 (Calxeda Midway) : 21%
Cortex-A53 (Raspberry Pi 3) : 3%
Cortex-A72 (Raspberry Pi 4) : 19%
Cc: Eric Biggers <ebiggers@google.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-03 19:28:09 +03:00
subs r4 , r4 , #32
2017-01-11 19:41:50 +03:00
veor q0 , q0 , q3
veor q1 , q1 , q7
crypto: arm/chacha-neon - optimize for non-block size multiples
The current NEON based ChaCha implementation for ARM is optimized for
multiples of 4x the ChaCha block size (64 bytes). This makes sense for
block encryption, but given that ChaCha is also often used in the
context of networking, it makes sense to consider arbitrary length
inputs as well.
For example, WireGuard typically uses 1420 byte packets, and performing
ChaCha encryption involves 5 invocations of chacha_4block_xor_neon()
and 3 invocations of chacha_block_xor_neon(), where the last one also
involves a memcpy() using a buffer on the stack to process the final
chunk of 1420 % 64 == 12 bytes.
Let's optimize for this case as well, by letting chacha_4block_xor_neon()
deal with any input size between 64 and 256 bytes, using NEON permutation
instructions and overlapping loads and stores. This way, the 140 byte
tail of a 1420 byte input buffer can simply be processed in one go.
This results in the following performance improvements for 1420 byte
blocks, without significant impact on power-of-2 input sizes. (Note
that Raspberry Pi is widely used in combination with a 32-bit kernel,
even though the core is 64-bit capable)
Cortex-A8 (BeagleBone) : 7%
Cortex-A15 (Calxeda Midway) : 21%
Cortex-A53 (Raspberry Pi 3) : 3%
Cortex-A72 (Raspberry Pi 4) : 19%
Cc: Eric Biggers <ebiggers@google.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-03 19:28:09 +03:00
blt . L l t 2 5 6
.Lout :
2017-01-11 19:41:50 +03:00
vst1 . 8 { q0 - q1 } , [ r1 ] !
vld1 . 8 { q0 - q1 } , [ r2 ]
veor q0 , q0 , q11
veor q1 , q1 , q15
vst1 . 8 { q0 - q1 } , [ r1 ]
crypto: arm/chacha-neon - optimize for non-block size multiples
The current NEON based ChaCha implementation for ARM is optimized for
multiples of 4x the ChaCha block size (64 bytes). This makes sense for
block encryption, but given that ChaCha is also often used in the
context of networking, it makes sense to consider arbitrary length
inputs as well.
For example, WireGuard typically uses 1420 byte packets, and performing
ChaCha encryption involves 5 invocations of chacha_4block_xor_neon()
and 3 invocations of chacha_block_xor_neon(), where the last one also
involves a memcpy() using a buffer on the stack to process the final
chunk of 1420 % 64 == 12 bytes.
Let's optimize for this case as well, by letting chacha_4block_xor_neon()
deal with any input size between 64 and 256 bytes, using NEON permutation
instructions and overlapping loads and stores. This way, the 140 byte
tail of a 1420 byte input buffer can simply be processed in one go.
This results in the following performance improvements for 1420 byte
blocks, without significant impact on power-of-2 input sizes. (Note
that Raspberry Pi is widely used in combination with a 32-bit kernel,
even though the core is 64-bit capable)
Cortex-A8 (BeagleBone) : 7%
Cortex-A15 (Calxeda Midway) : 21%
Cortex-A53 (Raspberry Pi 3) : 3%
Cortex-A72 (Raspberry Pi 4) : 19%
Cc: Eric Biggers <ebiggers@google.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-03 19:28:09 +03:00
pop { r4 , p c }
.Lle192 :
vmov q4 , q9
vmov q5 , q13
.Lle160 :
/ / nothing t o d o
.Lfinalblock :
/ / Process t h e f i n a l b l o c k i f p r o c e s s i n g l e s s t h a n 4 f u l l b l o c k s .
/ / Entered w i t h 3 2 b y t e s o f C h a C h a c i p h e r s t r e a m i n q4 - q5 , a n d t h e
/ / previous 3 2 b y t e o u t p u t b l o c k t h a t s t i l l n e e d s t o b e w r i t t e n a t
/ / [ r1 ] i n q0 - q1 .
beq . L f u l l b l o c k
.Lpartialblock :
adr l r , . L p e r m u t e + 3 2
add r2 , r2 , r4
add l r , l r , r4
add r4 , r4 , r1
vld1 . 8 { q2 - q3 } , [ l r ]
vld1 . 8 { q6 - q7 } , [ r2 ]
add r4 , r4 , #32
vtbl. 8 d4 , { q4 - q5 } , d4
vtbl. 8 d5 , { q4 - q5 } , d5
vtbl. 8 d6 , { q4 - q5 } , d6
vtbl. 8 d7 , { q4 - q5 } , d7
veor q6 , q6 , q2
veor q7 , q7 , q3
vst1 . 8 { q6 - q7 } , [ r4 ] / / o v e r l a p p i n g s t o r e s
vst1 . 8 { q0 - q1 } , [ r1 ]
pop { r4 , p c }
.Lfullblock :
vmov q11 , q4
vmov q15 , q5
b . L o u t
.Lle96 :
vmov q4 , q2
vmov q5 , q6
b . L f i n a l b l o c k
.Lle128 :
vmov q4 , q10
vmov q5 , q14
b . L f i n a l b l o c k
.Lle224 :
vmov q4 , q3
vmov q5 , q7
b . L f i n a l b l o c k
.Llt256 :
vmov q4 , q11
vmov q5 , q15
b . L p a r t i a l b l o c k
2018-11-17 04:26:25 +03:00
ENDPROC( c h a c h a _ 4 b l o c k _ x o r _ n e o n )
crypto: arm/chacha-neon - optimize for non-block size multiples
The current NEON based ChaCha implementation for ARM is optimized for
multiples of 4x the ChaCha block size (64 bytes). This makes sense for
block encryption, but given that ChaCha is also often used in the
context of networking, it makes sense to consider arbitrary length
inputs as well.
For example, WireGuard typically uses 1420 byte packets, and performing
ChaCha encryption involves 5 invocations of chacha_4block_xor_neon()
and 3 invocations of chacha_block_xor_neon(), where the last one also
involves a memcpy() using a buffer on the stack to process the final
chunk of 1420 % 64 == 12 bytes.
Let's optimize for this case as well, by letting chacha_4block_xor_neon()
deal with any input size between 64 and 256 bytes, using NEON permutation
instructions and overlapping loads and stores. This way, the 140 byte
tail of a 1420 byte input buffer can simply be processed in one go.
This results in the following performance improvements for 1420 byte
blocks, without significant impact on power-of-2 input sizes. (Note
that Raspberry Pi is widely used in combination with a 32-bit kernel,
even though the core is 64-bit capable)
Cortex-A8 (BeagleBone) : 7%
Cortex-A15 (Calxeda Midway) : 21%
Cortex-A53 (Raspberry Pi 3) : 3%
Cortex-A72 (Raspberry Pi 4) : 19%
Cc: Eric Biggers <ebiggers@google.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-03 19:28:09 +03:00
.align L1_CACHE_SHIFT
.Lpermute :
.byte 0 x0 0 , 0 x01 , 0 x02 , 0 x03 , 0 x04 , 0 x05 , 0 x06 , 0 x07
.byte 0 x0 8 , 0 x09 , 0 x0 a , 0 x0 b , 0 x0 c , 0 x0 d , 0 x0 e , 0 x0 f
.byte 0 x1 0 , 0 x11 , 0 x12 , 0 x13 , 0 x14 , 0 x15 , 0 x16 , 0 x17
.byte 0 x1 8 , 0 x19 , 0 x1 a , 0 x1 b , 0 x1 c , 0 x1 d , 0 x1 e , 0 x1 f
.byte 0 x0 0 , 0 x01 , 0 x02 , 0 x03 , 0 x04 , 0 x05 , 0 x06 , 0 x07
.byte 0 x0 8 , 0 x09 , 0 x0 a , 0 x0 b , 0 x0 c , 0 x0 d , 0 x0 e , 0 x0 f
.byte 0 x1 0 , 0 x11 , 0 x12 , 0 x13 , 0 x14 , 0 x15 , 0 x16 , 0 x17
.byte 0 x1 8 , 0 x19 , 0 x1 a , 0 x1 b , 0 x1 c , 0 x1 d , 0 x1 e , 0 x1 f