2019-05-27 08:55:01 +02:00
// SPDX-License-Identifier: GPL-2.0-or-later
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:01 +02:00
/*
2018-12-04 22:20:03 -08:00
* x64 SIMD accelerated ChaCha and XChaCha stream ciphers ,
* including ChaCha20 ( RFC7539 )
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:01 +02:00
*
* Copyright ( C ) 2015 Martin Willi
*/
# include <crypto/algapi.h>
2019-11-08 13:22:08 +01:00
# include <crypto/internal/chacha.h>
2019-03-12 22:12:48 -07:00
# include <crypto/internal/simd.h>
2016-12-09 14:33:51 +00:00
# include <crypto/internal/skcipher.h>
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:01 +02:00
# include <linux/kernel.h>
# include <linux/module.h>
2020-08-19 21:58:20 +10:00
# include <linux/sizes.h>
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:01 +02:00
# include <asm/simd.h>
2018-12-04 22:20:03 -08:00
asmlinkage void chacha_block_xor_ssse3 ( u32 * state , u8 * dst , const u8 * src ,
unsigned int len , int nrounds ) ;
asmlinkage void chacha_4block_xor_ssse3 ( u32 * state , u8 * dst , const u8 * src ,
unsigned int len , int nrounds ) ;
asmlinkage void hchacha_block_ssse3 ( const u32 * state , u32 * out , int nrounds ) ;
2019-11-08 13:22:10 +01:00
2018-12-04 22:20:03 -08:00
asmlinkage void chacha_2block_xor_avx2 ( u32 * state , u8 * dst , const u8 * src ,
unsigned int len , int nrounds ) ;
asmlinkage void chacha_4block_xor_avx2 ( u32 * state , u8 * dst , const u8 * src ,
unsigned int len , int nrounds ) ;
asmlinkage void chacha_8block_xor_avx2 ( u32 * state , u8 * dst , const u8 * src ,
unsigned int len , int nrounds ) ;
2019-11-08 13:22:10 +01:00
2018-12-04 22:20:03 -08:00
asmlinkage void chacha_2block_xor_avx512vl ( u32 * state , u8 * dst , const u8 * src ,
unsigned int len , int nrounds ) ;
asmlinkage void chacha_4block_xor_avx512vl ( u32 * state , u8 * dst , const u8 * src ,
unsigned int len , int nrounds ) ;
asmlinkage void chacha_8block_xor_avx512vl ( u32 * state , u8 * dst , const u8 * src ,
unsigned int len , int nrounds ) ;
2019-11-08 13:22:10 +01:00
static __ro_after_init DEFINE_STATIC_KEY_FALSE ( chacha_use_simd ) ;
static __ro_after_init DEFINE_STATIC_KEY_FALSE ( chacha_use_avx2 ) ;
static __ro_after_init DEFINE_STATIC_KEY_FALSE ( chacha_use_avx512vl ) ;
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:01 +02:00
2018-12-04 22:20:03 -08:00
static unsigned int chacha_advance ( unsigned int len , unsigned int maxblocks )
2018-11-11 10:36:28 +01:00
{
2018-11-16 17:26:21 -08:00
len = min ( len , maxblocks * CHACHA_BLOCK_SIZE ) ;
return round_up ( len , CHACHA_BLOCK_SIZE ) / CHACHA_BLOCK_SIZE ;
2018-11-11 10:36:28 +01:00
}
2018-12-04 22:20:03 -08:00
static void chacha_dosimd ( u32 * state , u8 * dst , const u8 * src ,
unsigned int bytes , int nrounds )
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:01 +02:00
{
2019-11-08 13:22:10 +01:00
if ( IS_ENABLED ( CONFIG_AS_AVX512 ) & &
static_branch_likely ( & chacha_use_avx512vl ) ) {
2018-11-20 17:30:48 +01:00
while ( bytes > = CHACHA_BLOCK_SIZE * 8 ) {
2018-12-04 22:20:03 -08:00
chacha_8block_xor_avx512vl ( state , dst , src , bytes ,
nrounds ) ;
2018-11-20 17:30:48 +01:00
bytes - = CHACHA_BLOCK_SIZE * 8 ;
src + = CHACHA_BLOCK_SIZE * 8 ;
dst + = CHACHA_BLOCK_SIZE * 8 ;
state [ 12 ] + = 8 ;
}
if ( bytes > CHACHA_BLOCK_SIZE * 4 ) {
2018-12-04 22:20:03 -08:00
chacha_8block_xor_avx512vl ( state , dst , src , bytes ,
nrounds ) ;
state [ 12 ] + = chacha_advance ( bytes , 8 ) ;
2018-11-20 17:30:48 +01:00
return ;
}
2018-11-20 17:30:50 +01:00
if ( bytes > CHACHA_BLOCK_SIZE * 2 ) {
2018-12-04 22:20:03 -08:00
chacha_4block_xor_avx512vl ( state , dst , src , bytes ,
nrounds ) ;
state [ 12 ] + = chacha_advance ( bytes , 4 ) ;
2018-11-20 17:30:50 +01:00
return ;
}
2018-11-20 17:30:49 +01:00
if ( bytes ) {
2018-12-04 22:20:03 -08:00
chacha_2block_xor_avx512vl ( state , dst , src , bytes ,
nrounds ) ;
state [ 12 ] + = chacha_advance ( bytes , 2 ) ;
2018-11-20 17:30:49 +01:00
return ;
}
2018-11-20 17:30:48 +01:00
}
2019-11-08 13:22:10 +01:00
2020-03-26 14:26:00 -06:00
if ( static_branch_likely ( & chacha_use_avx2 ) ) {
2018-11-16 17:26:21 -08:00
while ( bytes > = CHACHA_BLOCK_SIZE * 8 ) {
2018-12-04 22:20:03 -08:00
chacha_8block_xor_avx2 ( state , dst , src , bytes , nrounds ) ;
2018-11-16 17:26:21 -08:00
bytes - = CHACHA_BLOCK_SIZE * 8 ;
src + = CHACHA_BLOCK_SIZE * 8 ;
dst + = CHACHA_BLOCK_SIZE * 8 ;
crypto: chacha20 - Add an eight block AVX2 variant for x86_64
Extends the x86_64 ChaCha20 implementation by a function processing eight
ChaCha20 blocks in parallel using AVX2.
For large messages, throughput increases by ~55-70% compared to four block
SSSE3:
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 42249230 operations in 10 seconds (675987680 bytes)
test 1 (256 bit key, 64 byte blocks): 46441641 operations in 10 seconds (2972265024 bytes)
test 2 (256 bit key, 256 byte blocks): 33028112 operations in 10 seconds (8455196672 bytes)
test 3 (256 bit key, 1024 byte blocks): 11568759 operations in 10 seconds (11846409216 bytes)
test 4 (256 bit key, 8192 byte blocks): 1448761 operations in 10 seconds (11868250112 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 41999675 operations in 10 seconds (671994800 bytes)
test 1 (256 bit key, 64 byte blocks): 45805908 operations in 10 seconds (2931578112 bytes)
test 2 (256 bit key, 256 byte blocks): 32814947 operations in 10 seconds (8400626432 bytes)
test 3 (256 bit key, 1024 byte blocks): 19777167 operations in 10 seconds (20251819008 bytes)
test 4 (256 bit key, 8192 byte blocks): 2279321 operations in 10 seconds (18672197632 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:03 +02:00
state [ 12 ] + = 8 ;
}
2018-11-16 17:26:21 -08:00
if ( bytes > CHACHA_BLOCK_SIZE * 4 ) {
2018-12-04 22:20:03 -08:00
chacha_8block_xor_avx2 ( state , dst , src , bytes , nrounds ) ;
state [ 12 ] + = chacha_advance ( bytes , 8 ) ;
2018-11-11 10:36:28 +01:00
return ;
}
2018-11-16 17:26:21 -08:00
if ( bytes > CHACHA_BLOCK_SIZE * 2 ) {
2018-12-04 22:20:03 -08:00
chacha_4block_xor_avx2 ( state , dst , src , bytes , nrounds ) ;
state [ 12 ] + = chacha_advance ( bytes , 4 ) ;
2018-11-11 10:36:30 +01:00
return ;
}
2018-11-16 17:26:21 -08:00
if ( bytes > CHACHA_BLOCK_SIZE ) {
2018-12-04 22:20:03 -08:00
chacha_2block_xor_avx2 ( state , dst , src , bytes , nrounds ) ;
state [ 12 ] + = chacha_advance ( bytes , 2 ) ;
2018-11-11 10:36:29 +01:00
return ;
}
crypto: chacha20 - Add an eight block AVX2 variant for x86_64
Extends the x86_64 ChaCha20 implementation by a function processing eight
ChaCha20 blocks in parallel using AVX2.
For large messages, throughput increases by ~55-70% compared to four block
SSSE3:
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 42249230 operations in 10 seconds (675987680 bytes)
test 1 (256 bit key, 64 byte blocks): 46441641 operations in 10 seconds (2972265024 bytes)
test 2 (256 bit key, 256 byte blocks): 33028112 operations in 10 seconds (8455196672 bytes)
test 3 (256 bit key, 1024 byte blocks): 11568759 operations in 10 seconds (11846409216 bytes)
test 4 (256 bit key, 8192 byte blocks): 1448761 operations in 10 seconds (11868250112 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 41999675 operations in 10 seconds (671994800 bytes)
test 1 (256 bit key, 64 byte blocks): 45805908 operations in 10 seconds (2931578112 bytes)
test 2 (256 bit key, 256 byte blocks): 32814947 operations in 10 seconds (8400626432 bytes)
test 3 (256 bit key, 1024 byte blocks): 19777167 operations in 10 seconds (20251819008 bytes)
test 4 (256 bit key, 8192 byte blocks): 2279321 operations in 10 seconds (18672197632 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:03 +02:00
}
2019-11-08 13:22:10 +01:00
2018-11-16 17:26:21 -08:00
while ( bytes > = CHACHA_BLOCK_SIZE * 4 ) {
2018-12-04 22:20:03 -08:00
chacha_4block_xor_ssse3 ( state , dst , src , bytes , nrounds ) ;
2018-11-16 17:26:21 -08:00
bytes - = CHACHA_BLOCK_SIZE * 4 ;
src + = CHACHA_BLOCK_SIZE * 4 ;
dst + = CHACHA_BLOCK_SIZE * 4 ;
crypto: chacha20 - Add a four block SSSE3 variant for x86_64
Extends the x86_64 SSSE3 ChaCha20 implementation by a function processing
four ChaCha20 blocks in parallel. This avoids the word shuffling needed
in the single block variant, further increasing throughput.
For large messages, throughput increases by ~110% compared to single block
SSSE3:
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 42249230 operations in 10 seconds (675987680 bytes)
test 1 (256 bit key, 64 byte blocks): 46441641 operations in 10 seconds (2972265024 bytes)
test 2 (256 bit key, 256 byte blocks): 33028112 operations in 10 seconds (8455196672 bytes)
test 3 (256 bit key, 1024 byte blocks): 11568759 operations in 10 seconds (11846409216 bytes)
test 4 (256 bit key, 8192 byte blocks): 1448761 operations in 10 seconds (11868250112 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:02 +02:00
state [ 12 ] + = 4 ;
}
2018-11-16 17:26:21 -08:00
if ( bytes > CHACHA_BLOCK_SIZE ) {
2018-12-04 22:20:03 -08:00
chacha_4block_xor_ssse3 ( state , dst , src , bytes , nrounds ) ;
state [ 12 ] + = chacha_advance ( bytes , 4 ) ;
2018-11-11 10:36:28 +01:00
return ;
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:01 +02:00
}
if ( bytes ) {
2018-12-04 22:20:03 -08:00
chacha_block_xor_ssse3 ( state , dst , src , bytes , nrounds ) ;
2018-11-11 10:36:28 +01:00
state [ 12 ] + + ;
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:01 +02:00
}
}
2019-11-08 13:22:10 +01:00
void hchacha_block_arch ( const u32 * state , u32 * stream , int nrounds )
{
if ( ! static_branch_likely ( & chacha_use_simd ) | | ! crypto_simd_usable ( ) ) {
hchacha_block_generic ( state , stream , nrounds ) ;
} else {
kernel_fpu_begin ( ) ;
hchacha_block_ssse3 ( state , stream , nrounds ) ;
kernel_fpu_end ( ) ;
}
}
EXPORT_SYMBOL ( hchacha_block_arch ) ;
void chacha_init_arch ( u32 * state , const u32 * key , const u8 * iv )
{
chacha_init_generic ( state , key , iv ) ;
}
EXPORT_SYMBOL ( chacha_init_arch ) ;
void chacha_crypt_arch ( u32 * state , u8 * dst , const u8 * src , unsigned int bytes ,
int nrounds )
{
if ( ! static_branch_likely ( & chacha_use_simd ) | | ! crypto_simd_usable ( ) | |
bytes < = CHACHA_BLOCK_SIZE )
return chacha_crypt_generic ( state , dst , src , bytes , nrounds ) ;
crypto: arch/lib - limit simd usage to 4k chunks
The initial Zinc patchset, after some mailing list discussion, contained
code to ensure that kernel_fpu_enable would not be kept on for more than
a 4k chunk, since it disables preemption. The choice of 4k isn't totally
scientific, but it's not a bad guess either, and it's what's used in
both the x86 poly1305, blake2s, and nhpoly1305 code already (in the form
of PAGE_SIZE, which this commit corrects to be explicitly 4k for the
former two).
Ard did some back of the envelope calculations and found that
at 5 cycles/byte (overestimate) on a 1ghz processor (pretty slow), 4k
means we have a maximum preemption disabling of 20us, which Sebastian
confirmed was probably a good limit.
Unfortunately the chunking appears to have been left out of the final
patchset that added the glue code. So, this commit adds it back in.
Fixes: 84e03fa39fbe ("crypto: x86/chacha - expose SIMD ChaCha routine as library function")
Fixes: b3aad5bad26a ("crypto: arm64/chacha - expose arm64 ChaCha routine as library function")
Fixes: a44a3430d71b ("crypto: arm/chacha - expose ARM ChaCha routine as library function")
Fixes: d7d7b8535662 ("crypto: x86/poly1305 - wire up faster implementations for kernel")
Fixes: f569ca164751 ("crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
Fixes: a6b803b3ddc7 ("crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
Fixes: ed0356eda153 ("crypto: blake2s - x86_64 SIMD implementation")
Cc: Eric Biggers <ebiggers@google.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-04-22 17:18:53 -06:00
do {
unsigned int todo = min_t ( unsigned int , bytes , SZ_4K ) ;
kernel_fpu_begin ( ) ;
chacha_dosimd ( state , dst , src , todo , nrounds ) ;
kernel_fpu_end ( ) ;
bytes - = todo ;
src + = todo ;
dst + = todo ;
} while ( bytes ) ;
2019-11-08 13:22:10 +01:00
}
EXPORT_SYMBOL ( chacha_crypt_arch ) ;
2019-11-08 13:22:09 +01:00
static int chacha_simd_stream_xor ( struct skcipher_request * req ,
2019-06-02 22:47:14 -07:00
const struct chacha_ctx * ctx , const u8 * iv )
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:01 +02:00
{
2020-07-08 12:11:18 +03:00
u32 state [ CHACHA_STATE_WORDS ] __aligned ( 8 ) ;
2019-11-08 13:22:09 +01:00
struct skcipher_walk walk ;
int err ;
err = skcipher_walk_virt ( & walk , req , false ) ;
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:01 +02:00
2019-11-08 13:22:09 +01:00
chacha_init_generic ( state , ctx - > key , iv ) ;
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:01 +02:00
2019-11-08 13:22:09 +01:00
while ( walk . nbytes > 0 ) {
unsigned int nbytes = walk . nbytes ;
2018-11-11 10:36:28 +01:00
2019-11-08 13:22:09 +01:00
if ( nbytes < walk . total )
nbytes = round_down ( nbytes , walk . stride ) ;
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:01 +02:00
2019-11-08 13:22:10 +01:00
if ( ! static_branch_likely ( & chacha_use_simd ) | |
! crypto_simd_usable ( ) ) {
2019-11-08 13:22:09 +01:00
chacha_crypt_generic ( state , walk . dst . virt . addr ,
walk . src . virt . addr , nbytes ,
ctx - > nrounds ) ;
} else {
2018-12-04 22:20:05 -08:00
kernel_fpu_begin ( ) ;
2019-11-08 13:22:09 +01:00
chacha_dosimd ( state , walk . dst . virt . addr ,
walk . src . virt . addr , nbytes ,
ctx - > nrounds ) ;
kernel_fpu_end ( ) ;
2018-12-04 22:20:05 -08:00
}
2019-11-08 13:22:09 +01:00
err = skcipher_walk_done ( & walk , walk . nbytes - nbytes ) ;
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:01 +02:00
}
2018-12-04 22:20:02 -08:00
return err ;
}
2018-12-04 22:20:03 -08:00
static int chacha_simd ( struct skcipher_request * req )
2018-12-04 22:20:02 -08:00
{
struct crypto_skcipher * tfm = crypto_skcipher_reqtfm ( req ) ;
struct chacha_ctx * ctx = crypto_skcipher_ctx ( tfm ) ;
2019-11-08 13:22:09 +01:00
return chacha_simd_stream_xor ( req , ctx , req - > iv ) ;
2018-12-04 22:20:02 -08:00
}
2018-12-04 22:20:03 -08:00
static int xchacha_simd ( struct skcipher_request * req )
2018-12-04 22:20:02 -08:00
{
struct crypto_skcipher * tfm = crypto_skcipher_reqtfm ( req ) ;
struct chacha_ctx * ctx = crypto_skcipher_ctx ( tfm ) ;
2020-07-08 12:11:18 +03:00
u32 state [ CHACHA_STATE_WORDS ] __aligned ( 8 ) ;
2019-11-08 13:22:09 +01:00
struct chacha_ctx subctx ;
2018-12-04 22:20:02 -08:00
u8 real_iv [ 16 ] ;
2018-12-15 12:40:17 -08:00
2019-11-08 13:22:09 +01:00
chacha_init_generic ( state , ctx - > key , req - > iv ) ;
if ( req - > cryptlen > CHACHA_BLOCK_SIZE & & crypto_simd_usable ( ) ) {
kernel_fpu_begin ( ) ;
hchacha_block_ssse3 ( state , subctx . key , ctx - > nrounds ) ;
kernel_fpu_end ( ) ;
} else {
hchacha_block_generic ( state , subctx . key , ctx - > nrounds ) ;
}
2018-12-04 22:20:03 -08:00
subctx . nrounds = ctx - > nrounds ;
2018-12-04 22:20:02 -08:00
memcpy ( & real_iv [ 0 ] , req - > iv + 24 , 8 ) ;
memcpy ( & real_iv [ 8 ] , req - > iv + 16 , 8 ) ;
2019-11-08 13:22:09 +01:00
return chacha_simd_stream_xor ( req , & subctx , real_iv ) ;
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:01 +02:00
}
2018-12-04 22:20:02 -08:00
static struct skcipher_alg algs [ ] = {
{
. base . cra_name = " chacha20 " ,
. base . cra_driver_name = " chacha20-simd " ,
. base . cra_priority = 300 ,
. base . cra_blocksize = 1 ,
. base . cra_ctxsize = sizeof ( struct chacha_ctx ) ,
. base . cra_module = THIS_MODULE ,
. min_keysize = CHACHA_KEY_SIZE ,
. max_keysize = CHACHA_KEY_SIZE ,
. ivsize = CHACHA_IV_SIZE ,
. chunksize = CHACHA_BLOCK_SIZE ,
2019-11-08 13:22:09 +01:00
. setkey = chacha20_setkey ,
2018-12-04 22:20:03 -08:00
. encrypt = chacha_simd ,
. decrypt = chacha_simd ,
2018-12-04 22:20:02 -08:00
} , {
. base . cra_name = " xchacha20 " ,
. base . cra_driver_name = " xchacha20-simd " ,
. base . cra_priority = 300 ,
. base . cra_blocksize = 1 ,
. base . cra_ctxsize = sizeof ( struct chacha_ctx ) ,
. base . cra_module = THIS_MODULE ,
. min_keysize = CHACHA_KEY_SIZE ,
. max_keysize = CHACHA_KEY_SIZE ,
. ivsize = XCHACHA_IV_SIZE ,
. chunksize = CHACHA_BLOCK_SIZE ,
2019-11-08 13:22:09 +01:00
. setkey = chacha20_setkey ,
2018-12-04 22:20:03 -08:00
. encrypt = xchacha_simd ,
. decrypt = xchacha_simd ,
2018-12-04 22:20:04 -08:00
} , {
. base . cra_name = " xchacha12 " ,
. base . cra_driver_name = " xchacha12-simd " ,
. base . cra_priority = 300 ,
. base . cra_blocksize = 1 ,
. base . cra_ctxsize = sizeof ( struct chacha_ctx ) ,
. base . cra_module = THIS_MODULE ,
. min_keysize = CHACHA_KEY_SIZE ,
. max_keysize = CHACHA_KEY_SIZE ,
. ivsize = XCHACHA_IV_SIZE ,
. chunksize = CHACHA_BLOCK_SIZE ,
2019-11-08 13:22:09 +01:00
. setkey = chacha12_setkey ,
2018-12-04 22:20:04 -08:00
. encrypt = xchacha_simd ,
. decrypt = xchacha_simd ,
2018-12-04 22:20:02 -08:00
} ,
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:01 +02:00
} ;
2018-12-04 22:20:03 -08:00
static int __init chacha_simd_mod_init ( void )
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:01 +02:00
{
2015-12-07 10:39:41 +01:00
if ( ! boot_cpu_has ( X86_FEATURE_SSSE3 ) )
2019-11-08 13:22:10 +01:00
return 0 ;
static_branch_enable ( & chacha_use_simd ) ;
2020-03-26 14:26:00 -06:00
if ( boot_cpu_has ( X86_FEATURE_AVX ) & &
2019-11-08 13:22:10 +01:00
boot_cpu_has ( X86_FEATURE_AVX2 ) & &
cpu_has_xfeatures ( XFEATURE_MASK_SSE | XFEATURE_MASK_YMM , NULL ) ) {
static_branch_enable ( & chacha_use_avx2 ) ;
if ( IS_ENABLED ( CONFIG_AS_AVX512 ) & &
boot_cpu_has ( X86_FEATURE_AVX512VL ) & &
boot_cpu_has ( X86_FEATURE_AVX512BW ) ) /* kmovq */
static_branch_enable ( & chacha_use_avx512vl ) ;
}
2019-11-25 11:31:12 +01:00
return IS_REACHABLE ( CONFIG_CRYPTO_SKCIPHER ) ?
crypto_register_skciphers ( algs , ARRAY_SIZE ( algs ) ) : 0 ;
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:01 +02:00
}
2018-12-04 22:20:03 -08:00
static void __exit chacha_simd_mod_fini ( void )
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:01 +02:00
{
2019-11-25 11:31:12 +01:00
if ( IS_REACHABLE ( CONFIG_CRYPTO_SKCIPHER ) & & boot_cpu_has ( X86_FEATURE_SSSE3 ) )
2019-11-17 23:21:58 -08:00
crypto_unregister_skciphers ( algs , ARRAY_SIZE ( algs ) ) ;
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:01 +02:00
}
2018-12-04 22:20:03 -08:00
module_init ( chacha_simd_mod_init ) ;
module_exit ( chacha_simd_mod_fini ) ;
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:01 +02:00
MODULE_LICENSE ( " GPL " ) ;
MODULE_AUTHOR ( " Martin Willi <martin@strongswan.org> " ) ;
2018-12-04 22:20:03 -08:00
MODULE_DESCRIPTION ( " ChaCha and XChaCha stream ciphers (x64 SIMD accelerated) " ) ;
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.
For large messages, throughput increases by ~65% compared to
chacha20-generic:
testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
Benchmark results from a Core i5-4670T.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2015-07-16 19:14:01 +02:00
MODULE_ALIAS_CRYPTO ( " chacha20 " ) ;
MODULE_ALIAS_CRYPTO ( " chacha20-simd " ) ;
2018-12-04 22:20:02 -08:00
MODULE_ALIAS_CRYPTO ( " xchacha20 " ) ;
MODULE_ALIAS_CRYPTO ( " xchacha20-simd " ) ;
2018-12-04 22:20:04 -08:00
MODULE_ALIAS_CRYPTO ( " xchacha12 " ) ;
MODULE_ALIAS_CRYPTO ( " xchacha12-simd " ) ;