crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
/* SPDX-License-Identifier: GPL-2.0-or-later */
/ *
* ARIA C i p h e r 1 6 - w a y p a r a l l e l a l g o r i t h m ( A V X )
*
* Copyright ( c ) 2 0 2 2 T a e h e e Y o o < a p42 0 0 7 3 @gmail.com>
*
* /
# include < l i n u x / l i n k a g e . h >
2022-11-18 22:44:11 +03:00
# include < l i n u x / c f i _ t y p e s . h >
2023-01-01 12:12:50 +03:00
# include < a s m / a s m - o f f s e t s . h >
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
# include < a s m / f r a m e . h >
/* register macros */
# define C T X % r d i
# define B V 8 ( a0 , a1 , a2 , a3 , a4 , a5 , a6 , a7 ) \
( ( ( ( a0 ) & 1 ) < < 0 ) | \
( ( ( a1 ) & 1 ) < < 1 ) | \
( ( ( a2 ) & 1 ) < < 2 ) | \
( ( ( a3 ) & 1 ) < < 3 ) | \
( ( ( a4 ) & 1 ) < < 4 ) | \
( ( ( a5 ) & 1 ) < < 5 ) | \
( ( ( a6 ) & 1 ) < < 6 ) | \
( ( ( a7 ) & 1 ) < < 7 ) )
# define B M 8 X 8 ( l 0 , l 1 , l 2 , l 3 , l 4 , l 5 , l 6 , l 7 ) \
( ( ( l7 ) < < ( 0 * 8 ) ) | \
( ( l6 ) < < ( 1 * 8 ) ) | \
( ( l5 ) < < ( 2 * 8 ) ) | \
( ( l4 ) < < ( 3 * 8 ) ) | \
( ( l3 ) < < ( 4 * 8 ) ) | \
( ( l2 ) < < ( 5 * 8 ) ) | \
( ( l1 ) < < ( 6 * 8 ) ) | \
( ( l0 ) < < ( 7 * 8 ) ) )
# define i n c _ l e 1 2 8 ( x , m i n u s _ o n e , t m p ) \
vpcmpeqq m i n u s _ o n e , x , t m p ; \
vpsubq m i n u s _ o n e , x , x ; \
vpslldq $ 8 , t m p , t m p ; \
vpsubq t m p , x , x ;
# define f i l t e r _ 8 b i t ( x , l o _ t , h i _ t , m a s k 4 b i t , t m p0 ) \
vpand x , m a s k 4 b i t , t m p0 ; \
vpandn x , m a s k 4 b i t , x ; \
vpsrld $ 4 , x , x ; \
\
vpshufb t m p0 , l o _ t , t m p0 ; \
vpshufb x , h i _ t , x ; \
vpxor t m p0 , x , x ;
# define t r a n s p o s e _ 4 x4 ( x0 , x1 , x2 , x3 , t 1 , t 2 ) \
vpunpckhdq x1 , x0 , t 2 ; \
vpunpckldq x1 , x0 , x0 ; \
\
vpunpckldq x3 , x2 , t 1 ; \
vpunpckhdq x3 , x2 , x2 ; \
\
vpunpckhqdq t 1 , x0 , x1 ; \
vpunpcklqdq t 1 , x0 , x0 ; \
\
vpunpckhqdq x2 , t 2 , x3 ; \
vpunpcklqdq x2 , t 2 , x2 ;
# define b y t e s l i c e _ 1 6 x16 b ( a0 , b0 , c0 , d0 , \
a1 , b1 , c1 , d1 , \
a2 , b2 , c2 , d2 , \
a3 , b3 , c3 , d3 , \
st0 , s t 1 ) \
vmovdqu d2 , s t 0 ; \
vmovdqu d3 , s t 1 ; \
transpose_ 4 x4 ( a0 , a1 , a2 , a3 , d2 , d3 ) ; \
transpose_ 4 x4 ( b0 , b1 , b2 , b3 , d2 , d3 ) ; \
vmovdqu s t 0 , d2 ; \
vmovdqu s t 1 , d3 ; \
\
vmovdqu a0 , s t 0 ; \
vmovdqu a1 , s t 1 ; \
transpose_ 4 x4 ( c0 , c1 , c2 , c3 , a0 , a1 ) ; \
transpose_ 4 x4 ( d0 , d1 , d2 , d3 , a0 , a1 ) ; \
\
2023-04-12 14:00:25 +03:00
vmovdqu . L s h u f b _ 1 6 x16 b ( % r i p ) , a0 ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
vmovdqu s t 1 , a1 ; \
vpshufb a0 , a2 , a2 ; \
vpshufb a0 , a3 , a3 ; \
vpshufb a0 , b0 , b0 ; \
vpshufb a0 , b1 , b1 ; \
vpshufb a0 , b2 , b2 ; \
vpshufb a0 , b3 , b3 ; \
vpshufb a0 , a1 , a1 ; \
vpshufb a0 , c0 , c0 ; \
vpshufb a0 , c1 , c1 ; \
vpshufb a0 , c2 , c2 ; \
vpshufb a0 , c3 , c3 ; \
vpshufb a0 , d0 , d0 ; \
vpshufb a0 , d1 , d1 ; \
vpshufb a0 , d2 , d2 ; \
vpshufb a0 , d3 , d3 ; \
vmovdqu d3 , s t 1 ; \
vmovdqu s t 0 , d3 ; \
vpshufb a0 , d3 , a0 ; \
vmovdqu d2 , s t 0 ; \
\
transpose_ 4 x4 ( a0 , b0 , c0 , d0 , d2 , d3 ) ; \
transpose_ 4 x4 ( a1 , b1 , c1 , d1 , d2 , d3 ) ; \
vmovdqu s t 0 , d2 ; \
vmovdqu s t 1 , d3 ; \
\
vmovdqu b0 , s t 0 ; \
vmovdqu b1 , s t 1 ; \
transpose_ 4 x4 ( a2 , b2 , c2 , d2 , b0 , b1 ) ; \
transpose_ 4 x4 ( a3 , b3 , c3 , d3 , b0 , b1 ) ; \
vmovdqu s t 0 , b0 ; \
vmovdqu s t 1 , b1 ; \
/* does not adjust output bytes inside vectors */
# define d e b y t e s l i c e _ 1 6 x16 b ( a0 , b0 , c0 , d0 , \
a1 , b1 , c1 , d1 , \
a2 , b2 , c2 , d2 , \
a3 , b3 , c3 , d3 , \
st0 , s t 1 ) \
vmovdqu d2 , s t 0 ; \
vmovdqu d3 , s t 1 ; \
transpose_ 4 x4 ( a0 , a1 , a2 , a3 , d2 , d3 ) ; \
transpose_ 4 x4 ( b0 , b1 , b2 , b3 , d2 , d3 ) ; \
vmovdqu s t 0 , d2 ; \
vmovdqu s t 1 , d3 ; \
\
vmovdqu a0 , s t 0 ; \
vmovdqu a1 , s t 1 ; \
transpose_ 4 x4 ( c0 , c1 , c2 , c3 , a0 , a1 ) ; \
transpose_ 4 x4 ( d0 , d1 , d2 , d3 , a0 , a1 ) ; \
\
2023-04-12 14:00:25 +03:00
vmovdqu . L s h u f b _ 1 6 x16 b ( % r i p ) , a0 ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
vmovdqu s t 1 , a1 ; \
vpshufb a0 , a2 , a2 ; \
vpshufb a0 , a3 , a3 ; \
vpshufb a0 , b0 , b0 ; \
vpshufb a0 , b1 , b1 ; \
vpshufb a0 , b2 , b2 ; \
vpshufb a0 , b3 , b3 ; \
vpshufb a0 , a1 , a1 ; \
vpshufb a0 , c0 , c0 ; \
vpshufb a0 , c1 , c1 ; \
vpshufb a0 , c2 , c2 ; \
vpshufb a0 , c3 , c3 ; \
vpshufb a0 , d0 , d0 ; \
vpshufb a0 , d1 , d1 ; \
vpshufb a0 , d2 , d2 ; \
vpshufb a0 , d3 , d3 ; \
vmovdqu d3 , s t 1 ; \
vmovdqu s t 0 , d3 ; \
vpshufb a0 , d3 , a0 ; \
vmovdqu d2 , s t 0 ; \
\
transpose_ 4 x4 ( c0 , d0 , a0 , b0 , d2 , d3 ) ; \
transpose_ 4 x4 ( c1 , d1 , a1 , b1 , d2 , d3 ) ; \
vmovdqu s t 0 , d2 ; \
vmovdqu s t 1 , d3 ; \
\
vmovdqu b0 , s t 0 ; \
vmovdqu b1 , s t 1 ; \
transpose_ 4 x4 ( c2 , d2 , a2 , b2 , b0 , b1 ) ; \
transpose_ 4 x4 ( c3 , d3 , a3 , b3 , b0 , b1 ) ; \
vmovdqu s t 0 , b0 ; \
vmovdqu s t 1 , b1 ; \
/* does not adjust output bytes inside vectors */
/* load blocks to registers and apply pre-whitening */
# define i n p a c k 1 6 _ p r e ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
y0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 , \
rio) \
vmovdqu ( 0 * 1 6 ) ( r i o ) , x0 ; \
vmovdqu ( 1 * 1 6 ) ( r i o ) , x1 ; \
vmovdqu ( 2 * 1 6 ) ( r i o ) , x2 ; \
vmovdqu ( 3 * 1 6 ) ( r i o ) , x3 ; \
vmovdqu ( 4 * 1 6 ) ( r i o ) , x4 ; \
vmovdqu ( 5 * 1 6 ) ( r i o ) , x5 ; \
vmovdqu ( 6 * 1 6 ) ( r i o ) , x6 ; \
vmovdqu ( 7 * 1 6 ) ( r i o ) , x7 ; \
vmovdqu ( 8 * 1 6 ) ( r i o ) , y 0 ; \
vmovdqu ( 9 * 1 6 ) ( r i o ) , y 1 ; \
vmovdqu ( 1 0 * 1 6 ) ( r i o ) , y 2 ; \
vmovdqu ( 1 1 * 1 6 ) ( r i o ) , y 3 ; \
vmovdqu ( 1 2 * 1 6 ) ( r i o ) , y 4 ; \
vmovdqu ( 1 3 * 1 6 ) ( r i o ) , y 5 ; \
vmovdqu ( 1 4 * 1 6 ) ( r i o ) , y 6 ; \
vmovdqu ( 1 5 * 1 6 ) ( r i o ) , y 7 ;
/* byteslice pre-whitened blocks and store to temporary memory */
# define i n p a c k 1 6 _ p o s t ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
y0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 , \
mem_ a b , m e m _ c d ) \
byteslice_ 1 6 x16 b ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
y0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 , \
( mem_ a b ) , ( m e m _ c d ) ) ; \
\
vmovdqu x0 , 0 * 1 6 ( m e m _ a b ) ; \
vmovdqu x1 , 1 * 1 6 ( m e m _ a b ) ; \
vmovdqu x2 , 2 * 1 6 ( m e m _ a b ) ; \
vmovdqu x3 , 3 * 1 6 ( m e m _ a b ) ; \
vmovdqu x4 , 4 * 1 6 ( m e m _ a b ) ; \
vmovdqu x5 , 5 * 1 6 ( m e m _ a b ) ; \
vmovdqu x6 , 6 * 1 6 ( m e m _ a b ) ; \
vmovdqu x7 , 7 * 1 6 ( m e m _ a b ) ; \
vmovdqu y 0 , 0 * 1 6 ( m e m _ c d ) ; \
vmovdqu y 1 , 1 * 1 6 ( m e m _ c d ) ; \
vmovdqu y 2 , 2 * 1 6 ( m e m _ c d ) ; \
vmovdqu y 3 , 3 * 1 6 ( m e m _ c d ) ; \
vmovdqu y 4 , 4 * 1 6 ( m e m _ c d ) ; \
vmovdqu y 5 , 5 * 1 6 ( m e m _ c d ) ; \
vmovdqu y 6 , 6 * 1 6 ( m e m _ c d ) ; \
vmovdqu y 7 , 7 * 1 6 ( m e m _ c d ) ;
# define w r i t e _ o u t p u t ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
y0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 , \
mem) \
vmovdqu x0 , 0 * 1 6 ( m e m ) ; \
vmovdqu x1 , 1 * 1 6 ( m e m ) ; \
vmovdqu x2 , 2 * 1 6 ( m e m ) ; \
vmovdqu x3 , 3 * 1 6 ( m e m ) ; \
vmovdqu x4 , 4 * 1 6 ( m e m ) ; \
vmovdqu x5 , 5 * 1 6 ( m e m ) ; \
vmovdqu x6 , 6 * 1 6 ( m e m ) ; \
vmovdqu x7 , 7 * 1 6 ( m e m ) ; \
vmovdqu y 0 , 8 * 1 6 ( m e m ) ; \
vmovdqu y 1 , 9 * 1 6 ( m e m ) ; \
vmovdqu y 2 , 1 0 * 1 6 ( m e m ) ; \
vmovdqu y 3 , 1 1 * 1 6 ( m e m ) ; \
vmovdqu y 4 , 1 2 * 1 6 ( m e m ) ; \
vmovdqu y 5 , 1 3 * 1 6 ( m e m ) ; \
vmovdqu y 6 , 1 4 * 1 6 ( m e m ) ; \
vmovdqu y 7 , 1 5 * 1 6 ( m e m ) ; \
# define a r i a _ s t o r e _ s t a t e _ 8 w a y ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
mem_ t m p , i d x ) \
vmovdqu x0 , ( ( i d x + 0 ) * 1 6 ) ( m e m _ t m p ) ; \
vmovdqu x1 , ( ( i d x + 1 ) * 1 6 ) ( m e m _ t m p ) ; \
vmovdqu x2 , ( ( i d x + 2 ) * 1 6 ) ( m e m _ t m p ) ; \
vmovdqu x3 , ( ( i d x + 3 ) * 1 6 ) ( m e m _ t m p ) ; \
vmovdqu x4 , ( ( i d x + 4 ) * 1 6 ) ( m e m _ t m p ) ; \
vmovdqu x5 , ( ( i d x + 5 ) * 1 6 ) ( m e m _ t m p ) ; \
vmovdqu x6 , ( ( i d x + 6 ) * 1 6 ) ( m e m _ t m p ) ; \
vmovdqu x7 , ( ( i d x + 7 ) * 1 6 ) ( m e m _ t m p ) ;
# define a r i a _ l o a d _ s t a t e _ 8 w a y ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
mem_ t m p , i d x ) \
vmovdqu ( ( i d x + 0 ) * 1 6 ) ( m e m _ t m p ) , x0 ; \
vmovdqu ( ( i d x + 1 ) * 1 6 ) ( m e m _ t m p ) , x1 ; \
vmovdqu ( ( i d x + 2 ) * 1 6 ) ( m e m _ t m p ) , x2 ; \
vmovdqu ( ( i d x + 3 ) * 1 6 ) ( m e m _ t m p ) , x3 ; \
vmovdqu ( ( i d x + 4 ) * 1 6 ) ( m e m _ t m p ) , x4 ; \
vmovdqu ( ( i d x + 5 ) * 1 6 ) ( m e m _ t m p ) , x5 ; \
vmovdqu ( ( i d x + 6 ) * 1 6 ) ( m e m _ t m p ) , x6 ; \
vmovdqu ( ( i d x + 7 ) * 1 6 ) ( m e m _ t m p ) , x7 ;
# define a r i a _ a r k _ 8 w a y ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
2023-02-10 21:15:41 +03:00
t0 , t 1 , t 2 , r k , \
idx, r o u n d ) \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
/* AddRoundKey */ \
2023-02-10 21:15:41 +03:00
vbroadcastss ( ( r o u n d * 1 6 ) + i d x + 0 ) ( r k ) , t 0 ; \
vpsrld $ 2 4 , t 0 , t 2 ; \
vpshufb t 1 , t 2 , t 2 ; \
vpxor t 2 , x0 , x0 ; \
vpsrld $ 1 6 , t 0 , t 2 ; \
vpshufb t 1 , t 2 , t 2 ; \
vpxor t 2 , x1 , x1 ; \
vpsrld $ 8 , t 0 , t 2 ; \
vpshufb t 1 , t 2 , t 2 ; \
vpxor t 2 , x2 , x2 ; \
vpshufb t 1 , t 0 , t 2 ; \
vpxor t 2 , x3 , x3 ; \
vbroadcastss ( ( r o u n d * 1 6 ) + i d x + 4 ) ( r k ) , t 0 ; \
vpsrld $ 2 4 , t 0 , t 2 ; \
vpshufb t 1 , t 2 , t 2 ; \
vpxor t 2 , x4 , x4 ; \
vpsrld $ 1 6 , t 0 , t 2 ; \
vpshufb t 1 , t 2 , t 2 ; \
vpxor t 2 , x5 , x5 ; \
vpsrld $ 8 , t 0 , t 2 ; \
vpshufb t 1 , t 2 , t 2 ; \
vpxor t 2 , x6 , x6 ; \
vpshufb t 1 , t 0 , t 2 ; \
vpxor t 2 , x7 , x7 ;
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
2023-01-15 15:15:34 +03:00
# ifdef C O N F I G _ A S _ G F N I
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
# define a r i a _ s b o x _ 8 w a y _ g f n i ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
t0 , t 1 , t 2 , t 3 , \
t4 , t 5 , t 6 , t 7 ) \
2023-04-12 14:00:25 +03:00
vmovdqa . L t f _ s2 _ b i t m a t r i x ( % r i p ) , t 0 ; \
vmovdqa . L t f _ i n v _ b i t m a t r i x ( % r i p ) , t 1 ; \
vmovdqa . L t f _ i d _ b i t m a t r i x ( % r i p ) , t 2 ; \
vmovdqa . L t f _ a f f _ b i t m a t r i x ( % r i p ) , t 3 ; \
vmovdqa . L t f _ x2 _ b i t m a t r i x ( % r i p ) , t 4 ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
vgf2 p8 a f f i n e i n v q b $ ( t f _ s2 _ c o n s t ) , t 0 , x1 , x1 ; \
vgf2 p8 a f f i n e i n v q b $ ( t f _ s2 _ c o n s t ) , t 0 , x5 , x5 ; \
vgf2 p8 a f f i n e q b $ ( t f _ i n v _ c o n s t ) , t 1 , x2 , x2 ; \
vgf2 p8 a f f i n e q b $ ( t f _ i n v _ c o n s t ) , t 1 , x6 , x6 ; \
vgf2 p8 a f f i n e i n v q b $ 0 , t 2 , x2 , x2 ; \
vgf2 p8 a f f i n e i n v q b $ 0 , t 2 , x6 , x6 ; \
vgf2 p8 a f f i n e i n v q b $ ( t f _ a f f _ c o n s t ) , t 3 , x0 , x0 ; \
vgf2 p8 a f f i n e i n v q b $ ( t f _ a f f _ c o n s t ) , t 3 , x4 , x4 ; \
vgf2 p8 a f f i n e q b $ ( t f _ x2 _ c o n s t ) , t 4 , x3 , x3 ; \
vgf2 p8 a f f i n e q b $ ( t f _ x2 _ c o n s t ) , t 4 , x7 , x7 ; \
vgf2 p8 a f f i n e i n v q b $ 0 , t 2 , x3 , x3 ; \
vgf2 p8 a f f i n e i n v q b $ 0 , t 2 , x7 , x7
2023-01-15 15:15:34 +03:00
# endif / * C O N F I G _ A S _ G F N I * /
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
# define a r i a _ s b o x _ 8 w a y ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
t0 , t 1 , t 2 , t 3 , \
t4 , t 5 , t 6 , t 7 ) \
2023-04-12 14:00:25 +03:00
vmovdqa . L i n v _ s h i f t _ r o w ( % r i p ) , t 0 ; \
vmovdqa . L s h i f t _ r o w ( % r i p ) , t 1 ; \
vbroadcastss . L 0 f0 f0 f0 f ( % r i p ) , t 6 ; \
vmovdqa . L t f _ l o _ _ i n v _ a f f _ _ a n d _ _ s2 ( % r i p ) , t 2 ; \
vmovdqa . L t f _ h i _ _ i n v _ a f f _ _ a n d _ _ s2 ( % r i p ) , t 3 ; \
vmovdqa . L t f _ l o _ _ x2 _ _ a n d _ _ f w d _ a f f ( % r i p ) , t 4 ; \
vmovdqa . L t f _ h i _ _ x2 _ _ a n d _ _ f w d _ a f f ( % r i p ) , t 5 ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
\
vaesenclast t 7 , x0 , x0 ; \
vaesenclast t 7 , x4 , x4 ; \
vaesenclast t 7 , x1 , x1 ; \
vaesenclast t 7 , x5 , x5 ; \
vaesdeclast t 7 , x2 , x2 ; \
vaesdeclast t 7 , x6 , x6 ; \
\
/* AES inverse shift rows */ \
vpshufb t 0 , x0 , x0 ; \
vpshufb t 0 , x4 , x4 ; \
vpshufb t 0 , x1 , x1 ; \
vpshufb t 0 , x5 , x5 ; \
vpshufb t 1 , x3 , x3 ; \
vpshufb t 1 , x7 , x7 ; \
vpshufb t 1 , x2 , x2 ; \
vpshufb t 1 , x6 , x6 ; \
\
/* affine transformation for S2 */ \
filter_ 8 b i t ( x1 , t 2 , t 3 , t 6 , t 0 ) ; \
/* affine transformation for S2 */ \
filter_ 8 b i t ( x5 , t 2 , t 3 , t 6 , t 0 ) ; \
\
/* affine transformation for X2 */ \
filter_ 8 b i t ( x3 , t 4 , t 5 , t 6 , t 0 ) ; \
/* affine transformation for X2 */ \
filter_ 8 b i t ( x7 , t 4 , t 5 , t 6 , t 0 ) ; \
vaesdeclast t 7 , x3 , x3 ; \
vaesdeclast t 7 , x7 , x7 ;
# define a r i a _ d i f f _ m ( x0 , x1 , x2 , x3 , \
t0 , t 1 , t 2 , t 3 ) \
/* T = rotr32(X, 8); */ \
/* X ^= T */ \
vpxor x0 , x3 , t 0 ; \
vpxor x1 , x0 , t 1 ; \
vpxor x2 , x1 , t 2 ; \
vpxor x3 , x2 , t 3 ; \
/* X = T ^ rotr(X, 16); */ \
vpxor t 2 , x0 , x0 ; \
vpxor x1 , t 3 , t 3 ; \
vpxor t 0 , x2 , x2 ; \
vpxor t 1 , x3 , x1 ; \
vmovdqu t 3 , x3 ;
# define a r i a _ d i f f _ w o r d ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
y0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 ) \
/* t1 ^= t2; */ \
vpxor y 0 , x4 , x4 ; \
vpxor y 1 , x5 , x5 ; \
vpxor y 2 , x6 , x6 ; \
vpxor y 3 , x7 , x7 ; \
\
/* t2 ^= t3; */ \
vpxor y 4 , y 0 , y 0 ; \
vpxor y 5 , y 1 , y 1 ; \
vpxor y 6 , y 2 , y 2 ; \
vpxor y 7 , y 3 , y 3 ; \
\
/* t0 ^= t1; */ \
vpxor x4 , x0 , x0 ; \
vpxor x5 , x1 , x1 ; \
vpxor x6 , x2 , x2 ; \
vpxor x7 , x3 , x3 ; \
\
/* t3 ^= t1; */ \
vpxor x4 , y 4 , y 4 ; \
vpxor x5 , y 5 , y 5 ; \
vpxor x6 , y 6 , y 6 ; \
vpxor x7 , y 7 , y 7 ; \
\
/* t2 ^= t0; */ \
vpxor x0 , y 0 , y 0 ; \
vpxor x1 , y 1 , y 1 ; \
vpxor x2 , y 2 , y 2 ; \
vpxor x3 , y 3 , y 3 ; \
\
/* t1 ^= t2; */ \
vpxor y 0 , x4 , x4 ; \
vpxor y 1 , x5 , x5 ; \
vpxor y 2 , x6 , x6 ; \
vpxor y 3 , x7 , x7 ;
# define a r i a _ f e ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
y0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 , \
mem_ t m p , r k , r o u n d ) \
2023-02-10 21:15:41 +03:00
vpxor y 7 , y 7 , y 7 ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
aria_ a r k _ 8 w a y ( x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , \
2023-02-10 21:15:41 +03:00
y0 , y 7 , y 2 , r k , 8 , r o u n d ) ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
\
aria_ s b o x _ 8 w a y ( x2 , x3 , x0 , x1 , x6 , x7 , x4 , x5 , \
y0 , y 1 , y 2 , y 3 , y 4 , y 5 , y 6 , y 7 ) ; \
\
aria_ d i f f _ m ( x0 , x1 , x2 , x3 , y 0 , y 1 , y 2 , y 3 ) ; \
aria_ d i f f _ m ( x4 , x5 , x6 , x7 , y 0 , y 1 , y 2 , y 3 ) ; \
aria_ s t o r e _ s t a t e _ 8 w a y ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
mem_ t m p , 8 ) ; \
\
aria_ l o a d _ s t a t e _ 8 w a y ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
mem_ t m p , 0 ) ; \
aria_ a r k _ 8 w a y ( x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , \
2023-02-10 21:15:41 +03:00
y0 , y 7 , y 2 , r k , 0 , r o u n d ) ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
\
aria_ s b o x _ 8 w a y ( x2 , x3 , x0 , x1 , x6 , x7 , x4 , x5 , \
y0 , y 1 , y 2 , y 3 , y 4 , y 5 , y 6 , y 7 ) ; \
\
aria_ d i f f _ m ( x0 , x1 , x2 , x3 , y 0 , y 1 , y 2 , y 3 ) ; \
aria_ d i f f _ m ( x4 , x5 , x6 , x7 , y 0 , y 1 , y 2 , y 3 ) ; \
aria_ s t o r e _ s t a t e _ 8 w a y ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
mem_ t m p , 0 ) ; \
aria_ l o a d _ s t a t e _ 8 w a y ( y 0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 , \
mem_ t m p , 8 ) ; \
aria_ d i f f _ w o r d ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
y0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 ) ; \
/ * aria_ d i f f _ b y t e ( ) \
* T3 = A B C D - > B A D C \
* T3 = y 4 , y 5 , y 6 , y 7 - > y 5 , y 4 , y 7 , y 6 \
* T0 = A B C D - > C D A B \
* T0 = x0 , x1 , x2 , x3 - > x2 , x3 , x0 , x1 \
* T1 = A B C D - > D C B A \
* T1 = x4 , x5 , x6 , x7 - > x7 , x6 , x5 , x4 \
* / \
aria_ d i f f _ w o r d ( x2 , x3 , x0 , x1 , \
x7 , x6 , x5 , x4 , \
y0 , y 1 , y 2 , y 3 , \
y5 , y 4 , y 7 , y 6 ) ; \
aria_ s t o r e _ s t a t e _ 8 w a y ( x3 , x2 , x1 , x0 , \
x6 , x7 , x4 , x5 , \
mem_ t m p , 0 ) ;
# define a r i a _ f o ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
y0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 , \
mem_ t m p , r k , r o u n d ) \
2023-02-10 21:15:41 +03:00
vpxor y 7 , y 7 , y 7 ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
aria_ a r k _ 8 w a y ( x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , \
2023-02-10 21:15:41 +03:00
y0 , y 7 , y 2 , r k , 8 , r o u n d ) ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
\
aria_ s b o x _ 8 w a y ( x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , \
y0 , y 1 , y 2 , y 3 , y 4 , y 5 , y 6 , y 7 ) ; \
\
aria_ d i f f _ m ( x0 , x1 , x2 , x3 , y 0 , y 1 , y 2 , y 3 ) ; \
aria_ d i f f _ m ( x4 , x5 , x6 , x7 , y 0 , y 1 , y 2 , y 3 ) ; \
aria_ s t o r e _ s t a t e _ 8 w a y ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
mem_ t m p , 8 ) ; \
\
aria_ l o a d _ s t a t e _ 8 w a y ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
mem_ t m p , 0 ) ; \
aria_ a r k _ 8 w a y ( x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , \
2023-02-10 21:15:41 +03:00
y0 , y 7 , y 2 , r k , 0 , r o u n d ) ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
\
aria_ s b o x _ 8 w a y ( x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , \
y0 , y 1 , y 2 , y 3 , y 4 , y 5 , y 6 , y 7 ) ; \
\
aria_ d i f f _ m ( x0 , x1 , x2 , x3 , y 0 , y 1 , y 2 , y 3 ) ; \
aria_ d i f f _ m ( x4 , x5 , x6 , x7 , y 0 , y 1 , y 2 , y 3 ) ; \
aria_ s t o r e _ s t a t e _ 8 w a y ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
mem_ t m p , 0 ) ; \
aria_ l o a d _ s t a t e _ 8 w a y ( y 0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 , \
mem_ t m p , 8 ) ; \
aria_ d i f f _ w o r d ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
y0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 ) ; \
/ * aria_ d i f f _ b y t e ( ) \
* T1 = A B C D - > B A D C \
* T1 = x4 , x5 , x6 , x7 - > x5 , x4 , x7 , x6 \
* T2 = A B C D - > C D A B \
* T2 = y 0 , y 1 , y 2 , y 3 , - > y 2 , y 3 , y 0 , y 1 \
* T3 = A B C D - > D C B A \
* T3 = y 4 , y 5 , y 6 , y 7 - > y 7 , y 6 , y 5 , y 4 \
* / \
aria_ d i f f _ w o r d ( x0 , x1 , x2 , x3 , \
x5 , x4 , x7 , x6 , \
y2 , y 3 , y 0 , y 1 , \
y7 , y 6 , y 5 , y 4 ) ; \
aria_ s t o r e _ s t a t e _ 8 w a y ( x3 , x2 , x1 , x0 , \
x6 , x7 , x4 , x5 , \
mem_ t m p , 0 ) ;
# define a r i a _ f f ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
y0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 , \
mem_ t m p , r k , r o u n d , l a s t _ r o u n d ) \
2023-02-10 21:15:41 +03:00
vpxor y 7 , y 7 , y 7 ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
aria_ a r k _ 8 w a y ( x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , \
2023-02-10 21:15:41 +03:00
y0 , y 7 , y 2 , r k , 8 , r o u n d ) ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
\
aria_ s b o x _ 8 w a y ( x2 , x3 , x0 , x1 , x6 , x7 , x4 , x5 , \
y0 , y 1 , y 2 , y 3 , y 4 , y 5 , y 6 , y 7 ) ; \
\
aria_ a r k _ 8 w a y ( x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , \
2023-02-10 21:15:41 +03:00
y0 , y 7 , y 2 , r k , 8 , l a s t _ r o u n d ) ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
\
aria_ s t o r e _ s t a t e _ 8 w a y ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
mem_ t m p , 8 ) ; \
\
aria_ l o a d _ s t a t e _ 8 w a y ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
mem_ t m p , 0 ) ; \
aria_ a r k _ 8 w a y ( x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , \
2023-02-10 21:15:41 +03:00
y0 , y 7 , y 2 , r k , 0 , r o u n d ) ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
\
aria_ s b o x _ 8 w a y ( x2 , x3 , x0 , x1 , x6 , x7 , x4 , x5 , \
y0 , y 1 , y 2 , y 3 , y 4 , y 5 , y 6 , y 7 ) ; \
\
aria_ a r k _ 8 w a y ( x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , \
2023-02-10 21:15:41 +03:00
y0 , y 7 , y 2 , r k , 0 , l a s t _ r o u n d ) ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
\
aria_ l o a d _ s t a t e _ 8 w a y ( y 0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 , \
mem_ t m p , 8 ) ;
2023-01-15 15:15:34 +03:00
# ifdef C O N F I G _ A S _ G F N I
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
# define a r i a _ f e _ g f n i ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
y0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 , \
mem_ t m p , r k , r o u n d ) \
2023-02-10 21:15:41 +03:00
vpxor y 7 , y 7 , y 7 ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
aria_ a r k _ 8 w a y ( x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , \
2023-02-10 21:15:41 +03:00
y0 , y 7 , y 2 , r k , 8 , r o u n d ) ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
\
aria_ s b o x _ 8 w a y _ g f n i ( x2 , x3 , x0 , x1 , \
x6 , x7 , x4 , x5 , \
y0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 ) ; \
\
aria_ d i f f _ m ( x0 , x1 , x2 , x3 , y 0 , y 1 , y 2 , y 3 ) ; \
aria_ d i f f _ m ( x4 , x5 , x6 , x7 , y 0 , y 1 , y 2 , y 3 ) ; \
aria_ s t o r e _ s t a t e _ 8 w a y ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
mem_ t m p , 8 ) ; \
\
aria_ l o a d _ s t a t e _ 8 w a y ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
mem_ t m p , 0 ) ; \
aria_ a r k _ 8 w a y ( x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , \
2023-02-10 21:15:41 +03:00
y0 , y 7 , y 2 , r k , 0 , r o u n d ) ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
\
aria_ s b o x _ 8 w a y _ g f n i ( x2 , x3 , x0 , x1 , \
x6 , x7 , x4 , x5 , \
y0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 ) ; \
\
aria_ d i f f _ m ( x0 , x1 , x2 , x3 , y 0 , y 1 , y 2 , y 3 ) ; \
aria_ d i f f _ m ( x4 , x5 , x6 , x7 , y 0 , y 1 , y 2 , y 3 ) ; \
aria_ s t o r e _ s t a t e _ 8 w a y ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
mem_ t m p , 0 ) ; \
aria_ l o a d _ s t a t e _ 8 w a y ( y 0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 , \
mem_ t m p , 8 ) ; \
aria_ d i f f _ w o r d ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
y0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 ) ; \
/ * aria_ d i f f _ b y t e ( ) \
* T3 = A B C D - > B A D C \
* T3 = y 4 , y 5 , y 6 , y 7 - > y 5 , y 4 , y 7 , y 6 \
* T0 = A B C D - > C D A B \
* T0 = x0 , x1 , x2 , x3 - > x2 , x3 , x0 , x1 \
* T1 = A B C D - > D C B A \
* T1 = x4 , x5 , x6 , x7 - > x7 , x6 , x5 , x4 \
* / \
aria_ d i f f _ w o r d ( x2 , x3 , x0 , x1 , \
x7 , x6 , x5 , x4 , \
y0 , y 1 , y 2 , y 3 , \
y5 , y 4 , y 7 , y 6 ) ; \
aria_ s t o r e _ s t a t e _ 8 w a y ( x3 , x2 , x1 , x0 , \
x6 , x7 , x4 , x5 , \
mem_ t m p , 0 ) ;
# define a r i a _ f o _ g f n i ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
y0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 , \
mem_ t m p , r k , r o u n d ) \
2023-02-10 21:15:41 +03:00
vpxor y 7 , y 7 , y 7 ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
aria_ a r k _ 8 w a y ( x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , \
2023-02-10 21:15:41 +03:00
y0 , y 7 , y 2 , r k , 8 , r o u n d ) ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
\
aria_ s b o x _ 8 w a y _ g f n i ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
y0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 ) ; \
\
aria_ d i f f _ m ( x0 , x1 , x2 , x3 , y 0 , y 1 , y 2 , y 3 ) ; \
aria_ d i f f _ m ( x4 , x5 , x6 , x7 , y 0 , y 1 , y 2 , y 3 ) ; \
aria_ s t o r e _ s t a t e _ 8 w a y ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
mem_ t m p , 8 ) ; \
\
aria_ l o a d _ s t a t e _ 8 w a y ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
mem_ t m p , 0 ) ; \
aria_ a r k _ 8 w a y ( x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , \
2023-02-10 21:15:41 +03:00
y0 , y 7 , y 2 , r k , 0 , r o u n d ) ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
\
aria_ s b o x _ 8 w a y _ g f n i ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
y0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 ) ; \
\
aria_ d i f f _ m ( x0 , x1 , x2 , x3 , y 0 , y 1 , y 2 , y 3 ) ; \
aria_ d i f f _ m ( x4 , x5 , x6 , x7 , y 0 , y 1 , y 2 , y 3 ) ; \
aria_ s t o r e _ s t a t e _ 8 w a y ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
mem_ t m p , 0 ) ; \
aria_ l o a d _ s t a t e _ 8 w a y ( y 0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 , \
mem_ t m p , 8 ) ; \
aria_ d i f f _ w o r d ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
y0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 ) ; \
/ * aria_ d i f f _ b y t e ( ) \
* T1 = A B C D - > B A D C \
* T1 = x4 , x5 , x6 , x7 - > x5 , x4 , x7 , x6 \
* T2 = A B C D - > C D A B \
* T2 = y 0 , y 1 , y 2 , y 3 , - > y 2 , y 3 , y 0 , y 1 \
* T3 = A B C D - > D C B A \
* T3 = y 4 , y 5 , y 6 , y 7 - > y 7 , y 6 , y 5 , y 4 \
* / \
aria_ d i f f _ w o r d ( x0 , x1 , x2 , x3 , \
x5 , x4 , x7 , x6 , \
y2 , y 3 , y 0 , y 1 , \
y7 , y 6 , y 5 , y 4 ) ; \
aria_ s t o r e _ s t a t e _ 8 w a y ( x3 , x2 , x1 , x0 , \
x6 , x7 , x4 , x5 , \
mem_ t m p , 0 ) ;
# define a r i a _ f f _ g f n i ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
y0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 , \
mem_ t m p , r k , r o u n d , l a s t _ r o u n d ) \
2023-02-10 21:15:41 +03:00
vpxor y 7 , y 7 , y 7 ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
aria_ a r k _ 8 w a y ( x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , \
2023-02-10 21:15:41 +03:00
y0 , y 7 , y 2 , r k , 8 , r o u n d ) ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
\
aria_ s b o x _ 8 w a y _ g f n i ( x2 , x3 , x0 , x1 , \
x6 , x7 , x4 , x5 , \
y0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 ) ; \
\
aria_ a r k _ 8 w a y ( x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , \
2023-02-10 21:15:41 +03:00
y0 , y 7 , y 2 , r k , 8 , l a s t _ r o u n d ) ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
\
aria_ s t o r e _ s t a t e _ 8 w a y ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
mem_ t m p , 8 ) ; \
\
aria_ l o a d _ s t a t e _ 8 w a y ( x0 , x1 , x2 , x3 , \
x4 , x5 , x6 , x7 , \
mem_ t m p , 0 ) ; \
aria_ a r k _ 8 w a y ( x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , \
2023-02-10 21:15:41 +03:00
y0 , y 7 , y 2 , r k , 0 , r o u n d ) ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
\
aria_ s b o x _ 8 w a y _ g f n i ( x2 , x3 , x0 , x1 , \
x6 , x7 , x4 , x5 , \
y0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 ) ; \
\
aria_ a r k _ 8 w a y ( x0 , x1 , x2 , x3 , x4 , x5 , x6 , x7 , \
2023-02-10 21:15:41 +03:00
y0 , y 7 , y 2 , r k , 0 , l a s t _ r o u n d ) ; \
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
\
aria_ l o a d _ s t a t e _ 8 w a y ( y 0 , y 1 , y 2 , y 3 , \
y4 , y 5 , y 6 , y 7 , \
mem_ t m p , 8 ) ;
2023-01-15 15:15:34 +03:00
# endif / * C O N F I G _ A S _ G F N I * /
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
/* NB: section is mergeable, all elements must be aligned 16-byte blocks */
.section .rodata .cst16 , " aM" , @progbits, 16
.align 16
# define S H U F B _ B Y T E S ( i d x ) \
0 + ( idx) , 4 + ( i d x ) , 8 + ( i d x ) , 1 2 + ( i d x )
.Lshufb_16x16b :
.byte SHUFB_ B Y T E S ( 0 ) , S H U F B _ B Y T E S ( 1 ) , S H U F B _ B Y T E S ( 2 ) , S H U F B _ B Y T E S ( 3 ) ;
/* For isolating SubBytes from AESENCLAST, inverse shift row */
.Linv_shift_row :
.byte 0 x0 0 , 0 x0 d , 0 x0 a , 0 x07 , 0 x04 , 0 x01 , 0 x0 e , 0 x0 b
.byte 0 x0 8 , 0 x05 , 0 x02 , 0 x0 f , 0 x0 c , 0 x09 , 0 x06 , 0 x03
.Lshift_row :
.byte 0 x0 0 , 0 x05 , 0 x0 a , 0 x0 f , 0 x04 , 0 x09 , 0 x0 e , 0 x03
.byte 0 x0 8 , 0 x0 d , 0 x02 , 0 x07 , 0 x0 c , 0 x01 , 0 x06 , 0 x0 b
/* For CTR-mode IV byteswap */
.Lbswap128_mask :
.byte 0 x0 f , 0 x0 e , 0 x0 d , 0 x0 c , 0 x0 b , 0 x0 a , 0 x09 , 0 x08
.byte 0 x0 7 , 0 x06 , 0 x05 , 0 x04 , 0 x03 , 0 x02 , 0 x01 , 0 x00
/ * AES i n v e r s e a f f i n e a n d S 2 c o m b i n e d :
* 1 1 0 0 0 0 0 1 x0 0
* 0 1 0 0 1 0 0 0 x1 0
* 1 1 0 0 1 1 1 1 x2 0
* 0 1 1 0 1 0 0 1 x3 1
* 0 1 0 0 1 1 0 0 * x4 + 0
* 0 1 0 1 1 0 0 0 x5 0
* 0 0 0 0 0 1 0 1 x6 0
* 1 1 1 0 0 1 1 1 x7 1
* /
.Ltf_lo__inv_aff__and__s2 :
.octa 0x92172DA81A9FA520B2370D883ABF8500
.Ltf_hi__inv_aff__and__s2 :
.octa 0x2B15FFC1AF917B45E6D8320C625CB688
/ * X2 a n d A E S f o r w a r d a f f i n e c o m b i n e d :
* 1 0 1 1 0 0 0 1 x0 0
* 0 1 1 1 1 0 1 1 x1 0
* 0 0 0 1 1 0 1 0 x2 1
* 0 1 0 0 0 1 0 0 x3 0
* 0 0 1 1 1 0 1 1 * x4 + 0
* 0 1 0 0 1 0 0 0 x5 0
* 1 1 0 1 0 0 1 1 x6 0
* 0 1 0 0 1 0 1 0 x7 0
* /
.Ltf_lo__x2__and__fwd_aff :
.octa 0xEFAE0544FCBD1657B8F95213ABEA4100
.Ltf_hi__x2__and__fwd_aff :
.octa 0x3F893781E95FE1576CDA64D2BA0CB204
2023-01-15 15:15:34 +03:00
# ifdef C O N F I G _ A S _ G F N I
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
/* AES affine: */
# define t f _ a f f _ c o n s t B V 8 ( 1 , 1 , 0 , 0 , 0 , 1 , 1 , 0 )
.Ltf_aff_bitmatrix :
.quad BM8 X 8 ( B V 8 ( 1 , 0 , 0 , 0 , 1 , 1 , 1 , 1 ) ,
BV8 ( 1 , 1 , 0 , 0 , 0 , 1 , 1 , 1 ) ,
BV8 ( 1 , 1 , 1 , 0 , 0 , 0 , 1 , 1 ) ,
BV8 ( 1 , 1 , 1 , 1 , 0 , 0 , 0 , 1 ) ,
BV8 ( 1 , 1 , 1 , 1 , 1 , 0 , 0 , 0 ) ,
BV8 ( 0 , 1 , 1 , 1 , 1 , 1 , 0 , 0 ) ,
BV8 ( 0 , 0 , 1 , 1 , 1 , 1 , 1 , 0 ) ,
BV8 ( 0 , 0 , 0 , 1 , 1 , 1 , 1 , 1 ) )
2023-02-10 21:15:41 +03:00
.quad BM8 X 8 ( B V 8 ( 1 , 0 , 0 , 0 , 1 , 1 , 1 , 1 ) ,
BV8 ( 1 , 1 , 0 , 0 , 0 , 1 , 1 , 1 ) ,
BV8 ( 1 , 1 , 1 , 0 , 0 , 0 , 1 , 1 ) ,
BV8 ( 1 , 1 , 1 , 1 , 0 , 0 , 0 , 1 ) ,
BV8 ( 1 , 1 , 1 , 1 , 1 , 0 , 0 , 0 ) ,
BV8 ( 0 , 1 , 1 , 1 , 1 , 1 , 0 , 0 ) ,
BV8 ( 0 , 0 , 1 , 1 , 1 , 1 , 1 , 0 ) ,
BV8 ( 0 , 0 , 0 , 1 , 1 , 1 , 1 , 1 ) )
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
/* AES inverse affine: */
# define t f _ i n v _ c o n s t B V 8 ( 1 , 0 , 1 , 0 , 0 , 0 , 0 , 0 )
.Ltf_inv_bitmatrix :
.quad BM8 X 8 ( B V 8 ( 0 , 0 , 1 , 0 , 0 , 1 , 0 , 1 ) ,
BV8 ( 1 , 0 , 0 , 1 , 0 , 0 , 1 , 0 ) ,
BV8 ( 0 , 1 , 0 , 0 , 1 , 0 , 0 , 1 ) ,
BV8 ( 1 , 0 , 1 , 0 , 0 , 1 , 0 , 0 ) ,
BV8 ( 0 , 1 , 0 , 1 , 0 , 0 , 1 , 0 ) ,
BV8 ( 0 , 0 , 1 , 0 , 1 , 0 , 0 , 1 ) ,
BV8 ( 1 , 0 , 0 , 1 , 0 , 1 , 0 , 0 ) ,
BV8 ( 0 , 1 , 0 , 0 , 1 , 0 , 1 , 0 ) )
2023-02-10 21:15:41 +03:00
.quad BM8 X 8 ( B V 8 ( 0 , 0 , 1 , 0 , 0 , 1 , 0 , 1 ) ,
BV8 ( 1 , 0 , 0 , 1 , 0 , 0 , 1 , 0 ) ,
BV8 ( 0 , 1 , 0 , 0 , 1 , 0 , 0 , 1 ) ,
BV8 ( 1 , 0 , 1 , 0 , 0 , 1 , 0 , 0 ) ,
BV8 ( 0 , 1 , 0 , 1 , 0 , 0 , 1 , 0 ) ,
BV8 ( 0 , 0 , 1 , 0 , 1 , 0 , 0 , 1 ) ,
BV8 ( 1 , 0 , 0 , 1 , 0 , 1 , 0 , 0 ) ,
BV8 ( 0 , 1 , 0 , 0 , 1 , 0 , 1 , 0 ) )
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
/* S2: */
# define t f _ s2 _ c o n s t B V 8 ( 0 , 1 , 0 , 0 , 0 , 1 , 1 , 1 )
.Ltf_s2_bitmatrix :
.quad BM8 X 8 ( B V 8 ( 0 , 1 , 0 , 1 , 0 , 1 , 1 , 1 ) ,
BV8 ( 0 , 0 , 1 , 1 , 1 , 1 , 1 , 1 ) ,
BV8 ( 1 , 1 , 1 , 0 , 1 , 1 , 0 , 1 ) ,
BV8 ( 1 , 1 , 0 , 0 , 0 , 0 , 1 , 1 ) ,
BV8 ( 0 , 1 , 0 , 0 , 0 , 0 , 1 , 1 ) ,
BV8 ( 1 , 1 , 0 , 0 , 1 , 1 , 1 , 0 ) ,
BV8 ( 0 , 1 , 1 , 0 , 0 , 0 , 1 , 1 ) ,
BV8 ( 1 , 1 , 1 , 1 , 0 , 1 , 1 , 0 ) )
2023-02-10 21:15:41 +03:00
.quad BM8 X 8 ( B V 8 ( 0 , 1 , 0 , 1 , 0 , 1 , 1 , 1 ) ,
BV8 ( 0 , 0 , 1 , 1 , 1 , 1 , 1 , 1 ) ,
BV8 ( 1 , 1 , 1 , 0 , 1 , 1 , 0 , 1 ) ,
BV8 ( 1 , 1 , 0 , 0 , 0 , 0 , 1 , 1 ) ,
BV8 ( 0 , 1 , 0 , 0 , 0 , 0 , 1 , 1 ) ,
BV8 ( 1 , 1 , 0 , 0 , 1 , 1 , 1 , 0 ) ,
BV8 ( 0 , 1 , 1 , 0 , 0 , 0 , 1 , 1 ) ,
BV8 ( 1 , 1 , 1 , 1 , 0 , 1 , 1 , 0 ) )
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
/* X2: */
# define t f _ x2 _ c o n s t B V 8 ( 0 , 0 , 1 , 1 , 0 , 1 , 0 , 0 )
.Ltf_x2_bitmatrix :
.quad BM8 X 8 ( B V 8 ( 0 , 0 , 0 , 1 , 1 , 0 , 0 , 0 ) ,
BV8 ( 0 , 0 , 1 , 0 , 0 , 1 , 1 , 0 ) ,
BV8 ( 0 , 0 , 0 , 0 , 1 , 0 , 1 , 0 ) ,
BV8 ( 1 , 1 , 1 , 0 , 0 , 0 , 1 , 1 ) ,
BV8 ( 1 , 1 , 1 , 0 , 1 , 1 , 0 , 0 ) ,
BV8 ( 0 , 1 , 1 , 0 , 1 , 0 , 1 , 1 ) ,
BV8 ( 1 , 0 , 1 , 1 , 1 , 1 , 0 , 1 ) ,
BV8 ( 1 , 0 , 0 , 1 , 0 , 0 , 1 , 1 ) )
2023-02-10 21:15:41 +03:00
.quad BM8 X 8 ( B V 8 ( 0 , 0 , 0 , 1 , 1 , 0 , 0 , 0 ) ,
BV8 ( 0 , 0 , 1 , 0 , 0 , 1 , 1 , 0 ) ,
BV8 ( 0 , 0 , 0 , 0 , 1 , 0 , 1 , 0 ) ,
BV8 ( 1 , 1 , 1 , 0 , 0 , 0 , 1 , 1 ) ,
BV8 ( 1 , 1 , 1 , 0 , 1 , 1 , 0 , 0 ) ,
BV8 ( 0 , 1 , 1 , 0 , 1 , 0 , 1 , 1 ) ,
BV8 ( 1 , 0 , 1 , 1 , 1 , 1 , 0 , 1 ) ,
BV8 ( 1 , 0 , 0 , 1 , 0 , 0 , 1 , 1 ) )
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
/* Identity matrix: */
.Ltf_id_bitmatrix :
.quad BM8 X 8 ( B V 8 ( 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ) ,
BV8 ( 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 ) ,
BV8 ( 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 ) ,
BV8 ( 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 ) ,
BV8 ( 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 ) ,
BV8 ( 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 ) ,
BV8 ( 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 ) ,
BV8 ( 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 ) )
2023-02-10 21:15:41 +03:00
.quad BM8 X 8 ( B V 8 ( 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ) ,
BV8 ( 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 ) ,
BV8 ( 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 ) ,
BV8 ( 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 ) ,
BV8 ( 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 ) ,
BV8 ( 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 ) ,
BV8 ( 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 ) ,
BV8 ( 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 ) )
2023-01-15 15:15:34 +03:00
# endif / * C O N F I G _ A S _ G F N I * /
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
/* 4-bit mask */
.section .rodata .cst4 .L0f0f0f0f , " aM" , @progbits, 4
.align 4
.L0f0f0f0f :
.long 0x0f0f0f0f
.text
SYM_ F U N C _ S T A R T _ L O C A L ( _ _ a r i a _ a e s n i _ a v x _ c r y p t _ 1 6 w a y )
/ * input :
* % r9 : rk
* % rsi : dst
* % rdx : src
* % xmm0 . . % x m m 1 5 : 1 6 b y t e - s l i c e d b l o c k s
* /
FRAME_ B E G I N
movq % r s i , % r a x ;
leaq 8 * 1 6 ( % r a x ) , % r8 ;
inpack1 6 _ p o s t ( % x m m 0 , % x m m 1 , % x m m 2 , % x m m 3 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r8 ) ;
aria_ f o ( % x m m 8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 , % x m m 1 5 ,
% xmm0 , % x m m 1 , % x m m 2 , % x m m 3 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% rax, % r9 , 0 ) ;
aria_ f e ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r9 , 1 ) ;
aria_ f o ( % x m m 9 , % x m m 8 , % x m m 1 1 , % x m m 1 0 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 , % x m m 1 5 ,
% xmm0 , % x m m 1 , % x m m 2 , % x m m 3 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% rax, % r9 , 2 ) ;
aria_ f e ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r9 , 3 ) ;
aria_ f o ( % x m m 9 , % x m m 8 , % x m m 1 1 , % x m m 1 0 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 , % x m m 1 5 ,
% xmm0 , % x m m 1 , % x m m 2 , % x m m 3 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% rax, % r9 , 4 ) ;
aria_ f e ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r9 , 5 ) ;
aria_ f o ( % x m m 9 , % x m m 8 , % x m m 1 1 , % x m m 1 0 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 , % x m m 1 5 ,
% xmm0 , % x m m 1 , % x m m 2 , % x m m 3 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% rax, % r9 , 6 ) ;
aria_ f e ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r9 , 7 ) ;
aria_ f o ( % x m m 9 , % x m m 8 , % x m m 1 1 , % x m m 1 0 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 , % x m m 1 5 ,
% xmm0 , % x m m 1 , % x m m 2 , % x m m 3 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% rax, % r9 , 8 ) ;
aria_ f e ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r9 , 9 ) ;
aria_ f o ( % x m m 9 , % x m m 8 , % x m m 1 1 , % x m m 1 0 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 , % x m m 1 5 ,
% xmm0 , % x m m 1 , % x m m 2 , % x m m 3 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% rax, % r9 , 1 0 ) ;
2023-01-01 12:12:50 +03:00
cmpl $ 1 2 , A R I A _ C T X _ r o u n d s ( C T X ) ;
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
jne . L a r i a _ 1 9 2 ;
aria_ f f ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r9 , 1 1 , 1 2 ) ;
jmp . L a r i a _ e n d ;
.Laria_192 :
aria_ f e ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r9 , 1 1 ) ;
aria_ f o ( % x m m 9 , % x m m 8 , % x m m 1 1 , % x m m 1 0 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 , % x m m 1 5 ,
% xmm0 , % x m m 1 , % x m m 2 , % x m m 3 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% rax, % r9 , 1 2 ) ;
2023-01-01 12:12:50 +03:00
cmpl $ 1 4 , A R I A _ C T X _ r o u n d s ( C T X ) ;
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
jne . L a r i a _ 2 5 6 ;
aria_ f f ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r9 , 1 3 , 1 4 ) ;
jmp . L a r i a _ e n d ;
.Laria_256 :
aria_ f e ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r9 , 1 3 ) ;
aria_ f o ( % x m m 9 , % x m m 8 , % x m m 1 1 , % x m m 1 0 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 , % x m m 1 5 ,
% xmm0 , % x m m 1 , % x m m 2 , % x m m 3 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% rax, % r9 , 1 4 ) ;
aria_ f f ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r9 , 1 5 , 1 6 ) ;
.Laria_end :
debyteslice_ 1 6 x16 b ( % x m m 8 , % x m m 1 2 , % x m m 1 , % x m m 4 ,
% xmm9 , % x m m 1 3 , % x m m 0 , % x m m 5 ,
% xmm1 0 , % x m m 1 4 , % x m m 3 , % x m m 6 ,
% xmm1 1 , % x m m 1 5 , % x m m 2 , % x m m 7 ,
( % rax) , ( % r8 ) ) ;
FRAME_ E N D
RET;
SYM_ F U N C _ E N D ( _ _ a r i a _ a e s n i _ a v x _ c r y p t _ 1 6 w a y )
2022-11-18 22:44:11 +03:00
SYM_ T Y P E D _ F U N C _ S T A R T ( a r i a _ a e s n i _ a v x _ e n c r y p t _ 1 6 w a y )
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
/ * input :
* % rdi : ctx, C T X
* % rsi : dst
* % rdx : src
* /
FRAME_ B E G I N
2023-01-01 12:12:50 +03:00
leaq A R I A _ C T X _ e n c _ k e y ( C T X ) , % r9 ;
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
inpack1 6 _ p r e ( % x m m 0 , % x m m 1 , % x m m 2 , % x m m 3 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r d x ) ;
call _ _ a r i a _ a e s n i _ a v x _ c r y p t _ 1 6 w a y ;
write_ o u t p u t ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x ) ;
FRAME_ E N D
RET;
SYM_ F U N C _ E N D ( a r i a _ a e s n i _ a v x _ e n c r y p t _ 1 6 w a y )
2022-11-18 22:44:11 +03:00
SYM_ T Y P E D _ F U N C _ S T A R T ( a r i a _ a e s n i _ a v x _ d e c r y p t _ 1 6 w a y )
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
/ * input :
* % rdi : ctx, C T X
* % rsi : dst
* % rdx : src
* /
FRAME_ B E G I N
2023-01-01 12:12:50 +03:00
leaq A R I A _ C T X _ d e c _ k e y ( C T X ) , % r9 ;
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
inpack1 6 _ p r e ( % x m m 0 , % x m m 1 , % x m m 2 , % x m m 3 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r d x ) ;
call _ _ a r i a _ a e s n i _ a v x _ c r y p t _ 1 6 w a y ;
write_ o u t p u t ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x ) ;
FRAME_ E N D
RET;
SYM_ F U N C _ E N D ( a r i a _ a e s n i _ a v x _ d e c r y p t _ 1 6 w a y )
SYM_ F U N C _ S T A R T _ L O C A L ( _ _ a r i a _ a e s n i _ a v x _ c t r _ g e n _ k e y s t r e a m _ 1 6 w a y )
/ * input :
* % rdi : ctx
* % rsi : dst
* % rdx : src
* % rcx : keystream
* % r8 : iv ( b i g e n d i a n , 1 2 8 b i t )
* /
FRAME_ B E G I N
/* load IV and byteswap */
vmovdqu ( % r8 ) , % x m m 8 ;
vmovdqa . L b s w a p12 8 _ m a s k ( % r i p ) , % x m m 1 ;
vpshufb % x m m 1 , % x m m 8 , % x m m 3 ; /* be => le */
vpcmpeqd % x m m 0 , % x m m 0 , % x m m 0 ;
vpsrldq $ 8 , % x m m 0 , % x m m 0 ; /* low: -1, high: 0 */
/* construct IVs */
inc_ l e 1 2 8 ( % x m m 3 , % x m m 0 , % x m m 5 ) ; /* +1 */
vpshufb % x m m 1 , % x m m 3 , % x m m 9 ;
inc_ l e 1 2 8 ( % x m m 3 , % x m m 0 , % x m m 5 ) ; /* +1 */
vpshufb % x m m 1 , % x m m 3 , % x m m 1 0 ;
inc_ l e 1 2 8 ( % x m m 3 , % x m m 0 , % x m m 5 ) ; /* +1 */
vpshufb % x m m 1 , % x m m 3 , % x m m 1 1 ;
inc_ l e 1 2 8 ( % x m m 3 , % x m m 0 , % x m m 5 ) ; /* +1 */
vpshufb % x m m 1 , % x m m 3 , % x m m 1 2 ;
inc_ l e 1 2 8 ( % x m m 3 , % x m m 0 , % x m m 5 ) ; /* +1 */
vpshufb % x m m 1 , % x m m 3 , % x m m 1 3 ;
inc_ l e 1 2 8 ( % x m m 3 , % x m m 0 , % x m m 5 ) ; /* +1 */
vpshufb % x m m 1 , % x m m 3 , % x m m 1 4 ;
inc_ l e 1 2 8 ( % x m m 3 , % x m m 0 , % x m m 5 ) ; /* +1 */
vpshufb % x m m 1 , % x m m 3 , % x m m 1 5 ;
vmovdqu % x m m 8 , ( 0 * 1 6 ) ( % r c x ) ;
vmovdqu % x m m 9 , ( 1 * 1 6 ) ( % r c x ) ;
vmovdqu % x m m 1 0 , ( 2 * 1 6 ) ( % r c x ) ;
vmovdqu % x m m 1 1 , ( 3 * 1 6 ) ( % r c x ) ;
vmovdqu % x m m 1 2 , ( 4 * 1 6 ) ( % r c x ) ;
vmovdqu % x m m 1 3 , ( 5 * 1 6 ) ( % r c x ) ;
vmovdqu % x m m 1 4 , ( 6 * 1 6 ) ( % r c x ) ;
vmovdqu % x m m 1 5 , ( 7 * 1 6 ) ( % r c x ) ;
inc_ l e 1 2 8 ( % x m m 3 , % x m m 0 , % x m m 5 ) ; /* +1 */
vpshufb % x m m 1 , % x m m 3 , % x m m 8 ;
inc_ l e 1 2 8 ( % x m m 3 , % x m m 0 , % x m m 5 ) ; /* +1 */
vpshufb % x m m 1 , % x m m 3 , % x m m 9 ;
inc_ l e 1 2 8 ( % x m m 3 , % x m m 0 , % x m m 5 ) ; /* +1 */
vpshufb % x m m 1 , % x m m 3 , % x m m 1 0 ;
inc_ l e 1 2 8 ( % x m m 3 , % x m m 0 , % x m m 5 ) ; /* +1 */
vpshufb % x m m 1 , % x m m 3 , % x m m 1 1 ;
inc_ l e 1 2 8 ( % x m m 3 , % x m m 0 , % x m m 5 ) ; /* +1 */
vpshufb % x m m 1 , % x m m 3 , % x m m 1 2 ;
inc_ l e 1 2 8 ( % x m m 3 , % x m m 0 , % x m m 5 ) ; /* +1 */
vpshufb % x m m 1 , % x m m 3 , % x m m 1 3 ;
inc_ l e 1 2 8 ( % x m m 3 , % x m m 0 , % x m m 5 ) ; /* +1 */
vpshufb % x m m 1 , % x m m 3 , % x m m 1 4 ;
inc_ l e 1 2 8 ( % x m m 3 , % x m m 0 , % x m m 5 ) ; /* +1 */
vpshufb % x m m 1 , % x m m 3 , % x m m 1 5 ;
inc_ l e 1 2 8 ( % x m m 3 , % x m m 0 , % x m m 5 ) ; /* +1 */
vpshufb % x m m 1 , % x m m 3 , % x m m 4 ;
vmovdqu % x m m 4 , ( % r8 ) ;
vmovdqu ( 0 * 1 6 ) ( % r c x ) , % x m m 0 ;
vmovdqu ( 1 * 1 6 ) ( % r c x ) , % x m m 1 ;
vmovdqu ( 2 * 1 6 ) ( % r c x ) , % x m m 2 ;
vmovdqu ( 3 * 1 6 ) ( % r c x ) , % x m m 3 ;
vmovdqu ( 4 * 1 6 ) ( % r c x ) , % x m m 4 ;
vmovdqu ( 5 * 1 6 ) ( % r c x ) , % x m m 5 ;
vmovdqu ( 6 * 1 6 ) ( % r c x ) , % x m m 6 ;
vmovdqu ( 7 * 1 6 ) ( % r c x ) , % x m m 7 ;
FRAME_ E N D
RET;
SYM_ F U N C _ E N D ( _ _ a r i a _ a e s n i _ a v x _ c t r _ g e n _ k e y s t r e a m _ 1 6 w a y )
2022-11-18 22:44:11 +03:00
SYM_ T Y P E D _ F U N C _ S T A R T ( a r i a _ a e s n i _ a v x _ c t r _ c r y p t _ 1 6 w a y )
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
/ * input :
* % rdi : ctx
* % rsi : dst
* % rdx : src
* % rcx : keystream
* % r8 : iv ( b i g e n d i a n , 1 2 8 b i t )
* /
FRAME_ B E G I N
call _ _ a r i a _ a e s n i _ a v x _ c t r _ g e n _ k e y s t r e a m _ 1 6 w a y ;
leaq ( % r s i ) , % r10 ;
leaq ( % r d x ) , % r11 ;
leaq ( % r c x ) , % r s i ;
leaq ( % r c x ) , % r d x ;
2023-01-01 12:12:50 +03:00
leaq A R I A _ C T X _ e n c _ k e y ( C T X ) , % r9 ;
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
call _ _ a r i a _ a e s n i _ a v x _ c r y p t _ 1 6 w a y ;
vpxor ( 0 * 1 6 ) ( % r11 ) , % x m m 1 , % x m m 1 ;
vpxor ( 1 * 1 6 ) ( % r11 ) , % x m m 0 , % x m m 0 ;
vpxor ( 2 * 1 6 ) ( % r11 ) , % x m m 3 , % x m m 3 ;
vpxor ( 3 * 1 6 ) ( % r11 ) , % x m m 2 , % x m m 2 ;
vpxor ( 4 * 1 6 ) ( % r11 ) , % x m m 4 , % x m m 4 ;
vpxor ( 5 * 1 6 ) ( % r11 ) , % x m m 5 , % x m m 5 ;
vpxor ( 6 * 1 6 ) ( % r11 ) , % x m m 6 , % x m m 6 ;
vpxor ( 7 * 1 6 ) ( % r11 ) , % x m m 7 , % x m m 7 ;
vpxor ( 8 * 1 6 ) ( % r11 ) , % x m m 8 , % x m m 8 ;
vpxor ( 9 * 1 6 ) ( % r11 ) , % x m m 9 , % x m m 9 ;
vpxor ( 1 0 * 1 6 ) ( % r11 ) , % x m m 1 0 , % x m m 1 0 ;
vpxor ( 1 1 * 1 6 ) ( % r11 ) , % x m m 1 1 , % x m m 1 1 ;
vpxor ( 1 2 * 1 6 ) ( % r11 ) , % x m m 1 2 , % x m m 1 2 ;
vpxor ( 1 3 * 1 6 ) ( % r11 ) , % x m m 1 3 , % x m m 1 3 ;
vpxor ( 1 4 * 1 6 ) ( % r11 ) , % x m m 1 4 , % x m m 1 4 ;
vpxor ( 1 5 * 1 6 ) ( % r11 ) , % x m m 1 5 , % x m m 1 5 ;
write_ o u t p u t ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r10 ) ;
FRAME_ E N D
RET;
SYM_ F U N C _ E N D ( a r i a _ a e s n i _ a v x _ c t r _ c r y p t _ 1 6 w a y )
2023-01-15 15:15:34 +03:00
# ifdef C O N F I G _ A S _ G F N I
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
SYM_ F U N C _ S T A R T _ L O C A L ( _ _ a r i a _ a e s n i _ a v x _ g f n i _ c r y p t _ 1 6 w a y )
/ * input :
* % r9 : rk
* % rsi : dst
* % rdx : src
* % xmm0 . . % x m m 1 5 : 1 6 b y t e - s l i c e d b l o c k s
* /
FRAME_ B E G I N
movq % r s i , % r a x ;
leaq 8 * 1 6 ( % r a x ) , % r8 ;
inpack1 6 _ p o s t ( % x m m 0 , % x m m 1 , % x m m 2 , % x m m 3 ,
% xmm4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 ,
% xmm1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r8 ) ;
aria_ f o _ g f n i ( % x m m 8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 ,
% xmm1 2 , % x m m 1 3 , % x m m 1 4 , % x m m 1 5 ,
% xmm0 , % x m m 1 , % x m m 2 , % x m m 3 ,
% xmm4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% rax, % r9 , 0 ) ;
aria_ f e _ g f n i ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 ,
% xmm4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 ,
% xmm1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r9 , 1 ) ;
aria_ f o _ g f n i ( % x m m 9 , % x m m 8 , % x m m 1 1 , % x m m 1 0 ,
% xmm1 2 , % x m m 1 3 , % x m m 1 4 , % x m m 1 5 ,
% xmm0 , % x m m 1 , % x m m 2 , % x m m 3 ,
% xmm4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% rax, % r9 , 2 ) ;
aria_ f e _ g f n i ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 ,
% xmm4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 ,
% xmm1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r9 , 3 ) ;
aria_ f o _ g f n i ( % x m m 9 , % x m m 8 , % x m m 1 1 , % x m m 1 0 ,
% xmm1 2 , % x m m 1 3 , % x m m 1 4 , % x m m 1 5 ,
% xmm0 , % x m m 1 , % x m m 2 , % x m m 3 ,
% xmm4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% rax, % r9 , 4 ) ;
aria_ f e _ g f n i ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 ,
% xmm4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 ,
% xmm1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r9 , 5 ) ;
aria_ f o _ g f n i ( % x m m 9 , % x m m 8 , % x m m 1 1 , % x m m 1 0 ,
% xmm1 2 , % x m m 1 3 , % x m m 1 4 , % x m m 1 5 ,
% xmm0 , % x m m 1 , % x m m 2 , % x m m 3 ,
% xmm4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% rax, % r9 , 6 ) ;
aria_ f e _ g f n i ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 ,
% xmm4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 ,
% xmm1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r9 , 7 ) ;
aria_ f o _ g f n i ( % x m m 9 , % x m m 8 , % x m m 1 1 , % x m m 1 0 ,
% xmm1 2 , % x m m 1 3 , % x m m 1 4 , % x m m 1 5 ,
% xmm0 , % x m m 1 , % x m m 2 , % x m m 3 ,
% xmm4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% rax, % r9 , 8 ) ;
aria_ f e _ g f n i ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 ,
% xmm4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 ,
% xmm1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r9 , 9 ) ;
aria_ f o _ g f n i ( % x m m 9 , % x m m 8 , % x m m 1 1 , % x m m 1 0 ,
% xmm1 2 , % x m m 1 3 , % x m m 1 4 , % x m m 1 5 ,
% xmm0 , % x m m 1 , % x m m 2 , % x m m 3 ,
% xmm4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% rax, % r9 , 1 0 ) ;
2023-01-01 12:12:50 +03:00
cmpl $ 1 2 , A R I A _ C T X _ r o u n d s ( C T X ) ;
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
jne . L a r i a _ g f n i _ 1 9 2 ;
aria_ f f _ g f n i ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r9 , 1 1 , 1 2 ) ;
jmp . L a r i a _ g f n i _ e n d ;
.Laria_gfni_192 :
aria_ f e _ g f n i ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 ,
% xmm4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 ,
% xmm1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r9 , 1 1 ) ;
aria_ f o _ g f n i ( % x m m 9 , % x m m 8 , % x m m 1 1 , % x m m 1 0 ,
% xmm1 2 , % x m m 1 3 , % x m m 1 4 , % x m m 1 5 ,
% xmm0 , % x m m 1 , % x m m 2 , % x m m 3 ,
% xmm4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% rax, % r9 , 1 2 ) ;
2023-01-01 12:12:50 +03:00
cmpl $ 1 4 , A R I A _ C T X _ r o u n d s ( C T X ) ;
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
jne . L a r i a _ g f n i _ 2 5 6 ;
aria_ f f _ g f n i ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 ,
% xmm4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 ,
% xmm1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r9 , 1 3 , 1 4 ) ;
jmp . L a r i a _ g f n i _ e n d ;
.Laria_gfni_256 :
aria_ f e _ g f n i ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 ,
% xmm4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 ,
% xmm1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r9 , 1 3 ) ;
aria_ f o _ g f n i ( % x m m 9 , % x m m 8 , % x m m 1 1 , % x m m 1 0 ,
% xmm1 2 , % x m m 1 3 , % x m m 1 4 , % x m m 1 5 ,
% xmm0 , % x m m 1 , % x m m 2 , % x m m 3 ,
% xmm4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% rax, % r9 , 1 4 ) ;
aria_ f f _ g f n i ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 ,
% xmm4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 ,
% xmm1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x , % r9 , 1 5 , 1 6 ) ;
.Laria_gfni_end :
debyteslice_ 1 6 x16 b ( % x m m 8 , % x m m 1 2 , % x m m 1 , % x m m 4 ,
% xmm9 , % x m m 1 3 , % x m m 0 , % x m m 5 ,
% xmm1 0 , % x m m 1 4 , % x m m 3 , % x m m 6 ,
% xmm1 1 , % x m m 1 5 , % x m m 2 , % x m m 7 ,
( % rax) , ( % r8 ) ) ;
FRAME_ E N D
RET;
SYM_ F U N C _ E N D ( _ _ a r i a _ a e s n i _ a v x _ g f n i _ c r y p t _ 1 6 w a y )
2022-11-18 22:44:11 +03:00
SYM_ T Y P E D _ F U N C _ S T A R T ( a r i a _ a e s n i _ a v x _ g f n i _ e n c r y p t _ 1 6 w a y )
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
/ * input :
* % rdi : ctx, C T X
* % rsi : dst
* % rdx : src
* /
FRAME_ B E G I N
2023-01-01 12:12:50 +03:00
leaq A R I A _ C T X _ e n c _ k e y ( C T X ) , % r9 ;
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
inpack1 6 _ p r e ( % x m m 0 , % x m m 1 , % x m m 2 , % x m m 3 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r d x ) ;
call _ _ a r i a _ a e s n i _ a v x _ g f n i _ c r y p t _ 1 6 w a y ;
write_ o u t p u t ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x ) ;
FRAME_ E N D
RET;
SYM_ F U N C _ E N D ( a r i a _ a e s n i _ a v x _ g f n i _ e n c r y p t _ 1 6 w a y )
2022-11-18 22:44:11 +03:00
SYM_ T Y P E D _ F U N C _ S T A R T ( a r i a _ a e s n i _ a v x _ g f n i _ d e c r y p t _ 1 6 w a y )
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
/ * input :
* % rdi : ctx, C T X
* % rsi : dst
* % rdx : src
* /
FRAME_ B E G I N
2023-01-01 12:12:50 +03:00
leaq A R I A _ C T X _ d e c _ k e y ( C T X ) , % r9 ;
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
inpack1 6 _ p r e ( % x m m 0 , % x m m 1 , % x m m 2 , % x m m 3 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r d x ) ;
call _ _ a r i a _ a e s n i _ a v x _ g f n i _ c r y p t _ 1 6 w a y ;
write_ o u t p u t ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r a x ) ;
FRAME_ E N D
RET;
SYM_ F U N C _ E N D ( a r i a _ a e s n i _ a v x _ g f n i _ d e c r y p t _ 1 6 w a y )
2022-11-18 22:44:11 +03:00
SYM_ T Y P E D _ F U N C _ S T A R T ( a r i a _ a e s n i _ a v x _ g f n i _ c t r _ c r y p t _ 1 6 w a y )
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
/ * input :
* % rdi : ctx
* % rsi : dst
* % rdx : src
* % rcx : keystream
* % r8 : iv ( b i g e n d i a n , 1 2 8 b i t )
* /
FRAME_ B E G I N
call _ _ a r i a _ a e s n i _ a v x _ c t r _ g e n _ k e y s t r e a m _ 1 6 w a y
leaq ( % r s i ) , % r10 ;
leaq ( % r d x ) , % r11 ;
leaq ( % r c x ) , % r s i ;
leaq ( % r c x ) , % r d x ;
2023-01-01 12:12:50 +03:00
leaq A R I A _ C T X _ e n c _ k e y ( C T X ) , % r9 ;
crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2022-09-16 15:57:35 +03:00
call _ _ a r i a _ a e s n i _ a v x _ g f n i _ c r y p t _ 1 6 w a y ;
vpxor ( 0 * 1 6 ) ( % r11 ) , % x m m 1 , % x m m 1 ;
vpxor ( 1 * 1 6 ) ( % r11 ) , % x m m 0 , % x m m 0 ;
vpxor ( 2 * 1 6 ) ( % r11 ) , % x m m 3 , % x m m 3 ;
vpxor ( 3 * 1 6 ) ( % r11 ) , % x m m 2 , % x m m 2 ;
vpxor ( 4 * 1 6 ) ( % r11 ) , % x m m 4 , % x m m 4 ;
vpxor ( 5 * 1 6 ) ( % r11 ) , % x m m 5 , % x m m 5 ;
vpxor ( 6 * 1 6 ) ( % r11 ) , % x m m 6 , % x m m 6 ;
vpxor ( 7 * 1 6 ) ( % r11 ) , % x m m 7 , % x m m 7 ;
vpxor ( 8 * 1 6 ) ( % r11 ) , % x m m 8 , % x m m 8 ;
vpxor ( 9 * 1 6 ) ( % r11 ) , % x m m 9 , % x m m 9 ;
vpxor ( 1 0 * 1 6 ) ( % r11 ) , % x m m 1 0 , % x m m 1 0 ;
vpxor ( 1 1 * 1 6 ) ( % r11 ) , % x m m 1 1 , % x m m 1 1 ;
vpxor ( 1 2 * 1 6 ) ( % r11 ) , % x m m 1 2 , % x m m 1 2 ;
vpxor ( 1 3 * 1 6 ) ( % r11 ) , % x m m 1 3 , % x m m 1 3 ;
vpxor ( 1 4 * 1 6 ) ( % r11 ) , % x m m 1 4 , % x m m 1 4 ;
vpxor ( 1 5 * 1 6 ) ( % r11 ) , % x m m 1 5 , % x m m 1 5 ;
write_ o u t p u t ( % x m m 1 , % x m m 0 , % x m m 3 , % x m m 2 , % x m m 4 , % x m m 5 , % x m m 6 , % x m m 7 ,
% xmm8 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 ,
% xmm1 5 , % r10 ) ;
FRAME_ E N D
RET;
SYM_ F U N C _ E N D ( a r i a _ a e s n i _ a v x _ g f n i _ c t r _ c r y p t _ 1 6 w a y )
2023-01-15 15:15:34 +03:00
# endif / * C O N F I G _ A S _ G F N I * /