2019-05-27 08:55:05 +02:00
/* SPDX-License-Identifier: GPL-2.0-or-later */
2011-09-26 16:47:25 +03:00
/ *
* Twofish C i p h e r 3 - w a y p a r a l l e l a l g o r i t h m ( x86 _ 6 4 )
*
* Copyright ( C ) 2 0 1 1 J u s s i K i v i l i n n a < j u s s i . k i v i l i n n a @mbnet.fi>
* /
2013-01-19 13:39:46 +02:00
# include < l i n u x / l i n k a g e . h >
2011-09-26 16:47:25 +03:00
.file " twofish- x86 _ 6 4 - a s m - 3 w a y . S "
.text
/* structure of crypto context */
# define s0 0
# define s1 1 0 2 4
# define s2 2 0 4 8
# define s3 3 0 7 2
# define w 4 0 9 6
# define k 4 1 2 8
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
3 - way t w o f i s h
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
# define C T X % r d i
# define R I O % r d x
# define R A B 0 % r a x
# define R A B 1 % r b x
# define R A B 2 % r c x
# define R A B 0 d % e a x
# define R A B 1 d % e b x
# define R A B 2 d % e c x
# define R A B 0 b h % a h
# define R A B 1 b h % b h
# define R A B 2 b h % c h
# define R A B 0 b l % a l
# define R A B 1 b l % b l
# define R A B 2 b l % c l
crypto: x86/twofish-3way - Fix %rbp usage
Using %rbp as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.
In twofish-3way, we can't simply replace %rbp with another register
because there are none available. Instead, we use the stack to hold the
values that %rbp, %r11, and %r12 were holding previously. Each of these
values represents the half of the output from the previous Feistel round
that is being passed on unchanged to the following round. They are only
used once per round, when they are exchanged with %rax, %rbx, and %rcx.
As a result, we free up 3 registers (one per block) and can reassign
them so that %rbp is not used, and additionally %r14 and %r15 are not
used so they do not need to be saved/restored.
There may be a small overhead caused by replacing 'xchg REG, REG' with
the needed sequence 'mov MEM, REG; mov REG, MEM; mov REG, REG' once per
round. But, counterintuitively, when I tested "ctr-twofish-3way" on a
Haswell processor, the new version was actually about 2% faster.
(Perhaps 'xchg' is not as well optimized as plain moves.)
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-12-18 16:40:26 -08:00
# define C D 0 0 x0 ( % r s p )
# define C D 1 0 x8 ( % r s p )
# define C D 2 0 x10 ( % r s p )
# used o n l y b e f o r e / a f t e r a l l r o u n d s
2011-09-26 16:47:25 +03:00
# define R C D 0 % r8
# define R C D 1 % r9
# define R C D 2 % r10
crypto: x86/twofish-3way - Fix %rbp usage
Using %rbp as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.
In twofish-3way, we can't simply replace %rbp with another register
because there are none available. Instead, we use the stack to hold the
values that %rbp, %r11, and %r12 were holding previously. Each of these
values represents the half of the output from the previous Feistel round
that is being passed on unchanged to the following round. They are only
used once per round, when they are exchanged with %rax, %rbx, and %rcx.
As a result, we free up 3 registers (one per block) and can reassign
them so that %rbp is not used, and additionally %r14 and %r15 are not
used so they do not need to be saved/restored.
There may be a small overhead caused by replacing 'xchg REG, REG' with
the needed sequence 'mov MEM, REG; mov REG, MEM; mov REG, REG' once per
round. But, counterintuitively, when I tested "ctr-twofish-3way" on a
Haswell processor, the new version was actually about 2% faster.
(Perhaps 'xchg' is not as well optimized as plain moves.)
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-12-18 16:40:26 -08:00
# used o n l y d u r i n g r o u n d s
# define R X 0 % r8
# define R X 1 % r9
# define R X 2 % r10
2011-09-26 16:47:25 +03:00
crypto: x86/twofish-3way - Fix %rbp usage
Using %rbp as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.
In twofish-3way, we can't simply replace %rbp with another register
because there are none available. Instead, we use the stack to hold the
values that %rbp, %r11, and %r12 were holding previously. Each of these
values represents the half of the output from the previous Feistel round
that is being passed on unchanged to the following round. They are only
used once per round, when they are exchanged with %rax, %rbx, and %rcx.
As a result, we free up 3 registers (one per block) and can reassign
them so that %rbp is not used, and additionally %r14 and %r15 are not
used so they do not need to be saved/restored.
There may be a small overhead caused by replacing 'xchg REG, REG' with
the needed sequence 'mov MEM, REG; mov REG, MEM; mov REG, REG' once per
round. But, counterintuitively, when I tested "ctr-twofish-3way" on a
Haswell processor, the new version was actually about 2% faster.
(Perhaps 'xchg' is not as well optimized as plain moves.)
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-12-18 16:40:26 -08:00
# define R X 0 d % r8 d
# define R X 1 d % r9 d
# define R X 2 d % r10 d
2011-09-26 16:47:25 +03:00
crypto: x86/twofish-3way - Fix %rbp usage
Using %rbp as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.
In twofish-3way, we can't simply replace %rbp with another register
because there are none available. Instead, we use the stack to hold the
values that %rbp, %r11, and %r12 were holding previously. Each of these
values represents the half of the output from the previous Feistel round
that is being passed on unchanged to the following round. They are only
used once per round, when they are exchanged with %rax, %rbx, and %rcx.
As a result, we free up 3 registers (one per block) and can reassign
them so that %rbp is not used, and additionally %r14 and %r15 are not
used so they do not need to be saved/restored.
There may be a small overhead caused by replacing 'xchg REG, REG' with
the needed sequence 'mov MEM, REG; mov REG, MEM; mov REG, REG' once per
round. But, counterintuitively, when I tested "ctr-twofish-3way" on a
Haswell processor, the new version was actually about 2% faster.
(Perhaps 'xchg' is not as well optimized as plain moves.)
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-12-18 16:40:26 -08:00
# define R Y 0 % r11
# define R Y 1 % r12
# define R Y 2 % r13
2011-09-26 16:47:25 +03:00
crypto: x86/twofish-3way - Fix %rbp usage
Using %rbp as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.
In twofish-3way, we can't simply replace %rbp with another register
because there are none available. Instead, we use the stack to hold the
values that %rbp, %r11, and %r12 were holding previously. Each of these
values represents the half of the output from the previous Feistel round
that is being passed on unchanged to the following round. They are only
used once per round, when they are exchanged with %rax, %rbx, and %rcx.
As a result, we free up 3 registers (one per block) and can reassign
them so that %rbp is not used, and additionally %r14 and %r15 are not
used so they do not need to be saved/restored.
There may be a small overhead caused by replacing 'xchg REG, REG' with
the needed sequence 'mov MEM, REG; mov REG, MEM; mov REG, REG' once per
round. But, counterintuitively, when I tested "ctr-twofish-3way" on a
Haswell processor, the new version was actually about 2% faster.
(Perhaps 'xchg' is not as well optimized as plain moves.)
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-12-18 16:40:26 -08:00
# define R Y 0 d % r11 d
# define R Y 1 d % r12 d
# define R Y 2 d % r13 d
2011-09-26 16:47:25 +03:00
# define R T 0 % r d x
# define R T 1 % r s i
# define R T 0 d % e d x
# define R T 1 d % e s i
crypto: x86/twofish-3way - Fix %rbp usage
Using %rbp as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.
In twofish-3way, we can't simply replace %rbp with another register
because there are none available. Instead, we use the stack to hold the
values that %rbp, %r11, and %r12 were holding previously. Each of these
values represents the half of the output from the previous Feistel round
that is being passed on unchanged to the following round. They are only
used once per round, when they are exchanged with %rax, %rbx, and %rcx.
As a result, we free up 3 registers (one per block) and can reassign
them so that %rbp is not used, and additionally %r14 and %r15 are not
used so they do not need to be saved/restored.
There may be a small overhead caused by replacing 'xchg REG, REG' with
the needed sequence 'mov MEM, REG; mov REG, MEM; mov REG, REG' once per
round. But, counterintuitively, when I tested "ctr-twofish-3way" on a
Haswell processor, the new version was actually about 2% faster.
(Perhaps 'xchg' is not as well optimized as plain moves.)
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-12-18 16:40:26 -08:00
# define R T 1 b l % s i l
2011-09-26 16:47:25 +03:00
# define d o 1 6 b i t _ r o r ( r o t , o p1 , o p2 , T 0 , T 1 , t m p1 , t m p2 , a b , d s t ) \
movzbl a b ## b l , t m p 2 ## d ; \
movzbl a b ## b h , t m p 1 ## d ; \
rorq $ ( r o t ) , a b ; \
op1 ## l T 0 ( C T X , t m p2 , 4 ) , d s t ## d ; \
op2 ## l T 1 ( C T X , t m p1 , 4 ) , d s t ## d ;
crypto: x86/twofish-3way - Fix %rbp usage
Using %rbp as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.
In twofish-3way, we can't simply replace %rbp with another register
because there are none available. Instead, we use the stack to hold the
values that %rbp, %r11, and %r12 were holding previously. Each of these
values represents the half of the output from the previous Feistel round
that is being passed on unchanged to the following round. They are only
used once per round, when they are exchanged with %rax, %rbx, and %rcx.
As a result, we free up 3 registers (one per block) and can reassign
them so that %rbp is not used, and additionally %r14 and %r15 are not
used so they do not need to be saved/restored.
There may be a small overhead caused by replacing 'xchg REG, REG' with
the needed sequence 'mov MEM, REG; mov REG, MEM; mov REG, REG' once per
round. But, counterintuitively, when I tested "ctr-twofish-3way" on a
Haswell processor, the new version was actually about 2% faster.
(Perhaps 'xchg' is not as well optimized as plain moves.)
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-12-18 16:40:26 -08:00
# define s w a p _ a b _ w i t h _ c d ( a b , c d , t m p ) \
movq c d , t m p ; \
movq a b , c d ; \
movq t m p , a b ;
2011-09-26 16:47:25 +03:00
/ *
* Combined G 1 & G 2 f u n c t i o n . R e o r d e r e d w i t h h e l p o f r o t a t e s t o h a v e m o v e s
2021-03-21 22:28:53 +01:00
* at b e g i n n i n g .
2011-09-26 16:47:25 +03:00
* /
# define g 1 g 2 _ 3 ( a b , c d , T x0 , T x1 , T x2 , T x3 , T y 0 , T y 1 , T y 2 , T y 3 , x , y ) \
/* G1,1 && G2,1 */ \
do1 6 b i t _ r o r ( 3 2 , m o v , x o r , T x0 , T x1 , R T 0 , x ## 0 , a b ## 0 , x ## 0 ) ; \
do1 6 b i t _ r o r ( 4 8 , m o v , x o r , T y 1 , T y 2 , R T 0 , y ## 0 , a b ## 0 , y ## 0 ) ; \
\
do1 6 b i t _ r o r ( 3 2 , m o v , x o r , T x0 , T x1 , R T 0 , x ## 1 , a b ## 1 , x ## 1 ) ; \
do1 6 b i t _ r o r ( 4 8 , m o v , x o r , T y 1 , T y 2 , R T 0 , y ## 1 , a b ## 1 , y ## 1 ) ; \
\
do1 6 b i t _ r o r ( 3 2 , m o v , x o r , T x0 , T x1 , R T 0 , x ## 2 , a b ## 2 , x ## 2 ) ; \
do1 6 b i t _ r o r ( 4 8 , m o v , x o r , T y 1 , T y 2 , R T 0 , y ## 2 , a b ## 2 , y ## 2 ) ; \
\
/* G1,2 && G2,2 */ \
do1 6 b i t _ r o r ( 3 2 , x o r , x o r , T x2 , T x3 , R T 0 , R T 1 , a b ## 0 , x ## 0 ) ; \
do1 6 b i t _ r o r ( 1 6 , x o r , x o r , T y 3 , T y 0 , R T 0 , R T 1 , a b ## 0 , y ## 0 ) ; \
crypto: x86/twofish-3way - Fix %rbp usage
Using %rbp as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.
In twofish-3way, we can't simply replace %rbp with another register
because there are none available. Instead, we use the stack to hold the
values that %rbp, %r11, and %r12 were holding previously. Each of these
values represents the half of the output from the previous Feistel round
that is being passed on unchanged to the following round. They are only
used once per round, when they are exchanged with %rax, %rbx, and %rcx.
As a result, we free up 3 registers (one per block) and can reassign
them so that %rbp is not used, and additionally %r14 and %r15 are not
used so they do not need to be saved/restored.
There may be a small overhead caused by replacing 'xchg REG, REG' with
the needed sequence 'mov MEM, REG; mov REG, MEM; mov REG, REG' once per
round. But, counterintuitively, when I tested "ctr-twofish-3way" on a
Haswell processor, the new version was actually about 2% faster.
(Perhaps 'xchg' is not as well optimized as plain moves.)
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-12-18 16:40:26 -08:00
swap_ a b _ w i t h _ c d ( a b ## 0 , c d ## 0 , R T 0 ) ; \
2011-09-26 16:47:25 +03:00
\
do1 6 b i t _ r o r ( 3 2 , x o r , x o r , T x2 , T x3 , R T 0 , R T 1 , a b ## 1 , x ## 1 ) ; \
do1 6 b i t _ r o r ( 1 6 , x o r , x o r , T y 3 , T y 0 , R T 0 , R T 1 , a b ## 1 , y ## 1 ) ; \
crypto: x86/twofish-3way - Fix %rbp usage
Using %rbp as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.
In twofish-3way, we can't simply replace %rbp with another register
because there are none available. Instead, we use the stack to hold the
values that %rbp, %r11, and %r12 were holding previously. Each of these
values represents the half of the output from the previous Feistel round
that is being passed on unchanged to the following round. They are only
used once per round, when they are exchanged with %rax, %rbx, and %rcx.
As a result, we free up 3 registers (one per block) and can reassign
them so that %rbp is not used, and additionally %r14 and %r15 are not
used so they do not need to be saved/restored.
There may be a small overhead caused by replacing 'xchg REG, REG' with
the needed sequence 'mov MEM, REG; mov REG, MEM; mov REG, REG' once per
round. But, counterintuitively, when I tested "ctr-twofish-3way" on a
Haswell processor, the new version was actually about 2% faster.
(Perhaps 'xchg' is not as well optimized as plain moves.)
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-12-18 16:40:26 -08:00
swap_ a b _ w i t h _ c d ( a b ## 1 , c d ## 1 , R T 0 ) ; \
2011-09-26 16:47:25 +03:00
\
do1 6 b i t _ r o r ( 3 2 , x o r , x o r , T x2 , T x3 , R T 0 , R T 1 , a b ## 2 , x ## 2 ) ; \
do1 6 b i t _ r o r ( 1 6 , x o r , x o r , T y 3 , T y 0 , R T 0 , R T 1 , a b ## 2 , y ## 2 ) ; \
crypto: x86/twofish-3way - Fix %rbp usage
Using %rbp as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.
In twofish-3way, we can't simply replace %rbp with another register
because there are none available. Instead, we use the stack to hold the
values that %rbp, %r11, and %r12 were holding previously. Each of these
values represents the half of the output from the previous Feistel round
that is being passed on unchanged to the following round. They are only
used once per round, when they are exchanged with %rax, %rbx, and %rcx.
As a result, we free up 3 registers (one per block) and can reassign
them so that %rbp is not used, and additionally %r14 and %r15 are not
used so they do not need to be saved/restored.
There may be a small overhead caused by replacing 'xchg REG, REG' with
the needed sequence 'mov MEM, REG; mov REG, MEM; mov REG, REG' once per
round. But, counterintuitively, when I tested "ctr-twofish-3way" on a
Haswell processor, the new version was actually about 2% faster.
(Perhaps 'xchg' is not as well optimized as plain moves.)
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-12-18 16:40:26 -08:00
swap_ a b _ w i t h _ c d ( a b ## 2 , c d ## 2 , R T 0 ) ;
2011-09-26 16:47:25 +03:00
# define e n c _ r o u n d _ e n d ( a b , x , y , n ) \
addl y ## d , x # # d ; \
addl x ## d , y # # d ; \
addl k + 4 * ( 2 * ( n ) ) ( C T X ) , x ## d ; \
xorl a b ## d , x # # d ; \
addl k + 4 * ( 2 * ( n ) + 1 ) ( C T X ) , y ## d ; \
shrq $ 3 2 , a b ; \
roll $ 1 , a b ## d ; \
xorl y ## d , a b # # d ; \
shlq $ 3 2 , a b ; \
rorl $ 1 , x ## d ; \
orq x , a b ;
# define d e c _ r o u n d _ e n d ( b a , x , y , n ) \
addl y ## d , x # # d ; \
addl x ## d , y # # d ; \
addl k + 4 * ( 2 * ( n ) ) ( C T X ) , x ## d ; \
addl k + 4 * ( 2 * ( n ) + 1 ) ( C T X ) , y ## d ; \
xorl b a ## d , y # # d ; \
shrq $ 3 2 , b a ; \
roll $ 1 , b a ## d ; \
xorl x ## d , b a # # d ; \
shlq $ 3 2 , b a ; \
rorl $ 1 , y ## d ; \
orq y , b a ;
# define e n c r y p t _ r o u n d3 ( a b , c d , n ) \
g1 g 2 _ 3 ( a b , c d , s0 , s1 , s2 , s3 , s0 , s1 , s2 , s3 , R X , R Y ) ; \
\
enc_ r o u n d _ e n d ( a b ## 0 , R X 0 , R Y 0 , n ) ; \
enc_ r o u n d _ e n d ( a b ## 1 , R X 1 , R Y 1 , n ) ; \
enc_ r o u n d _ e n d ( a b ## 2 , R X 2 , R Y 2 , n ) ;
# define d e c r y p t _ r o u n d3 ( b a , d c , n ) \
g1 g 2 _ 3 ( b a , d c , s1 , s2 , s3 , s0 , s3 , s0 , s1 , s2 , R Y , R X ) ; \
\
dec_ r o u n d _ e n d ( b a ## 0 , R X 0 , R Y 0 , n ) ; \
dec_ r o u n d _ e n d ( b a ## 1 , R X 1 , R Y 1 , n ) ; \
dec_ r o u n d _ e n d ( b a ## 2 , R X 2 , R Y 2 , n ) ;
# define e n c r y p t _ c y c l e 3 ( a b , c d , n ) \
encrypt_ r o u n d3 ( a b , c d , n * 2 ) ; \
encrypt_ r o u n d3 ( a b , c d , ( n * 2 ) + 1 ) ;
# define d e c r y p t _ c y c l e 3 ( b a , d c , n ) \
decrypt_ r o u n d3 ( b a , d c , ( n * 2 ) + 1 ) ; \
decrypt_ r o u n d3 ( b a , d c , ( n * 2 ) ) ;
crypto: x86/twofish-3way - Fix %rbp usage
Using %rbp as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.
In twofish-3way, we can't simply replace %rbp with another register
because there are none available. Instead, we use the stack to hold the
values that %rbp, %r11, and %r12 were holding previously. Each of these
values represents the half of the output from the previous Feistel round
that is being passed on unchanged to the following round. They are only
used once per round, when they are exchanged with %rax, %rbx, and %rcx.
As a result, we free up 3 registers (one per block) and can reassign
them so that %rbp is not used, and additionally %r14 and %r15 are not
used so they do not need to be saved/restored.
There may be a small overhead caused by replacing 'xchg REG, REG' with
the needed sequence 'mov MEM, REG; mov REG, MEM; mov REG, REG' once per
round. But, counterintuitively, when I tested "ctr-twofish-3way" on a
Haswell processor, the new version was actually about 2% faster.
(Perhaps 'xchg' is not as well optimized as plain moves.)
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-12-18 16:40:26 -08:00
# define p u s h _ c d ( ) \
pushq R C D 2 ; \
pushq R C D 1 ; \
pushq R C D 0 ;
# define p o p _ c d ( ) \
popq R C D 0 ; \
popq R C D 1 ; \
popq R C D 2 ;
2011-09-26 16:47:25 +03:00
# define i n p a c k 3 ( i n , n , x y , m ) \
movq 4 * ( n ) ( i n ) , x y ## 0 ; \
xorq w + 4 * m ( C T X ) , x y ## 0 ; \
\
movq 4 * ( 4 + ( n ) ) ( i n ) , x y ## 1 ; \
xorq w + 4 * m ( C T X ) , x y ## 1 ; \
\
movq 4 * ( 8 + ( n ) ) ( i n ) , x y ## 2 ; \
xorq w + 4 * m ( C T X ) , x y ## 2 ;
# define o u t u n p a c k 3 ( o p , o u t , n , x y , m ) \
xorq w + 4 * m ( C T X ) , x y ## 0 ; \
op ## q x y # # 0 , 4 * ( n ) ( o u t ) ; \
\
xorq w + 4 * m ( C T X ) , x y ## 1 ; \
op ## q x y # # 1 , 4 * ( 4 + ( n ) ) ( o u t ) ; \
\
xorq w + 4 * m ( C T X ) , x y ## 2 ; \
op ## q x y # # 2 , 4 * ( 8 + ( n ) ) ( o u t ) ;
# define i n p a c k _ e n c3 ( ) \
inpack3 ( R I O , 0 , R A B , 0 ) ; \
inpack3 ( R I O , 2 , R C D , 2 ) ;
# define o u t u n p a c k _ e n c3 ( o p ) \
outunpack3 ( o p , R I O , 2 , R A B , 6 ) ; \
outunpack3 ( o p , R I O , 0 , R C D , 4 ) ;
# define i n p a c k _ d e c3 ( ) \
inpack3 ( R I O , 0 , R A B , 4 ) ; \
rorq $ 3 2 , R A B 0 ; \
rorq $ 3 2 , R A B 1 ; \
rorq $ 3 2 , R A B 2 ; \
inpack3 ( R I O , 2 , R C D , 6 ) ; \
rorq $ 3 2 , R C D 0 ; \
rorq $ 3 2 , R C D 1 ; \
rorq $ 3 2 , R C D 2 ;
# define o u t u n p a c k _ d e c3 ( ) \
rorq $ 3 2 , R C D 0 ; \
rorq $ 3 2 , R C D 1 ; \
rorq $ 3 2 , R C D 2 ; \
outunpack3 ( m o v , R I O , 0 , R C D , 0 ) ; \
rorq $ 3 2 , R A B 0 ; \
rorq $ 3 2 , R A B 1 ; \
rorq $ 3 2 , R A B 2 ; \
outunpack3 ( m o v , R I O , 2 , R A B , 2 ) ;
2019-10-11 13:51:04 +02:00
SYM_ F U N C _ S T A R T ( _ _ t w o f i s h _ e n c _ b l k _ 3 w a y )
2011-09-26 16:47:25 +03:00
/ * input :
* % rdi : ctx, C T X
* % rsi : dst
* % rdx : src, R I O
* % rcx : bool, i f t r u e : x o r o u t p u t
* /
pushq % r13 ;
pushq % r12 ;
pushq % r b x ;
pushq % r c x ; /* bool xor */
pushq % r s i ; /* dst */
inpack_ e n c3 ( ) ;
crypto: x86/twofish-3way - Fix %rbp usage
Using %rbp as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.
In twofish-3way, we can't simply replace %rbp with another register
because there are none available. Instead, we use the stack to hold the
values that %rbp, %r11, and %r12 were holding previously. Each of these
values represents the half of the output from the previous Feistel round
that is being passed on unchanged to the following round. They are only
used once per round, when they are exchanged with %rax, %rbx, and %rcx.
As a result, we free up 3 registers (one per block) and can reassign
them so that %rbp is not used, and additionally %r14 and %r15 are not
used so they do not need to be saved/restored.
There may be a small overhead caused by replacing 'xchg REG, REG' with
the needed sequence 'mov MEM, REG; mov REG, MEM; mov REG, REG' once per
round. But, counterintuitively, when I tested "ctr-twofish-3way" on a
Haswell processor, the new version was actually about 2% faster.
(Perhaps 'xchg' is not as well optimized as plain moves.)
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-12-18 16:40:26 -08:00
push_ c d ( ) ;
encrypt_ c y c l e 3 ( R A B , C D , 0 ) ;
encrypt_ c y c l e 3 ( R A B , C D , 1 ) ;
encrypt_ c y c l e 3 ( R A B , C D , 2 ) ;
encrypt_ c y c l e 3 ( R A B , C D , 3 ) ;
encrypt_ c y c l e 3 ( R A B , C D , 4 ) ;
encrypt_ c y c l e 3 ( R A B , C D , 5 ) ;
encrypt_ c y c l e 3 ( R A B , C D , 6 ) ;
encrypt_ c y c l e 3 ( R A B , C D , 7 ) ;
pop_ c d ( ) ;
2011-09-26 16:47:25 +03:00
popq R I O ; /* dst */
crypto: x86/twofish-3way - Fix %rbp usage
Using %rbp as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.
In twofish-3way, we can't simply replace %rbp with another register
because there are none available. Instead, we use the stack to hold the
values that %rbp, %r11, and %r12 were holding previously. Each of these
values represents the half of the output from the previous Feistel round
that is being passed on unchanged to the following round. They are only
used once per round, when they are exchanged with %rax, %rbx, and %rcx.
As a result, we free up 3 registers (one per block) and can reassign
them so that %rbp is not used, and additionally %r14 and %r15 are not
used so they do not need to be saved/restored.
There may be a small overhead caused by replacing 'xchg REG, REG' with
the needed sequence 'mov MEM, REG; mov REG, MEM; mov REG, REG' once per
round. But, counterintuitively, when I tested "ctr-twofish-3way" on a
Haswell processor, the new version was actually about 2% faster.
(Perhaps 'xchg' is not as well optimized as plain moves.)
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-12-18 16:40:26 -08:00
popq R T 1 ; /* bool xor */
2011-09-26 16:47:25 +03:00
crypto: x86/twofish-3way - Fix %rbp usage
Using %rbp as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.
In twofish-3way, we can't simply replace %rbp with another register
because there are none available. Instead, we use the stack to hold the
values that %rbp, %r11, and %r12 were holding previously. Each of these
values represents the half of the output from the previous Feistel round
that is being passed on unchanged to the following round. They are only
used once per round, when they are exchanged with %rax, %rbx, and %rcx.
As a result, we free up 3 registers (one per block) and can reassign
them so that %rbp is not used, and additionally %r14 and %r15 are not
used so they do not need to be saved/restored.
There may be a small overhead caused by replacing 'xchg REG, REG' with
the needed sequence 'mov MEM, REG; mov REG, MEM; mov REG, REG' once per
round. But, counterintuitively, when I tested "ctr-twofish-3way" on a
Haswell processor, the new version was actually about 2% faster.
(Perhaps 'xchg' is not as well optimized as plain moves.)
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-12-18 16:40:26 -08:00
testb R T 1 b l , R T 1 b l ;
2013-01-19 13:39:46 +02:00
jnz . L _ _ e n c _ x o r3 ;
2011-09-26 16:47:25 +03:00
outunpack_ e n c3 ( m o v ) ;
popq % r b x ;
popq % r12 ;
popq % r13 ;
2021-12-04 14:43:40 +01:00
RET;
2011-09-26 16:47:25 +03:00
2013-01-19 13:39:46 +02:00
.L__enc_xor3 :
2011-09-26 16:47:25 +03:00
outunpack_ e n c3 ( x o r ) ;
popq % r b x ;
popq % r12 ;
popq % r13 ;
2021-12-04 14:43:40 +01:00
RET;
2019-10-11 13:51:04 +02:00
SYM_ F U N C _ E N D ( _ _ t w o f i s h _ e n c _ b l k _ 3 w a y )
2011-09-26 16:47:25 +03:00
2019-10-11 13:51:04 +02:00
SYM_ F U N C _ S T A R T ( t w o f i s h _ d e c _ b l k _ 3 w a y )
2011-09-26 16:47:25 +03:00
/ * input :
* % rdi : ctx, C T X
* % rsi : dst
* % rdx : src, R I O
* /
pushq % r13 ;
pushq % r12 ;
pushq % r b x ;
pushq % r s i ; /* dst */
inpack_ d e c3 ( ) ;
crypto: x86/twofish-3way - Fix %rbp usage
Using %rbp as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.
In twofish-3way, we can't simply replace %rbp with another register
because there are none available. Instead, we use the stack to hold the
values that %rbp, %r11, and %r12 were holding previously. Each of these
values represents the half of the output from the previous Feistel round
that is being passed on unchanged to the following round. They are only
used once per round, when they are exchanged with %rax, %rbx, and %rcx.
As a result, we free up 3 registers (one per block) and can reassign
them so that %rbp is not used, and additionally %r14 and %r15 are not
used so they do not need to be saved/restored.
There may be a small overhead caused by replacing 'xchg REG, REG' with
the needed sequence 'mov MEM, REG; mov REG, MEM; mov REG, REG' once per
round. But, counterintuitively, when I tested "ctr-twofish-3way" on a
Haswell processor, the new version was actually about 2% faster.
(Perhaps 'xchg' is not as well optimized as plain moves.)
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-12-18 16:40:26 -08:00
push_ c d ( ) ;
decrypt_ c y c l e 3 ( R A B , C D , 7 ) ;
decrypt_ c y c l e 3 ( R A B , C D , 6 ) ;
decrypt_ c y c l e 3 ( R A B , C D , 5 ) ;
decrypt_ c y c l e 3 ( R A B , C D , 4 ) ;
decrypt_ c y c l e 3 ( R A B , C D , 3 ) ;
decrypt_ c y c l e 3 ( R A B , C D , 2 ) ;
decrypt_ c y c l e 3 ( R A B , C D , 1 ) ;
decrypt_ c y c l e 3 ( R A B , C D , 0 ) ;
pop_ c d ( ) ;
2011-09-26 16:47:25 +03:00
popq R I O ; /* dst */
outunpack_ d e c3 ( ) ;
popq % r b x ;
popq % r12 ;
popq % r13 ;
2021-12-04 14:43:40 +01:00
RET;
2019-10-11 13:51:04 +02:00
SYM_ F U N C _ E N D ( t w o f i s h _ d e c _ b l k _ 3 w a y )