2019-06-04 10:11:33 +02:00
/* SPDX-License-Identifier: GPL-2.0-only */
2014-03-21 10:19:17 +01:00
/ *
* linux/ a r c h / a r m 6 4 / c r y p t o / a e s - m o d e s . S - c h a i n i n g m o d e w r a p p e r s f o r A E S
*
2017-02-03 14:49:37 +00:00
* Copyright ( C ) 2 0 1 3 - 2 0 1 7 L i n a r o L t d < a r d . b i e s h e u v e l @linaro.org>
2014-03-21 10:19:17 +01:00
* /
/* included by aes-ce.S and aes-neon.S */
.text
.align 4
2019-06-24 19:38:30 +02:00
# ifndef M A X _ S T R I D E
# define M A X _ S T R I D E 4
# endif
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
# if M A X _ S T R I D E = = 4
# define S T 4 ( x . . . ) x
# define S T 5 ( x . . . )
# else
# define S T 4 ( x . . . )
# define S T 5 ( x . . . ) x
# endif
2019-12-13 15:49:10 +00:00
SYM_ F U N C _ S T A R T _ L O C A L ( a e s _ e n c r y p t _ b l o c k 4 x )
2018-09-10 16:41:13 +02:00
encrypt_ b l o c k 4 x v0 , v1 , v2 , v3 , w3 , x2 , x8 , w7
2014-03-21 10:19:17 +01:00
ret
2019-12-13 15:49:10 +00:00
SYM_ F U N C _ E N D ( a e s _ e n c r y p t _ b l o c k 4 x )
2014-03-21 10:19:17 +01:00
2019-12-13 15:49:10 +00:00
SYM_ F U N C _ S T A R T _ L O C A L ( a e s _ d e c r y p t _ b l o c k 4 x )
2018-09-10 16:41:13 +02:00
decrypt_ b l o c k 4 x v0 , v1 , v2 , v3 , w3 , x2 , x8 , w7
2014-03-21 10:19:17 +01:00
ret
2019-12-13 15:49:10 +00:00
SYM_ F U N C _ E N D ( a e s _ d e c r y p t _ b l o c k 4 x )
2014-03-21 10:19:17 +01:00
2019-06-24 19:38:30 +02:00
# if M A X _ S T R I D E = = 5
2019-12-13 15:49:10 +00:00
SYM_ F U N C _ S T A R T _ L O C A L ( a e s _ e n c r y p t _ b l o c k 5 x )
2019-06-24 19:38:30 +02:00
encrypt_ b l o c k 5 x v0 , v1 , v2 , v3 , v4 , w3 , x2 , x8 , w7
ret
2019-12-13 15:49:10 +00:00
SYM_ F U N C _ E N D ( a e s _ e n c r y p t _ b l o c k 5 x )
2019-06-24 19:38:30 +02:00
2019-12-13 15:49:10 +00:00
SYM_ F U N C _ S T A R T _ L O C A L ( a e s _ d e c r y p t _ b l o c k 5 x )
2019-06-24 19:38:30 +02:00
decrypt_ b l o c k 5 x v0 , v1 , v2 , v3 , v4 , w3 , x2 , x8 , w7
ret
2019-12-13 15:49:10 +00:00
SYM_ F U N C _ E N D ( a e s _ d e c r y p t _ b l o c k 5 x )
2019-06-24 19:38:30 +02:00
# endif
2014-03-21 10:19:17 +01:00
/ *
* aes_ e c b _ e n c r y p t ( u 8 o u t [ ] , u 8 c o n s t i n [ ] , u 8 c o n s t r k [ ] , i n t r o u n d s ,
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
* int b l o c k s )
2014-03-21 10:19:17 +01:00
* aes_ e c b _ d e c r y p t ( u 8 o u t [ ] , u 8 c o n s t i n [ ] , u 8 c o n s t r k [ ] , i n t r o u n d s ,
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
* int b l o c k s )
2014-03-21 10:19:17 +01:00
* /
2020-02-18 19:58:26 +00:00
AES_ F U N C _ S T A R T ( a e s _ e c b _ e n c r y p t )
2018-09-10 16:41:13 +02:00
stp x29 , x30 , [ s p , #- 16 ] !
mov x29 , s p
2014-03-21 10:19:17 +01:00
2018-09-10 16:41:13 +02:00
enc_ p r e p a r e w3 , x2 , x5
2014-03-21 10:19:17 +01:00
.LecbencloopNx :
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
subs w4 , w4 , #M A X _ S T R I D E
2014-03-21 10:19:17 +01:00
bmi . L e c b e n c1 x
2018-09-10 16:41:13 +02:00
ld1 { v0 . 1 6 b - v3 . 1 6 b } , [ x1 ] , #64 / * g e t 4 p t b l o c k s * /
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST4 ( b l a e s _ e n c r y p t _ b l o c k 4 x )
ST5 ( l d1 { v4 . 1 6 b } , [ x1 ] , #16 )
ST5 ( b l a e s _ e n c r y p t _ b l o c k 5 x )
2018-09-10 16:41:13 +02:00
st1 { v0 . 1 6 b - v3 . 1 6 b } , [ x0 ] , #64
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST5 ( s t 1 { v4 . 1 6 b } , [ x0 ] , #16 )
2014-03-21 10:19:17 +01:00
b . L e c b e n c l o o p N x
.Lecbenc1x :
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
adds w4 , w4 , #M A X _ S T R I D E
2014-03-21 10:19:17 +01:00
beq . L e c b e n c o u t
.Lecbencloop :
2018-09-10 16:41:13 +02:00
ld1 { v0 . 1 6 b } , [ x1 ] , #16 / * g e t n e x t p t b l o c k * /
encrypt_ b l o c k v0 , w3 , x2 , x5 , w6
st1 { v0 . 1 6 b } , [ x0 ] , #16
subs w4 , w4 , #1
2014-03-21 10:19:17 +01:00
bne . L e c b e n c l o o p
.Lecbencout :
2018-09-10 16:41:13 +02:00
ldp x29 , x30 , [ s p ] , #16
2014-03-21 10:19:17 +01:00
ret
2020-02-18 19:58:26 +00:00
AES_ F U N C _ E N D ( a e s _ e c b _ e n c r y p t )
2014-03-21 10:19:17 +01:00
2020-02-18 19:58:26 +00:00
AES_ F U N C _ S T A R T ( a e s _ e c b _ d e c r y p t )
2018-09-10 16:41:13 +02:00
stp x29 , x30 , [ s p , #- 16 ] !
mov x29 , s p
2018-04-30 18:18:24 +02:00
2018-09-10 16:41:13 +02:00
dec_ p r e p a r e w3 , x2 , x5
2014-03-21 10:19:17 +01:00
.LecbdecloopNx :
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
subs w4 , w4 , #M A X _ S T R I D E
2014-03-21 10:19:17 +01:00
bmi . L e c b d e c1 x
2018-09-10 16:41:13 +02:00
ld1 { v0 . 1 6 b - v3 . 1 6 b } , [ x1 ] , #64 / * g e t 4 c t b l o c k s * /
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST4 ( b l a e s _ d e c r y p t _ b l o c k 4 x )
ST5 ( l d1 { v4 . 1 6 b } , [ x1 ] , #16 )
ST5 ( b l a e s _ d e c r y p t _ b l o c k 5 x )
2018-09-10 16:41:13 +02:00
st1 { v0 . 1 6 b - v3 . 1 6 b } , [ x0 ] , #64
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST5 ( s t 1 { v4 . 1 6 b } , [ x0 ] , #16 )
2014-03-21 10:19:17 +01:00
b . L e c b d e c l o o p N x
.Lecbdec1x :
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
adds w4 , w4 , #M A X _ S T R I D E
2014-03-21 10:19:17 +01:00
beq . L e c b d e c o u t
.Lecbdecloop :
2018-09-10 16:41:13 +02:00
ld1 { v0 . 1 6 b } , [ x1 ] , #16 / * g e t n e x t c t b l o c k * /
decrypt_ b l o c k v0 , w3 , x2 , x5 , w6
st1 { v0 . 1 6 b } , [ x0 ] , #16
subs w4 , w4 , #1
2014-03-21 10:19:17 +01:00
bne . L e c b d e c l o o p
.Lecbdecout :
2018-09-10 16:41:13 +02:00
ldp x29 , x30 , [ s p ] , #16
2014-03-21 10:19:17 +01:00
ret
2020-02-18 19:58:26 +00:00
AES_ F U N C _ E N D ( a e s _ e c b _ d e c r y p t )
2014-03-21 10:19:17 +01:00
/ *
* aes_ c b c _ e n c r y p t ( u 8 o u t [ ] , u 8 c o n s t i n [ ] , u 8 c o n s t r k [ ] , i n t r o u n d s ,
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
* int b l o c k s , u 8 i v [ ] )
2014-03-21 10:19:17 +01:00
* aes_ c b c _ d e c r y p t ( u 8 o u t [ ] , u 8 c o n s t i n [ ] , u 8 c o n s t r k [ ] , i n t r o u n d s ,
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
* int b l o c k s , u 8 i v [ ] )
2019-08-19 17:17:36 +03:00
* aes_ e s s i v _ c b c _ e n c r y p t ( u 8 o u t [ ] , u 8 c o n s t i n [ ] , u 3 2 c o n s t r k 1 [ ] ,
* int r o u n d s , i n t b l o c k s , u 8 i v [ ] ,
* u3 2 c o n s t r k 2 [ ] ) ;
* aes_ e s s i v _ c b c _ d e c r y p t ( u 8 o u t [ ] , u 8 c o n s t i n [ ] , u 3 2 c o n s t r k 1 [ ] ,
* int r o u n d s , i n t b l o c k s , u 8 i v [ ] ,
* u3 2 c o n s t r k 2 [ ] ) ;
2014-03-21 10:19:17 +01:00
* /
2020-02-18 19:58:26 +00:00
AES_ F U N C _ S T A R T ( a e s _ e s s i v _ c b c _ e n c r y p t )
2019-08-19 17:17:36 +03:00
ld1 { v4 . 1 6 b } , [ x5 ] / * g e t i v * /
mov w8 , #14 / * A E S - 2 5 6 : 1 4 r o u n d s * /
enc_ p r e p a r e w8 , x6 , x7
encrypt_ b l o c k v4 , w8 , x6 , x7 , w9
enc_ s w i t c h _ k e y w3 , x2 , x6
b . L c b c e n c l o o p4 x
2020-02-18 19:58:26 +00:00
AES_ F U N C _ S T A R T ( a e s _ c b c _ e n c r y p t )
2018-09-10 16:41:13 +02:00
ld1 { v4 . 1 6 b } , [ x5 ] / * g e t i v * /
enc_ p r e p a r e w3 , x2 , x6
2014-03-21 10:19:17 +01:00
2018-03-10 15:21:52 +00:00
.Lcbcencloop4x :
2018-09-10 16:41:13 +02:00
subs w4 , w4 , #4
2018-03-10 15:21:52 +00:00
bmi . L c b c e n c1 x
2018-09-10 16:41:13 +02:00
ld1 { v0 . 1 6 b - v3 . 1 6 b } , [ x1 ] , #64 / * g e t 4 p t b l o c k s * /
2018-03-10 15:21:52 +00:00
eor v0 . 1 6 b , v0 . 1 6 b , v4 . 1 6 b / * . . a n d x o r w i t h i v * /
2018-09-10 16:41:13 +02:00
encrypt_ b l o c k v0 , w3 , x2 , x6 , w7
2018-03-10 15:21:52 +00:00
eor v1 . 1 6 b , v1 . 1 6 b , v0 . 1 6 b
2018-09-10 16:41:13 +02:00
encrypt_ b l o c k v1 , w3 , x2 , x6 , w7
2018-03-10 15:21:52 +00:00
eor v2 . 1 6 b , v2 . 1 6 b , v1 . 1 6 b
2018-09-10 16:41:13 +02:00
encrypt_ b l o c k v2 , w3 , x2 , x6 , w7
2018-03-10 15:21:52 +00:00
eor v3 . 1 6 b , v3 . 1 6 b , v2 . 1 6 b
2018-09-10 16:41:13 +02:00
encrypt_ b l o c k v3 , w3 , x2 , x6 , w7
st1 { v0 . 1 6 b - v3 . 1 6 b } , [ x0 ] , #64
2018-03-10 15:21:52 +00:00
mov v4 . 1 6 b , v3 . 1 6 b
b . L c b c e n c l o o p4 x
.Lcbcenc1x :
2018-09-10 16:41:13 +02:00
adds w4 , w4 , #4
2018-03-10 15:21:52 +00:00
beq . L c b c e n c o u t
.Lcbcencloop :
2018-09-10 16:41:13 +02:00
ld1 { v0 . 1 6 b } , [ x1 ] , #16 / * g e t n e x t p t b l o c k * /
2018-03-10 15:21:52 +00:00
eor v4 . 1 6 b , v4 . 1 6 b , v0 . 1 6 b / * . . a n d x o r w i t h i v * /
2018-09-10 16:41:13 +02:00
encrypt_ b l o c k v4 , w3 , x2 , x6 , w7
st1 { v4 . 1 6 b } , [ x0 ] , #16
subs w4 , w4 , #1
2014-03-21 10:19:17 +01:00
bne . L c b c e n c l o o p
2018-03-10 15:21:52 +00:00
.Lcbcencout :
2018-09-10 16:41:13 +02:00
st1 { v4 . 1 6 b } , [ x5 ] / * r e t u r n i v * /
2014-03-21 10:19:17 +01:00
ret
2020-02-18 19:58:26 +00:00
AES_ F U N C _ E N D ( a e s _ c b c _ e n c r y p t )
AES_ F U N C _ E N D ( a e s _ e s s i v _ c b c _ e n c r y p t )
2019-08-19 17:17:36 +03:00
2020-02-18 19:58:26 +00:00
AES_ F U N C _ S T A R T ( a e s _ e s s i v _ c b c _ d e c r y p t )
2019-08-19 17:17:36 +03:00
stp x29 , x30 , [ s p , #- 16 ] !
mov x29 , s p
ld1 { c b c i v . 1 6 b } , [ x5 ] / * g e t i v * /
2014-03-21 10:19:17 +01:00
2019-08-19 17:17:36 +03:00
mov w8 , #14 / * A E S - 2 5 6 : 1 4 r o u n d s * /
enc_ p r e p a r e w8 , x6 , x7
encrypt_ b l o c k c b c i v , w8 , x6 , x7 , w9
b . L e s s i v c b c d e c s t a r t
2014-03-21 10:19:17 +01:00
2020-02-18 19:58:26 +00:00
AES_ F U N C _ S T A R T ( a e s _ c b c _ d e c r y p t )
2018-09-10 16:41:13 +02:00
stp x29 , x30 , [ s p , #- 16 ] !
mov x29 , s p
2014-03-21 10:19:17 +01:00
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ld1 { c b c i v . 1 6 b } , [ x5 ] / * g e t i v * /
2019-08-19 17:17:36 +03:00
.Lessivcbcdecstart :
2018-09-10 16:41:13 +02:00
dec_ p r e p a r e w3 , x2 , x6
2014-03-21 10:19:17 +01:00
.LcbcdecloopNx :
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
subs w4 , w4 , #M A X _ S T R I D E
2014-03-21 10:19:17 +01:00
bmi . L c b c d e c1 x
2018-09-10 16:41:13 +02:00
ld1 { v0 . 1 6 b - v3 . 1 6 b } , [ x1 ] , #64 / * g e t 4 c t b l o c k s * /
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
# if M A X _ S T R I D E = = 5
ld1 { v4 . 1 6 b } , [ x1 ] , #16 / * g e t 1 c t b l o c k * /
mov v5 . 1 6 b , v0 . 1 6 b
mov v6 . 1 6 b , v1 . 1 6 b
mov v7 . 1 6 b , v2 . 1 6 b
bl a e s _ d e c r y p t _ b l o c k 5 x
sub x1 , x1 , #32
eor v0 . 1 6 b , v0 . 1 6 b , c b c i v . 1 6 b
eor v1 . 1 6 b , v1 . 1 6 b , v5 . 1 6 b
ld1 { v5 . 1 6 b } , [ x1 ] , #16 / * r e l o a d 1 c t b l o c k * /
ld1 { c b c i v . 1 6 b } , [ x1 ] , #16 / * r e l o a d 1 c t b l o c k * /
eor v2 . 1 6 b , v2 . 1 6 b , v6 . 1 6 b
eor v3 . 1 6 b , v3 . 1 6 b , v7 . 1 6 b
eor v4 . 1 6 b , v4 . 1 6 b , v5 . 1 6 b
# else
2014-03-21 10:19:17 +01:00
mov v4 . 1 6 b , v0 . 1 6 b
mov v5 . 1 6 b , v1 . 1 6 b
mov v6 . 1 6 b , v2 . 1 6 b
2018-03-10 15:21:51 +00:00
bl a e s _ d e c r y p t _ b l o c k 4 x
2018-09-10 16:41:13 +02:00
sub x1 , x1 , #16
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
eor v0 . 1 6 b , v0 . 1 6 b , c b c i v . 1 6 b
2014-03-21 10:19:17 +01:00
eor v1 . 1 6 b , v1 . 1 6 b , v4 . 1 6 b
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ld1 { c b c i v . 1 6 b } , [ x1 ] , #16 / * r e l o a d 1 c t b l o c k * /
2014-03-21 10:19:17 +01:00
eor v2 . 1 6 b , v2 . 1 6 b , v5 . 1 6 b
eor v3 . 1 6 b , v3 . 1 6 b , v6 . 1 6 b
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
# endif
2018-09-10 16:41:13 +02:00
st1 { v0 . 1 6 b - v3 . 1 6 b } , [ x0 ] , #64
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST5 ( s t 1 { v4 . 1 6 b } , [ x0 ] , #16 )
2014-03-21 10:19:17 +01:00
b . L c b c d e c l o o p N x
.Lcbcdec1x :
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
adds w4 , w4 , #M A X _ S T R I D E
2014-03-21 10:19:17 +01:00
beq . L c b c d e c o u t
.Lcbcdecloop :
2018-09-10 16:41:13 +02:00
ld1 { v1 . 1 6 b } , [ x1 ] , #16 / * g e t n e x t c t b l o c k * /
2014-03-21 10:19:17 +01:00
mov v0 . 1 6 b , v1 . 1 6 b / * . . . a n d c o p y t o v0 * /
2018-09-10 16:41:13 +02:00
decrypt_ b l o c k v0 , w3 , x2 , x6 , w7
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
eor v0 . 1 6 b , v0 . 1 6 b , c b c i v . 1 6 b / * x o r w i t h i v = > p t * /
mov c b c i v . 1 6 b , v1 . 1 6 b / * c t i s n e x t i v * /
2018-09-10 16:41:13 +02:00
st1 { v0 . 1 6 b } , [ x0 ] , #16
subs w4 , w4 , #1
2014-03-21 10:19:17 +01:00
bne . L c b c d e c l o o p
.Lcbcdecout :
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
st1 { c b c i v . 1 6 b } , [ x5 ] / * r e t u r n i v * /
2018-09-10 16:41:13 +02:00
ldp x29 , x30 , [ s p ] , #16
2014-03-21 10:19:17 +01:00
ret
2020-02-18 19:58:26 +00:00
AES_ F U N C _ E N D ( a e s _ c b c _ d e c r y p t )
AES_ F U N C _ E N D ( a e s _ e s s i v _ c b c _ d e c r y p t )
2014-03-21 10:19:17 +01:00
2018-09-10 16:41:14 +02:00
/ *
* aes_ c b c _ c t s _ e n c r y p t ( u 8 o u t [ ] , u 8 c o n s t i n [ ] , u 3 2 c o n s t r k [ ] ,
* int r o u n d s , i n t b y t e s , u 8 c o n s t i v [ ] )
* aes_ c b c _ c t s _ d e c r y p t ( u 8 o u t [ ] , u 8 c o n s t i n [ ] , u 3 2 c o n s t r k [ ] ,
* int r o u n d s , i n t b y t e s , u 8 c o n s t i v [ ] )
* /
2020-02-18 19:58:26 +00:00
AES_ F U N C _ S T A R T ( a e s _ c b c _ c t s _ e n c r y p t )
2018-09-10 16:41:14 +02:00
adr_ l x8 , . L c t s _ p e r m u t e _ t a b l e
sub x4 , x4 , #16
add x9 , x8 , #32
add x8 , x8 , x4
sub x9 , x9 , x4
ld1 { v3 . 1 6 b } , [ x8 ]
ld1 { v4 . 1 6 b } , [ x9 ]
ld1 { v0 . 1 6 b } , [ x1 ] , x4 / * o v e r l a p p i n g l o a d s * /
ld1 { v1 . 1 6 b } , [ x1 ]
ld1 { v5 . 1 6 b } , [ x5 ] / * g e t i v * /
enc_ p r e p a r e w3 , x2 , x6
eor v0 . 1 6 b , v0 . 1 6 b , v5 . 1 6 b / * x o r w i t h i v * /
tbl v1 . 1 6 b , { v1 . 1 6 b } , v4 . 1 6 b
encrypt_ b l o c k v0 , w3 , x2 , x6 , w7
eor v1 . 1 6 b , v1 . 1 6 b , v0 . 1 6 b
tbl v0 . 1 6 b , { v0 . 1 6 b } , v3 . 1 6 b
encrypt_ b l o c k v1 , w3 , x2 , x6 , w7
add x4 , x0 , x4
st1 { v0 . 1 6 b } , [ x4 ] / * o v e r l a p p i n g s t o r e s * /
st1 { v1 . 1 6 b } , [ x0 ]
ret
2020-02-18 19:58:26 +00:00
AES_ F U N C _ E N D ( a e s _ c b c _ c t s _ e n c r y p t )
2018-09-10 16:41:14 +02:00
2020-02-18 19:58:26 +00:00
AES_ F U N C _ S T A R T ( a e s _ c b c _ c t s _ d e c r y p t )
2018-09-10 16:41:14 +02:00
adr_ l x8 , . L c t s _ p e r m u t e _ t a b l e
sub x4 , x4 , #16
add x9 , x8 , #32
add x8 , x8 , x4
sub x9 , x9 , x4
ld1 { v3 . 1 6 b } , [ x8 ]
ld1 { v4 . 1 6 b } , [ x9 ]
ld1 { v0 . 1 6 b } , [ x1 ] , x4 / * o v e r l a p p i n g l o a d s * /
ld1 { v1 . 1 6 b } , [ x1 ]
ld1 { v5 . 1 6 b } , [ x5 ] / * g e t i v * /
dec_ p r e p a r e w3 , x2 , x6
decrypt_ b l o c k v0 , w3 , x2 , x6 , w7
2019-09-03 09:43:31 -07:00
tbl v2 . 1 6 b , { v0 . 1 6 b } , v3 . 1 6 b
eor v2 . 1 6 b , v2 . 1 6 b , v1 . 1 6 b
2018-09-10 16:41:14 +02:00
tbx v0 . 1 6 b , { v1 . 1 6 b } , v4 . 1 6 b
decrypt_ b l o c k v0 , w3 , x2 , x6 , w7
eor v0 . 1 6 b , v0 . 1 6 b , v5 . 1 6 b / * x o r w i t h i v * /
add x4 , x0 , x4
st1 { v2 . 1 6 b } , [ x4 ] / * o v e r l a p p i n g s t o r e s * /
st1 { v0 . 1 6 b } , [ x0 ]
ret
2020-02-18 19:58:26 +00:00
AES_ F U N C _ E N D ( a e s _ c b c _ c t s _ d e c r y p t )
2018-09-10 16:41:14 +02:00
.section " .rodata " , " a"
.align 6
.Lcts_permute_table :
.byte 0 xff, 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f
.byte 0 xff, 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f
.byte 0 x0 , 0 x1 , 0 x2 , 0 x3 , 0 x4 , 0 x5 , 0 x6 , 0 x7
.byte 0 x8 , 0 x9 , 0 x a , 0 x b , 0 x c , 0 x d , 0 x e , 0 x f
.byte 0 xff, 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f
.byte 0 xff, 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f
.previous
2014-03-21 10:19:17 +01:00
/ *
* aes_ c t r _ e n c r y p t ( u 8 o u t [ ] , u 8 c o n s t i n [ ] , u 8 c o n s t r k [ ] , i n t r o u n d s ,
2022-01-27 10:52:11 +01:00
* int b y t e s , u 8 c t r [ ] )
2014-03-21 10:19:17 +01:00
* /
2020-02-18 19:58:26 +00:00
AES_ F U N C _ S T A R T ( a e s _ c t r _ e n c r y p t )
2018-09-10 16:41:13 +02:00
stp x29 , x30 , [ s p , #- 16 ] !
mov x29 , s p
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
2020-12-17 19:55:16 +01:00
enc_ p r e p a r e w3 , x2 , x12
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ld1 { v c t r . 1 6 b } , [ x5 ]
2017-01-17 13:46:29 +00:00
2020-12-17 19:55:16 +01:00
umov x12 , v c t r . d [ 1 ] / * k e e p s w a b b e d c t r i n r e g * /
rev x12 , x12
2014-03-21 10:19:17 +01:00
.LctrloopNx :
2020-12-17 19:55:16 +01:00
add w7 , w4 , #15
sub w4 , w4 , #M A X _ S T R I D E < < 4
lsr w7 , w7 , #4
mov w8 , #M A X _ S T R I D E
cmp w7 , w8
csel w7 , w7 , w8 , l t
adds x12 , x12 , x7
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
mov v0 . 1 6 b , v c t r . 1 6 b
mov v1 . 1 6 b , v c t r . 1 6 b
mov v2 . 1 6 b , v c t r . 1 6 b
mov v3 . 1 6 b , v c t r . 1 6 b
ST5 ( m o v v4 . 1 6 b , v c t r . 1 6 b )
2020-12-17 19:55:16 +01:00
bcs 0 f
.subsection 1
/* apply carry to outgoing counter */
0 : umov x8 , v c t r . d [ 0 ]
rev x8 , x8
add x8 , x8 , #1
rev x8 , x8
ins v c t r . d [ 0 ] , x8
/* apply carry to N counter blocks for N := x12 */
2021-04-06 16:25:23 +02:00
cbz x12 , 2 f
2020-12-17 19:55:16 +01:00
adr x16 , 1 f
sub x16 , x16 , x12 , l s l #3
br x16
2021-12-14 15:27:12 +00:00
bti c
2020-12-17 19:55:16 +01:00
mov v0 . d [ 0 ] , v c t r . d [ 0 ]
2021-12-14 15:27:12 +00:00
bti c
2020-12-17 19:55:16 +01:00
mov v1 . d [ 0 ] , v c t r . d [ 0 ]
2021-12-14 15:27:12 +00:00
bti c
2020-12-17 19:55:16 +01:00
mov v2 . d [ 0 ] , v c t r . d [ 0 ]
2021-12-14 15:27:12 +00:00
bti c
2020-12-17 19:55:16 +01:00
mov v3 . d [ 0 ] , v c t r . d [ 0 ]
2021-12-14 15:27:12 +00:00
ST5 ( b t i c )
2020-12-17 19:55:16 +01:00
ST5 ( m o v v4 . d [ 0 ] , v c t r . d [ 0 ] )
1 : b 2 f
.previous
2 : rev x7 , x12
ins v c t r . d [ 1 ] , x7
sub x7 , x12 , #M A X _ S T R I D E - 1
sub x8 , x12 , #M A X _ S T R I D E - 2
sub x9 , x12 , #M A X _ S T R I D E - 3
rev x7 , x7
rev x8 , x8
mov v1 . d [ 1 ] , x7
rev x9 , x9
ST5 ( s u b x10 , x12 , #M A X _ S T R I D E - 4 )
mov v2 . d [ 1 ] , x8
ST5 ( r e v x10 , x10 )
mov v3 . d [ 1 ] , x9
ST5 ( m o v v4 . d [ 1 ] , x10 )
tbnz w4 , #31 , . L c t r t a i l
ld1 { v5 . 1 6 b - v7 . 1 6 b } , [ x1 ] , #48
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST4 ( b l a e s _ e n c r y p t _ b l o c k 4 x )
ST5 ( b l a e s _ e n c r y p t _ b l o c k 5 x )
2014-03-21 10:19:17 +01:00
eor v0 . 1 6 b , v5 . 1 6 b , v0 . 1 6 b
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST4 ( l d1 { v5 . 1 6 b } , [ x1 ] , #16 )
2014-03-21 10:19:17 +01:00
eor v1 . 1 6 b , v6 . 1 6 b , v1 . 1 6 b
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST5 ( l d1 { v5 . 1 6 b - v6 . 1 6 b } , [ x1 ] , #32 )
2014-03-21 10:19:17 +01:00
eor v2 . 1 6 b , v7 . 1 6 b , v2 . 1 6 b
eor v3 . 1 6 b , v5 . 1 6 b , v3 . 1 6 b
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST5 ( e o r v4 . 1 6 b , v6 . 1 6 b , v4 . 1 6 b )
2018-09-10 16:41:13 +02:00
st1 { v0 . 1 6 b - v3 . 1 6 b } , [ x0 ] , #64
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST5 ( s t 1 { v4 . 1 6 b } , [ x0 ] , #16 )
2018-09-10 16:41:13 +02:00
cbz w4 , . L c t r o u t
2014-03-21 10:19:17 +01:00
b . L c t r l o o p N x
2017-01-17 13:46:29 +00:00
.Lctrout :
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
st1 { v c t r . 1 6 b } , [ x5 ] / * r e t u r n n e x t C T R v a l u e * /
2018-09-10 16:41:13 +02:00
ldp x29 , x30 , [ s p ] , #16
2017-01-17 13:46:29 +00:00
ret
2020-12-17 19:55:16 +01:00
.Lctrtail :
/* XOR up to MAX_STRIDE * 16 - 1 bytes of in/output with v0 ... v3/v4 */
mov x16 , #16
2022-01-27 10:52:11 +01:00
ands x6 , x4 , #0xf
csel x13 , x6 , x16 , n e
2020-12-17 19:55:16 +01:00
ST5 ( c m p w4 , #64 - ( M A X _ S T R I D E < < 4 ) )
ST5 ( c s e l x14 , x16 , x z r , g t )
cmp w4 , #48 - ( M A X _ S T R I D E < < 4 )
csel x15 , x16 , x z r , g t
cmp w4 , #32 - ( M A X _ S T R I D E < < 4 )
csel x16 , x16 , x z r , g t
cmp w4 , #16 - ( M A X _ S T R I D E < < 4 )
adr_ l x12 , . L c t s _ p e r m u t e _ t a b l e
add x12 , x12 , x13
2022-01-27 10:52:11 +01:00
ble . L c t r t a i l 1 x
2020-12-17 19:55:16 +01:00
ST5 ( l d1 { v5 . 1 6 b } , [ x1 ] , x14 )
ld1 { v6 . 1 6 b } , [ x1 ] , x15
ld1 { v7 . 1 6 b } , [ x1 ] , x16
ST4 ( b l a e s _ e n c r y p t _ b l o c k 4 x )
ST5 ( b l a e s _ e n c r y p t _ b l o c k 5 x )
ld1 { v8 . 1 6 b } , [ x1 ] , x13
ld1 { v9 . 1 6 b } , [ x1 ]
ld1 { v10 . 1 6 b } , [ x12 ]
ST4 ( e o r v6 . 1 6 b , v6 . 1 6 b , v0 . 1 6 b )
ST4 ( e o r v7 . 1 6 b , v7 . 1 6 b , v1 . 1 6 b )
ST4 ( t b l v3 . 1 6 b , { v3 . 1 6 b } , v10 . 1 6 b )
ST4 ( e o r v8 . 1 6 b , v8 . 1 6 b , v2 . 1 6 b )
ST4 ( e o r v9 . 1 6 b , v9 . 1 6 b , v3 . 1 6 b )
ST5 ( e o r v5 . 1 6 b , v5 . 1 6 b , v0 . 1 6 b )
ST5 ( e o r v6 . 1 6 b , v6 . 1 6 b , v1 . 1 6 b )
ST5 ( t b l v4 . 1 6 b , { v4 . 1 6 b } , v10 . 1 6 b )
ST5 ( e o r v7 . 1 6 b , v7 . 1 6 b , v2 . 1 6 b )
ST5 ( e o r v8 . 1 6 b , v8 . 1 6 b , v3 . 1 6 b )
ST5 ( e o r v9 . 1 6 b , v9 . 1 6 b , v4 . 1 6 b )
ST5 ( s t 1 { v5 . 1 6 b } , [ x0 ] , x14 )
st1 { v6 . 1 6 b } , [ x0 ] , x15
st1 { v7 . 1 6 b } , [ x0 ] , x16
add x13 , x13 , x0
st1 { v9 . 1 6 b } , [ x13 ] / / o v e r l a p p i n g s t o r e s
st1 { v8 . 1 6 b } , [ x0 ]
2019-02-14 00:03:54 -08:00
b . L c t r o u t
2017-01-17 13:46:29 +00:00
2020-12-17 19:55:16 +01:00
.Lctrtail1x :
2022-01-27 10:52:11 +01:00
sub x7 , x6 , #16
csel x6 , x6 , x7 , e q
add x1 , x1 , x6
add x0 , x0 , x6
2020-12-17 19:55:16 +01:00
ld1 { v5 . 1 6 b } , [ x1 ]
2022-01-27 10:52:11 +01:00
ld1 { v6 . 1 6 b } , [ x0 ]
2020-12-17 19:55:16 +01:00
ST5 ( m o v v3 . 1 6 b , v4 . 1 6 b )
encrypt_ b l o c k v3 , w3 , x2 , x8 , w7
2022-01-27 10:52:11 +01:00
ld1 { v10 . 1 6 b - v11 . 1 6 b } , [ x12 ]
tbl v3 . 1 6 b , { v3 . 1 6 b } , v10 . 1 6 b
sshr v11 . 1 6 b , v11 . 1 6 b , #7
2020-12-17 19:55:16 +01:00
eor v5 . 1 6 b , v5 . 1 6 b , v3 . 1 6 b
2022-01-27 10:52:11 +01:00
bif v5 . 1 6 b , v6 . 1 6 b , v11 . 1 6 b
2020-12-17 19:55:16 +01:00
st1 { v5 . 1 6 b } , [ x0 ]
b . L c t r o u t
2020-02-18 19:58:26 +00:00
AES_ F U N C _ E N D ( a e s _ c t r _ e n c r y p t )
2014-03-21 10:19:17 +01:00
/ *
2019-09-03 09:43:33 -07:00
* aes_ x t s _ e n c r y p t ( u 8 o u t [ ] , u 8 c o n s t i n [ ] , u 8 c o n s t r k 1 [ ] , i n t r o u n d s ,
* int b y t e s , u 8 c o n s t r k 2 [ ] , u 8 i v [ ] , i n t f i r s t )
2014-03-21 10:19:17 +01:00
* aes_ x t s _ d e c r y p t ( u 8 o u t [ ] , u 8 c o n s t i n [ ] , u 8 c o n s t r k 1 [ ] , i n t r o u n d s ,
2019-09-03 09:43:33 -07:00
* int b y t e s , u 8 c o n s t r k 2 [ ] , u 8 i v [ ] , i n t f i r s t )
2014-03-21 10:19:17 +01:00
* /
2018-09-10 16:41:15 +02:00
.macro next_ t w e a k , o u t , i n , t m p
2014-03-21 10:19:17 +01:00
sshr \ t m p \ ( ) . 2 d , \ i n \ ( ) . 2 d , #63
2018-09-10 16:41:15 +02:00
and \ t m p \ ( ) . 1 6 b , \ t m p \ ( ) . 1 6 b , x t s m a s k . 1 6 b
2014-03-21 10:19:17 +01:00
add \ o u t \ ( ) . 2 d , \ i n \ ( ) . 2 d , \ i n \ ( ) . 2 d
ext \ t m p \ ( ) . 1 6 b , \ t m p \ ( ) . 1 6 b , \ t m p \ ( ) . 1 6 b , #8
eor \ o u t \ ( ) . 1 6 b , \ o u t \ ( ) . 1 6 b , \ t m p \ ( ) . 1 6 b
.endm
2018-09-10 16:41:15 +02:00
.macro xts_ l o a d _ m a s k , t m p
movi x t s m a s k . 2 s , #0x1
movi \ t m p \ ( ) . 2 s , #0x87
uzp1 x t s m a s k . 4 s , x t s m a s k . 4 s , \ t m p \ ( ) . 4 s
.endm
2014-03-21 10:19:17 +01:00
2020-02-18 19:58:26 +00:00
AES_ F U N C _ S T A R T ( a e s _ x t s _ e n c r y p t )
2018-09-10 16:41:13 +02:00
stp x29 , x30 , [ s p , #- 16 ] !
mov x29 , s p
2018-03-10 15:21:51 +00:00
2018-09-10 16:41:13 +02:00
ld1 { v4 . 1 6 b } , [ x6 ]
2018-10-08 13:16:59 +02:00
xts_ l o a d _ m a s k v8
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
cbz w7 , . L x t s e n c n o t f i r s t
enc_ p r e p a r e w3 , x5 , x8
2019-09-03 09:43:34 -07:00
xts_ c t s _ s k i p _ t w w7 , . L x t s e n c N x
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
encrypt_ b l o c k v4 , w3 , x5 , x8 , w7 / * f i r s t t w e a k * /
enc_ s w i t c h _ k e y w3 , x2 , x8
2014-03-21 10:19:17 +01:00
b . L x t s e n c N x
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
.Lxtsencnotfirst :
2018-09-10 16:41:13 +02:00
enc_ p r e p a r e w3 , x2 , x8
2014-03-21 10:19:17 +01:00
.LxtsencloopNx :
2018-09-10 16:41:15 +02:00
next_ t w e a k v4 , v4 , v8
2014-03-21 10:19:17 +01:00
.LxtsencNx :
2019-09-03 09:43:33 -07:00
subs w4 , w4 , #64
2014-03-21 10:19:17 +01:00
bmi . L x t s e n c1 x
2018-09-10 16:41:13 +02:00
ld1 { v0 . 1 6 b - v3 . 1 6 b } , [ x1 ] , #64 / * g e t 4 p t b l o c k s * /
2018-09-10 16:41:15 +02:00
next_ t w e a k v5 , v4 , v8
2014-03-21 10:19:17 +01:00
eor v0 . 1 6 b , v0 . 1 6 b , v4 . 1 6 b
2018-09-10 16:41:15 +02:00
next_ t w e a k v6 , v5 , v8
2014-03-21 10:19:17 +01:00
eor v1 . 1 6 b , v1 . 1 6 b , v5 . 1 6 b
eor v2 . 1 6 b , v2 . 1 6 b , v6 . 1 6 b
2018-09-10 16:41:15 +02:00
next_ t w e a k v7 , v6 , v8
2014-03-21 10:19:17 +01:00
eor v3 . 1 6 b , v3 . 1 6 b , v7 . 1 6 b
2018-03-10 15:21:51 +00:00
bl a e s _ e n c r y p t _ b l o c k 4 x
2014-03-21 10:19:17 +01:00
eor v3 . 1 6 b , v3 . 1 6 b , v7 . 1 6 b
eor v0 . 1 6 b , v0 . 1 6 b , v4 . 1 6 b
eor v1 . 1 6 b , v1 . 1 6 b , v5 . 1 6 b
eor v2 . 1 6 b , v2 . 1 6 b , v6 . 1 6 b
2018-09-10 16:41:13 +02:00
st1 { v0 . 1 6 b - v3 . 1 6 b } , [ x0 ] , #64
2014-03-21 10:19:17 +01:00
mov v4 . 1 6 b , v7 . 1 6 b
2019-09-03 09:43:33 -07:00
cbz w4 , . L x t s e n c r e t
2018-10-08 13:16:59 +02:00
xts_ r e l o a d _ m a s k v8
2014-03-21 10:19:17 +01:00
b . L x t s e n c l o o p N x
.Lxtsenc1x :
2019-09-03 09:43:33 -07:00
adds w4 , w4 , #64
2014-03-21 10:19:17 +01:00
beq . L x t s e n c o u t
2019-09-03 09:43:33 -07:00
subs w4 , w4 , #16
bmi . L x t s e n c c t s N x
2014-03-21 10:19:17 +01:00
.Lxtsencloop :
2019-09-03 09:43:33 -07:00
ld1 { v0 . 1 6 b } , [ x1 ] , #16
.Lxtsencctsout :
eor v0 . 1 6 b , v0 . 1 6 b , v4 . 1 6 b
2018-09-10 16:41:13 +02:00
encrypt_ b l o c k v0 , w3 , x2 , x8 , w7
2014-03-21 10:19:17 +01:00
eor v0 . 1 6 b , v0 . 1 6 b , v4 . 1 6 b
2019-09-03 09:43:33 -07:00
cbz w4 , . L x t s e n c o u t
subs w4 , w4 , #16
2018-09-10 16:41:15 +02:00
next_ t w e a k v4 , v4 , v8
2019-09-03 09:43:33 -07:00
bmi . L x t s e n c c t s
st1 { v0 . 1 6 b } , [ x0 ] , #16
2014-03-21 10:19:17 +01:00
b . L x t s e n c l o o p
.Lxtsencout :
2019-09-03 09:43:33 -07:00
st1 { v0 . 1 6 b } , [ x0 ]
.Lxtsencret :
2018-09-10 16:41:13 +02:00
st1 { v4 . 1 6 b } , [ x6 ]
ldp x29 , x30 , [ s p ] , #16
2014-03-21 10:19:17 +01:00
ret
2019-09-03 09:43:33 -07:00
.LxtsencctsNx :
mov v0 . 1 6 b , v3 . 1 6 b
sub x0 , x0 , #16
.Lxtsenccts :
adr_ l x8 , . L c t s _ p e r m u t e _ t a b l e
add x1 , x1 , w4 , s x t w / * r e w i n d i n p u t p o i n t e r * /
add w4 , w4 , #16 / * # b y t e s i n f i n a l b l o c k * /
add x9 , x8 , #32
add x8 , x8 , x4
sub x9 , x9 , x4
add x4 , x0 , x4 / * o u t p u t a d d r e s s o f f i n a l b l o c k * /
ld1 { v1 . 1 6 b } , [ x1 ] / * l o a d f i n a l b l o c k * /
ld1 { v2 . 1 6 b } , [ x8 ]
ld1 { v3 . 1 6 b } , [ x9 ]
tbl v2 . 1 6 b , { v0 . 1 6 b } , v2 . 1 6 b
tbx v0 . 1 6 b , { v1 . 1 6 b } , v3 . 1 6 b
st1 { v2 . 1 6 b } , [ x4 ] / * o v e r l a p p i n g s t o r e s * /
mov w4 , w z r
b . L x t s e n c c t s o u t
2020-02-18 19:58:26 +00:00
AES_ F U N C _ E N D ( a e s _ x t s _ e n c r y p t )
2014-03-21 10:19:17 +01:00
2020-02-18 19:58:26 +00:00
AES_ F U N C _ S T A R T ( a e s _ x t s _ d e c r y p t )
2018-09-10 16:41:13 +02:00
stp x29 , x30 , [ s p , #- 16 ] !
mov x29 , s p
2018-03-10 15:21:51 +00:00
2019-09-03 09:43:33 -07:00
/* subtract 16 bytes if we are doing CTS */
sub w8 , w4 , #0x10
tst w4 , #0xf
csel w4 , w4 , w8 , e q
2018-09-10 16:41:13 +02:00
ld1 { v4 . 1 6 b } , [ x6 ]
2018-10-08 13:16:59 +02:00
xts_ l o a d _ m a s k v8
2019-09-03 09:43:34 -07:00
xts_ c t s _ s k i p _ t w w7 , . L x t s d e c s k i p t w
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
cbz w7 , . L x t s d e c n o t f i r s t
enc_ p r e p a r e w3 , x5 , x8
encrypt_ b l o c k v4 , w3 , x5 , x8 , w7 / * f i r s t t w e a k * /
2019-09-03 09:43:34 -07:00
.Lxtsdecskiptw :
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
dec_ p r e p a r e w3 , x2 , x8
2014-03-21 10:19:17 +01:00
b . L x t s d e c N x
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
.Lxtsdecnotfirst :
2018-09-10 16:41:13 +02:00
dec_ p r e p a r e w3 , x2 , x8
2014-03-21 10:19:17 +01:00
.LxtsdecloopNx :
2018-09-10 16:41:15 +02:00
next_ t w e a k v4 , v4 , v8
2014-03-21 10:19:17 +01:00
.LxtsdecNx :
2019-09-03 09:43:33 -07:00
subs w4 , w4 , #64
2014-03-21 10:19:17 +01:00
bmi . L x t s d e c1 x
2018-09-10 16:41:13 +02:00
ld1 { v0 . 1 6 b - v3 . 1 6 b } , [ x1 ] , #64 / * g e t 4 c t b l o c k s * /
2018-09-10 16:41:15 +02:00
next_ t w e a k v5 , v4 , v8
2014-03-21 10:19:17 +01:00
eor v0 . 1 6 b , v0 . 1 6 b , v4 . 1 6 b
2018-09-10 16:41:15 +02:00
next_ t w e a k v6 , v5 , v8
2014-03-21 10:19:17 +01:00
eor v1 . 1 6 b , v1 . 1 6 b , v5 . 1 6 b
eor v2 . 1 6 b , v2 . 1 6 b , v6 . 1 6 b
2018-09-10 16:41:15 +02:00
next_ t w e a k v7 , v6 , v8
2014-03-21 10:19:17 +01:00
eor v3 . 1 6 b , v3 . 1 6 b , v7 . 1 6 b
2018-03-10 15:21:51 +00:00
bl a e s _ d e c r y p t _ b l o c k 4 x
2014-03-21 10:19:17 +01:00
eor v3 . 1 6 b , v3 . 1 6 b , v7 . 1 6 b
eor v0 . 1 6 b , v0 . 1 6 b , v4 . 1 6 b
eor v1 . 1 6 b , v1 . 1 6 b , v5 . 1 6 b
eor v2 . 1 6 b , v2 . 1 6 b , v6 . 1 6 b
2018-09-10 16:41:13 +02:00
st1 { v0 . 1 6 b - v3 . 1 6 b } , [ x0 ] , #64
2014-03-21 10:19:17 +01:00
mov v4 . 1 6 b , v7 . 1 6 b
2018-09-10 16:41:13 +02:00
cbz w4 , . L x t s d e c o u t
2018-10-08 13:16:59 +02:00
xts_ r e l o a d _ m a s k v8
2014-03-21 10:19:17 +01:00
b . L x t s d e c l o o p N x
.Lxtsdec1x :
2019-09-03 09:43:33 -07:00
adds w4 , w4 , #64
2014-03-21 10:19:17 +01:00
beq . L x t s d e c o u t
2019-09-03 09:43:33 -07:00
subs w4 , w4 , #16
2014-03-21 10:19:17 +01:00
.Lxtsdecloop :
2019-09-03 09:43:33 -07:00
ld1 { v0 . 1 6 b } , [ x1 ] , #16
bmi . L x t s d e c c t s
.Lxtsdecctsout :
eor v0 . 1 6 b , v0 . 1 6 b , v4 . 1 6 b
2018-09-10 16:41:13 +02:00
decrypt_ b l o c k v0 , w3 , x2 , x8 , w7
2014-03-21 10:19:17 +01:00
eor v0 . 1 6 b , v0 . 1 6 b , v4 . 1 6 b
2018-09-10 16:41:13 +02:00
st1 { v0 . 1 6 b } , [ x0 ] , #16
2019-09-03 09:43:33 -07:00
cbz w4 , . L x t s d e c o u t
subs w4 , w4 , #16
2018-09-10 16:41:15 +02:00
next_ t w e a k v4 , v4 , v8
2014-03-21 10:19:17 +01:00
b . L x t s d e c l o o p
.Lxtsdecout :
2018-09-10 16:41:13 +02:00
st1 { v4 . 1 6 b } , [ x6 ]
ldp x29 , x30 , [ s p ] , #16
2014-03-21 10:19:17 +01:00
ret
2019-09-03 09:43:33 -07:00
.Lxtsdeccts :
adr_ l x8 , . L c t s _ p e r m u t e _ t a b l e
add x1 , x1 , w4 , s x t w / * r e w i n d i n p u t p o i n t e r * /
add w4 , w4 , #16 / * # b y t e s i n f i n a l b l o c k * /
add x9 , x8 , #32
add x8 , x8 , x4
sub x9 , x9 , x4
add x4 , x0 , x4 / * o u t p u t a d d r e s s o f f i n a l b l o c k * /
next_ t w e a k v5 , v4 , v8
ld1 { v1 . 1 6 b } , [ x1 ] / * l o a d f i n a l b l o c k * /
ld1 { v2 . 1 6 b } , [ x8 ]
ld1 { v3 . 1 6 b } , [ x9 ]
eor v0 . 1 6 b , v0 . 1 6 b , v5 . 1 6 b
decrypt_ b l o c k v0 , w3 , x2 , x8 , w7
eor v0 . 1 6 b , v0 . 1 6 b , v5 . 1 6 b
tbl v2 . 1 6 b , { v0 . 1 6 b } , v2 . 1 6 b
tbx v0 . 1 6 b , { v1 . 1 6 b } , v3 . 1 6 b
st1 { v2 . 1 6 b } , [ x4 ] / * o v e r l a p p i n g s t o r e s * /
mov w4 , w z r
b . L x t s d e c c t s o u t
2020-02-18 19:58:26 +00:00
AES_ F U N C _ E N D ( a e s _ x t s _ d e c r y p t )
2017-02-03 14:49:37 +00:00
/ *
* aes_ m a c _ u p d a t e ( u 8 c o n s t i n [ ] , u 3 2 c o n s t r k [ ] , i n t r o u n d s ,
* int b l o c k s , u 8 d g [ ] , i n t e n c _ b e f o r e , i n t e n c _ a f t e r )
* /
2020-02-18 19:58:26 +00:00
AES_ F U N C _ S T A R T ( a e s _ m a c _ u p d a t e )
2021-02-03 12:36:24 +01:00
ld1 { v0 . 1 6 b } , [ x4 ] / * g e t d g * /
2017-02-03 14:49:37 +00:00
enc_ p r e p a r e w2 , x1 , x7
2018-03-10 15:21:53 +00:00
cbz w5 , . L m a c l o o p4 x
2017-02-03 14:49:37 +00:00
2018-03-10 15:21:53 +00:00
encrypt_ b l o c k v0 , w2 , x1 , x7 , w8
.Lmacloop4x :
2021-02-03 12:36:24 +01:00
subs w3 , w3 , #4
2018-03-10 15:21:53 +00:00
bmi . L m a c1 x
2021-02-03 12:36:24 +01:00
ld1 { v1 . 1 6 b - v4 . 1 6 b } , [ x0 ] , #64 / * g e t n e x t p t b l o c k * /
2018-03-10 15:21:53 +00:00
eor v0 . 1 6 b , v0 . 1 6 b , v1 . 1 6 b / * . . a n d x o r w i t h d g * /
2021-02-03 12:36:24 +01:00
encrypt_ b l o c k v0 , w2 , x1 , x7 , w8
2018-03-10 15:21:53 +00:00
eor v0 . 1 6 b , v0 . 1 6 b , v2 . 1 6 b
2021-02-03 12:36:24 +01:00
encrypt_ b l o c k v0 , w2 , x1 , x7 , w8
2018-03-10 15:21:53 +00:00
eor v0 . 1 6 b , v0 . 1 6 b , v3 . 1 6 b
2021-02-03 12:36:24 +01:00
encrypt_ b l o c k v0 , w2 , x1 , x7 , w8
2018-03-10 15:21:53 +00:00
eor v0 . 1 6 b , v0 . 1 6 b , v4 . 1 6 b
2021-02-03 12:36:24 +01:00
cmp w3 , w z r
csinv x5 , x6 , x z r , e q
2018-03-10 15:21:53 +00:00
cbz w5 , . L m a c o u t
2021-02-03 12:36:24 +01:00
encrypt_ b l o c k v0 , w2 , x1 , x7 , w8
st1 { v0 . 1 6 b } , [ x4 ] / * r e t u r n d g * /
2021-03-02 10:01:12 +01:00
cond_ y i e l d . L m a c o u t , x7 , x8
2018-03-10 15:21:53 +00:00
b . L m a c l o o p4 x
.Lmac1x :
2021-02-03 12:36:24 +01:00
add w3 , w3 , #4
2017-02-03 14:49:37 +00:00
.Lmacloop :
2021-02-03 12:36:24 +01:00
cbz w3 , . L m a c o u t
ld1 { v1 . 1 6 b } , [ x0 ] , #16 / * g e t n e x t p t b l o c k * /
2017-02-03 14:49:37 +00:00
eor v0 . 1 6 b , v0 . 1 6 b , v1 . 1 6 b / * . . a n d x o r w i t h d g * /
2021-02-03 12:36:24 +01:00
subs w3 , w3 , #1
csinv x5 , x6 , x z r , e q
2017-02-03 14:49:37 +00:00
cbz w5 , . L m a c o u t
2018-04-30 18:18:24 +02:00
.Lmacenc :
2021-02-03 12:36:24 +01:00
encrypt_ b l o c k v0 , w2 , x1 , x7 , w8
2017-02-03 14:49:37 +00:00
b . L m a c l o o p
.Lmacout :
2021-02-03 12:36:24 +01:00
st1 { v0 . 1 6 b } , [ x4 ] / * r e t u r n d g * /
mov w0 , w3
2017-02-03 14:49:37 +00:00
ret
2020-02-18 19:58:26 +00:00
AES_ F U N C _ E N D ( a e s _ m a c _ u p d a t e )