2019-06-04 10:11:33 +02:00
/* SPDX-License-Identifier: GPL-2.0-only */
2014-03-21 10:19:17 +01:00
/ *
* linux/ a r c h / a r m 6 4 / c r y p t o / a e s - m o d e s . S - c h a i n i n g m o d e w r a p p e r s f o r A E S
*
2017-02-03 14:49:37 +00:00
* Copyright ( C ) 2 0 1 3 - 2 0 1 7 L i n a r o L t d < a r d . b i e s h e u v e l @linaro.org>
2014-03-21 10:19:17 +01:00
* /
/* included by aes-ce.S and aes-neon.S */
.text
.align 4
2019-06-24 19:38:30 +02:00
# ifndef M A X _ S T R I D E
# define M A X _ S T R I D E 4
# endif
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
# if M A X _ S T R I D E = = 4
# define S T 4 ( x . . . ) x
# define S T 5 ( x . . . )
# else
# define S T 4 ( x . . . )
# define S T 5 ( x . . . ) x
# endif
2014-03-21 10:19:17 +01:00
aes_encrypt_block4x :
2018-09-10 16:41:13 +02:00
encrypt_ b l o c k 4 x v0 , v1 , v2 , v3 , w3 , x2 , x8 , w7
2014-03-21 10:19:17 +01:00
ret
ENDPROC( a e s _ e n c r y p t _ b l o c k 4 x )
aes_decrypt_block4x :
2018-09-10 16:41:13 +02:00
decrypt_ b l o c k 4 x v0 , v1 , v2 , v3 , w3 , x2 , x8 , w7
2014-03-21 10:19:17 +01:00
ret
ENDPROC( a e s _ d e c r y p t _ b l o c k 4 x )
2019-06-24 19:38:30 +02:00
# if M A X _ S T R I D E = = 5
aes_encrypt_block5x :
encrypt_ b l o c k 5 x v0 , v1 , v2 , v3 , v4 , w3 , x2 , x8 , w7
ret
ENDPROC( a e s _ e n c r y p t _ b l o c k 5 x )
aes_decrypt_block5x :
decrypt_ b l o c k 5 x v0 , v1 , v2 , v3 , v4 , w3 , x2 , x8 , w7
ret
ENDPROC( a e s _ d e c r y p t _ b l o c k 5 x )
# endif
2014-03-21 10:19:17 +01:00
/ *
* aes_ e c b _ e n c r y p t ( u 8 o u t [ ] , u 8 c o n s t i n [ ] , u 8 c o n s t r k [ ] , i n t r o u n d s ,
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
* int b l o c k s )
2014-03-21 10:19:17 +01:00
* aes_ e c b _ d e c r y p t ( u 8 o u t [ ] , u 8 c o n s t i n [ ] , u 8 c o n s t r k [ ] , i n t r o u n d s ,
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
* int b l o c k s )
2014-03-21 10:19:17 +01:00
* /
AES_ E N T R Y ( a e s _ e c b _ e n c r y p t )
2018-09-10 16:41:13 +02:00
stp x29 , x30 , [ s p , #- 16 ] !
mov x29 , s p
2014-03-21 10:19:17 +01:00
2018-09-10 16:41:13 +02:00
enc_ p r e p a r e w3 , x2 , x5
2014-03-21 10:19:17 +01:00
.LecbencloopNx :
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
subs w4 , w4 , #M A X _ S T R I D E
2014-03-21 10:19:17 +01:00
bmi . L e c b e n c1 x
2018-09-10 16:41:13 +02:00
ld1 { v0 . 1 6 b - v3 . 1 6 b } , [ x1 ] , #64 / * g e t 4 p t b l o c k s * /
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST4 ( b l a e s _ e n c r y p t _ b l o c k 4 x )
ST5 ( l d1 { v4 . 1 6 b } , [ x1 ] , #16 )
ST5 ( b l a e s _ e n c r y p t _ b l o c k 5 x )
2018-09-10 16:41:13 +02:00
st1 { v0 . 1 6 b - v3 . 1 6 b } , [ x0 ] , #64
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST5 ( s t 1 { v4 . 1 6 b } , [ x0 ] , #16 )
2014-03-21 10:19:17 +01:00
b . L e c b e n c l o o p N x
.Lecbenc1x :
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
adds w4 , w4 , #M A X _ S T R I D E
2014-03-21 10:19:17 +01:00
beq . L e c b e n c o u t
.Lecbencloop :
2018-09-10 16:41:13 +02:00
ld1 { v0 . 1 6 b } , [ x1 ] , #16 / * g e t n e x t p t b l o c k * /
encrypt_ b l o c k v0 , w3 , x2 , x5 , w6
st1 { v0 . 1 6 b } , [ x0 ] , #16
subs w4 , w4 , #1
2014-03-21 10:19:17 +01:00
bne . L e c b e n c l o o p
.Lecbencout :
2018-09-10 16:41:13 +02:00
ldp x29 , x30 , [ s p ] , #16
2014-03-21 10:19:17 +01:00
ret
AES_ E N D P R O C ( a e s _ e c b _ e n c r y p t )
AES_ E N T R Y ( a e s _ e c b _ d e c r y p t )
2018-09-10 16:41:13 +02:00
stp x29 , x30 , [ s p , #- 16 ] !
mov x29 , s p
2018-04-30 18:18:24 +02:00
2018-09-10 16:41:13 +02:00
dec_ p r e p a r e w3 , x2 , x5
2014-03-21 10:19:17 +01:00
.LecbdecloopNx :
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
subs w4 , w4 , #M A X _ S T R I D E
2014-03-21 10:19:17 +01:00
bmi . L e c b d e c1 x
2018-09-10 16:41:13 +02:00
ld1 { v0 . 1 6 b - v3 . 1 6 b } , [ x1 ] , #64 / * g e t 4 c t b l o c k s * /
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST4 ( b l a e s _ d e c r y p t _ b l o c k 4 x )
ST5 ( l d1 { v4 . 1 6 b } , [ x1 ] , #16 )
ST5 ( b l a e s _ d e c r y p t _ b l o c k 5 x )
2018-09-10 16:41:13 +02:00
st1 { v0 . 1 6 b - v3 . 1 6 b } , [ x0 ] , #64
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST5 ( s t 1 { v4 . 1 6 b } , [ x0 ] , #16 )
2014-03-21 10:19:17 +01:00
b . L e c b d e c l o o p N x
.Lecbdec1x :
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
adds w4 , w4 , #M A X _ S T R I D E
2014-03-21 10:19:17 +01:00
beq . L e c b d e c o u t
.Lecbdecloop :
2018-09-10 16:41:13 +02:00
ld1 { v0 . 1 6 b } , [ x1 ] , #16 / * g e t n e x t c t b l o c k * /
decrypt_ b l o c k v0 , w3 , x2 , x5 , w6
st1 { v0 . 1 6 b } , [ x0 ] , #16
subs w4 , w4 , #1
2014-03-21 10:19:17 +01:00
bne . L e c b d e c l o o p
.Lecbdecout :
2018-09-10 16:41:13 +02:00
ldp x29 , x30 , [ s p ] , #16
2014-03-21 10:19:17 +01:00
ret
AES_ E N D P R O C ( a e s _ e c b _ d e c r y p t )
/ *
* aes_ c b c _ e n c r y p t ( u 8 o u t [ ] , u 8 c o n s t i n [ ] , u 8 c o n s t r k [ ] , i n t r o u n d s ,
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
* int b l o c k s , u 8 i v [ ] )
2014-03-21 10:19:17 +01:00
* aes_ c b c _ d e c r y p t ( u 8 o u t [ ] , u 8 c o n s t i n [ ] , u 8 c o n s t r k [ ] , i n t r o u n d s ,
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
* int b l o c k s , u 8 i v [ ] )
2014-03-21 10:19:17 +01:00
* /
AES_ E N T R Y ( a e s _ c b c _ e n c r y p t )
2018-09-10 16:41:13 +02:00
ld1 { v4 . 1 6 b } , [ x5 ] / * g e t i v * /
enc_ p r e p a r e w3 , x2 , x6
2014-03-21 10:19:17 +01:00
2018-03-10 15:21:52 +00:00
.Lcbcencloop4x :
2018-09-10 16:41:13 +02:00
subs w4 , w4 , #4
2018-03-10 15:21:52 +00:00
bmi . L c b c e n c1 x
2018-09-10 16:41:13 +02:00
ld1 { v0 . 1 6 b - v3 . 1 6 b } , [ x1 ] , #64 / * g e t 4 p t b l o c k s * /
2018-03-10 15:21:52 +00:00
eor v0 . 1 6 b , v0 . 1 6 b , v4 . 1 6 b / * . . a n d x o r w i t h i v * /
2018-09-10 16:41:13 +02:00
encrypt_ b l o c k v0 , w3 , x2 , x6 , w7
2018-03-10 15:21:52 +00:00
eor v1 . 1 6 b , v1 . 1 6 b , v0 . 1 6 b
2018-09-10 16:41:13 +02:00
encrypt_ b l o c k v1 , w3 , x2 , x6 , w7
2018-03-10 15:21:52 +00:00
eor v2 . 1 6 b , v2 . 1 6 b , v1 . 1 6 b
2018-09-10 16:41:13 +02:00
encrypt_ b l o c k v2 , w3 , x2 , x6 , w7
2018-03-10 15:21:52 +00:00
eor v3 . 1 6 b , v3 . 1 6 b , v2 . 1 6 b
2018-09-10 16:41:13 +02:00
encrypt_ b l o c k v3 , w3 , x2 , x6 , w7
st1 { v0 . 1 6 b - v3 . 1 6 b } , [ x0 ] , #64
2018-03-10 15:21:52 +00:00
mov v4 . 1 6 b , v3 . 1 6 b
b . L c b c e n c l o o p4 x
.Lcbcenc1x :
2018-09-10 16:41:13 +02:00
adds w4 , w4 , #4
2018-03-10 15:21:52 +00:00
beq . L c b c e n c o u t
.Lcbcencloop :
2018-09-10 16:41:13 +02:00
ld1 { v0 . 1 6 b } , [ x1 ] , #16 / * g e t n e x t p t b l o c k * /
2018-03-10 15:21:52 +00:00
eor v4 . 1 6 b , v4 . 1 6 b , v0 . 1 6 b / * . . a n d x o r w i t h i v * /
2018-09-10 16:41:13 +02:00
encrypt_ b l o c k v4 , w3 , x2 , x6 , w7
st1 { v4 . 1 6 b } , [ x0 ] , #16
subs w4 , w4 , #1
2014-03-21 10:19:17 +01:00
bne . L c b c e n c l o o p
2018-03-10 15:21:52 +00:00
.Lcbcencout :
2018-09-10 16:41:13 +02:00
st1 { v4 . 1 6 b } , [ x5 ] / * r e t u r n i v * /
2014-03-21 10:19:17 +01:00
ret
AES_ E N D P R O C ( a e s _ c b c _ e n c r y p t )
AES_ E N T R Y ( a e s _ c b c _ d e c r y p t )
2018-09-10 16:41:13 +02:00
stp x29 , x30 , [ s p , #- 16 ] !
mov x29 , s p
2014-03-21 10:19:17 +01:00
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ld1 { c b c i v . 1 6 b } , [ x5 ] / * g e t i v * /
2018-09-10 16:41:13 +02:00
dec_ p r e p a r e w3 , x2 , x6
2014-03-21 10:19:17 +01:00
.LcbcdecloopNx :
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
subs w4 , w4 , #M A X _ S T R I D E
2014-03-21 10:19:17 +01:00
bmi . L c b c d e c1 x
2018-09-10 16:41:13 +02:00
ld1 { v0 . 1 6 b - v3 . 1 6 b } , [ x1 ] , #64 / * g e t 4 c t b l o c k s * /
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
# if M A X _ S T R I D E = = 5
ld1 { v4 . 1 6 b } , [ x1 ] , #16 / * g e t 1 c t b l o c k * /
mov v5 . 1 6 b , v0 . 1 6 b
mov v6 . 1 6 b , v1 . 1 6 b
mov v7 . 1 6 b , v2 . 1 6 b
bl a e s _ d e c r y p t _ b l o c k 5 x
sub x1 , x1 , #32
eor v0 . 1 6 b , v0 . 1 6 b , c b c i v . 1 6 b
eor v1 . 1 6 b , v1 . 1 6 b , v5 . 1 6 b
ld1 { v5 . 1 6 b } , [ x1 ] , #16 / * r e l o a d 1 c t b l o c k * /
ld1 { c b c i v . 1 6 b } , [ x1 ] , #16 / * r e l o a d 1 c t b l o c k * /
eor v2 . 1 6 b , v2 . 1 6 b , v6 . 1 6 b
eor v3 . 1 6 b , v3 . 1 6 b , v7 . 1 6 b
eor v4 . 1 6 b , v4 . 1 6 b , v5 . 1 6 b
# else
2014-03-21 10:19:17 +01:00
mov v4 . 1 6 b , v0 . 1 6 b
mov v5 . 1 6 b , v1 . 1 6 b
mov v6 . 1 6 b , v2 . 1 6 b
2018-03-10 15:21:51 +00:00
bl a e s _ d e c r y p t _ b l o c k 4 x
2018-09-10 16:41:13 +02:00
sub x1 , x1 , #16
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
eor v0 . 1 6 b , v0 . 1 6 b , c b c i v . 1 6 b
2014-03-21 10:19:17 +01:00
eor v1 . 1 6 b , v1 . 1 6 b , v4 . 1 6 b
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ld1 { c b c i v . 1 6 b } , [ x1 ] , #16 / * r e l o a d 1 c t b l o c k * /
2014-03-21 10:19:17 +01:00
eor v2 . 1 6 b , v2 . 1 6 b , v5 . 1 6 b
eor v3 . 1 6 b , v3 . 1 6 b , v6 . 1 6 b
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
# endif
2018-09-10 16:41:13 +02:00
st1 { v0 . 1 6 b - v3 . 1 6 b } , [ x0 ] , #64
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST5 ( s t 1 { v4 . 1 6 b } , [ x0 ] , #16 )
2014-03-21 10:19:17 +01:00
b . L c b c d e c l o o p N x
.Lcbcdec1x :
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
adds w4 , w4 , #M A X _ S T R I D E
2014-03-21 10:19:17 +01:00
beq . L c b c d e c o u t
.Lcbcdecloop :
2018-09-10 16:41:13 +02:00
ld1 { v1 . 1 6 b } , [ x1 ] , #16 / * g e t n e x t c t b l o c k * /
2014-03-21 10:19:17 +01:00
mov v0 . 1 6 b , v1 . 1 6 b / * . . . a n d c o p y t o v0 * /
2018-09-10 16:41:13 +02:00
decrypt_ b l o c k v0 , w3 , x2 , x6 , w7
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
eor v0 . 1 6 b , v0 . 1 6 b , c b c i v . 1 6 b / * x o r w i t h i v = > p t * /
mov c b c i v . 1 6 b , v1 . 1 6 b / * c t i s n e x t i v * /
2018-09-10 16:41:13 +02:00
st1 { v0 . 1 6 b } , [ x0 ] , #16
subs w4 , w4 , #1
2014-03-21 10:19:17 +01:00
bne . L c b c d e c l o o p
.Lcbcdecout :
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
st1 { c b c i v . 1 6 b } , [ x5 ] / * r e t u r n i v * /
2018-09-10 16:41:13 +02:00
ldp x29 , x30 , [ s p ] , #16
2014-03-21 10:19:17 +01:00
ret
AES_ E N D P R O C ( a e s _ c b c _ d e c r y p t )
2018-09-10 16:41:14 +02:00
/ *
* aes_ c b c _ c t s _ e n c r y p t ( u 8 o u t [ ] , u 8 c o n s t i n [ ] , u 3 2 c o n s t r k [ ] ,
* int r o u n d s , i n t b y t e s , u 8 c o n s t i v [ ] )
* aes_ c b c _ c t s _ d e c r y p t ( u 8 o u t [ ] , u 8 c o n s t i n [ ] , u 3 2 c o n s t r k [ ] ,
* int r o u n d s , i n t b y t e s , u 8 c o n s t i v [ ] )
* /
AES_ E N T R Y ( a e s _ c b c _ c t s _ e n c r y p t )
adr_ l x8 , . L c t s _ p e r m u t e _ t a b l e
sub x4 , x4 , #16
add x9 , x8 , #32
add x8 , x8 , x4
sub x9 , x9 , x4
ld1 { v3 . 1 6 b } , [ x8 ]
ld1 { v4 . 1 6 b } , [ x9 ]
ld1 { v0 . 1 6 b } , [ x1 ] , x4 / * o v e r l a p p i n g l o a d s * /
ld1 { v1 . 1 6 b } , [ x1 ]
ld1 { v5 . 1 6 b } , [ x5 ] / * g e t i v * /
enc_ p r e p a r e w3 , x2 , x6
eor v0 . 1 6 b , v0 . 1 6 b , v5 . 1 6 b / * x o r w i t h i v * /
tbl v1 . 1 6 b , { v1 . 1 6 b } , v4 . 1 6 b
encrypt_ b l o c k v0 , w3 , x2 , x6 , w7
eor v1 . 1 6 b , v1 . 1 6 b , v0 . 1 6 b
tbl v0 . 1 6 b , { v0 . 1 6 b } , v3 . 1 6 b
encrypt_ b l o c k v1 , w3 , x2 , x6 , w7
add x4 , x0 , x4
st1 { v0 . 1 6 b } , [ x4 ] / * o v e r l a p p i n g s t o r e s * /
st1 { v1 . 1 6 b } , [ x0 ]
ret
AES_ E N D P R O C ( a e s _ c b c _ c t s _ e n c r y p t )
AES_ E N T R Y ( a e s _ c b c _ c t s _ d e c r y p t )
adr_ l x8 , . L c t s _ p e r m u t e _ t a b l e
sub x4 , x4 , #16
add x9 , x8 , #32
add x8 , x8 , x4
sub x9 , x9 , x4
ld1 { v3 . 1 6 b } , [ x8 ]
ld1 { v4 . 1 6 b } , [ x9 ]
ld1 { v0 . 1 6 b } , [ x1 ] , x4 / * o v e r l a p p i n g l o a d s * /
ld1 { v1 . 1 6 b } , [ x1 ]
ld1 { v5 . 1 6 b } , [ x5 ] / * g e t i v * /
dec_ p r e p a r e w3 , x2 , x6
tbl v2 . 1 6 b , { v1 . 1 6 b } , v4 . 1 6 b
decrypt_ b l o c k v0 , w3 , x2 , x6 , w7
eor v2 . 1 6 b , v2 . 1 6 b , v0 . 1 6 b
tbx v0 . 1 6 b , { v1 . 1 6 b } , v4 . 1 6 b
tbl v2 . 1 6 b , { v2 . 1 6 b } , v3 . 1 6 b
decrypt_ b l o c k v0 , w3 , x2 , x6 , w7
eor v0 . 1 6 b , v0 . 1 6 b , v5 . 1 6 b / * x o r w i t h i v * /
add x4 , x0 , x4
st1 { v2 . 1 6 b } , [ x4 ] / * o v e r l a p p i n g s t o r e s * /
st1 { v0 . 1 6 b } , [ x0 ]
ret
AES_ E N D P R O C ( a e s _ c b c _ c t s _ d e c r y p t )
.section " .rodata " , " a"
.align 6
.Lcts_permute_table :
.byte 0 xff, 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f
.byte 0 xff, 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f
.byte 0 x0 , 0 x1 , 0 x2 , 0 x3 , 0 x4 , 0 x5 , 0 x6 , 0 x7
.byte 0 x8 , 0 x9 , 0 x a , 0 x b , 0 x c , 0 x d , 0 x e , 0 x f
.byte 0 xff, 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f
.byte 0 xff, 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f , 0 x f f
.previous
2014-03-21 10:19:17 +01:00
/ *
* aes_ c t r _ e n c r y p t ( u 8 o u t [ ] , u 8 c o n s t i n [ ] , u 8 c o n s t r k [ ] , i n t r o u n d s ,
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
* int b l o c k s , u 8 c t r [ ] )
2014-03-21 10:19:17 +01:00
* /
AES_ E N T R Y ( a e s _ c t r _ e n c r y p t )
2018-09-10 16:41:13 +02:00
stp x29 , x30 , [ s p , #- 16 ] !
mov x29 , s p
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
2018-09-10 16:41:13 +02:00
enc_ p r e p a r e w3 , x2 , x6
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ld1 { v c t r . 1 6 b } , [ x5 ]
2017-01-17 13:46:29 +00:00
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
umov x6 , v c t r . d [ 1 ] / * k e e p s w a b b e d c t r i n r e g * /
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
rev x6 , x6
2018-09-10 16:41:13 +02:00
cmn w6 , w4 / * 3 2 b i t o v e r f l o w ? * /
bcs . L c t r l o o p
2014-03-21 10:19:17 +01:00
.LctrloopNx :
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
subs w4 , w4 , #M A X _ S T R I D E
2014-03-21 10:19:17 +01:00
bmi . L c t r1 x
2018-08-23 17:48:45 +01:00
add w7 , w6 , #1
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
mov v0 . 1 6 b , v c t r . 1 6 b
2018-08-23 17:48:45 +01:00
add w8 , w6 , #2
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
mov v1 . 1 6 b , v c t r . 1 6 b
add w9 , w6 , #3
mov v2 . 1 6 b , v c t r . 1 6 b
2018-08-23 17:48:45 +01:00
add w9 , w6 , #3
rev w7 , w7
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
mov v3 . 1 6 b , v c t r . 1 6 b
2018-08-23 17:48:45 +01:00
rev w8 , w8
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST5 ( m o v v4 . 1 6 b , v c t r . 1 6 b )
2018-08-23 17:48:45 +01:00
mov v1 . s [ 3 ] , w7
rev w9 , w9
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST5 ( a d d w10 , w6 , #4 )
2018-08-23 17:48:45 +01:00
mov v2 . s [ 3 ] , w8
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST5 ( r e v w10 , w10 )
2018-08-23 17:48:45 +01:00
mov v3 . s [ 3 ] , w9
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST5 ( m o v v4 . s [ 3 ] , w10 )
2018-09-10 16:41:13 +02:00
ld1 { v5 . 1 6 b - v7 . 1 6 b } , [ x1 ] , #48 / * g e t 3 i n p u t b l o c k s * /
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST4 ( b l a e s _ e n c r y p t _ b l o c k 4 x )
ST5 ( b l a e s _ e n c r y p t _ b l o c k 5 x )
2014-03-21 10:19:17 +01:00
eor v0 . 1 6 b , v5 . 1 6 b , v0 . 1 6 b
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST4 ( l d1 { v5 . 1 6 b } , [ x1 ] , #16 )
2014-03-21 10:19:17 +01:00
eor v1 . 1 6 b , v6 . 1 6 b , v1 . 1 6 b
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST5 ( l d1 { v5 . 1 6 b - v6 . 1 6 b } , [ x1 ] , #32 )
2014-03-21 10:19:17 +01:00
eor v2 . 1 6 b , v7 . 1 6 b , v2 . 1 6 b
eor v3 . 1 6 b , v5 . 1 6 b , v3 . 1 6 b
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST5 ( e o r v4 . 1 6 b , v6 . 1 6 b , v4 . 1 6 b )
2018-09-10 16:41:13 +02:00
st1 { v0 . 1 6 b - v3 . 1 6 b } , [ x0 ] , #64
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ST5 ( s t 1 { v4 . 1 6 b } , [ x0 ] , #16 )
add x6 , x6 , #M A X _ S T R I D E
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
rev x7 , x6
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ins v c t r . d [ 1 ] , x7
2018-09-10 16:41:13 +02:00
cbz w4 , . L c t r o u t
2014-03-21 10:19:17 +01:00
b . L c t r l o o p N x
.Lctr1x :
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
adds w4 , w4 , #M A X _ S T R I D E
2014-03-21 10:19:17 +01:00
beq . L c t r o u t
.Lctrloop :
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
mov v0 . 1 6 b , v c t r . 1 6 b
2018-09-10 16:41:13 +02:00
encrypt_ b l o c k v0 , w3 , x2 , x8 , w7
2017-01-17 13:46:29 +00:00
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
adds x6 , x6 , #1 / * i n c r e m e n t B E c t r * /
rev x7 , x6
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ins v c t r . d [ 1 ] , x7
2017-01-17 13:46:29 +00:00
bcs . L c t r c a r r y / * o v e r f l o w ? * /
.Lctrcarrydone :
2018-09-10 16:41:13 +02:00
subs w4 , w4 , #1
2017-01-28 23:25:34 +00:00
bmi . L c t r t a i l b l o c k / * b l o c k s < 0 m e a n s t a i l b l o c k * /
2018-09-10 16:41:13 +02:00
ld1 { v3 . 1 6 b } , [ x1 ] , #16
2014-03-21 10:19:17 +01:00
eor v3 . 1 6 b , v0 . 1 6 b , v3 . 1 6 b
2018-09-10 16:41:13 +02:00
st1 { v3 . 1 6 b } , [ x0 ] , #16
2017-01-17 13:46:29 +00:00
bne . L c t r l o o p
.Lctrout :
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
st1 { v c t r . 1 6 b } , [ x5 ] / * r e t u r n n e x t C T R v a l u e * /
2018-09-10 16:41:13 +02:00
ldp x29 , x30 , [ s p ] , #16
2017-01-17 13:46:29 +00:00
ret
2017-01-28 23:25:34 +00:00
.Lctrtailblock :
2018-09-10 16:41:13 +02:00
st1 { v0 . 1 6 b } , [ x0 ]
2019-02-14 00:03:54 -08:00
b . L c t r o u t
2017-01-17 13:46:29 +00:00
.Lctrcarry :
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
umov x7 , v c t r . d [ 0 ] / * l o a d u p p e r w o r d o f c t r * /
2017-01-17 13:46:29 +00:00
rev x7 , x7 / * . . . t o h a n d l e t h e c a r r y * /
add x7 , x7 , #1
rev x7 , x7
crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
This implements 5-way interleaving for ECB, CBC decryption and CTR,
resulting in a speedup of ~11% on Marvell ThunderX2, which has a
very deep pipeline and therefore a high issue latency for NEON
instructions operating on the same registers.
Note that XTS is left alone: implementing 5-way interleave there
would either involve spilling of the calculated tweaks to the
stack, or recalculating them after the encryption operation, and
doing either of those would most likely penalize low end cores.
For ECB, this is not a concern at all, given that we have plenty
of spare registers. For CTR and CBC decryption, we take advantage
of the fact that v16 is not used by the CE version of the code
(which is the only one targeted by the optimization), and so we
can reshuffle the code a bit and avoid having to spill to memory
(with the exception of one extra reload in the CBC routine)
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-06-24 19:38:31 +02:00
ins v c t r . d [ 0 ] , x7
2017-01-17 13:46:29 +00:00
b . L c t r c a r r y d o n e
2014-03-21 10:19:17 +01:00
AES_ E N D P R O C ( a e s _ c t r _ e n c r y p t )
/ *
* aes_ x t s _ d e c r y p t ( u 8 o u t [ ] , u 8 c o n s t i n [ ] , u 8 c o n s t r k 1 [ ] , i n t r o u n d s ,
* int b l o c k s , u 8 c o n s t r k 2 [ ] , u 8 i v [ ] , i n t f i r s t )
* aes_ x t s _ d e c r y p t ( u 8 o u t [ ] , u 8 c o n s t i n [ ] , u 8 c o n s t r k 1 [ ] , i n t r o u n d s ,
* int b l o c k s , u 8 c o n s t r k 2 [ ] , u 8 i v [ ] , i n t f i r s t )
* /
2018-09-10 16:41:15 +02:00
.macro next_ t w e a k , o u t , i n , t m p
2014-03-21 10:19:17 +01:00
sshr \ t m p \ ( ) . 2 d , \ i n \ ( ) . 2 d , #63
2018-09-10 16:41:15 +02:00
and \ t m p \ ( ) . 1 6 b , \ t m p \ ( ) . 1 6 b , x t s m a s k . 1 6 b
2014-03-21 10:19:17 +01:00
add \ o u t \ ( ) . 2 d , \ i n \ ( ) . 2 d , \ i n \ ( ) . 2 d
ext \ t m p \ ( ) . 1 6 b , \ t m p \ ( ) . 1 6 b , \ t m p \ ( ) . 1 6 b , #8
eor \ o u t \ ( ) . 1 6 b , \ o u t \ ( ) . 1 6 b , \ t m p \ ( ) . 1 6 b
.endm
2018-09-10 16:41:15 +02:00
.macro xts_ l o a d _ m a s k , t m p
movi x t s m a s k . 2 s , #0x1
movi \ t m p \ ( ) . 2 s , #0x87
uzp1 x t s m a s k . 4 s , x t s m a s k . 4 s , \ t m p \ ( ) . 4 s
.endm
2014-03-21 10:19:17 +01:00
AES_ E N T R Y ( a e s _ x t s _ e n c r y p t )
2018-09-10 16:41:13 +02:00
stp x29 , x30 , [ s p , #- 16 ] !
mov x29 , s p
2018-03-10 15:21:51 +00:00
2018-09-10 16:41:13 +02:00
ld1 { v4 . 1 6 b } , [ x6 ]
2018-10-08 13:16:59 +02:00
xts_ l o a d _ m a s k v8
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
cbz w7 , . L x t s e n c n o t f i r s t
enc_ p r e p a r e w3 , x5 , x8
encrypt_ b l o c k v4 , w3 , x5 , x8 , w7 / * f i r s t t w e a k * /
enc_ s w i t c h _ k e y w3 , x2 , x8
2014-03-21 10:19:17 +01:00
b . L x t s e n c N x
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
.Lxtsencnotfirst :
2018-09-10 16:41:13 +02:00
enc_ p r e p a r e w3 , x2 , x8
2014-03-21 10:19:17 +01:00
.LxtsencloopNx :
2018-09-10 16:41:15 +02:00
next_ t w e a k v4 , v4 , v8
2014-03-21 10:19:17 +01:00
.LxtsencNx :
2018-09-10 16:41:13 +02:00
subs w4 , w4 , #4
2014-03-21 10:19:17 +01:00
bmi . L x t s e n c1 x
2018-09-10 16:41:13 +02:00
ld1 { v0 . 1 6 b - v3 . 1 6 b } , [ x1 ] , #64 / * g e t 4 p t b l o c k s * /
2018-09-10 16:41:15 +02:00
next_ t w e a k v5 , v4 , v8
2014-03-21 10:19:17 +01:00
eor v0 . 1 6 b , v0 . 1 6 b , v4 . 1 6 b
2018-09-10 16:41:15 +02:00
next_ t w e a k v6 , v5 , v8
2014-03-21 10:19:17 +01:00
eor v1 . 1 6 b , v1 . 1 6 b , v5 . 1 6 b
eor v2 . 1 6 b , v2 . 1 6 b , v6 . 1 6 b
2018-09-10 16:41:15 +02:00
next_ t w e a k v7 , v6 , v8
2014-03-21 10:19:17 +01:00
eor v3 . 1 6 b , v3 . 1 6 b , v7 . 1 6 b
2018-03-10 15:21:51 +00:00
bl a e s _ e n c r y p t _ b l o c k 4 x
2014-03-21 10:19:17 +01:00
eor v3 . 1 6 b , v3 . 1 6 b , v7 . 1 6 b
eor v0 . 1 6 b , v0 . 1 6 b , v4 . 1 6 b
eor v1 . 1 6 b , v1 . 1 6 b , v5 . 1 6 b
eor v2 . 1 6 b , v2 . 1 6 b , v6 . 1 6 b
2018-09-10 16:41:13 +02:00
st1 { v0 . 1 6 b - v3 . 1 6 b } , [ x0 ] , #64
2014-03-21 10:19:17 +01:00
mov v4 . 1 6 b , v7 . 1 6 b
2018-09-10 16:41:13 +02:00
cbz w4 , . L x t s e n c o u t
2018-10-08 13:16:59 +02:00
xts_ r e l o a d _ m a s k v8
2014-03-21 10:19:17 +01:00
b . L x t s e n c l o o p N x
.Lxtsenc1x :
2018-09-10 16:41:13 +02:00
adds w4 , w4 , #4
2014-03-21 10:19:17 +01:00
beq . L x t s e n c o u t
.Lxtsencloop :
2018-09-10 16:41:13 +02:00
ld1 { v1 . 1 6 b } , [ x1 ] , #16
2014-03-21 10:19:17 +01:00
eor v0 . 1 6 b , v1 . 1 6 b , v4 . 1 6 b
2018-09-10 16:41:13 +02:00
encrypt_ b l o c k v0 , w3 , x2 , x8 , w7
2014-03-21 10:19:17 +01:00
eor v0 . 1 6 b , v0 . 1 6 b , v4 . 1 6 b
2018-09-10 16:41:13 +02:00
st1 { v0 . 1 6 b } , [ x0 ] , #16
subs w4 , w4 , #1
2014-03-21 10:19:17 +01:00
beq . L x t s e n c o u t
2018-09-10 16:41:15 +02:00
next_ t w e a k v4 , v4 , v8
2014-03-21 10:19:17 +01:00
b . L x t s e n c l o o p
.Lxtsencout :
2018-09-10 16:41:13 +02:00
st1 { v4 . 1 6 b } , [ x6 ]
ldp x29 , x30 , [ s p ] , #16
2014-03-21 10:19:17 +01:00
ret
AES_ E N D P R O C ( a e s _ x t s _ e n c r y p t )
AES_ E N T R Y ( a e s _ x t s _ d e c r y p t )
2018-09-10 16:41:13 +02:00
stp x29 , x30 , [ s p , #- 16 ] !
mov x29 , s p
2018-03-10 15:21:51 +00:00
2018-09-10 16:41:13 +02:00
ld1 { v4 . 1 6 b } , [ x6 ]
2018-10-08 13:16:59 +02:00
xts_ l o a d _ m a s k v8
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
cbz w7 , . L x t s d e c n o t f i r s t
enc_ p r e p a r e w3 , x5 , x8
encrypt_ b l o c k v4 , w3 , x5 , x8 , w7 / * f i r s t t w e a k * /
dec_ p r e p a r e w3 , x2 , x8
2014-03-21 10:19:17 +01:00
b . L x t s d e c N x
crypto: arm64/aes-blk - move kernel mode neon en/disable into loop
When kernel mode NEON was first introduced on arm64, the preserve and
restore of the userland NEON state was completely unoptimized, and
involved saving all registers on each call to kernel_neon_begin(),
and restoring them on each call to kernel_neon_end(). For this reason,
the NEON crypto code that was introduced at the time keeps the NEON
enabled throughout the execution of the crypto API methods, which may
include calls back into the crypto API that could result in memory
allocation or other actions that we should avoid when running with
preemption disabled.
Since then, we have optimized the kernel mode NEON handling, which now
restores lazily (upon return to userland), and so the preserve action
is only costly the first time it is called after entering the kernel.
So let's put the kernel_neon_begin() and kernel_neon_end() calls around
the actual invocations of the NEON crypto code, and run the remainder of
the code with kernel mode NEON disabled (and preemption enabled)
Note that this requires some reshuffling of the registers in the asm
code, because the XTS routines can no longer rely on the registers to
retain their contents between invocations.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-03-10 15:21:48 +00:00
.Lxtsdecnotfirst :
2018-09-10 16:41:13 +02:00
dec_ p r e p a r e w3 , x2 , x8
2014-03-21 10:19:17 +01:00
.LxtsdecloopNx :
2018-09-10 16:41:15 +02:00
next_ t w e a k v4 , v4 , v8
2014-03-21 10:19:17 +01:00
.LxtsdecNx :
2018-09-10 16:41:13 +02:00
subs w4 , w4 , #4
2014-03-21 10:19:17 +01:00
bmi . L x t s d e c1 x
2018-09-10 16:41:13 +02:00
ld1 { v0 . 1 6 b - v3 . 1 6 b } , [ x1 ] , #64 / * g e t 4 c t b l o c k s * /
2018-09-10 16:41:15 +02:00
next_ t w e a k v5 , v4 , v8
2014-03-21 10:19:17 +01:00
eor v0 . 1 6 b , v0 . 1 6 b , v4 . 1 6 b
2018-09-10 16:41:15 +02:00
next_ t w e a k v6 , v5 , v8
2014-03-21 10:19:17 +01:00
eor v1 . 1 6 b , v1 . 1 6 b , v5 . 1 6 b
eor v2 . 1 6 b , v2 . 1 6 b , v6 . 1 6 b
2018-09-10 16:41:15 +02:00
next_ t w e a k v7 , v6 , v8
2014-03-21 10:19:17 +01:00
eor v3 . 1 6 b , v3 . 1 6 b , v7 . 1 6 b
2018-03-10 15:21:51 +00:00
bl a e s _ d e c r y p t _ b l o c k 4 x
2014-03-21 10:19:17 +01:00
eor v3 . 1 6 b , v3 . 1 6 b , v7 . 1 6 b
eor v0 . 1 6 b , v0 . 1 6 b , v4 . 1 6 b
eor v1 . 1 6 b , v1 . 1 6 b , v5 . 1 6 b
eor v2 . 1 6 b , v2 . 1 6 b , v6 . 1 6 b
2018-09-10 16:41:13 +02:00
st1 { v0 . 1 6 b - v3 . 1 6 b } , [ x0 ] , #64
2014-03-21 10:19:17 +01:00
mov v4 . 1 6 b , v7 . 1 6 b
2018-09-10 16:41:13 +02:00
cbz w4 , . L x t s d e c o u t
2018-10-08 13:16:59 +02:00
xts_ r e l o a d _ m a s k v8
2014-03-21 10:19:17 +01:00
b . L x t s d e c l o o p N x
.Lxtsdec1x :
2018-09-10 16:41:13 +02:00
adds w4 , w4 , #4
2014-03-21 10:19:17 +01:00
beq . L x t s d e c o u t
.Lxtsdecloop :
2018-09-10 16:41:13 +02:00
ld1 { v1 . 1 6 b } , [ x1 ] , #16
2014-03-21 10:19:17 +01:00
eor v0 . 1 6 b , v1 . 1 6 b , v4 . 1 6 b
2018-09-10 16:41:13 +02:00
decrypt_ b l o c k v0 , w3 , x2 , x8 , w7
2014-03-21 10:19:17 +01:00
eor v0 . 1 6 b , v0 . 1 6 b , v4 . 1 6 b
2018-09-10 16:41:13 +02:00
st1 { v0 . 1 6 b } , [ x0 ] , #16
subs w4 , w4 , #1
2014-03-21 10:19:17 +01:00
beq . L x t s d e c o u t
2018-09-10 16:41:15 +02:00
next_ t w e a k v4 , v4 , v8
2014-03-21 10:19:17 +01:00
b . L x t s d e c l o o p
.Lxtsdecout :
2018-09-10 16:41:13 +02:00
st1 { v4 . 1 6 b } , [ x6 ]
ldp x29 , x30 , [ s p ] , #16
2014-03-21 10:19:17 +01:00
ret
AES_ E N D P R O C ( a e s _ x t s _ d e c r y p t )
2017-02-03 14:49:37 +00:00
/ *
* aes_ m a c _ u p d a t e ( u 8 c o n s t i n [ ] , u 3 2 c o n s t r k [ ] , i n t r o u n d s ,
* int b l o c k s , u 8 d g [ ] , i n t e n c _ b e f o r e , i n t e n c _ a f t e r )
* /
AES_ E N T R Y ( a e s _ m a c _ u p d a t e )
2018-04-30 18:18:24 +02:00
frame_ p u s h 6
mov x19 , x0
mov x20 , x1
mov x21 , x2
mov x22 , x3
mov x23 , x4
mov x24 , x6
ld1 { v0 . 1 6 b } , [ x23 ] / * g e t d g * /
2017-02-03 14:49:37 +00:00
enc_ p r e p a r e w2 , x1 , x7
2018-03-10 15:21:53 +00:00
cbz w5 , . L m a c l o o p4 x
2017-02-03 14:49:37 +00:00
2018-03-10 15:21:53 +00:00
encrypt_ b l o c k v0 , w2 , x1 , x7 , w8
.Lmacloop4x :
2018-04-30 18:18:24 +02:00
subs w22 , w22 , #4
2018-03-10 15:21:53 +00:00
bmi . L m a c1 x
2018-04-30 18:18:24 +02:00
ld1 { v1 . 1 6 b - v4 . 1 6 b } , [ x19 ] , #64 / * g e t n e x t p t b l o c k * /
2018-03-10 15:21:53 +00:00
eor v0 . 1 6 b , v0 . 1 6 b , v1 . 1 6 b / * . . a n d x o r w i t h d g * /
2018-04-30 18:18:24 +02:00
encrypt_ b l o c k v0 , w21 , x20 , x7 , w8
2018-03-10 15:21:53 +00:00
eor v0 . 1 6 b , v0 . 1 6 b , v2 . 1 6 b
2018-04-30 18:18:24 +02:00
encrypt_ b l o c k v0 , w21 , x20 , x7 , w8
2018-03-10 15:21:53 +00:00
eor v0 . 1 6 b , v0 . 1 6 b , v3 . 1 6 b
2018-04-30 18:18:24 +02:00
encrypt_ b l o c k v0 , w21 , x20 , x7 , w8
2018-03-10 15:21:53 +00:00
eor v0 . 1 6 b , v0 . 1 6 b , v4 . 1 6 b
2018-04-30 18:18:24 +02:00
cmp w22 , w z r
csinv x5 , x24 , x z r , e q
2018-03-10 15:21:53 +00:00
cbz w5 , . L m a c o u t
2018-04-30 18:18:24 +02:00
encrypt_ b l o c k v0 , w21 , x20 , x7 , w8
st1 { v0 . 1 6 b } , [ x23 ] / * r e t u r n d g * /
cond_ y i e l d _ n e o n . L m a c r e s t a r t
2018-03-10 15:21:53 +00:00
b . L m a c l o o p4 x
.Lmac1x :
2018-04-30 18:18:24 +02:00
add w22 , w22 , #4
2017-02-03 14:49:37 +00:00
.Lmacloop :
2018-04-30 18:18:24 +02:00
cbz w22 , . L m a c o u t
ld1 { v1 . 1 6 b } , [ x19 ] , #16 / * g e t n e x t p t b l o c k * /
2017-02-03 14:49:37 +00:00
eor v0 . 1 6 b , v0 . 1 6 b , v1 . 1 6 b / * . . a n d x o r w i t h d g * /
2018-04-30 18:18:24 +02:00
subs w22 , w22 , #1
csinv x5 , x24 , x z r , e q
2017-02-03 14:49:37 +00:00
cbz w5 , . L m a c o u t
2018-04-30 18:18:24 +02:00
.Lmacenc :
encrypt_ b l o c k v0 , w21 , x20 , x7 , w8
2017-02-03 14:49:37 +00:00
b . L m a c l o o p
.Lmacout :
2018-04-30 18:18:24 +02:00
st1 { v0 . 1 6 b } , [ x23 ] / * r e t u r n d g * /
frame_ p o p
2017-02-03 14:49:37 +00:00
ret
2018-04-30 18:18:24 +02:00
.Lmacrestart :
ld1 { v0 . 1 6 b } , [ x23 ] / * g e t d g * /
enc_ p r e p a r e w21 , x20 , x0
b . L m a c l o o p4 x
2017-02-03 14:49:37 +00:00
AES_ E N D P R O C ( a e s _ m a c _ u p d a t e )