2009-01-18 08:28:34 +03:00
/ *
* Implement A E S a l g o r i t h m i n I n t e l A E S - N I i n s t r u c t i o n s .
*
* The w h i t e p a p e r o f A E S - N I i n s t r u c t i o n s c a n b e d o w n l o a d e d f r o m :
* http : / / softwarecommunity. i n t e l . c o m / i s n / d o w n l o a d s / i n t e l a v x / A E S - I n s t r u c t i o n s - S e t _ W P . p d f
*
* Copyright ( C ) 2 0 0 8 , I n t e l C o r p .
* Author : Huang Y i n g < y i n g . h u a n g @intel.com>
* Vinodh G o p a l < v i n o d h . g o p a l @intel.com>
* Kahraman A k d e m i r
*
2010-11-04 22:00:45 +03:00
* Added R F C 4 1 0 6 A E S - G C M s u p p o r t f o r 1 2 8 - b i t k e y s u n d e r t h e A E A D
* interface f o r 6 4 - b i t k e r n e l s .
* Authors : Erdinc O z t u r k ( e r d i n c . o z t u r k @intel.com)
* Aidan O ' M a h o n y ( a i d a n . o . m a h o n y @intel.com)
* Adrian H o b a n < a d r i a n . h o b a n @intel.com>
* James G u i l f o r d ( j a m e s . g u i l f o r d @intel.com)
* Gabriele P a o l o n i < g a b r i e l e . p a o l o n i @intel.com>
* Tadeusz S t r u k ( t a d e u s z . s t r u k @intel.com)
* Wajdi F e g h a l i ( w a j d i . k . f e g h a l i @intel.com)
* Copyright ( c ) 2 0 1 0 , I n t e l C o r p o r a t i o n .
*
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
* Ported x86 _ 6 4 v e r s i o n t o x86 :
* Author : Mathias K r a u s e < m i n i p l i @googlemail.com>
*
2009-01-18 08:28:34 +03:00
* This p r o g r a m i s f r e e s o f t w a r e ; you can redistribute it and/or modify
* it u n d e r t h e t e r m s o f t h e G N U G e n e r a l P u b l i c L i c e n s e a s p u b l i s h e d b y
* the F r e e S o f t w a r e F o u n d a t i o n ; either version 2 of the License, or
* ( at y o u r o p t i o n ) a n y l a t e r v e r s i o n .
* /
# include < l i n u x / l i n k a g e . h >
2009-11-23 14:54:06 +03:00
# include < a s m / i n s t . h >
2016-01-22 01:49:19 +03:00
# include < a s m / f r a m e . h >
2018-01-12 00:46:27 +03:00
# include < a s m / n o s p e c - b r a n c h . h >
2009-01-18 08:28:34 +03:00
2015-01-13 21:16:43 +03:00
/ *
* The f o l l o w i n g m a c r o s a r e u s e d t o m o v e a n ( u n ) a l i g n e d 1 6 b y t e v a l u e t o / f r o m
* an X M M r e g i s t e r . T h i s c a n d o n e f o r e i t h e r F P o r i n t e g e r v a l u e s , f o r F P u s e
* movaps ( m o v e a l i g n e d p a c k e d s i n g l e ) o r i n t e g e r u s e m o v d q a ( m o v e d o u b l e q u a d
* aligned) . I t d o e s n ' t m a k e a p e r f o r m a n c e d i f f e r e n c e w h i c h i n s t r u c t i o n i s u s e d
* since N e h a l e m ( o r i g i n a l C o r e i 7 ) w a s r e l e a s e d . H o w e v e r , t h e m o v a p s i s a b y t e
* shorter, s o t h a t i s t h e o n e w e ' l l u s e f o r n o w . ( s a m e f o r u n a l i g n e d ) .
* /
# define M O V A D Q m o v a p s
# define M O V U D Q m o v u p s
2010-11-29 03:35:39 +03:00
# ifdef _ _ x86 _ 6 4 _ _
2015-01-13 21:16:43 +03:00
crypto: x86 - make constants readonly, allow linker to merge them
A lot of asm-optimized routines in arch/x86/crypto/ keep its
constants in .data. This is wrong, they should be on .rodata.
Mnay of these constants are the same in different modules.
For example, 128-bit shuffle mask 0x000102030405060708090A0B0C0D0E0F
exists in at least half a dozen places.
There is a way to let linker merge them and use just one copy.
The rules are as follows: mergeable objects of different sizes
should not share sections. You can't put them all in one .rodata
section, they will lose "mergeability".
GCC puts its mergeable constants in ".rodata.cstSIZE" sections,
or ".rodata.cstSIZE.<object_name>" if -fdata-sections is used.
This patch does the same:
.section .rodata.cst16.SHUF_MASK, "aM", @progbits, 16
It is important that all data in such section consists of
16-byte elements, not larger ones, and there are no implicit
use of one element from another.
When this is not the case, use non-mergeable section:
.section .rodata[.VAR_NAME], "a", @progbits
This reduces .data by ~15 kbytes:
text data bss dec hex filename
11097415 2705840 2630712 16433967 fac32f vmlinux-prev.o
11112095 2690672 2630712 16433479 fac147 vmlinux.o
Merged objects are visible in System.map:
ffffffff81a28810 r POLY
ffffffff81a28810 r POLY
ffffffff81a28820 r TWOONE
ffffffff81a28820 r TWOONE
ffffffff81a28830 r PSHUFFLE_BYTE_FLIP_MASK <- merged regardless of
ffffffff81a28830 r SHUF_MASK <------------- the name difference
ffffffff81a28830 r SHUF_MASK
ffffffff81a28830 r SHUF_MASK
..
ffffffff81a28d00 r K512 <- merged three identical 640-byte tables
ffffffff81a28d00 r K512
ffffffff81a28d00 r K512
Use of object names in section name suffixes is not strictly necessary,
but might help if someday link stage will use garbage collection
to eliminate unused sections (ld --gc-sections).
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Josh Poimboeuf <jpoimboe@redhat.com>
CC: Xiaodong Liu <xiaodong.liu@intel.com>
CC: Megha Dey <megha.dey@intel.com>
CC: linux-crypto@vger.kernel.org
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-01-20 00:33:04 +03:00
# constants i n m e r g e a b l e s e c t i o n s , l i n k e r c a n r e o r d e r a n d m e r g e
.section .rodata .cst16 .gf128mul_x_ble_mask , " aM" , @progbits, 16
2013-04-08 22:51:16 +04:00
.align 16
.Lgf128mul_x_ble_mask :
.octa 0x00000000000000010000000000000087
crypto: x86 - make constants readonly, allow linker to merge them
A lot of asm-optimized routines in arch/x86/crypto/ keep its
constants in .data. This is wrong, they should be on .rodata.
Mnay of these constants are the same in different modules.
For example, 128-bit shuffle mask 0x000102030405060708090A0B0C0D0E0F
exists in at least half a dozen places.
There is a way to let linker merge them and use just one copy.
The rules are as follows: mergeable objects of different sizes
should not share sections. You can't put them all in one .rodata
section, they will lose "mergeability".
GCC puts its mergeable constants in ".rodata.cstSIZE" sections,
or ".rodata.cstSIZE.<object_name>" if -fdata-sections is used.
This patch does the same:
.section .rodata.cst16.SHUF_MASK, "aM", @progbits, 16
It is important that all data in such section consists of
16-byte elements, not larger ones, and there are no implicit
use of one element from another.
When this is not the case, use non-mergeable section:
.section .rodata[.VAR_NAME], "a", @progbits
This reduces .data by ~15 kbytes:
text data bss dec hex filename
11097415 2705840 2630712 16433967 fac32f vmlinux-prev.o
11112095 2690672 2630712 16433479 fac147 vmlinux.o
Merged objects are visible in System.map:
ffffffff81a28810 r POLY
ffffffff81a28810 r POLY
ffffffff81a28820 r TWOONE
ffffffff81a28820 r TWOONE
ffffffff81a28830 r PSHUFFLE_BYTE_FLIP_MASK <- merged regardless of
ffffffff81a28830 r SHUF_MASK <------------- the name difference
ffffffff81a28830 r SHUF_MASK
ffffffff81a28830 r SHUF_MASK
..
ffffffff81a28d00 r K512 <- merged three identical 640-byte tables
ffffffff81a28d00 r K512
ffffffff81a28d00 r K512
Use of object names in section name suffixes is not strictly necessary,
but might help if someday link stage will use garbage collection
to eliminate unused sections (ld --gc-sections).
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Josh Poimboeuf <jpoimboe@redhat.com>
CC: Xiaodong Liu <xiaodong.liu@intel.com>
CC: Megha Dey <megha.dey@intel.com>
CC: linux-crypto@vger.kernel.org
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-01-20 00:33:04 +03:00
.section .rodata .cst16 .POLY , " aM" , @progbits, 16
.align 16
2010-11-04 22:00:45 +03:00
POLY : .octa 0xC2000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
crypto: x86 - make constants readonly, allow linker to merge them
A lot of asm-optimized routines in arch/x86/crypto/ keep its
constants in .data. This is wrong, they should be on .rodata.
Mnay of these constants are the same in different modules.
For example, 128-bit shuffle mask 0x000102030405060708090A0B0C0D0E0F
exists in at least half a dozen places.
There is a way to let linker merge them and use just one copy.
The rules are as follows: mergeable objects of different sizes
should not share sections. You can't put them all in one .rodata
section, they will lose "mergeability".
GCC puts its mergeable constants in ".rodata.cstSIZE" sections,
or ".rodata.cstSIZE.<object_name>" if -fdata-sections is used.
This patch does the same:
.section .rodata.cst16.SHUF_MASK, "aM", @progbits, 16
It is important that all data in such section consists of
16-byte elements, not larger ones, and there are no implicit
use of one element from another.
When this is not the case, use non-mergeable section:
.section .rodata[.VAR_NAME], "a", @progbits
This reduces .data by ~15 kbytes:
text data bss dec hex filename
11097415 2705840 2630712 16433967 fac32f vmlinux-prev.o
11112095 2690672 2630712 16433479 fac147 vmlinux.o
Merged objects are visible in System.map:
ffffffff81a28810 r POLY
ffffffff81a28810 r POLY
ffffffff81a28820 r TWOONE
ffffffff81a28820 r TWOONE
ffffffff81a28830 r PSHUFFLE_BYTE_FLIP_MASK <- merged regardless of
ffffffff81a28830 r SHUF_MASK <------------- the name difference
ffffffff81a28830 r SHUF_MASK
ffffffff81a28830 r SHUF_MASK
..
ffffffff81a28d00 r K512 <- merged three identical 640-byte tables
ffffffff81a28d00 r K512
ffffffff81a28d00 r K512
Use of object names in section name suffixes is not strictly necessary,
but might help if someday link stage will use garbage collection
to eliminate unused sections (ld --gc-sections).
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Josh Poimboeuf <jpoimboe@redhat.com>
CC: Xiaodong Liu <xiaodong.liu@intel.com>
CC: Megha Dey <megha.dey@intel.com>
CC: linux-crypto@vger.kernel.org
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-01-20 00:33:04 +03:00
.section .rodata .cst16 .TWOONE , " aM" , @progbits, 16
.align 16
2010-11-04 22:00:45 +03:00
TWOONE : .octa 0x00000001 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
crypto: x86 - make constants readonly, allow linker to merge them
A lot of asm-optimized routines in arch/x86/crypto/ keep its
constants in .data. This is wrong, they should be on .rodata.
Mnay of these constants are the same in different modules.
For example, 128-bit shuffle mask 0x000102030405060708090A0B0C0D0E0F
exists in at least half a dozen places.
There is a way to let linker merge them and use just one copy.
The rules are as follows: mergeable objects of different sizes
should not share sections. You can't put them all in one .rodata
section, they will lose "mergeability".
GCC puts its mergeable constants in ".rodata.cstSIZE" sections,
or ".rodata.cstSIZE.<object_name>" if -fdata-sections is used.
This patch does the same:
.section .rodata.cst16.SHUF_MASK, "aM", @progbits, 16
It is important that all data in such section consists of
16-byte elements, not larger ones, and there are no implicit
use of one element from another.
When this is not the case, use non-mergeable section:
.section .rodata[.VAR_NAME], "a", @progbits
This reduces .data by ~15 kbytes:
text data bss dec hex filename
11097415 2705840 2630712 16433967 fac32f vmlinux-prev.o
11112095 2690672 2630712 16433479 fac147 vmlinux.o
Merged objects are visible in System.map:
ffffffff81a28810 r POLY
ffffffff81a28810 r POLY
ffffffff81a28820 r TWOONE
ffffffff81a28820 r TWOONE
ffffffff81a28830 r PSHUFFLE_BYTE_FLIP_MASK <- merged regardless of
ffffffff81a28830 r SHUF_MASK <------------- the name difference
ffffffff81a28830 r SHUF_MASK
ffffffff81a28830 r SHUF_MASK
..
ffffffff81a28d00 r K512 <- merged three identical 640-byte tables
ffffffff81a28d00 r K512
ffffffff81a28d00 r K512
Use of object names in section name suffixes is not strictly necessary,
but might help if someday link stage will use garbage collection
to eliminate unused sections (ld --gc-sections).
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Josh Poimboeuf <jpoimboe@redhat.com>
CC: Xiaodong Liu <xiaodong.liu@intel.com>
CC: Megha Dey <megha.dey@intel.com>
CC: linux-crypto@vger.kernel.org
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-01-20 00:33:04 +03:00
.section .rodata .cst16 .SHUF_MASK , " aM" , @progbits, 16
.align 16
2010-11-04 22:00:45 +03:00
SHUF_MASK : .octa 0x00010203 0 4 0 5 0 6 0 7 0 8 0 9 0 A0 B 0 C 0 D 0 E 0 F
crypto: x86 - make constants readonly, allow linker to merge them
A lot of asm-optimized routines in arch/x86/crypto/ keep its
constants in .data. This is wrong, they should be on .rodata.
Mnay of these constants are the same in different modules.
For example, 128-bit shuffle mask 0x000102030405060708090A0B0C0D0E0F
exists in at least half a dozen places.
There is a way to let linker merge them and use just one copy.
The rules are as follows: mergeable objects of different sizes
should not share sections. You can't put them all in one .rodata
section, they will lose "mergeability".
GCC puts its mergeable constants in ".rodata.cstSIZE" sections,
or ".rodata.cstSIZE.<object_name>" if -fdata-sections is used.
This patch does the same:
.section .rodata.cst16.SHUF_MASK, "aM", @progbits, 16
It is important that all data in such section consists of
16-byte elements, not larger ones, and there are no implicit
use of one element from another.
When this is not the case, use non-mergeable section:
.section .rodata[.VAR_NAME], "a", @progbits
This reduces .data by ~15 kbytes:
text data bss dec hex filename
11097415 2705840 2630712 16433967 fac32f vmlinux-prev.o
11112095 2690672 2630712 16433479 fac147 vmlinux.o
Merged objects are visible in System.map:
ffffffff81a28810 r POLY
ffffffff81a28810 r POLY
ffffffff81a28820 r TWOONE
ffffffff81a28820 r TWOONE
ffffffff81a28830 r PSHUFFLE_BYTE_FLIP_MASK <- merged regardless of
ffffffff81a28830 r SHUF_MASK <------------- the name difference
ffffffff81a28830 r SHUF_MASK
ffffffff81a28830 r SHUF_MASK
..
ffffffff81a28d00 r K512 <- merged three identical 640-byte tables
ffffffff81a28d00 r K512
ffffffff81a28d00 r K512
Use of object names in section name suffixes is not strictly necessary,
but might help if someday link stage will use garbage collection
to eliminate unused sections (ld --gc-sections).
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Josh Poimboeuf <jpoimboe@redhat.com>
CC: Xiaodong Liu <xiaodong.liu@intel.com>
CC: Megha Dey <megha.dey@intel.com>
CC: linux-crypto@vger.kernel.org
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-01-20 00:33:04 +03:00
.section .rodata .cst16 .MASK1 , " aM" , @progbits, 16
.align 16
2010-11-04 22:00:45 +03:00
MASK1 : .octa 0x00000000 0 0 0 0 0 0 0 0 ffffffffffffffff
crypto: x86 - make constants readonly, allow linker to merge them
A lot of asm-optimized routines in arch/x86/crypto/ keep its
constants in .data. This is wrong, they should be on .rodata.
Mnay of these constants are the same in different modules.
For example, 128-bit shuffle mask 0x000102030405060708090A0B0C0D0E0F
exists in at least half a dozen places.
There is a way to let linker merge them and use just one copy.
The rules are as follows: mergeable objects of different sizes
should not share sections. You can't put them all in one .rodata
section, they will lose "mergeability".
GCC puts its mergeable constants in ".rodata.cstSIZE" sections,
or ".rodata.cstSIZE.<object_name>" if -fdata-sections is used.
This patch does the same:
.section .rodata.cst16.SHUF_MASK, "aM", @progbits, 16
It is important that all data in such section consists of
16-byte elements, not larger ones, and there are no implicit
use of one element from another.
When this is not the case, use non-mergeable section:
.section .rodata[.VAR_NAME], "a", @progbits
This reduces .data by ~15 kbytes:
text data bss dec hex filename
11097415 2705840 2630712 16433967 fac32f vmlinux-prev.o
11112095 2690672 2630712 16433479 fac147 vmlinux.o
Merged objects are visible in System.map:
ffffffff81a28810 r POLY
ffffffff81a28810 r POLY
ffffffff81a28820 r TWOONE
ffffffff81a28820 r TWOONE
ffffffff81a28830 r PSHUFFLE_BYTE_FLIP_MASK <- merged regardless of
ffffffff81a28830 r SHUF_MASK <------------- the name difference
ffffffff81a28830 r SHUF_MASK
ffffffff81a28830 r SHUF_MASK
..
ffffffff81a28d00 r K512 <- merged three identical 640-byte tables
ffffffff81a28d00 r K512
ffffffff81a28d00 r K512
Use of object names in section name suffixes is not strictly necessary,
but might help if someday link stage will use garbage collection
to eliminate unused sections (ld --gc-sections).
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Josh Poimboeuf <jpoimboe@redhat.com>
CC: Xiaodong Liu <xiaodong.liu@intel.com>
CC: Megha Dey <megha.dey@intel.com>
CC: linux-crypto@vger.kernel.org
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-01-20 00:33:04 +03:00
.section .rodata .cst16 .MASK2 , " aM" , @progbits, 16
.align 16
2010-11-04 22:00:45 +03:00
MASK2 : .octa 0xffffffff ffffffff0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
crypto: x86 - make constants readonly, allow linker to merge them
A lot of asm-optimized routines in arch/x86/crypto/ keep its
constants in .data. This is wrong, they should be on .rodata.
Mnay of these constants are the same in different modules.
For example, 128-bit shuffle mask 0x000102030405060708090A0B0C0D0E0F
exists in at least half a dozen places.
There is a way to let linker merge them and use just one copy.
The rules are as follows: mergeable objects of different sizes
should not share sections. You can't put them all in one .rodata
section, they will lose "mergeability".
GCC puts its mergeable constants in ".rodata.cstSIZE" sections,
or ".rodata.cstSIZE.<object_name>" if -fdata-sections is used.
This patch does the same:
.section .rodata.cst16.SHUF_MASK, "aM", @progbits, 16
It is important that all data in such section consists of
16-byte elements, not larger ones, and there are no implicit
use of one element from another.
When this is not the case, use non-mergeable section:
.section .rodata[.VAR_NAME], "a", @progbits
This reduces .data by ~15 kbytes:
text data bss dec hex filename
11097415 2705840 2630712 16433967 fac32f vmlinux-prev.o
11112095 2690672 2630712 16433479 fac147 vmlinux.o
Merged objects are visible in System.map:
ffffffff81a28810 r POLY
ffffffff81a28810 r POLY
ffffffff81a28820 r TWOONE
ffffffff81a28820 r TWOONE
ffffffff81a28830 r PSHUFFLE_BYTE_FLIP_MASK <- merged regardless of
ffffffff81a28830 r SHUF_MASK <------------- the name difference
ffffffff81a28830 r SHUF_MASK
ffffffff81a28830 r SHUF_MASK
..
ffffffff81a28d00 r K512 <- merged three identical 640-byte tables
ffffffff81a28d00 r K512
ffffffff81a28d00 r K512
Use of object names in section name suffixes is not strictly necessary,
but might help if someday link stage will use garbage collection
to eliminate unused sections (ld --gc-sections).
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Josh Poimboeuf <jpoimboe@redhat.com>
CC: Xiaodong Liu <xiaodong.liu@intel.com>
CC: Megha Dey <megha.dey@intel.com>
CC: linux-crypto@vger.kernel.org
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-01-20 00:33:04 +03:00
.section .rodata .cst16 .ONE , " aM" , @progbits, 16
.align 16
2010-11-04 22:00:45 +03:00
ONE : .octa 0x00000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
crypto: x86 - make constants readonly, allow linker to merge them
A lot of asm-optimized routines in arch/x86/crypto/ keep its
constants in .data. This is wrong, they should be on .rodata.
Mnay of these constants are the same in different modules.
For example, 128-bit shuffle mask 0x000102030405060708090A0B0C0D0E0F
exists in at least half a dozen places.
There is a way to let linker merge them and use just one copy.
The rules are as follows: mergeable objects of different sizes
should not share sections. You can't put them all in one .rodata
section, they will lose "mergeability".
GCC puts its mergeable constants in ".rodata.cstSIZE" sections,
or ".rodata.cstSIZE.<object_name>" if -fdata-sections is used.
This patch does the same:
.section .rodata.cst16.SHUF_MASK, "aM", @progbits, 16
It is important that all data in such section consists of
16-byte elements, not larger ones, and there are no implicit
use of one element from another.
When this is not the case, use non-mergeable section:
.section .rodata[.VAR_NAME], "a", @progbits
This reduces .data by ~15 kbytes:
text data bss dec hex filename
11097415 2705840 2630712 16433967 fac32f vmlinux-prev.o
11112095 2690672 2630712 16433479 fac147 vmlinux.o
Merged objects are visible in System.map:
ffffffff81a28810 r POLY
ffffffff81a28810 r POLY
ffffffff81a28820 r TWOONE
ffffffff81a28820 r TWOONE
ffffffff81a28830 r PSHUFFLE_BYTE_FLIP_MASK <- merged regardless of
ffffffff81a28830 r SHUF_MASK <------------- the name difference
ffffffff81a28830 r SHUF_MASK
ffffffff81a28830 r SHUF_MASK
..
ffffffff81a28d00 r K512 <- merged three identical 640-byte tables
ffffffff81a28d00 r K512
ffffffff81a28d00 r K512
Use of object names in section name suffixes is not strictly necessary,
but might help if someday link stage will use garbage collection
to eliminate unused sections (ld --gc-sections).
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Josh Poimboeuf <jpoimboe@redhat.com>
CC: Xiaodong Liu <xiaodong.liu@intel.com>
CC: Megha Dey <megha.dey@intel.com>
CC: linux-crypto@vger.kernel.org
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-01-20 00:33:04 +03:00
.section .rodata .cst16 .F_MIN_MASK , " aM" , @progbits, 16
.align 16
2010-11-04 22:00:45 +03:00
F_MIN_MASK : .octa 0xf1f2f3f4 f5 f6 f7 f8 f9 f a f b f c f d f e f f0
crypto: x86 - make constants readonly, allow linker to merge them
A lot of asm-optimized routines in arch/x86/crypto/ keep its
constants in .data. This is wrong, they should be on .rodata.
Mnay of these constants are the same in different modules.
For example, 128-bit shuffle mask 0x000102030405060708090A0B0C0D0E0F
exists in at least half a dozen places.
There is a way to let linker merge them and use just one copy.
The rules are as follows: mergeable objects of different sizes
should not share sections. You can't put them all in one .rodata
section, they will lose "mergeability".
GCC puts its mergeable constants in ".rodata.cstSIZE" sections,
or ".rodata.cstSIZE.<object_name>" if -fdata-sections is used.
This patch does the same:
.section .rodata.cst16.SHUF_MASK, "aM", @progbits, 16
It is important that all data in such section consists of
16-byte elements, not larger ones, and there are no implicit
use of one element from another.
When this is not the case, use non-mergeable section:
.section .rodata[.VAR_NAME], "a", @progbits
This reduces .data by ~15 kbytes:
text data bss dec hex filename
11097415 2705840 2630712 16433967 fac32f vmlinux-prev.o
11112095 2690672 2630712 16433479 fac147 vmlinux.o
Merged objects are visible in System.map:
ffffffff81a28810 r POLY
ffffffff81a28810 r POLY
ffffffff81a28820 r TWOONE
ffffffff81a28820 r TWOONE
ffffffff81a28830 r PSHUFFLE_BYTE_FLIP_MASK <- merged regardless of
ffffffff81a28830 r SHUF_MASK <------------- the name difference
ffffffff81a28830 r SHUF_MASK
ffffffff81a28830 r SHUF_MASK
..
ffffffff81a28d00 r K512 <- merged three identical 640-byte tables
ffffffff81a28d00 r K512
ffffffff81a28d00 r K512
Use of object names in section name suffixes is not strictly necessary,
but might help if someday link stage will use garbage collection
to eliminate unused sections (ld --gc-sections).
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Josh Poimboeuf <jpoimboe@redhat.com>
CC: Xiaodong Liu <xiaodong.liu@intel.com>
CC: Megha Dey <megha.dey@intel.com>
CC: linux-crypto@vger.kernel.org
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-01-20 00:33:04 +03:00
.section .rodata .cst16 .dec , " aM" , @progbits, 16
.align 16
2010-11-04 22:00:45 +03:00
dec : .octa 0x1
crypto: x86 - make constants readonly, allow linker to merge them
A lot of asm-optimized routines in arch/x86/crypto/ keep its
constants in .data. This is wrong, they should be on .rodata.
Mnay of these constants are the same in different modules.
For example, 128-bit shuffle mask 0x000102030405060708090A0B0C0D0E0F
exists in at least half a dozen places.
There is a way to let linker merge them and use just one copy.
The rules are as follows: mergeable objects of different sizes
should not share sections. You can't put them all in one .rodata
section, they will lose "mergeability".
GCC puts its mergeable constants in ".rodata.cstSIZE" sections,
or ".rodata.cstSIZE.<object_name>" if -fdata-sections is used.
This patch does the same:
.section .rodata.cst16.SHUF_MASK, "aM", @progbits, 16
It is important that all data in such section consists of
16-byte elements, not larger ones, and there are no implicit
use of one element from another.
When this is not the case, use non-mergeable section:
.section .rodata[.VAR_NAME], "a", @progbits
This reduces .data by ~15 kbytes:
text data bss dec hex filename
11097415 2705840 2630712 16433967 fac32f vmlinux-prev.o
11112095 2690672 2630712 16433479 fac147 vmlinux.o
Merged objects are visible in System.map:
ffffffff81a28810 r POLY
ffffffff81a28810 r POLY
ffffffff81a28820 r TWOONE
ffffffff81a28820 r TWOONE
ffffffff81a28830 r PSHUFFLE_BYTE_FLIP_MASK <- merged regardless of
ffffffff81a28830 r SHUF_MASK <------------- the name difference
ffffffff81a28830 r SHUF_MASK
ffffffff81a28830 r SHUF_MASK
..
ffffffff81a28d00 r K512 <- merged three identical 640-byte tables
ffffffff81a28d00 r K512
ffffffff81a28d00 r K512
Use of object names in section name suffixes is not strictly necessary,
but might help if someday link stage will use garbage collection
to eliminate unused sections (ld --gc-sections).
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Josh Poimboeuf <jpoimboe@redhat.com>
CC: Xiaodong Liu <xiaodong.liu@intel.com>
CC: Megha Dey <megha.dey@intel.com>
CC: linux-crypto@vger.kernel.org
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-01-20 00:33:04 +03:00
.section .rodata .cst16 .enc , " aM" , @progbits, 16
.align 16
2010-11-04 22:00:45 +03:00
enc : .octa 0x2
crypto: x86 - make constants readonly, allow linker to merge them
A lot of asm-optimized routines in arch/x86/crypto/ keep its
constants in .data. This is wrong, they should be on .rodata.
Mnay of these constants are the same in different modules.
For example, 128-bit shuffle mask 0x000102030405060708090A0B0C0D0E0F
exists in at least half a dozen places.
There is a way to let linker merge them and use just one copy.
The rules are as follows: mergeable objects of different sizes
should not share sections. You can't put them all in one .rodata
section, they will lose "mergeability".
GCC puts its mergeable constants in ".rodata.cstSIZE" sections,
or ".rodata.cstSIZE.<object_name>" if -fdata-sections is used.
This patch does the same:
.section .rodata.cst16.SHUF_MASK, "aM", @progbits, 16
It is important that all data in such section consists of
16-byte elements, not larger ones, and there are no implicit
use of one element from another.
When this is not the case, use non-mergeable section:
.section .rodata[.VAR_NAME], "a", @progbits
This reduces .data by ~15 kbytes:
text data bss dec hex filename
11097415 2705840 2630712 16433967 fac32f vmlinux-prev.o
11112095 2690672 2630712 16433479 fac147 vmlinux.o
Merged objects are visible in System.map:
ffffffff81a28810 r POLY
ffffffff81a28810 r POLY
ffffffff81a28820 r TWOONE
ffffffff81a28820 r TWOONE
ffffffff81a28830 r PSHUFFLE_BYTE_FLIP_MASK <- merged regardless of
ffffffff81a28830 r SHUF_MASK <------------- the name difference
ffffffff81a28830 r SHUF_MASK
ffffffff81a28830 r SHUF_MASK
..
ffffffff81a28d00 r K512 <- merged three identical 640-byte tables
ffffffff81a28d00 r K512
ffffffff81a28d00 r K512
Use of object names in section name suffixes is not strictly necessary,
but might help if someday link stage will use garbage collection
to eliminate unused sections (ld --gc-sections).
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Josh Poimboeuf <jpoimboe@redhat.com>
CC: Xiaodong Liu <xiaodong.liu@intel.com>
CC: Megha Dey <megha.dey@intel.com>
CC: linux-crypto@vger.kernel.org
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-01-20 00:33:04 +03:00
# order o f t h e s e c o n s t a n t s s h o u l d n o t c h a n g e .
# more s p e c i f i c a l l y , A L L _ F s h o u l d f o l l o w S H I F T _ M A S K ,
# and z e r o s h o u l d f o l l o w A L L _ F
.section .rodata , " a" , @progbits
.align 16
SHIFT_MASK : .octa 0x0f0e0d0c 0 b0 a09 0 8 0 7 0 6 0 5 0 4 0 3 0 2 0 1 0 0
ALL_F : .octa 0xffffffff ffffffffffffffffffffffff
.octa 0x00000000000000000000000000000000
2009-01-18 08:28:34 +03:00
.text
2010-11-04 22:00:45 +03:00
# define S T A C K _ O F F S E T 8 * 3
2018-02-14 20:39:23 +03:00
# define A a d H a s h 1 6 * 0
# define A a d L e n 1 6 * 1
# define I n L e n ( 1 6 * 1 ) + 8
# define P B l o c k E n c K e y 1 6 * 2
# define O r i g I V 1 6 * 3
# define C u r C o u n t 1 6 * 4
# define P B l o c k L e n 1 6 * 5
2018-02-14 20:40:10 +03:00
# define H a s h K e y 1 6 * 6 / / s t o r e H a s h K e y < < 1 m o d p o l y h e r e
# define H a s h K e y _ 2 1 6 * 7 / / s t o r e H a s h K e y ^ 2 < < 1 m o d p o l y h e r e
# define H a s h K e y _ 3 1 6 * 8 / / s t o r e H a s h K e y ^ 3 < < 1 m o d p o l y h e r e
# define H a s h K e y _ 4 1 6 * 9 / / s t o r e H a s h K e y ^ 4 < < 1 m o d p o l y h e r e
# define H a s h K e y _ k 1 6 * 1 0 / / s t o r e X O R o f H i g h 6 4 b i t s a n d L o w 6 4
/ / bits o f H a s h K e y < < 1 m o d p o l y h e r e
/ / ( for K a r a t s u b a p u r p o s e s )
# define H a s h K e y _ 2 _ k 1 6 * 1 1 / / s t o r e X O R o f H i g h 6 4 b i t s a n d L o w 6 4
/ / bits o f H a s h K e y ^ 2 < < 1 m o d p o l y h e r e
/ / ( for K a r a t s u b a p u r p o s e s )
# define H a s h K e y _ 3 _ k 1 6 * 1 2 / / s t o r e X O R o f H i g h 6 4 b i t s a n d L o w 6 4
/ / bits o f H a s h K e y ^ 3 < < 1 m o d p o l y h e r e
/ / ( for K a r a t s u b a p u r p o s e s )
# define H a s h K e y _ 4 _ k 1 6 * 1 3 / / s t o r e X O R o f H i g h 6 4 b i t s a n d L o w 6 4
/ / bits o f H a s h K e y ^ 4 < < 1 m o d p o l y h e r e
/ / ( for K a r a t s u b a p u r p o s e s )
2018-02-14 20:39:23 +03:00
2010-11-04 22:00:45 +03:00
# define a r g 1 r d i
# define a r g 2 r s i
# define a r g 3 r d x
# define a r g 4 r c x
# define a r g 5 r8
# define a r g 6 r9
2018-02-14 20:40:10 +03:00
# define a r g 7 S T A C K _ O F F S E T + 8 ( % r s p )
# define a r g 8 S T A C K _ O F F S E T + 1 6 ( % r s p )
# define a r g 9 S T A C K _ O F F S E T + 2 4 ( % r s p )
# define a r g 1 0 S T A C K _ O F F S E T + 3 2 ( % r s p )
# define a r g 1 1 S T A C K _ O F F S E T + 4 0 ( % r s p )
2015-01-13 21:16:43 +03:00
# define k e y s i z e 2 * 1 5 * 1 6 ( % a r g 1 )
2010-11-29 03:35:39 +03:00
# endif
2010-11-04 22:00:45 +03:00
2009-01-18 08:28:34 +03:00
# define S T A T E 1 % x m m 0
# define S T A T E 2 % x m m 4
# define S T A T E 3 % x m m 5
# define S T A T E 4 % x m m 6
# define S T A T E S T A T E 1
# define I N 1 % x m m 1
# define I N 2 % x m m 7
# define I N 3 % x m m 8
# define I N 4 % x m m 9
# define I N I N 1
# define K E Y % x m m 2
# define I V % x m m 3
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
2010-03-10 13:28:55 +03:00
# define B S W A P _ M A S K % x m m 1 0
# define C T R % x m m 1 1
# define I N C % x m m 1 2
2009-01-18 08:28:34 +03:00
2013-04-08 22:51:16 +04:00
# define G F 1 2 8 M U L _ M A S K % x m m 1 0
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# ifdef _ _ x86 _ 6 4 _ _
# define A R E G % r a x
2009-01-18 08:28:34 +03:00
# define K E Y P % r d i
# define O U T P % r s i
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# define U K E Y P O U T P
2009-01-18 08:28:34 +03:00
# define I N P % r d x
# define L E N % r c x
# define I V P % r8
# define K L E N % r9 d
# define T 1 % r10
# define T K E Y P T 1
# define T 2 % r11
2010-03-10 13:28:55 +03:00
# define T C T R _ L O W T 2
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# else
# define A R E G % e a x
# define K E Y P % e d i
# define O U T P A R E G
# define U K E Y P O U T P
# define I N P % e d x
# define L E N % e s i
# define I V P % e b p
# define K L E N % e b x
# define T 1 % e c x
# define T K E Y P T 1
# endif
2009-01-18 08:28:34 +03:00
2018-02-14 20:38:35 +03:00
.macro FUNC_SAVE
push % r12
push % r13
push % r14
#
# states o f % x m m r e g i s t e r s % x m m 6 : % x m m 1 5 n o t s a v e d
# all % x m m r e g i s t e r s a r e c l o b b e r e d
#
.endm
.macro FUNC_RESTORE
pop % r14
pop % r13
pop % r12
.endm
2010-11-04 22:00:45 +03:00
2018-02-14 20:40:10 +03:00
# Precompute h a s h k e y s .
# Input : Hash s u b k e y .
# Output : HashKeys s t o r e d i n g c m _ c o n t e x t _ d a t a . O n l y n e e d s t o b e c a l l e d
# once p e r k e y .
# clobbers r12 , a n d t m p x m m r e g i s t e r s .
.macro PRECOMPUTE TMP1 T M P 2 T M P 3 T M P 4 T M P 5 T M P 6 T M P 7
mov a r g 7 , % r12
movdqu ( % r12 ) , \ T M P 3
movdqa S H U F _ M A S K ( % r i p ) , \ T M P 2
PSHUFB_ X M M \ T M P 2 , \ T M P 3
# precompute H a s h K e y < < 1 m o d p o l y f r o m t h e H a s h K e y ( r e q u i r e d f o r G H A S H )
movdqa \ T M P 3 , \ T M P 2
psllq $ 1 , \ T M P 3
psrlq $ 6 3 , \ T M P 2
movdqa \ T M P 2 , \ T M P 1
pslldq $ 8 , \ T M P 2
psrldq $ 8 , \ T M P 1
por \ T M P 2 , \ T M P 3
# reduce H a s h K e y < < 1
pshufd $ 0 x24 , \ T M P 1 , \ T M P 2
pcmpeqd T W O O N E ( % r i p ) , \ T M P 2
pand P O L Y ( % r i p ) , \ T M P 2
pxor \ T M P 2 , \ T M P 3
movdqa \ T M P 3 , H a s h K e y ( % a r g 2 )
movdqa \ T M P 3 , \ T M P 5
pshufd $ 7 8 , \ T M P 3 , \ T M P 1
pxor \ T M P 3 , \ T M P 1
movdqa \ T M P 1 , H a s h K e y _ k ( % a r g 2 )
GHASH_ M U L \ T M P 5 , \ T M P 3 , \ T M P 1 , \ T M P 2 , \ T M P 4 , \ T M P 6 , \ T M P 7
# TMP5 = H a s h K e y ^ 2 < < 1 ( m o d p o l y )
movdqa \ T M P 5 , H a s h K e y _ 2 ( % a r g 2 )
# HashKey_ 2 = H a s h K e y ^ 2 < < 1 ( m o d p o l y )
pshufd $ 7 8 , \ T M P 5 , \ T M P 1
pxor \ T M P 5 , \ T M P 1
movdqa \ T M P 1 , H a s h K e y _ 2 _ k ( % a r g 2 )
GHASH_ M U L \ T M P 5 , \ T M P 3 , \ T M P 1 , \ T M P 2 , \ T M P 4 , \ T M P 6 , \ T M P 7
# TMP5 = H a s h K e y ^ 3 < < 1 ( m o d p o l y )
movdqa \ T M P 5 , H a s h K e y _ 3 ( % a r g 2 )
pshufd $ 7 8 , \ T M P 5 , \ T M P 1
pxor \ T M P 5 , \ T M P 1
movdqa \ T M P 1 , H a s h K e y _ 3 _ k ( % a r g 2 )
GHASH_ M U L \ T M P 5 , \ T M P 3 , \ T M P 1 , \ T M P 2 , \ T M P 4 , \ T M P 6 , \ T M P 7
# TMP5 = H a s h K e y ^ 3 < < 1 ( m o d p o l y )
movdqa \ T M P 5 , H a s h K e y _ 4 ( % a r g 2 )
pshufd $ 7 8 , \ T M P 5 , \ T M P 1
pxor \ T M P 5 , \ T M P 1
movdqa \ T M P 1 , H a s h K e y _ 4 _ k ( % a r g 2 )
.endm
2018-02-14 20:38:45 +03:00
# GCM_ I N I T i n i t i a l i z e s a g c m _ c o n t e x t s t r u c t t o p r e p a r e f o r e n c o d i n g / d e c o d i n g .
# Clobbers r a x , r10 - r13 a n d x m m 0 - x m m 6 , % x m m 1 3
.macro GCM_INIT
2018-02-14 20:39:45 +03:00
mov a r g 9 , % r11
mov % r11 , A a d L e n ( % a r g 2 ) # c t x _ d a t a . a a d _ l e n g t h = a a d _ l e n g t h
xor % r11 , % r11
mov % r11 , I n L e n ( % a r g 2 ) # c t x _ d a t a . i n _ l e n g t h = 0
mov % r11 , P B l o c k L e n ( % a r g 2 ) # c t x _ d a t a . p a r t i a l _ b l o c k _ l e n g t h = 0
mov % r11 , P B l o c k E n c K e y ( % a r g 2 ) # c t x _ d a t a . p a r t i a l _ b l o c k _ e n c _ k e y = 0
mov % a r g 6 , % r a x
movdqu ( % r a x ) , % x m m 0
movdqu % x m m 0 , O r i g I V ( % a r g 2 ) # c t x _ d a t a . o r i g _ I V = i v
movdqa S H U F _ M A S K ( % r i p ) , % x m m 2
PSHUFB_ X M M % x m m 2 , % x m m 0
movdqu % x m m 0 , C u r C o u n t ( % a r g 2 ) # c t x _ d a t a . c u r r e n t _ c o u n t e r = i v
2018-02-14 20:40:10 +03:00
PRECOMPUTE % x m m 1 % x m m 2 % x m m 3 % x m m 4 % x m m 5 % x m m 6 % x m m 7
movdqa H a s h K e y ( % a r g 2 ) , % x m m 1 3
2018-02-14 20:39:36 +03:00
CALC_ A A D _ H A S H % x m m 1 3 % x m m 0 % x m m 1 % x m m 2 % x m m 3 % x m m 4 \
% xmm5 % x m m 6
2018-02-14 20:38:45 +03:00
.endm
2018-02-14 20:39:10 +03:00
# GCM_ E N C _ D E C E n c o d e s / D e c o d e s g i v e n d a t a . A s s u m e s t h a t t h e p a s s e d g c m _ c o n t e x t
# struct h a s b e e n i n i t i a l i z e d b y G C M _ I N I T .
# Requires t h e i n p u t d a t a b e a t l e a s t 1 b y t e l o n g b e c a u s e o f R E A D _ P A R T I A L _ B L O C K
# Clobbers r a x , r10 - r13 , a n d x m m 0 - x m m 1 5
.macro GCM_ENC_DEC operation
2018-02-14 20:39:45 +03:00
movdqu A a d H a s h ( % a r g 2 ) , % x m m 8
2018-02-14 20:40:10 +03:00
movdqu H a s h K e y ( % a r g 2 ) , % x m m 1 3
2018-02-14 20:39:45 +03:00
add % a r g 5 , I n L e n ( % a r g 2 )
2018-02-14 20:40:19 +03:00
xor % r11 , % r11 # i n i t i a l i s e t h e d a t a p o i n t e r o f f s e t a s z e r o
PARTIAL_ B L O C K % a r g 3 % a r g 4 % a r g 5 % r11 % x m m 8 \ o p e r a t i o n
sub % r11 , % a r g 5 # s u b p a r t i a l b l o c k d a t a u s e d
2018-02-14 20:39:45 +03:00
mov % a r g 5 , % r13 # s a v e t h e n u m b e r o f b y t e s
2018-02-14 20:40:19 +03:00
2018-02-14 20:39:45 +03:00
and $ - 1 6 , % r13 # % r 13 = % r13 - ( % r13 m o d 1 6 )
mov % r13 , % r12
2018-02-14 20:39:10 +03:00
# Encrypt/ D e c r y p t f i r s t f e w b l o c k s
and $ ( 3 < < 4 ) , % r12
jz _ i n i t i a l _ n u m _ b l o c k s _ i s _ 0 _ \ @
cmp $ ( 2 < < 4 ) , % r12
jb _ i n i t i a l _ n u m _ b l o c k s _ i s _ 1 _ \ @
je _ i n i t i a l _ n u m _ b l o c k s _ i s _ 2 _ \ @
_ initial_ n u m _ b l o c k s _ i s _ 3 _ \ @:
INITIAL_ B L O C K S _ E N C _ D E C % x m m 9 , % x m m 1 0 , % x m m 1 3 , % x m m 1 1 , % x m m 1 2 , % x m m 0 , \
% xmm1 , % x m m 2 , % x m m 3 , % x m m 4 , % x m m 8 , % x m m 5 , % x m m 6 , 5 , 6 7 8 , \ o p e r a t i o n
sub $ 4 8 , % r13
jmp _ i n i t i a l _ b l o c k s _ \ @
_ initial_ n u m _ b l o c k s _ i s _ 2 _ \ @:
INITIAL_ B L O C K S _ E N C _ D E C % x m m 9 , % x m m 1 0 , % x m m 1 3 , % x m m 1 1 , % x m m 1 2 , % x m m 0 , \
% xmm1 , % x m m 2 , % x m m 3 , % x m m 4 , % x m m 8 , % x m m 5 , % x m m 6 , 6 , 7 8 , \ o p e r a t i o n
sub $ 3 2 , % r13
jmp _ i n i t i a l _ b l o c k s _ \ @
_ initial_ n u m _ b l o c k s _ i s _ 1 _ \ @:
INITIAL_ B L O C K S _ E N C _ D E C % x m m 9 , % x m m 1 0 , % x m m 1 3 , % x m m 1 1 , % x m m 1 2 , % x m m 0 , \
% xmm1 , % x m m 2 , % x m m 3 , % x m m 4 , % x m m 8 , % x m m 5 , % x m m 6 , 7 , 8 , \ o p e r a t i o n
sub $ 1 6 , % r13
jmp _ i n i t i a l _ b l o c k s _ \ @
_ initial_ n u m _ b l o c k s _ i s _ 0 _ \ @:
INITIAL_ B L O C K S _ E N C _ D E C % x m m 9 , % x m m 1 0 , % x m m 1 3 , % x m m 1 1 , % x m m 1 2 , % x m m 0 , \
% xmm1 , % x m m 2 , % x m m 3 , % x m m 4 , % x m m 8 , % x m m 5 , % x m m 6 , 8 , 0 , \ o p e r a t i o n
_ initial_ b l o c k s _ \ @:
# Main l o o p - E n c r y p t / D e c r y p t r e m a i n i n g b l o c k s
cmp $ 0 , % r13
je _ z e r o _ c i p h e r _ l e f t _ \ @
sub $ 6 4 , % r13
je _ f o u r _ c i p h e r _ l e f t _ \ @
_ crypt_ b y _ 4 _ \ @:
GHASH_ 4 _ E N C R Y P T _ 4 _ P A R A L L E L _ \ o p e r a t i o n % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , \
% xmm1 3 , % x m m 1 4 , % x m m 0 , % x m m 1 , % x m m 2 , % x m m 3 , % x m m 4 , % x m m 5 , % x m m 6 , \
% xmm7 , % x m m 8 , e n c
add $ 6 4 , % r11
sub $ 6 4 , % r13
jne _ c r y p t _ b y _ 4 _ \ @
_ four_ c i p h e r _ l e f t _ \ @:
GHASH_ L A S T _ 4 % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 1 2 , % x m m 1 3 , % x m m 1 4 , \
% xmm1 5 , % x m m 1 , % x m m 2 , % x m m 3 , % x m m 4 , % x m m 8
_ zero_ c i p h e r _ l e f t _ \ @:
2018-02-14 20:39:45 +03:00
movdqu % x m m 8 , A a d H a s h ( % a r g 2 )
movdqu % x m m 0 , C u r C o u n t ( % a r g 2 )
2018-02-14 20:39:23 +03:00
mov % a r g 5 , % r13
and $ 1 5 , % r13 # % r 13 = a r g 5 ( m o d 1 6 )
2018-02-14 20:39:10 +03:00
je _ m u l t i p l e _ o f _ 1 6 _ b y t e s _ \ @
2018-02-14 20:39:45 +03:00
mov % r13 , P B l o c k L e n ( % a r g 2 )
2018-02-14 20:39:10 +03:00
# Handle t h e l a s t < 1 6 B y t e b l o c k s e p a r a t e l y
paddd O N E ( % r i p ) , % x m m 0 # I N C R C N T t o g e t Y n
2018-02-14 20:39:45 +03:00
movdqu % x m m 0 , C u r C o u n t ( % a r g 2 )
2018-02-14 20:39:23 +03:00
movdqa S H U F _ M A S K ( % r i p ) , % x m m 1 0
2018-02-14 20:39:10 +03:00
PSHUFB_ X M M % x m m 1 0 , % x m m 0
ENCRYPT_ S I N G L E _ B L O C K % x m m 0 , % x m m 1 # E n c r y p t ( K , Y n )
2018-02-14 20:39:45 +03:00
movdqu % x m m 0 , P B l o c k E n c K e y ( % a r g 2 )
2018-02-14 20:39:10 +03:00
2018-02-14 20:40:31 +03:00
cmp $ 1 6 , % a r g 5
jge _ l a r g e _ e n o u g h _ u p d a t e _ \ @
2018-02-14 20:39:23 +03:00
lea ( % a r g 4 ,% r11 ,1 ) , % r10
2018-02-14 20:39:10 +03:00
mov % r13 , % r12
READ_ P A R T I A L _ B L O C K % r10 % r12 % x m m 2 % x m m 1
2018-02-14 20:40:31 +03:00
jmp _ d a t a _ r e a d _ \ @
_ large_ e n o u g h _ u p d a t e _ \ @:
sub $ 1 6 , % r11
add % r13 , % r11
# receive t h e l a s t < 1 6 B y t e b l o c k
movdqu ( % a r g 4 , % r11 , 1 ) , % x m m 1
2018-02-14 20:39:10 +03:00
2018-02-14 20:40:31 +03:00
sub % r13 , % r11
add $ 1 6 , % r11
lea S H I F T _ M A S K + 1 6 ( % r i p ) , % r12
# adjust t h e s h u f f l e m a s k p o i n t e r t o b e a b l e t o s h i f t 1 6 - r13 b y t e s
# ( r1 3 i s t h e n u m b e r o f b y t e s i n p l a i n t e x t m o d 1 6 )
sub % r13 , % r12
# get t h e a p p r o p r i a t e s h u f f l e m a s k
movdqu ( % r12 ) , % x m m 2
# shift r i g h t 1 6 - r13 b y t e s
PSHUFB_ X M M % x m m 2 , % x m m 1
_ data_ r e a d _ \ @:
2018-02-14 20:39:10 +03:00
lea A L L _ F + 1 6 ( % r i p ) , % r12
sub % r13 , % r12
2018-02-14 20:40:31 +03:00
2018-02-14 20:39:10 +03:00
.ifc \ operation, d e c
movdqa % x m m 1 , % x m m 2
.endif
pxor % x m m 1 , % x m m 0 # X O R E n c r y p t ( K , Y n )
movdqu ( % r12 ) , % x m m 1
# get t h e a p p r o p r i a t e m a s k t o m a s k o u t t o p 1 6 - r13 b y t e s o f x m m 0
pand % x m m 1 , % x m m 0 # m a s k o u t t o p 16 - r13 b y t e s o f x m m 0
.ifc \ operation, d e c
pand % x m m 1 , % x m m 2
movdqa S H U F _ M A S K ( % r i p ) , % x m m 1 0
PSHUFB_ X M M % x m m 1 0 ,% x m m 2
pxor % x m m 2 , % x m m 8
.else
movdqa S H U F _ M A S K ( % r i p ) , % x m m 1 0
PSHUFB_ X M M % x m m 1 0 ,% x m m 0
pxor % x m m 0 , % x m m 8
.endif
2018-02-14 20:39:45 +03:00
movdqu % x m m 8 , A a d H a s h ( % a r g 2 )
2018-02-14 20:39:10 +03:00
.ifc \ operation, e n c
# GHASH c o m p u t a t i o n f o r t h e l a s t < 1 6 b y t e b l o c k
movdqa S H U F _ M A S K ( % r i p ) , % x m m 1 0
# shuffle x m m 0 b a c k t o o u t p u t a s c i p h e r t e x t
PSHUFB_ X M M % x m m 1 0 , % x m m 0
.endif
# Output % r13 b y t e s
MOVQ_ R 6 4 _ X M M % x m m 0 , % r a x
cmp $ 8 , % r13
jle _ l e s s _ t h a n _ 8 _ b y t e s _ l e f t _ \ @
2018-02-14 20:39:23 +03:00
mov % r a x , ( % a r g 3 , % r11 , 1 )
2018-02-14 20:39:10 +03:00
add $ 8 , % r11
psrldq $ 8 , % x m m 0
MOVQ_ R 6 4 _ X M M % x m m 0 , % r a x
sub $ 8 , % r13
_ less_ t h a n _ 8 _ b y t e s _ l e f t _ \ @:
2018-02-14 20:39:23 +03:00
mov % a l , ( % a r g 3 , % r11 , 1 )
2018-02-14 20:39:10 +03:00
add $ 1 , % r11
shr $ 8 , % r a x
sub $ 1 , % r13
jne _ l e s s _ t h a n _ 8 _ b y t e s _ l e f t _ \ @
_ multiple_ o f _ 1 6 _ b y t e s _ \ @:
.endm
2018-02-14 20:38:57 +03:00
# GCM_ C O M P L E T E F i n i s h e s u p d a t e o f t a g o f l a s t p a r t i a l b l o c k
# Output : Authorization T a g ( A U T H _ T A G )
# Clobbers r a x , r10 - r12 , a n d x m m 0 , x m m 1 , x m m 5 - x m m 1 5
.macro GCM_COMPLETE
2018-02-14 20:39:45 +03:00
movdqu A a d H a s h ( % a r g 2 ) , % x m m 8
2018-02-14 20:40:10 +03:00
movdqu H a s h K e y ( % a r g 2 ) , % x m m 1 3
2018-02-14 20:39:55 +03:00
mov P B l o c k L e n ( % a r g 2 ) , % r12
cmp $ 0 , % r12
je _ p a r t i a l _ d o n e \ @
GHASH_ M U L % x m m 8 , % x m m 1 3 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 5 , % x m m 6
_ partial_ d o n e \ @:
2018-02-14 20:39:45 +03:00
mov A a d L e n ( % a r g 2 ) , % r12 # % r 13 = a a d L e n ( n u m b e r o f b y t e s )
2018-02-14 20:38:57 +03:00
shl $ 3 , % r12 # c o n v e r t i n t o n u m b e r o f b i t s
movd % r12 d , % x m m 1 5 # l e n ( A ) i n % x m m 15
2018-02-14 20:39:45 +03:00
mov I n L e n ( % a r g 2 ) , % r12
shl $ 3 , % r12 # l e n ( C ) i n b i t s ( * 128 )
MOVQ_ R 6 4 _ X M M % r12 , % x m m 1
2018-02-14 20:38:57 +03:00
pslldq $ 8 , % x m m 1 5 # % x m m 15 = l e n ( A ) | | 0 x00 0 0 0 0 0 0 0 0 0 0 0 0 0 0
pxor % x m m 1 , % x m m 1 5 # % x m m 15 = l e n ( A ) | | l e n ( C )
pxor % x m m 1 5 , % x m m 8
GHASH_ M U L % x m m 8 , % x m m 1 3 , % x m m 9 , % x m m 1 0 , % x m m 1 1 , % x m m 5 , % x m m 6
# final G H A S H c o m p u t a t i o n
movdqa S H U F _ M A S K ( % r i p ) , % x m m 1 0
PSHUFB_ X M M % x m m 1 0 , % x m m 8
2018-02-14 20:39:45 +03:00
movdqu O r i g I V ( % a r g 2 ) , % x m m 0 # % x m m 0 = Y 0
2018-02-14 20:38:57 +03:00
ENCRYPT_ S I N G L E _ B L O C K % x m m 0 , % x m m 1 # E ( K , Y 0 )
pxor % x m m 8 , % x m m 0
_ return_ T _ \ @:
2018-02-14 20:39:23 +03:00
mov a r g 1 0 , % r10 # % r 10 = a u t h T a g
mov a r g 1 1 , % r11 # % r 11 = a u t h _ t a g _ l e n
2018-02-14 20:38:57 +03:00
cmp $ 1 6 , % r11
je _ T _ 1 6 _ \ @
cmp $ 8 , % r11
jl _ T _ 4 _ \ @
_ T_ 8 _ \ @:
MOVQ_ R 6 4 _ X M M % x m m 0 , % r a x
mov % r a x , ( % r10 )
add $ 8 , % r10
sub $ 8 , % r11
psrldq $ 8 , % x m m 0
cmp $ 0 , % r11
je _ r e t u r n _ T _ d o n e _ \ @
_ T_ 4 _ \ @:
movd % x m m 0 , % e a x
mov % e a x , ( % r10 )
add $ 4 , % r10
sub $ 4 , % r11
psrldq $ 4 , % x m m 0
cmp $ 0 , % r11
je _ r e t u r n _ T _ d o n e _ \ @
_ T_ 1 2 3 _ \ @:
movd % x m m 0 , % e a x
cmp $ 2 , % r11
jl _ T _ 1 _ \ @
mov % a x , ( % r10 )
cmp $ 2 , % r11
je _ r e t u r n _ T _ d o n e _ \ @
add $ 2 , % r10
sar $ 1 6 , % e a x
_ T_ 1 _ \ @:
mov % a l , ( % r10 )
jmp _ r e t u r n _ T _ d o n e _ \ @
_ T_ 1 6 _ \ @:
movdqu % x m m 0 , ( % r10 )
_ return_ T _ d o n e _ \ @:
.endm
2010-11-29 03:35:39 +03:00
# ifdef _ _ x86 _ 6 4 _ _
2010-11-04 22:00:45 +03:00
/ * GHASH_ M U L M A C R O t o i m p l e m e n t : D a t a * H a s h K e y m o d ( 1 2 8 ,1 2 7 ,1 2 6 ,1 2 1 ,0 )
*
*
* Input : A a n d B ( 1 2 8 - b i t s e a c h , b i t - r e f l e c t e d )
* Output : C = A * B * x m o d p o l y , ( i . e . > > 1 )
* To c o m p u t e G H = G H * H a s h K e y m o d p o l y , g i v e H K = H a s h K e y < < 1 m o d p o l y a s i n p u t
* GH = G H * H K * x m o d p o l y w h i c h i s e q u i v a l e n t t o G H * H a s h K e y m o d p o l y .
*
* /
.macro GHASH_MUL GH H K T M P 1 T M P 2 T M P 3 T M P 4 T M P 5
movdqa \ G H , \ T M P 1
pshufd $ 7 8 , \ G H , \ T M P 2
pshufd $ 7 8 , \ H K , \ T M P 3
pxor \ G H , \ T M P 2 # T M P 2 = a1 + a0
pxor \ H K , \ T M P 3 # T M P 3 = b1 + b0
PCLMULQDQ 0 x11 , \ H K , \ T M P 1 # T M P 1 = a1 * b1
PCLMULQDQ 0 x00 , \ H K , \ G H # G H = a 0 * b0
PCLMULQDQ 0 x00 , \ T M P 3 , \ T M P 2 # T M P 2 = ( a0 + a1 ) * ( b1 + b0 )
pxor \ G H , \ T M P 2
pxor \ T M P 1 , \ T M P 2 # T M P 2 = ( a0 * b0 ) + ( a1 * b0 )
movdqa \ T M P 2 , \ T M P 3
pslldq $ 8 , \ T M P 3 # l e f t s h i f t T M P 3 2 D W s
psrldq $ 8 , \ T M P 2 # r i g h t s h i f t T M P 2 2 D W s
pxor \ T M P 3 , \ G H
pxor \ T M P 2 , \ T M P 1 # T M P 2 : G H h o l d s t h e r e s u l t o f G H * H K
# first p h a s e o f t h e r e d u c t i o n
movdqa \ G H , \ T M P 2
movdqa \ G H , \ T M P 3
movdqa \ G H , \ T M P 4 # c o p y G H i n t o T M P 2 ,T M P 3 a n d T M P 4
# in i n o r d e r t o p e r f o r m
# independent s h i f t s
pslld $ 3 1 , \ T M P 2 # p a c k e d r i g h t s h i f t < < 31
pslld $ 3 0 , \ T M P 3 # p a c k e d r i g h t s h i f t < < 30
pslld $ 2 5 , \ T M P 4 # p a c k e d r i g h t s h i f t < < 25
pxor \ T M P 3 , \ T M P 2 # x o r t h e s h i f t e d v e r s i o n s
pxor \ T M P 4 , \ T M P 2
movdqa \ T M P 2 , \ T M P 5
psrldq $ 4 , \ T M P 5 # r i g h t s h i f t T M P 5 1 D W
pslldq $ 1 2 , \ T M P 2 # l e f t s h i f t T M P 2 3 D W s
pxor \ T M P 2 , \ G H
# second p h a s e o f t h e r e d u c t i o n
movdqa \ G H ,\ T M P 2 # c o p y G H i n t o T M P 2 ,T M P 3 a n d T M P 4
# in i n o r d e r t o p e r f o r m
# independent s h i f t s
movdqa \ G H ,\ T M P 3
movdqa \ G H ,\ T M P 4
psrld $ 1 ,\ T M P 2 # p a c k e d l e f t s h i f t > > 1
psrld $ 2 ,\ T M P 3 # p a c k e d l e f t s h i f t > > 2
psrld $ 7 ,\ T M P 4 # p a c k e d l e f t s h i f t > > 7
pxor \ T M P 3 ,\ T M P 2 # x o r t h e s h i f t e d v e r s i o n s
pxor \ T M P 4 ,\ T M P 2
pxor \ T M P 5 , \ T M P 2
pxor \ T M P 2 , \ G H
pxor \ T M P 1 , \ G H # r e s u l t i s i n T M P 1
.endm
2017-12-21 04:08:37 +03:00
# Reads D L E N b y t e s s t a r t i n g a t D P T R a n d s t o r e s i n X M M D s t
# where 0 < D L E N < 1 6
# Clobbers % r a x , D L E N a n d X M M 1
.macro READ_PARTIAL_BLOCK DPTR D L E N X M M 1 X M M D s t
cmp $ 8 , \ D L E N
jl _ r e a d _ l t 8 _ \ @
mov ( \ D P T R ) , % r a x
MOVQ_ R 6 4 _ X M M % r a x , \ X M M D s t
sub $ 8 , \ D L E N
jz _ d o n e _ r e a d _ p a r t i a l _ b l o c k _ \ @
xor % e a x , % e a x
_ read_ n e x t _ b y t e _ \ @:
shl $ 8 , % r a x
mov 7 ( \ D P T R , \ D L E N , 1 ) , % a l
dec \ D L E N
jnz _ r e a d _ n e x t _ b y t e _ \ @
MOVQ_ R 6 4 _ X M M % r a x , \ X M M 1
pslldq $ 8 , \ X M M 1
por \ X M M 1 , \ X M M D s t
jmp _ d o n e _ r e a d _ p a r t i a l _ b l o c k _ \ @
_ read_ l t 8 _ \ @:
xor % e a x , % e a x
_ read_ n e x t _ b y t e _ l t 8 _ \ @:
shl $ 8 , % r a x
mov - 1 ( \ D P T R , \ D L E N , 1 ) , % a l
dec \ D L E N
jnz _ r e a d _ n e x t _ b y t e _ l t 8 _ \ @
MOVQ_ R 6 4 _ X M M % r a x , \ X M M D s t
_ done_ r e a d _ p a r t i a l _ b l o c k _ \ @:
.endm
2018-02-14 20:39:36 +03:00
# CALC_AAD_HASH : Calculates t h e h a s h o f t h e d a t a w h i c h w i l l n o t b e e n c r y p t e d .
# clobbers r10 - 1 1 , x m m 1 4
.macro CALC_AAD_HASH HASHKEY T M P 1 T M P 2 T M P 3 T M P 4 T M P 5 \
TMP6 T M P 7
MOVADQ S H U F _ M A S K ( % r i p ) , % x m m 1 4
mov a r g 8 , % r10 # % r 10 = A A D
mov a r g 9 , % r11 # % r 11 = a a d L e n
pxor \ T M P 7 , \ T M P 7
pxor \ T M P 6 , \ T M P 6
2017-04-28 19:11:56 +03:00
cmp $ 1 6 , % r11
2018-02-14 20:38:12 +03:00
jl _ g e t _ A A D _ r e s t \ @
_ get_ A A D _ b l o c k s \ @:
2018-02-14 20:39:36 +03:00
movdqu ( % r10 ) , \ T M P 7
PSHUFB_ X M M % x m m 1 4 , \ T M P 7 # b y t e - r e f l e c t t h e A A D d a t a
pxor \ T M P 7 , \ T M P 6
GHASH_ M U L \ T M P 6 , \ H A S H K E Y , \ T M P 1 , \ T M P 2 , \ T M P 3 , \ T M P 4 , \ T M P 5
2017-04-28 19:11:56 +03:00
add $ 1 6 , % r10
sub $ 1 6 , % r11
cmp $ 1 6 , % r11
2018-02-14 20:38:12 +03:00
jge _ g e t _ A A D _ b l o c k s \ @
2017-04-28 19:11:56 +03:00
2018-02-14 20:39:36 +03:00
movdqu \ T M P 6 , \ T M P 7
2017-12-21 04:08:38 +03:00
/* read the last <16B of AAD */
2018-02-14 20:38:12 +03:00
_ get_ A A D _ r e s t \ @:
2017-04-28 19:11:56 +03:00
cmp $ 0 , % r11
2018-02-14 20:38:12 +03:00
je _ g e t _ A A D _ d o n e \ @
2017-04-28 19:11:56 +03:00
2018-02-14 20:39:36 +03:00
READ_ P A R T I A L _ B L O C K % r10 , % r11 , \ T M P 1 , \ T M P 7
PSHUFB_ X M M % x m m 1 4 , \ T M P 7 # b y t e - r e f l e c t t h e A A D d a t a
pxor \ T M P 6 , \ T M P 7
GHASH_ M U L \ T M P 7 , \ H A S H K E Y , \ T M P 1 , \ T M P 2 , \ T M P 3 , \ T M P 4 , \ T M P 5
movdqu \ T M P 7 , \ T M P 6
2010-12-13 14:51:15 +03:00
2018-02-14 20:38:12 +03:00
_ get_ A A D _ d o n e \ @:
2018-02-14 20:39:36 +03:00
movdqu \ T M P 6 , A a d H a s h ( % a r g 2 )
.endm
2018-02-14 20:40:19 +03:00
# PARTIAL_BLOCK : Handles e n c r y p t i o n / d e c r y p t i o n a n d t h e t a g p a r t i a l b l o c k s
# between u p d a t e c a l l s .
# Requires t h e i n p u t d a t a b e a t l e a s t 1 b y t e l o n g d u e t o R E A D _ P A R T I A L _ B L O C K
# Outputs e n c r y p t e d b y t e s , a n d u p d a t e s h a s h a n d p a r t i a l i n f o i n g c m _ d a t a _ c o n t e x t
# Clobbers r a x , r10 , r12 , r13 , x m m 0 - 6 , x m m 9 - 1 3
.macro PARTIAL_BLOCK CYPH_ P L A I N _ O U T P L A I N _ C Y P H _ I N P L A I N _ C Y P H _ L E N D A T A _ O F F S E T \
AAD_ H A S H o p e r a t i o n
mov P B l o c k L e n ( % a r g 2 ) , % r13
cmp $ 0 , % r13
je _ p a r t i a l _ b l o c k _ d o n e _ \ @ # Leave Macro if no partial blocks
# Read i n i n p u t d a t a w i t h o u t o v e r r e a d i n g
cmp $ 1 6 , \ P L A I N _ C Y P H _ L E N
jl _ f e w e r _ t h a n _ 1 6 _ b y t e s _ \ @
movups ( \ P L A I N _ C Y P H _ I N ) , % x m m 1 # I f m o r e t h a n 16 b y t e s , j u s t f i l l x m m
jmp _ d a t a _ r e a d _ \ @
_ fewer_ t h a n _ 1 6 _ b y t e s _ \ @:
lea ( \ P L A I N _ C Y P H _ I N , \ D A T A _ O F F S E T , 1 ) , % r10
mov \ P L A I N _ C Y P H _ L E N , % r12
READ_ P A R T I A L _ B L O C K % r10 % r12 % x m m 0 % x m m 1
mov P B l o c k L e n ( % a r g 2 ) , % r13
_ data_ r e a d _ \ @: # Finished reading in data
movdqu P B l o c k E n c K e y ( % a r g 2 ) , % x m m 9
movdqu H a s h K e y ( % a r g 2 ) , % x m m 1 3
lea S H I F T _ M A S K ( % r i p ) , % r12
# adjust t h e s h u f f l e m a s k p o i n t e r t o b e a b l e t o s h i f t r13 b y t e s
# r1 6 - r13 i s t h e n u m b e r o f b y t e s i n p l a i n t e x t m o d 1 6 )
add % r13 , % r12
movdqu ( % r12 ) , % x m m 2 # g e t t h e a p p r o p r i a t e s h u f f l e m a s k
PSHUFB_ X M M % x m m 2 , % x m m 9 # s h i f t r i g h t r 13 b y t e s
.ifc \ operation, d e c
movdqa % x m m 1 , % x m m 3
pxor % x m m 1 , % x m m 9 # C y p h e r t e x t X O R E ( K , Y n )
mov \ P L A I N _ C Y P H _ L E N , % r10
add % r13 , % r10
# Set r10 t o b e t h e a m o u n t o f d a t a l e f t i n C Y P H _ P L A I N _ I N a f t e r f i l l i n g
sub $ 1 6 , % r10
# Determine i f i f p a r t i a l b l o c k i s n o t b e i n g f i l l e d a n d
# shift m a s k a c c o r d i n g l y
jge _ n o _ e x t r a _ m a s k _ 1 _ \ @
sub % r10 , % r12
_ no_ e x t r a _ m a s k _ 1 _ \ @:
movdqu A L L _ F - S H I F T _ M A S K ( % r12 ) , % x m m 1
# get t h e a p p r o p r i a t e m a s k t o m a s k o u t b o t t o m r13 b y t e s o f x m m 9
pand % x m m 1 , % x m m 9 # m a s k o u t b o t t o m r 13 b y t e s o f x m m 9
pand % x m m 1 , % x m m 3
movdqa S H U F _ M A S K ( % r i p ) , % x m m 1 0
PSHUFB_ X M M % x m m 1 0 , % x m m 3
PSHUFB_ X M M % x m m 2 , % x m m 3
pxor % x m m 3 , \ A A D _ H A S H
cmp $ 0 , % r10
jl _ p a r t i a l _ i n c o m p l e t e _ 1 _ \ @
# GHASH c o m p u t a t i o n f o r t h e l a s t < 1 6 B y t e b l o c k
GHASH_ M U L \ A A D _ H A S H , % x m m 1 3 , % x m m 0 , % x m m 1 0 , % x m m 1 1 , % x m m 5 , % x m m 6
xor % r a x ,% r a x
mov % r a x , P B l o c k L e n ( % a r g 2 )
jmp _ d e c _ d o n e _ \ @
_ partial_ i n c o m p l e t e _ 1 _ \ @:
add \ P L A I N _ C Y P H _ L E N , P B l o c k L e n ( % a r g 2 )
_ dec_ d o n e _ \ @:
movdqu \ A A D _ H A S H , A a d H a s h ( % a r g 2 )
.else
pxor % x m m 1 , % x m m 9 # P l a i n t e x t X O R E ( K , Y n )
mov \ P L A I N _ C Y P H _ L E N , % r10
add % r13 , % r10
# Set r10 t o b e t h e a m o u n t o f d a t a l e f t i n C Y P H _ P L A I N _ I N a f t e r f i l l i n g
sub $ 1 6 , % r10
# Determine i f i f p a r t i a l b l o c k i s n o t b e i n g f i l l e d a n d
# shift m a s k a c c o r d i n g l y
jge _ n o _ e x t r a _ m a s k _ 2 _ \ @
sub % r10 , % r12
_ no_ e x t r a _ m a s k _ 2 _ \ @:
movdqu A L L _ F - S H I F T _ M A S K ( % r12 ) , % x m m 1
# get t h e a p p r o p r i a t e m a s k t o m a s k o u t b o t t o m r13 b y t e s o f x m m 9
pand % x m m 1 , % x m m 9
movdqa S H U F _ M A S K ( % r i p ) , % x m m 1
PSHUFB_ X M M % x m m 1 , % x m m 9
PSHUFB_ X M M % x m m 2 , % x m m 9
pxor % x m m 9 , \ A A D _ H A S H
cmp $ 0 , % r10
jl _ p a r t i a l _ i n c o m p l e t e _ 2 _ \ @
# GHASH c o m p u t a t i o n f o r t h e l a s t < 1 6 B y t e b l o c k
GHASH_ M U L \ A A D _ H A S H , % x m m 1 3 , % x m m 0 , % x m m 1 0 , % x m m 1 1 , % x m m 5 , % x m m 6
xor % r a x ,% r a x
mov % r a x , P B l o c k L e n ( % a r g 2 )
jmp _ e n c o d e _ d o n e _ \ @
_ partial_ i n c o m p l e t e _ 2 _ \ @:
add \ P L A I N _ C Y P H _ L E N , P B l o c k L e n ( % a r g 2 )
_ encode_ d o n e _ \ @:
movdqu \ A A D _ H A S H , A a d H a s h ( % a r g 2 )
movdqa S H U F _ M A S K ( % r i p ) , % x m m 1 0
# shuffle x m m 9 b a c k t o o u t p u t a s c i p h e r t e x t
PSHUFB_ X M M % x m m 1 0 , % x m m 9
PSHUFB_ X M M % x m m 2 , % x m m 9
.endif
# output e n c r y p t e d B y t e s
cmp $ 0 , % r10
jl _ p a r t i a l _ f i l l _ \ @
mov % r13 , % r12
mov $ 1 6 , % r13
# Set r13 t o b e t h e n u m b e r o f b y t e s t o w r i t e o u t
sub % r12 , % r13
jmp _ c o u n t _ s e t _ \ @
_ partial_ f i l l _ \ @:
mov \ P L A I N _ C Y P H _ L E N , % r13
_ count_ s e t _ \ @:
movdqa % x m m 9 , % x m m 0
MOVQ_ R 6 4 _ X M M % x m m 0 , % r a x
cmp $ 8 , % r13
jle _ l e s s _ t h a n _ 8 _ b y t e s _ l e f t _ \ @
mov % r a x , ( \ C Y P H _ P L A I N _ O U T , \ D A T A _ O F F S E T , 1 )
add $ 8 , \ D A T A _ O F F S E T
psrldq $ 8 , % x m m 0
MOVQ_ R 6 4 _ X M M % x m m 0 , % r a x
sub $ 8 , % r13
_ less_ t h a n _ 8 _ b y t e s _ l e f t _ \ @:
movb % a l , ( \ C Y P H _ P L A I N _ O U T , \ D A T A _ O F F S E T , 1 )
add $ 1 , \ D A T A _ O F F S E T
shr $ 8 , % r a x
sub $ 1 , % r13
jne _ l e s s _ t h a n _ 8 _ b y t e s _ l e f t _ \ @
_ partial_ b l o c k _ d o n e _ \ @:
.endm # PARTIAL_ B L O C K
2018-02-14 20:39:36 +03:00
/ *
* if a = n u m b e r o f t o t a l p l a i n t e x t b y t e s
* b = f l o o r ( a / 1 6 )
* num_ i n i t i a l _ b l o c k s = b m o d 4
* encrypt t h e i n i t i a l n u m _ i n i t i a l _ b l o c k s b l o c k s a n d a p p l y g h a s h o n
* the c i p h e r t e x t
* % r1 0 , % r11 , % r12 , % r a x , % x m m 5 , % x m m 6 , % x m m 7 , % x m m 8 , % x m m 9 r e g i s t e r s
* are c l o b b e r e d
2018-02-14 20:40:10 +03:00
* arg1 , % a r g 2 , % a r g 3 a r e u s e d a s a p o i n t e r o n l y , n o t m o d i f i e d
2018-02-14 20:39:36 +03:00
* /
.macro INITIAL_BLOCKS_ENC_DEC TMP1 T M P 2 T M P 3 T M P 4 T M P 5 X M M 0 X M M 1 \
XMM2 X M M 3 X M M 4 X M M D s t T M P 6 T M P 7 i i _ s e q o p e r a t i o n
2018-02-14 20:39:45 +03:00
MOVADQ S H U F _ M A S K ( % r i p ) , % x m m 1 4
2018-02-14 20:39:36 +03:00
movdqu A a d H a s h ( % a r g 2 ) , % x m m \ i # X M M 0 = Y 0
2017-04-28 19:11:56 +03:00
# start A E S f o r n u m _ i n i t i a l _ b l o c k s b l o c k s
2010-12-13 14:51:15 +03:00
2018-02-14 20:39:45 +03:00
movdqu C u r C o u n t ( % a r g 2 ) , \ X M M 0 # X M M 0 = Y 0
2010-12-13 14:51:15 +03:00
.if ( \ i = = 5 ) | | ( \ i = = 6 ) | | ( \ i = = 7 )
2015-01-13 21:16:43 +03:00
MOVADQ O N E ( % R I P ) ,\ T M P 1
MOVADQ 0 ( % a r g 1 ) ,\ T M P 2
2010-12-13 14:51:15 +03:00
.irpc index, \ i _ s e q
2015-01-13 21:16:43 +03:00
paddd \ T M P 1 , \ X M M 0 # I N C R Y 0
2018-02-14 20:38:12 +03:00
.ifc \ operation, d e c
movdqa \ X M M 0 , % x m m \ i n d e x
.else
2015-01-13 21:16:43 +03:00
MOVADQ \ X M M 0 , % x m m \ i n d e x
2018-02-14 20:38:12 +03:00
.endif
2015-01-13 21:16:43 +03:00
PSHUFB_ X M M % x m m 1 4 , % x m m \ i n d e x # p e r f o r m a 16 b y t e s w a p
pxor \ T M P 2 , % x m m \ i n d e x
2010-12-13 14:51:15 +03:00
.endr
2015-01-13 21:16:43 +03:00
lea 0 x10 ( % a r g 1 ) ,% r10
mov k e y s i z e ,% e a x
shr $ 2 ,% e a x # 128 - > 4 , 1 9 2 - > 6 , 2 5 6 - > 8
add $ 5 ,% e a x # 128 - > 9 , 1 9 2 - > 1 1 , 2 5 6 - > 1 3
2018-02-14 20:38:12 +03:00
aes_ l o o p _ i n i t i a l _ \ @:
2015-01-13 21:16:43 +03:00
MOVADQ ( % r10 ) ,\ T M P 1
.irpc index, \ i _ s e q
AESENC \ T M P 1 , % x m m \ i n d e x
2010-12-13 14:51:15 +03:00
.endr
2015-01-13 21:16:43 +03:00
add $ 1 6 ,% r10
sub $ 1 ,% e a x
2018-02-14 20:38:12 +03:00
jnz a e s _ l o o p _ i n i t i a l _ \ @
2015-01-13 21:16:43 +03:00
MOVADQ ( % r10 ) , \ T M P 1
2010-12-13 14:51:15 +03:00
.irpc index, \ i _ s e q
2015-01-13 21:16:43 +03:00
AESENCLAST \ T M P 1 , % x m m \ i n d e x # L a s t R o u n d
2010-12-13 14:51:15 +03:00
.endr
.irpc index, \ i _ s e q
2018-02-14 20:39:23 +03:00
movdqu ( % a r g 4 , % r11 , 1 ) , \ T M P 1
2010-12-13 14:51:15 +03:00
pxor \ T M P 1 , % x m m \ i n d e x
2018-02-14 20:39:23 +03:00
movdqu % x m m \ i n d e x , ( % a r g 3 , % r11 , 1 )
2010-12-13 14:51:15 +03:00
# write b a c k p l a i n t e x t / c i p h e r t e x t f o r n u m _ i n i t i a l _ b l o c k s
add $ 1 6 , % r11
2018-02-14 20:38:12 +03:00
.ifc \ operation, d e c
movdqa \ T M P 1 , % x m m \ i n d e x
.endif
2010-12-13 14:51:15 +03:00
PSHUFB_ X M M % x m m 1 4 , % x m m \ i n d e x
# prepare p l a i n t e x t / c i p h e r t e x t f o r G H A S H c o m p u t a t i o n
.endr
.endif
2017-04-28 19:11:56 +03:00
2010-12-13 14:51:15 +03:00
# apply G H A S H o n n u m _ i n i t i a l _ b l o c k s b l o c k s
.if \ i = = 5
pxor % x m m 5 , % x m m 6
GHASH_ M U L % x m m 6 , \ T M P 3 , \ T M P 1 , \ T M P 2 , \ T M P 4 , \ T M P 5 , \ X M M 1
pxor % x m m 6 , % x m m 7
GHASH_ M U L % x m m 7 , \ T M P 3 , \ T M P 1 , \ T M P 2 , \ T M P 4 , \ T M P 5 , \ X M M 1
pxor % x m m 7 , % x m m 8
GHASH_ M U L % x m m 8 , \ T M P 3 , \ T M P 1 , \ T M P 2 , \ T M P 4 , \ T M P 5 , \ X M M 1
.elseif \ i = = 6
pxor % x m m 6 , % x m m 7
GHASH_ M U L % x m m 7 , \ T M P 3 , \ T M P 1 , \ T M P 2 , \ T M P 4 , \ T M P 5 , \ X M M 1
pxor % x m m 7 , % x m m 8
GHASH_ M U L % x m m 8 , \ T M P 3 , \ T M P 1 , \ T M P 2 , \ T M P 4 , \ T M P 5 , \ X M M 1
.elseif \ i = = 7
pxor % x m m 7 , % x m m 8
GHASH_ M U L % x m m 8 , \ T M P 3 , \ T M P 1 , \ T M P 2 , \ T M P 4 , \ T M P 5 , \ X M M 1
.endif
cmp $ 6 4 , % r13
2018-02-14 20:38:12 +03:00
jl _ i n i t i a l _ b l o c k s _ d o n e \ @
2010-12-13 14:51:15 +03:00
# no n e e d f o r p r e c o m p u t e d v a l u e s
/ *
*
* Precomputations f o r H a s h K e y p a r a l l e l w i t h e n c r y p t i o n o f f i r s t 4 b l o c k s .
* Haskey_ i _ k h o l d s X O R e d v a l u e s o f t h e l o w a n d h i g h p a r t s o f t h e H a s k e y _ i
* /
2015-01-13 21:16:43 +03:00
MOVADQ O N E ( % R I P ) ,\ T M P 1
paddd \ T M P 1 , \ X M M 0 # I N C R Y 0
MOVADQ \ X M M 0 , \ X M M 1
2010-12-13 14:51:15 +03:00
PSHUFB_ X M M % x m m 1 4 , \ X M M 1 # p e r f o r m a 16 b y t e s w a p
2015-01-13 21:16:43 +03:00
paddd \ T M P 1 , \ X M M 0 # I N C R Y 0
MOVADQ \ X M M 0 , \ X M M 2
2010-12-13 14:51:15 +03:00
PSHUFB_ X M M % x m m 1 4 , \ X M M 2 # p e r f o r m a 16 b y t e s w a p
2015-01-13 21:16:43 +03:00
paddd \ T M P 1 , \ X M M 0 # I N C R Y 0
MOVADQ \ X M M 0 , \ X M M 3
2010-12-13 14:51:15 +03:00
PSHUFB_ X M M % x m m 1 4 , \ X M M 3 # p e r f o r m a 16 b y t e s w a p
2015-01-13 21:16:43 +03:00
paddd \ T M P 1 , \ X M M 0 # I N C R Y 0
MOVADQ \ X M M 0 , \ X M M 4
2010-12-13 14:51:15 +03:00
PSHUFB_ X M M % x m m 1 4 , \ X M M 4 # p e r f o r m a 16 b y t e s w a p
2015-01-13 21:16:43 +03:00
MOVADQ 0 ( % a r g 1 ) ,\ T M P 1
pxor \ T M P 1 , \ X M M 1
pxor \ T M P 1 , \ X M M 2
pxor \ T M P 1 , \ X M M 3
pxor \ T M P 1 , \ X M M 4
2010-12-13 14:51:15 +03:00
.irpc index, 1 2 3 4 # d o 4 r o u n d s
movaps 0 x10 * \ i n d e x ( % a r g 1 ) , \ T M P 1
AESENC \ T M P 1 , \ X M M 1
AESENC \ T M P 1 , \ X M M 2
AESENC \ T M P 1 , \ X M M 3
AESENC \ T M P 1 , \ X M M 4
.endr
.irpc index, 5 6 7 8 9 # d o n e x t 5 r o u n d s
movaps 0 x10 * \ i n d e x ( % a r g 1 ) , \ T M P 1
AESENC \ T M P 1 , \ X M M 1
AESENC \ T M P 1 , \ X M M 2
AESENC \ T M P 1 , \ X M M 3
AESENC \ T M P 1 , \ X M M 4
.endr
2015-01-13 21:16:43 +03:00
lea 0 x a0 ( % a r g 1 ) ,% r10
mov k e y s i z e ,% e a x
shr $ 2 ,% e a x # 128 - > 4 , 1 9 2 - > 6 , 2 5 6 - > 8
sub $ 4 ,% e a x # 128 - > 0 , 1 9 2 - > 2 , 2 5 6 - > 4
2018-02-14 20:38:12 +03:00
jz a e s _ l o o p _ p r e _ d o n e \ @
2015-01-13 21:16:43 +03:00
2018-02-14 20:38:12 +03:00
aes_ l o o p _ p r e _ \ @:
2015-01-13 21:16:43 +03:00
MOVADQ ( % r10 ) ,\ T M P 2
.irpc index, 1 2 3 4
AESENC \ T M P 2 , % x m m \ i n d e x
.endr
add $ 1 6 ,% r10
sub $ 1 ,% e a x
2018-02-14 20:38:12 +03:00
jnz a e s _ l o o p _ p r e _ \ @
2015-01-13 21:16:43 +03:00
2018-02-14 20:38:12 +03:00
aes_ l o o p _ p r e _ d o n e \ @:
2015-01-13 21:16:43 +03:00
MOVADQ ( % r10 ) , \ T M P 2
2010-12-13 14:51:15 +03:00
AESENCLAST \ T M P 2 , \ X M M 1
AESENCLAST \ T M P 2 , \ X M M 2
AESENCLAST \ T M P 2 , \ X M M 3
AESENCLAST \ T M P 2 , \ X M M 4
2018-02-14 20:39:23 +03:00
movdqu 1 6 * 0 ( % a r g 4 , % r11 , 1 ) , \ T M P 1
2010-12-13 14:51:15 +03:00
pxor \ T M P 1 , \ X M M 1
2018-02-14 20:38:12 +03:00
.ifc \ operation, d e c
2018-02-14 20:39:23 +03:00
movdqu \ X M M 1 , 1 6 * 0 ( % a r g 3 , % r11 , 1 )
2018-02-14 20:38:12 +03:00
movdqa \ T M P 1 , \ X M M 1
.endif
2018-02-14 20:39:23 +03:00
movdqu 1 6 * 1 ( % a r g 4 , % r11 , 1 ) , \ T M P 1
2010-12-13 14:51:15 +03:00
pxor \ T M P 1 , \ X M M 2
2018-02-14 20:38:12 +03:00
.ifc \ operation, d e c
2018-02-14 20:39:23 +03:00
movdqu \ X M M 2 , 1 6 * 1 ( % a r g 3 , % r11 , 1 )
2018-02-14 20:38:12 +03:00
movdqa \ T M P 1 , \ X M M 2
.endif
2018-02-14 20:39:23 +03:00
movdqu 1 6 * 2 ( % a r g 4 , % r11 , 1 ) , \ T M P 1
2010-12-13 14:51:15 +03:00
pxor \ T M P 1 , \ X M M 3
2018-02-14 20:38:12 +03:00
.ifc \ operation, d e c
2018-02-14 20:39:23 +03:00
movdqu \ X M M 3 , 1 6 * 2 ( % a r g 3 , % r11 , 1 )
2018-02-14 20:38:12 +03:00
movdqa \ T M P 1 , \ X M M 3
.endif
2018-02-14 20:39:23 +03:00
movdqu 1 6 * 3 ( % a r g 4 , % r11 , 1 ) , \ T M P 1
2010-12-13 14:51:15 +03:00
pxor \ T M P 1 , \ X M M 4
2018-02-14 20:38:12 +03:00
.ifc \ operation, d e c
2018-02-14 20:39:23 +03:00
movdqu \ X M M 4 , 1 6 * 3 ( % a r g 3 , % r11 , 1 )
2018-02-14 20:38:12 +03:00
movdqa \ T M P 1 , \ X M M 4
.else
2018-02-14 20:39:23 +03:00
movdqu \ X M M 1 , 1 6 * 0 ( % a r g 3 , % r11 , 1 )
movdqu \ X M M 2 , 1 6 * 1 ( % a r g 3 , % r11 , 1 )
movdqu \ X M M 3 , 1 6 * 2 ( % a r g 3 , % r11 , 1 )
movdqu \ X M M 4 , 1 6 * 3 ( % a r g 3 , % r11 , 1 )
2018-02-14 20:38:12 +03:00
.endif
2010-12-13 14:51:15 +03:00
2010-11-04 22:00:45 +03:00
add $ 6 4 , % r11
2010-12-13 14:51:15 +03:00
PSHUFB_ X M M % x m m 1 4 , \ X M M 1 # p e r f o r m a 16 b y t e s w a p
2010-11-04 22:00:45 +03:00
pxor \ X M M D s t , \ X M M 1
# combine G H A S H e d v a l u e w i t h t h e c o r r e s p o n d i n g c i p h e r t e x t
2010-12-13 14:51:15 +03:00
PSHUFB_ X M M % x m m 1 4 , \ X M M 2 # p e r f o r m a 16 b y t e s w a p
PSHUFB_ X M M % x m m 1 4 , \ X M M 3 # p e r f o r m a 16 b y t e s w a p
PSHUFB_ X M M % x m m 1 4 , \ X M M 4 # p e r f o r m a 16 b y t e s w a p
2018-02-14 20:38:12 +03:00
_ initial_ b l o c k s _ d o n e \ @:
2010-12-13 14:51:15 +03:00
2010-11-04 22:00:45 +03:00
.endm
/ *
* encrypt 4 b l o c k s a t a t i m e
* ghash t h e 4 p r e v i o u s l y e n c r y p t e d c i p h e r t e x t b l o c k s
2018-02-14 20:39:23 +03:00
* arg1 , % a r g 3 , % a r g 4 a r e u s e d a s p o i n t e r s o n l y , n o t m o d i f i e d
2010-11-04 22:00:45 +03:00
* % r1 1 i s t h e d a t a o f f s e t v a l u e
* /
2010-12-13 14:51:15 +03:00
.macro GHASH_4_ENCRYPT_4_PARALLEL_ENC TMP1 T M P 2 T M P 3 T M P 4 T M P 5 \
TMP6 X M M 0 X M M 1 X M M 2 X M M 3 X M M 4 X M M 5 X M M 6 X M M 7 X M M 8 o p e r a t i o n
movdqa \ X M M 1 , \ X M M 5
movdqa \ X M M 2 , \ X M M 6
movdqa \ X M M 3 , \ X M M 7
movdqa \ X M M 4 , \ X M M 8
movdqa S H U F _ M A S K ( % r i p ) , % x m m 1 5
# multiply T M P 5 * H a s h K e y u s i n g k a r a t s u b a
movdqa \ X M M 5 , \ T M P 4
pshufd $ 7 8 , \ X M M 5 , \ T M P 6
pxor \ X M M 5 , \ T M P 6
paddd O N E ( % r i p ) , \ X M M 0 # I N C R C N T
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y _ 4 ( % a r g 2 ) , \ T M P 5
2010-12-13 14:51:15 +03:00
PCLMULQDQ 0 x11 , \ T M P 5 , \ T M P 4 # T M P 4 = a1 * b1
movdqa \ X M M 0 , \ X M M 1
paddd O N E ( % r i p ) , \ X M M 0 # I N C R C N T
movdqa \ X M M 0 , \ X M M 2
paddd O N E ( % r i p ) , \ X M M 0 # I N C R C N T
movdqa \ X M M 0 , \ X M M 3
paddd O N E ( % r i p ) , \ X M M 0 # I N C R C N T
movdqa \ X M M 0 , \ X M M 4
PSHUFB_ X M M % x m m 1 5 , \ X M M 1 # p e r f o r m a 16 b y t e s w a p
PCLMULQDQ 0 x00 , \ T M P 5 , \ X M M 5 # X M M 5 = a0 * b0
PSHUFB_ X M M % x m m 1 5 , \ X M M 2 # p e r f o r m a 16 b y t e s w a p
PSHUFB_ X M M % x m m 1 5 , \ X M M 3 # p e r f o r m a 16 b y t e s w a p
PSHUFB_ X M M % x m m 1 5 , \ X M M 4 # p e r f o r m a 16 b y t e s w a p
pxor ( % a r g 1 ) , \ X M M 1
pxor ( % a r g 1 ) , \ X M M 2
pxor ( % a r g 1 ) , \ X M M 3
pxor ( % a r g 1 ) , \ X M M 4
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y _ 4 _ k ( % a r g 2 ) , \ T M P 5
2010-12-13 14:51:15 +03:00
PCLMULQDQ 0 x00 , \ T M P 5 , \ T M P 6 # T M P 6 = ( a1 + a0 ) * ( b1 + b0 )
movaps 0 x10 ( % a r g 1 ) , \ T M P 1
AESENC \ T M P 1 , \ X M M 1 # R o u n d 1
AESENC \ T M P 1 , \ X M M 2
AESENC \ T M P 1 , \ X M M 3
AESENC \ T M P 1 , \ X M M 4
movaps 0 x20 ( % a r g 1 ) , \ T M P 1
AESENC \ T M P 1 , \ X M M 1 # R o u n d 2
AESENC \ T M P 1 , \ X M M 2
AESENC \ T M P 1 , \ X M M 3
AESENC \ T M P 1 , \ X M M 4
movdqa \ X M M 6 , \ T M P 1
pshufd $ 7 8 , \ X M M 6 , \ T M P 2
pxor \ X M M 6 , \ T M P 2
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y _ 3 ( % a r g 2 ) , \ T M P 5
2010-12-13 14:51:15 +03:00
PCLMULQDQ 0 x11 , \ T M P 5 , \ T M P 1 # T M P 1 = a1 * b1
movaps 0 x30 ( % a r g 1 ) , \ T M P 3
AESENC \ T M P 3 , \ X M M 1 # R o u n d 3
AESENC \ T M P 3 , \ X M M 2
AESENC \ T M P 3 , \ X M M 3
AESENC \ T M P 3 , \ X M M 4
PCLMULQDQ 0 x00 , \ T M P 5 , \ X M M 6 # X M M 6 = a0 * b0
movaps 0 x40 ( % a r g 1 ) , \ T M P 3
AESENC \ T M P 3 , \ X M M 1 # R o u n d 4
AESENC \ T M P 3 , \ X M M 2
AESENC \ T M P 3 , \ X M M 3
AESENC \ T M P 3 , \ X M M 4
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y _ 3 _ k ( % a r g 2 ) , \ T M P 5
2010-12-13 14:51:15 +03:00
PCLMULQDQ 0 x00 , \ T M P 5 , \ T M P 2 # T M P 2 = ( a1 + a0 ) * ( b1 + b0 )
movaps 0 x50 ( % a r g 1 ) , \ T M P 3
AESENC \ T M P 3 , \ X M M 1 # R o u n d 5
AESENC \ T M P 3 , \ X M M 2
AESENC \ T M P 3 , \ X M M 3
AESENC \ T M P 3 , \ X M M 4
pxor \ T M P 1 , \ T M P 4
# accumulate t h e r e s u l t s i n T M P 4 : X M M 5 , T M P 6 h o l d s t h e m i d d l e p a r t
pxor \ X M M 6 , \ X M M 5
pxor \ T M P 2 , \ T M P 6
movdqa \ X M M 7 , \ T M P 1
pshufd $ 7 8 , \ X M M 7 , \ T M P 2
pxor \ X M M 7 , \ T M P 2
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y _ 2 ( % a r g 2 ) , \ T M P 5
2010-12-13 14:51:15 +03:00
# Multiply T M P 5 * H a s h K e y u s i n g k a r a t s u b a
PCLMULQDQ 0 x11 , \ T M P 5 , \ T M P 1 # T M P 1 = a1 * b1
movaps 0 x60 ( % a r g 1 ) , \ T M P 3
AESENC \ T M P 3 , \ X M M 1 # R o u n d 6
AESENC \ T M P 3 , \ X M M 2
AESENC \ T M P 3 , \ X M M 3
AESENC \ T M P 3 , \ X M M 4
PCLMULQDQ 0 x00 , \ T M P 5 , \ X M M 7 # X M M 7 = a0 * b0
movaps 0 x70 ( % a r g 1 ) , \ T M P 3
AESENC \ T M P 3 , \ X M M 1 # R o u n d 7
AESENC \ T M P 3 , \ X M M 2
AESENC \ T M P 3 , \ X M M 3
AESENC \ T M P 3 , \ X M M 4
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y _ 2 _ k ( % a r g 2 ) , \ T M P 5
2010-12-13 14:51:15 +03:00
PCLMULQDQ 0 x00 , \ T M P 5 , \ T M P 2 # T M P 2 = ( a1 + a0 ) * ( b1 + b0 )
movaps 0 x80 ( % a r g 1 ) , \ T M P 3
AESENC \ T M P 3 , \ X M M 1 # R o u n d 8
AESENC \ T M P 3 , \ X M M 2
AESENC \ T M P 3 , \ X M M 3
AESENC \ T M P 3 , \ X M M 4
pxor \ T M P 1 , \ T M P 4
# accumulate t h e r e s u l t s i n T M P 4 : X M M 5 , T M P 6 h o l d s t h e m i d d l e p a r t
pxor \ X M M 7 , \ X M M 5
pxor \ T M P 2 , \ T M P 6
# Multiply X M M 8 * H a s h K e y
# XMM8 a n d T M P 5 h o l d t h e v a l u e s f o r t h e t w o o p e r a n d s
movdqa \ X M M 8 , \ T M P 1
pshufd $ 7 8 , \ X M M 8 , \ T M P 2
pxor \ X M M 8 , \ T M P 2
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y ( % a r g 2 ) , \ T M P 5
2010-12-13 14:51:15 +03:00
PCLMULQDQ 0 x11 , \ T M P 5 , \ T M P 1 # T M P 1 = a1 * b1
movaps 0 x90 ( % a r g 1 ) , \ T M P 3
AESENC \ T M P 3 , \ X M M 1 # R o u n d 9
AESENC \ T M P 3 , \ X M M 2
AESENC \ T M P 3 , \ X M M 3
AESENC \ T M P 3 , \ X M M 4
PCLMULQDQ 0 x00 , \ T M P 5 , \ X M M 8 # X M M 8 = a0 * b0
2015-01-13 21:16:43 +03:00
lea 0 x a0 ( % a r g 1 ) ,% r10
mov k e y s i z e ,% e a x
shr $ 2 ,% e a x # 128 - > 4 , 1 9 2 - > 6 , 2 5 6 - > 8
sub $ 4 ,% e a x # 128 - > 0 , 1 9 2 - > 2 , 2 5 6 - > 4
jz a e s _ l o o p _ p a r _ e n c _ d o n e
aes_loop_par_enc :
MOVADQ ( % r10 ) ,\ T M P 3
.irpc index, 1 2 3 4
AESENC \ T M P 3 , % x m m \ i n d e x
.endr
add $ 1 6 ,% r10
sub $ 1 ,% e a x
jnz a e s _ l o o p _ p a r _ e n c
aes_loop_par_enc_done :
MOVADQ ( % r10 ) , \ T M P 3
2010-12-13 14:51:15 +03:00
AESENCLAST \ T M P 3 , \ X M M 1 # R o u n d 10
AESENCLAST \ T M P 3 , \ X M M 2
AESENCLAST \ T M P 3 , \ X M M 3
AESENCLAST \ T M P 3 , \ X M M 4
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y _ k ( % a r g 2 ) , \ T M P 5
2010-12-13 14:51:15 +03:00
PCLMULQDQ 0 x00 , \ T M P 5 , \ T M P 2 # T M P 2 = ( a1 + a0 ) * ( b1 + b0 )
2018-02-14 20:39:23 +03:00
movdqu ( % a r g 4 ,% r11 ,1 ) , \ T M P 3
2010-12-13 14:51:15 +03:00
pxor \ T M P 3 , \ X M M 1 # C i p h e r t e x t / P l a i n t e x t X O R E K
2018-02-14 20:39:23 +03:00
movdqu 1 6 ( % a r g 4 ,% r11 ,1 ) , \ T M P 3
2010-12-13 14:51:15 +03:00
pxor \ T M P 3 , \ X M M 2 # C i p h e r t e x t / P l a i n t e x t X O R E K
2018-02-14 20:39:23 +03:00
movdqu 3 2 ( % a r g 4 ,% r11 ,1 ) , \ T M P 3
2010-12-13 14:51:15 +03:00
pxor \ T M P 3 , \ X M M 3 # C i p h e r t e x t / P l a i n t e x t X O R E K
2018-02-14 20:39:23 +03:00
movdqu 4 8 ( % a r g 4 ,% r11 ,1 ) , \ T M P 3
2010-12-13 14:51:15 +03:00
pxor \ T M P 3 , \ X M M 4 # C i p h e r t e x t / P l a i n t e x t X O R E K
2018-02-14 20:39:23 +03:00
movdqu \ X M M 1 , ( % a r g 3 ,% r11 ,1 ) # W r i t e t o t h e c i p h e r t e x t b u f f e r
movdqu \ X M M 2 , 1 6 ( % a r g 3 ,% r11 ,1 ) # W r i t e t o t h e c i p h e r t e x t b u f f e r
movdqu \ X M M 3 , 3 2 ( % a r g 3 ,% r11 ,1 ) # W r i t e t o t h e c i p h e r t e x t b u f f e r
movdqu \ X M M 4 , 4 8 ( % a r g 3 ,% r11 ,1 ) # W r i t e t o t h e c i p h e r t e x t b u f f e r
2010-12-13 14:51:15 +03:00
PSHUFB_ X M M % x m m 1 5 , \ X M M 1 # p e r f o r m a 16 b y t e s w a p
PSHUFB_ X M M % x m m 1 5 , \ X M M 2 # p e r f o r m a 16 b y t e s w a p
PSHUFB_ X M M % x m m 1 5 , \ X M M 3 # p e r f o r m a 16 b y t e s w a p
PSHUFB_ X M M % x m m 1 5 , \ X M M 4 # p e r f o r m a 16 b y t e s w a p
pxor \ T M P 4 , \ T M P 1
pxor \ X M M 8 , \ X M M 5
pxor \ T M P 6 , \ T M P 2
pxor \ T M P 1 , \ T M P 2
pxor \ X M M 5 , \ T M P 2
movdqa \ T M P 2 , \ T M P 3
pslldq $ 8 , \ T M P 3 # l e f t s h i f t T M P 3 2 D W s
psrldq $ 8 , \ T M P 2 # r i g h t s h i f t T M P 2 2 D W s
pxor \ T M P 3 , \ X M M 5
pxor \ T M P 2 , \ T M P 1 # a c c u m u l a t e t h e r e s u l t s i n T M P 1 : X M M 5
# first p h a s e o f r e d u c t i o n
movdqa \ X M M 5 , \ T M P 2
movdqa \ X M M 5 , \ T M P 3
movdqa \ X M M 5 , \ T M P 4
# move X M M 5 i n t o T M P 2 , T M P 3 , T M P 4 i n o r d e r t o p e r f o r m s h i f t s i n d e p e n d e n t l y
pslld $ 3 1 , \ T M P 2 # p a c k e d r i g h t s h i f t < < 31
pslld $ 3 0 , \ T M P 3 # p a c k e d r i g h t s h i f t < < 30
pslld $ 2 5 , \ T M P 4 # p a c k e d r i g h t s h i f t < < 25
pxor \ T M P 3 , \ T M P 2 # x o r t h e s h i f t e d v e r s i o n s
pxor \ T M P 4 , \ T M P 2
movdqa \ T M P 2 , \ T M P 5
psrldq $ 4 , \ T M P 5 # r i g h t s h i f t T 5 1 D W
pslldq $ 1 2 , \ T M P 2 # l e f t s h i f t T 2 3 D W s
pxor \ T M P 2 , \ X M M 5
# second p h a s e o f r e d u c t i o n
movdqa \ X M M 5 ,\ T M P 2 # m a k e 3 c o p i e s o f X M M 5 i n t o T M P 2 , T M P 3 , T M P 4
movdqa \ X M M 5 ,\ T M P 3
movdqa \ X M M 5 ,\ T M P 4
psrld $ 1 , \ T M P 2 # p a c k e d l e f t s h i f t > > 1
psrld $ 2 , \ T M P 3 # p a c k e d l e f t s h i f t > > 2
psrld $ 7 , \ T M P 4 # p a c k e d l e f t s h i f t > > 7
pxor \ T M P 3 ,\ T M P 2 # x o r t h e s h i f t e d v e r s i o n s
pxor \ T M P 4 ,\ T M P 2
pxor \ T M P 5 , \ T M P 2
pxor \ T M P 2 , \ X M M 5
pxor \ T M P 1 , \ X M M 5 # r e s u l t i s i n T M P 1
pxor \ X M M 5 , \ X M M 1
.endm
/ *
* decrypt 4 b l o c k s a t a t i m e
* ghash t h e 4 p r e v i o u s l y d e c r y p t e d c i p h e r t e x t b l o c k s
2018-02-14 20:39:23 +03:00
* arg1 , % a r g 3 , % a r g 4 a r e u s e d a s p o i n t e r s o n l y , n o t m o d i f i e d
2010-12-13 14:51:15 +03:00
* % r1 1 i s t h e d a t a o f f s e t v a l u e
* /
.macro GHASH_4_ENCRYPT_4_PARALLEL_DEC TMP1 T M P 2 T M P 3 T M P 4 T M P 5 \
2010-11-04 22:00:45 +03:00
TMP6 X M M 0 X M M 1 X M M 2 X M M 3 X M M 4 X M M 5 X M M 6 X M M 7 X M M 8 o p e r a t i o n
movdqa \ X M M 1 , \ X M M 5
movdqa \ X M M 2 , \ X M M 6
movdqa \ X M M 3 , \ X M M 7
movdqa \ X M M 4 , \ X M M 8
2010-12-13 14:51:15 +03:00
movdqa S H U F _ M A S K ( % r i p ) , % x m m 1 5
2010-11-04 22:00:45 +03:00
# multiply T M P 5 * H a s h K e y u s i n g k a r a t s u b a
movdqa \ X M M 5 , \ T M P 4
pshufd $ 7 8 , \ X M M 5 , \ T M P 6
pxor \ X M M 5 , \ T M P 6
paddd O N E ( % r i p ) , \ X M M 0 # I N C R C N T
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y _ 4 ( % a r g 2 ) , \ T M P 5
2010-11-04 22:00:45 +03:00
PCLMULQDQ 0 x11 , \ T M P 5 , \ T M P 4 # T M P 4 = a1 * b1
movdqa \ X M M 0 , \ X M M 1
paddd O N E ( % r i p ) , \ X M M 0 # I N C R C N T
movdqa \ X M M 0 , \ X M M 2
paddd O N E ( % r i p ) , \ X M M 0 # I N C R C N T
movdqa \ X M M 0 , \ X M M 3
paddd O N E ( % r i p ) , \ X M M 0 # I N C R C N T
movdqa \ X M M 0 , \ X M M 4
2010-12-13 14:51:15 +03:00
PSHUFB_ X M M % x m m 1 5 , \ X M M 1 # p e r f o r m a 16 b y t e s w a p
2010-11-04 22:00:45 +03:00
PCLMULQDQ 0 x00 , \ T M P 5 , \ X M M 5 # X M M 5 = a0 * b0
2010-12-13 14:51:15 +03:00
PSHUFB_ X M M % x m m 1 5 , \ X M M 2 # p e r f o r m a 16 b y t e s w a p
PSHUFB_ X M M % x m m 1 5 , \ X M M 3 # p e r f o r m a 16 b y t e s w a p
PSHUFB_ X M M % x m m 1 5 , \ X M M 4 # p e r f o r m a 16 b y t e s w a p
2010-11-04 22:00:45 +03:00
pxor ( % a r g 1 ) , \ X M M 1
pxor ( % a r g 1 ) , \ X M M 2
pxor ( % a r g 1 ) , \ X M M 3
pxor ( % a r g 1 ) , \ X M M 4
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y _ 4 _ k ( % a r g 2 ) , \ T M P 5
2010-11-04 22:00:45 +03:00
PCLMULQDQ 0 x00 , \ T M P 5 , \ T M P 6 # T M P 6 = ( a1 + a0 ) * ( b1 + b0 )
movaps 0 x10 ( % a r g 1 ) , \ T M P 1
AESENC \ T M P 1 , \ X M M 1 # R o u n d 1
AESENC \ T M P 1 , \ X M M 2
AESENC \ T M P 1 , \ X M M 3
AESENC \ T M P 1 , \ X M M 4
movaps 0 x20 ( % a r g 1 ) , \ T M P 1
AESENC \ T M P 1 , \ X M M 1 # R o u n d 2
AESENC \ T M P 1 , \ X M M 2
AESENC \ T M P 1 , \ X M M 3
AESENC \ T M P 1 , \ X M M 4
movdqa \ X M M 6 , \ T M P 1
pshufd $ 7 8 , \ X M M 6 , \ T M P 2
pxor \ X M M 6 , \ T M P 2
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y _ 3 ( % a r g 2 ) , \ T M P 5
2010-11-04 22:00:45 +03:00
PCLMULQDQ 0 x11 , \ T M P 5 , \ T M P 1 # T M P 1 = a1 * b1
movaps 0 x30 ( % a r g 1 ) , \ T M P 3
AESENC \ T M P 3 , \ X M M 1 # R o u n d 3
AESENC \ T M P 3 , \ X M M 2
AESENC \ T M P 3 , \ X M M 3
AESENC \ T M P 3 , \ X M M 4
PCLMULQDQ 0 x00 , \ T M P 5 , \ X M M 6 # X M M 6 = a0 * b0
movaps 0 x40 ( % a r g 1 ) , \ T M P 3
AESENC \ T M P 3 , \ X M M 1 # R o u n d 4
AESENC \ T M P 3 , \ X M M 2
AESENC \ T M P 3 , \ X M M 3
AESENC \ T M P 3 , \ X M M 4
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y _ 3 _ k ( % a r g 2 ) , \ T M P 5
2010-11-04 22:00:45 +03:00
PCLMULQDQ 0 x00 , \ T M P 5 , \ T M P 2 # T M P 2 = ( a1 + a0 ) * ( b1 + b0 )
movaps 0 x50 ( % a r g 1 ) , \ T M P 3
AESENC \ T M P 3 , \ X M M 1 # R o u n d 5
AESENC \ T M P 3 , \ X M M 2
AESENC \ T M P 3 , \ X M M 3
AESENC \ T M P 3 , \ X M M 4
pxor \ T M P 1 , \ T M P 4
# accumulate t h e r e s u l t s i n T M P 4 : X M M 5 , T M P 6 h o l d s t h e m i d d l e p a r t
pxor \ X M M 6 , \ X M M 5
pxor \ T M P 2 , \ T M P 6
movdqa \ X M M 7 , \ T M P 1
pshufd $ 7 8 , \ X M M 7 , \ T M P 2
pxor \ X M M 7 , \ T M P 2
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y _ 2 ( % a r g 2 ) , \ T M P 5
2010-11-04 22:00:45 +03:00
# Multiply T M P 5 * H a s h K e y u s i n g k a r a t s u b a
PCLMULQDQ 0 x11 , \ T M P 5 , \ T M P 1 # T M P 1 = a1 * b1
movaps 0 x60 ( % a r g 1 ) , \ T M P 3
AESENC \ T M P 3 , \ X M M 1 # R o u n d 6
AESENC \ T M P 3 , \ X M M 2
AESENC \ T M P 3 , \ X M M 3
AESENC \ T M P 3 , \ X M M 4
PCLMULQDQ 0 x00 , \ T M P 5 , \ X M M 7 # X M M 7 = a0 * b0
movaps 0 x70 ( % a r g 1 ) , \ T M P 3
AESENC \ T M P 3 , \ X M M 1 # R o u n d 7
AESENC \ T M P 3 , \ X M M 2
AESENC \ T M P 3 , \ X M M 3
AESENC \ T M P 3 , \ X M M 4
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y _ 2 _ k ( % a r g 2 ) , \ T M P 5
2010-11-04 22:00:45 +03:00
PCLMULQDQ 0 x00 , \ T M P 5 , \ T M P 2 # T M P 2 = ( a1 + a0 ) * ( b1 + b0 )
movaps 0 x80 ( % a r g 1 ) , \ T M P 3
AESENC \ T M P 3 , \ X M M 1 # R o u n d 8
AESENC \ T M P 3 , \ X M M 2
AESENC \ T M P 3 , \ X M M 3
AESENC \ T M P 3 , \ X M M 4
pxor \ T M P 1 , \ T M P 4
# accumulate t h e r e s u l t s i n T M P 4 : X M M 5 , T M P 6 h o l d s t h e m i d d l e p a r t
pxor \ X M M 7 , \ X M M 5
pxor \ T M P 2 , \ T M P 6
# Multiply X M M 8 * H a s h K e y
# XMM8 a n d T M P 5 h o l d t h e v a l u e s f o r t h e t w o o p e r a n d s
movdqa \ X M M 8 , \ T M P 1
pshufd $ 7 8 , \ X M M 8 , \ T M P 2
pxor \ X M M 8 , \ T M P 2
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y ( % a r g 2 ) , \ T M P 5
2010-11-04 22:00:45 +03:00
PCLMULQDQ 0 x11 , \ T M P 5 , \ T M P 1 # T M P 1 = a1 * b1
movaps 0 x90 ( % a r g 1 ) , \ T M P 3
AESENC \ T M P 3 , \ X M M 1 # R o u n d 9
AESENC \ T M P 3 , \ X M M 2
AESENC \ T M P 3 , \ X M M 3
AESENC \ T M P 3 , \ X M M 4
PCLMULQDQ 0 x00 , \ T M P 5 , \ X M M 8 # X M M 8 = a0 * b0
2015-01-13 21:16:43 +03:00
lea 0 x a0 ( % a r g 1 ) ,% r10
mov k e y s i z e ,% e a x
shr $ 2 ,% e a x # 128 - > 4 , 1 9 2 - > 6 , 2 5 6 - > 8
sub $ 4 ,% e a x # 128 - > 0 , 1 9 2 - > 2 , 2 5 6 - > 4
jz a e s _ l o o p _ p a r _ d e c _ d o n e
aes_loop_par_dec :
MOVADQ ( % r10 ) ,\ T M P 3
.irpc index, 1 2 3 4
AESENC \ T M P 3 , % x m m \ i n d e x
.endr
add $ 1 6 ,% r10
sub $ 1 ,% e a x
jnz a e s _ l o o p _ p a r _ d e c
aes_loop_par_dec_done :
MOVADQ ( % r10 ) , \ T M P 3
AESENCLAST \ T M P 3 , \ X M M 1 # l a s t r o u n d
2010-11-04 22:00:45 +03:00
AESENCLAST \ T M P 3 , \ X M M 2
AESENCLAST \ T M P 3 , \ X M M 3
AESENCLAST \ T M P 3 , \ X M M 4
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y _ k ( % a r g 2 ) , \ T M P 5
2010-11-04 22:00:45 +03:00
PCLMULQDQ 0 x00 , \ T M P 5 , \ T M P 2 # T M P 2 = ( a1 + a0 ) * ( b1 + b0 )
2018-02-14 20:39:23 +03:00
movdqu ( % a r g 4 ,% r11 ,1 ) , \ T M P 3
2010-11-04 22:00:45 +03:00
pxor \ T M P 3 , \ X M M 1 # C i p h e r t e x t / P l a i n t e x t X O R E K
2018-02-14 20:39:23 +03:00
movdqu \ X M M 1 , ( % a r g 3 ,% r11 ,1 ) # W r i t e t o p l a i n t e x t b u f f e r
2010-11-04 22:00:45 +03:00
movdqa \ T M P 3 , \ X M M 1
2018-02-14 20:39:23 +03:00
movdqu 1 6 ( % a r g 4 ,% r11 ,1 ) , \ T M P 3
2010-11-04 22:00:45 +03:00
pxor \ T M P 3 , \ X M M 2 # C i p h e r t e x t / P l a i n t e x t X O R E K
2018-02-14 20:39:23 +03:00
movdqu \ X M M 2 , 1 6 ( % a r g 3 ,% r11 ,1 ) # W r i t e t o p l a i n t e x t b u f f e r
2010-11-04 22:00:45 +03:00
movdqa \ T M P 3 , \ X M M 2
2018-02-14 20:39:23 +03:00
movdqu 3 2 ( % a r g 4 ,% r11 ,1 ) , \ T M P 3
2010-11-04 22:00:45 +03:00
pxor \ T M P 3 , \ X M M 3 # C i p h e r t e x t / P l a i n t e x t X O R E K
2018-02-14 20:39:23 +03:00
movdqu \ X M M 3 , 3 2 ( % a r g 3 ,% r11 ,1 ) # W r i t e t o p l a i n t e x t b u f f e r
2010-11-04 22:00:45 +03:00
movdqa \ T M P 3 , \ X M M 3
2018-02-14 20:39:23 +03:00
movdqu 4 8 ( % a r g 4 ,% r11 ,1 ) , \ T M P 3
2010-11-04 22:00:45 +03:00
pxor \ T M P 3 , \ X M M 4 # C i p h e r t e x t / P l a i n t e x t X O R E K
2018-02-14 20:39:23 +03:00
movdqu \ X M M 4 , 4 8 ( % a r g 3 ,% r11 ,1 ) # W r i t e t o p l a i n t e x t b u f f e r
2010-11-04 22:00:45 +03:00
movdqa \ T M P 3 , \ X M M 4
2010-12-13 14:51:15 +03:00
PSHUFB_ X M M % x m m 1 5 , \ X M M 1 # p e r f o r m a 16 b y t e s w a p
PSHUFB_ X M M % x m m 1 5 , \ X M M 2 # p e r f o r m a 16 b y t e s w a p
PSHUFB_ X M M % x m m 1 5 , \ X M M 3 # p e r f o r m a 16 b y t e s w a p
PSHUFB_ X M M % x m m 1 5 , \ X M M 4 # p e r f o r m a 16 b y t e s w a p
2010-11-04 22:00:45 +03:00
pxor \ T M P 4 , \ T M P 1
pxor \ X M M 8 , \ X M M 5
pxor \ T M P 6 , \ T M P 2
pxor \ T M P 1 , \ T M P 2
pxor \ X M M 5 , \ T M P 2
movdqa \ T M P 2 , \ T M P 3
pslldq $ 8 , \ T M P 3 # l e f t s h i f t T M P 3 2 D W s
psrldq $ 8 , \ T M P 2 # r i g h t s h i f t T M P 2 2 D W s
pxor \ T M P 3 , \ X M M 5
pxor \ T M P 2 , \ T M P 1 # a c c u m u l a t e t h e r e s u l t s i n T M P 1 : X M M 5
# first p h a s e o f r e d u c t i o n
movdqa \ X M M 5 , \ T M P 2
movdqa \ X M M 5 , \ T M P 3
movdqa \ X M M 5 , \ T M P 4
# move X M M 5 i n t o T M P 2 , T M P 3 , T M P 4 i n o r d e r t o p e r f o r m s h i f t s i n d e p e n d e n t l y
pslld $ 3 1 , \ T M P 2 # p a c k e d r i g h t s h i f t < < 31
pslld $ 3 0 , \ T M P 3 # p a c k e d r i g h t s h i f t < < 30
pslld $ 2 5 , \ T M P 4 # p a c k e d r i g h t s h i f t < < 25
pxor \ T M P 3 , \ T M P 2 # x o r t h e s h i f t e d v e r s i o n s
pxor \ T M P 4 , \ T M P 2
movdqa \ T M P 2 , \ T M P 5
psrldq $ 4 , \ T M P 5 # r i g h t s h i f t T 5 1 D W
pslldq $ 1 2 , \ T M P 2 # l e f t s h i f t T 2 3 D W s
pxor \ T M P 2 , \ X M M 5
# second p h a s e o f r e d u c t i o n
movdqa \ X M M 5 ,\ T M P 2 # m a k e 3 c o p i e s o f X M M 5 i n t o T M P 2 , T M P 3 , T M P 4
movdqa \ X M M 5 ,\ T M P 3
movdqa \ X M M 5 ,\ T M P 4
psrld $ 1 , \ T M P 2 # p a c k e d l e f t s h i f t > > 1
psrld $ 2 , \ T M P 3 # p a c k e d l e f t s h i f t > > 2
psrld $ 7 , \ T M P 4 # p a c k e d l e f t s h i f t > > 7
pxor \ T M P 3 ,\ T M P 2 # x o r t h e s h i f t e d v e r s i o n s
pxor \ T M P 4 ,\ T M P 2
pxor \ T M P 5 , \ T M P 2
pxor \ T M P 2 , \ X M M 5
pxor \ T M P 1 , \ X M M 5 # r e s u l t i s i n T M P 1
pxor \ X M M 5 , \ X M M 1
.endm
/* GHASH the last 4 ciphertext blocks. */
.macro GHASH_LAST_4 TMP1 T M P 2 T M P 3 T M P 4 T M P 5 T M P 6 \
TMP7 X M M 1 X M M 2 X M M 3 X M M 4 X M M D s t
# Multiply T M P 6 * H a s h K e y ( u s i n g K a r a t s u b a )
movdqa \ X M M 1 , \ T M P 6
pshufd $ 7 8 , \ X M M 1 , \ T M P 2
pxor \ X M M 1 , \ T M P 2
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y _ 4 ( % a r g 2 ) , \ T M P 5
2010-11-04 22:00:45 +03:00
PCLMULQDQ 0 x11 , \ T M P 5 , \ T M P 6 # T M P 6 = a1 * b1
PCLMULQDQ 0 x00 , \ T M P 5 , \ X M M 1 # X M M 1 = a0 * b0
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y _ 4 _ k ( % a r g 2 ) , \ T M P 4
2010-11-04 22:00:45 +03:00
PCLMULQDQ 0 x00 , \ T M P 4 , \ T M P 2 # T M P 2 = ( a1 + a0 ) * ( b1 + b0 )
movdqa \ X M M 1 , \ X M M D s t
movdqa \ T M P 2 , \ X M M 1 # r e s u l t i n T M P 6 , X M M D s t , X M M 1
# Multiply T M P 1 * H a s h K e y ( u s i n g K a r a t s u b a )
movdqa \ X M M 2 , \ T M P 1
pshufd $ 7 8 , \ X M M 2 , \ T M P 2
pxor \ X M M 2 , \ T M P 2
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y _ 3 ( % a r g 2 ) , \ T M P 5
2010-11-04 22:00:45 +03:00
PCLMULQDQ 0 x11 , \ T M P 5 , \ T M P 1 # T M P 1 = a1 * b1
PCLMULQDQ 0 x00 , \ T M P 5 , \ X M M 2 # X M M 2 = a0 * b0
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y _ 3 _ k ( % a r g 2 ) , \ T M P 4
2010-11-04 22:00:45 +03:00
PCLMULQDQ 0 x00 , \ T M P 4 , \ T M P 2 # T M P 2 = ( a1 + a0 ) * ( b1 + b0 )
pxor \ T M P 1 , \ T M P 6
pxor \ X M M 2 , \ X M M D s t
pxor \ T M P 2 , \ X M M 1
# results a c c u m u l a t e d i n T M P 6 , X M M D s t , X M M 1
# Multiply T M P 1 * H a s h K e y ( u s i n g K a r a t s u b a )
movdqa \ X M M 3 , \ T M P 1
pshufd $ 7 8 , \ X M M 3 , \ T M P 2
pxor \ X M M 3 , \ T M P 2
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y _ 2 ( % a r g 2 ) , \ T M P 5
2010-11-04 22:00:45 +03:00
PCLMULQDQ 0 x11 , \ T M P 5 , \ T M P 1 # T M P 1 = a1 * b1
PCLMULQDQ 0 x00 , \ T M P 5 , \ X M M 3 # X M M 3 = a0 * b0
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y _ 2 _ k ( % a r g 2 ) , \ T M P 4
2010-11-04 22:00:45 +03:00
PCLMULQDQ 0 x00 , \ T M P 4 , \ T M P 2 # T M P 2 = ( a1 + a0 ) * ( b1 + b0 )
pxor \ T M P 1 , \ T M P 6
pxor \ X M M 3 , \ X M M D s t
pxor \ T M P 2 , \ X M M 1 # r e s u l t s a c c u m u l a t e d i n T M P 6 , X M M D s t , X M M 1
# Multiply T M P 1 * H a s h K e y ( u s i n g K a r a t s u b a )
movdqa \ X M M 4 , \ T M P 1
pshufd $ 7 8 , \ X M M 4 , \ T M P 2
pxor \ X M M 4 , \ T M P 2
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y ( % a r g 2 ) , \ T M P 5
2010-11-04 22:00:45 +03:00
PCLMULQDQ 0 x11 , \ T M P 5 , \ T M P 1 # T M P 1 = a1 * b1
PCLMULQDQ 0 x00 , \ T M P 5 , \ X M M 4 # X M M 4 = a0 * b0
2018-02-14 20:40:10 +03:00
movdqa H a s h K e y _ k ( % a r g 2 ) , \ T M P 4
2010-11-04 22:00:45 +03:00
PCLMULQDQ 0 x00 , \ T M P 4 , \ T M P 2 # T M P 2 = ( a1 + a0 ) * ( b1 + b0 )
pxor \ T M P 1 , \ T M P 6
pxor \ X M M 4 , \ X M M D s t
pxor \ X M M 1 , \ T M P 2
pxor \ T M P 6 , \ T M P 2
pxor \ X M M D s t , \ T M P 2
# middle s e c t i o n o f t h e t e m p r e s u l t s c o m b i n e d a s i n k a r a t s u b a a l g o r i t h m
movdqa \ T M P 2 , \ T M P 4
pslldq $ 8 , \ T M P 4 # l e f t s h i f t T M P 4 2 D W s
psrldq $ 8 , \ T M P 2 # r i g h t s h i f t T M P 2 2 D W s
pxor \ T M P 4 , \ X M M D s t
pxor \ T M P 2 , \ T M P 6
# TMP6 : XMMDst h o l d s t h e r e s u l t o f t h e a c c u m u l a t e d c a r r y - l e s s m u l t i p l i c a t i o n s
# first p h a s e o f t h e r e d u c t i o n
movdqa \ X M M D s t , \ T M P 2
movdqa \ X M M D s t , \ T M P 3
movdqa \ X M M D s t , \ T M P 4
# move X M M D s t i n t o T M P 2 , T M P 3 , T M P 4 i n o r d e r t o p e r f o r m 3 s h i f t s i n d e p e n d e n t l y
pslld $ 3 1 , \ T M P 2 # p a c k e d r i g h t s h i f t i n g < < 31
pslld $ 3 0 , \ T M P 3 # p a c k e d r i g h t s h i f t i n g < < 30
pslld $ 2 5 , \ T M P 4 # p a c k e d r i g h t s h i f t i n g < < 25
pxor \ T M P 3 , \ T M P 2 # x o r t h e s h i f t e d v e r s i o n s
pxor \ T M P 4 , \ T M P 2
movdqa \ T M P 2 , \ T M P 7
psrldq $ 4 , \ T M P 7 # r i g h t s h i f t T M P 7 1 D W
pslldq $ 1 2 , \ T M P 2 # l e f t s h i f t T M P 2 3 D W s
pxor \ T M P 2 , \ X M M D s t
# second p h a s e o f t h e r e d u c t i o n
movdqa \ X M M D s t , \ T M P 2
# make 3 c o p i e s o f X M M D s t f o r d o i n g 3 s h i f t o p e r a t i o n s
movdqa \ X M M D s t , \ T M P 3
movdqa \ X M M D s t , \ T M P 4
psrld $ 1 , \ T M P 2 # p a c k e d l e f t s h i f t > > 1
psrld $ 2 , \ T M P 3 # p a c k e d l e f t s h i f t > > 2
psrld $ 7 , \ T M P 4 # p a c k e d l e f t s h i f t > > 7
pxor \ T M P 3 , \ T M P 2 # x o r t h e s h i f t e d v e r s i o n s
pxor \ T M P 4 , \ T M P 2
pxor \ T M P 7 , \ T M P 2
pxor \ T M P 2 , \ X M M D s t
pxor \ T M P 6 , \ X M M D s t # r e d u c e d r e s u l t i s i n X M M D s t
.endm
2015-01-13 21:16:43 +03:00
/ * Encryption o f a s i n g l e b l o c k
* uses e a x & r10
* /
2010-11-04 22:00:45 +03:00
2015-01-13 21:16:43 +03:00
.macro ENCRYPT_SINGLE_BLOCK XMM0 T M P 1
2010-11-04 22:00:45 +03:00
2015-01-13 21:16:43 +03:00
pxor ( % a r g 1 ) , \ X M M 0
mov k e y s i z e ,% e a x
shr $ 2 ,% e a x # 128 - > 4 , 1 9 2 - > 6 , 2 5 6 - > 8
add $ 5 ,% e a x # 128 - > 9 , 1 9 2 - > 1 1 , 2 5 6 - > 1 3
lea 1 6 ( % a r g 1 ) , % r10 # g e t f i r s t e x p a n d e d k e y a d d r e s s
_ esb_ l o o p _ \ @:
MOVADQ ( % r10 ) ,\ T M P 1
AESENC \ T M P 1 ,\ X M M 0
add $ 1 6 ,% r10
sub $ 1 ,% e a x
jnz _ e s b _ l o o p _ \ @
MOVADQ ( % r10 ) ,\ T M P 1
AESENCLAST \ T M P 1 ,\ X M M 0
.endm
2010-11-04 22:00:45 +03:00
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* void a e s n i _ g c m _ d e c ( v o i d * a e s _ c t x , / / A E S K e y s c h e d u l e . S t a r t s o n a 1 6 b y t e b o u n d a r y .
2018-02-14 20:39:23 +03:00
* struct g c m _ c o n t e x t _ d a t a * d a t a
* / / Context d a t a
2010-11-04 22:00:45 +03:00
* u8 * o u t , / / P l a i n t e x t o u t p u t . E n c r y p t i n - p l a c e i s a l l o w e d .
* const u 8 * i n , / / C i p h e r t e x t i n p u t
* u6 4 p l a i n t e x t _ l e n , / / L e n g t h o f d a t a i n b y t e s f o r d e c r y p t i o n .
* u8 * i v , / / P r e - c o u n t e r b l o c k j 0 : 4 b y t e s a l t ( f r o m S e c u r i t y A s s o c i a t i o n )
* / / concatenated w i t h 8 b y t e I n i t i a l i s a t i o n V e c t o r ( f r o m I P S e c E S P P a y l o a d )
* / / concatenated w i t h 0 x00 0 0 0 0 0 1 . 1 6 - b y t e a l i g n e d p o i n t e r .
* u8 * h a s h _ s u b k e y , / / H , t h e H a s h s u b k e y i n p u t . D a t a s t a r t s o n a 1 6 - b y t e b o u n d a r y .
* const u 8 * a a d , / / A d d i t i o n a l A u t h e n t i c a t i o n D a t a ( A A D )
* u6 4 a a d _ l e n , / / L e n g t h o f A A D i n b y t e s . W i t h R F C 4 1 0 6 t h i s i s g o i n g t o b e 8 o r 1 2 b y t e s
* u8 * a u t h _ t a g , / / A u t h e n t i c a t e d T a g o u t p u t . T h e d r i v e r w i l l c o m p a r e t h i s t o t h e
* / / given a u t h e n t i c a t i o n t a g a n d o n l y r e t u r n t h e p l a i n t e x t i f t h e y m a t c h .
* u6 4 a u t h _ t a g _ l e n ) ; // Authenticated Tag Length in bytes. Valid values are 16
* / / ( most l i k e l y ) , 1 2 o r 8 .
*
* Assumptions :
*
* keys :
* keys a r e p r e - e x p a n d e d a n d a l i g n e d t o 1 6 b y t e s . w e a r e u s i n g t h e f i r s t
* set o f 1 1 k e y s i n t h e d a t a s t r u c t u r e v o i d * a e s _ c t x
*
* iv :
* 0 1 2 3
* 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
* | Salt ( F r o m t h e S A ) |
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
* | Initialization V e c t o r |
* | ( This i s t h e s e q u e n c e n u m b e r f r o m I P S e c h e a d e r ) |
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
* | 0 x1 |
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
*
*
*
* AAD :
* AAD p a d d e d t o 1 2 8 b i t s w i t h 0
* for e x a m p l e , a s s u m e A A D i s a u 3 2 v e c t o r
*
* if A A D i s 8 b y t e s :
* AAD[ 3 ] = { A 0 , A 1 } ;
* padded A A D i n x m m r e g i s t e r = { A 1 A 0 0 0 }
*
* 0 1 2 3
* 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
* | SPI ( A 1 ) |
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
* | 3 2 - bit S e q u e n c e N u m b e r ( A 0 ) |
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
* | 0 x0 |
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
*
* AAD F o r m a t w i t h 3 2 - b i t S e q u e n c e N u m b e r
*
* if A A D i s 1 2 b y t e s :
* AAD[ 3 ] = { A 0 , A 1 , A 2 } ;
* padded A A D i n x m m r e g i s t e r = { A 2 A 1 A 0 0 }
*
* 0 1 2 3
* 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
* 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
* | SPI ( A 2 ) |
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
* | 6 4 - bit E x t e n d e d S e q u e n c e N u m b e r { A 1 ,A 0 } |
* | |
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
* | 0 x0 |
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
*
* AAD F o r m a t w i t h 6 4 - b i t E x t e n d e d S e q u e n c e N u m b e r
*
* poly = x ^ 1 2 8 + x ^ 1 2 7 + x ^ 1 2 6 + x ^ 1 2 1 + 1
*
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
ENTRY( a e s n i _ g c m _ d e c )
2018-02-14 20:38:35 +03:00
FUNC_ S A V E
2010-11-04 22:00:45 +03:00
2018-02-14 20:38:45 +03:00
GCM_ I N I T
2018-02-14 20:39:10 +03:00
GCM_ E N C _ D E C d e c
2018-02-14 20:38:57 +03:00
GCM_ C O M P L E T E
2018-02-14 20:38:35 +03:00
FUNC_ R E S T O R E
2010-11-04 22:00:45 +03:00
ret
2013-01-19 15:38:55 +04:00
ENDPROC( a e s n i _ g c m _ d e c )
2010-11-04 22:00:45 +03:00
/ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* void a e s n i _ g c m _ e n c ( v o i d * a e s _ c t x , / / A E S K e y s c h e d u l e . S t a r t s o n a 1 6 b y t e b o u n d a r y .
2018-02-14 20:39:23 +03:00
* struct g c m _ c o n t e x t _ d a t a * d a t a
* / / Context d a t a
2010-11-04 22:00:45 +03:00
* u8 * o u t , / / C i p h e r t e x t o u t p u t . E n c r y p t i n - p l a c e i s a l l o w e d .
* const u 8 * i n , / / P l a i n t e x t i n p u t
* u6 4 p l a i n t e x t _ l e n , / / L e n g t h o f d a t a i n b y t e s f o r e n c r y p t i o n .
* u8 * i v , / / P r e - c o u n t e r b l o c k j 0 : 4 b y t e s a l t ( f r o m S e c u r i t y A s s o c i a t i o n )
* / / concatenated w i t h 8 b y t e I n i t i a l i s a t i o n V e c t o r ( f r o m I P S e c E S P P a y l o a d )
* / / concatenated w i t h 0 x00 0 0 0 0 0 1 . 1 6 - b y t e a l i g n e d p o i n t e r .
* u8 * h a s h _ s u b k e y , / / H , t h e H a s h s u b k e y i n p u t . D a t a s t a r t s o n a 1 6 - b y t e b o u n d a r y .
* const u 8 * a a d , / / A d d i t i o n a l A u t h e n t i c a t i o n D a t a ( A A D )
* u6 4 a a d _ l e n , / / L e n g t h o f A A D i n b y t e s . W i t h R F C 4 1 0 6 t h i s i s g o i n g t o b e 8 o r 1 2 b y t e s
* u8 * a u t h _ t a g , / / A u t h e n t i c a t e d T a g o u t p u t .
* u6 4 a u t h _ t a g _ l e n ) ; // Authenticated Tag Length in bytes. Valid values are 16 (most likely),
* / / 1 2 or 8 .
*
* Assumptions :
*
* keys :
* keys a r e p r e - e x p a n d e d a n d a l i g n e d t o 1 6 b y t e s . w e a r e u s i n g t h e
* first s e t o f 1 1 k e y s i n t h e d a t a s t r u c t u r e v o i d * a e s _ c t x
*
*
* iv :
* 0 1 2 3
* 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
* | Salt ( F r o m t h e S A ) |
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
* | Initialization V e c t o r |
* | ( This i s t h e s e q u e n c e n u m b e r f r o m I P S e c h e a d e r ) |
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
* | 0 x1 |
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
*
*
*
* AAD :
* AAD p a d d e d t o 1 2 8 b i t s w i t h 0
* for e x a m p l e , a s s u m e A A D i s a u 3 2 v e c t o r
*
* if A A D i s 8 b y t e s :
* AAD[ 3 ] = { A 0 , A 1 } ;
* padded A A D i n x m m r e g i s t e r = { A 1 A 0 0 0 }
*
* 0 1 2 3
* 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
* | SPI ( A 1 ) |
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
* | 3 2 - bit S e q u e n c e N u m b e r ( A 0 ) |
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
* | 0 x0 |
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
*
* AAD F o r m a t w i t h 3 2 - b i t S e q u e n c e N u m b e r
*
* if A A D i s 1 2 b y t e s :
* AAD[ 3 ] = { A 0 , A 1 , A 2 } ;
* padded A A D i n x m m r e g i s t e r = { A 2 A 1 A 0 0 }
*
* 0 1 2 3
* 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
* | SPI ( A 2 ) |
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
* | 6 4 - bit E x t e n d e d S e q u e n c e N u m b e r { A 1 ,A 0 } |
* | |
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
* | 0 x0 |
* + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - +
*
* AAD F o r m a t w i t h 6 4 - b i t E x t e n d e d S e q u e n c e N u m b e r
*
* poly = x ^ 1 2 8 + x ^ 1 2 7 + x ^ 1 2 6 + x ^ 1 2 1 + 1
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /
ENTRY( a e s n i _ g c m _ e n c )
2018-02-14 20:38:35 +03:00
FUNC_ S A V E
2010-11-04 22:00:45 +03:00
2018-02-14 20:38:45 +03:00
GCM_ I N I T
2018-02-14 20:39:10 +03:00
GCM_ E N C _ D E C e n c
2018-02-14 20:38:57 +03:00
GCM_ C O M P L E T E
2018-02-14 20:38:35 +03:00
FUNC_ R E S T O R E
2010-11-04 22:00:45 +03:00
ret
2013-01-19 15:38:55 +04:00
ENDPROC( a e s n i _ g c m _ e n c )
2010-12-13 14:51:15 +03:00
2010-11-29 03:35:39 +03:00
# endif
2010-11-04 22:00:45 +03:00
2013-01-19 15:38:55 +04:00
.align 4
2009-01-18 08:28:34 +03:00
_key_expansion_128 :
_key_expansion_256a :
pshufd $ 0 b11 1 1 1 1 1 1 , % x m m 1 , % x m m 1
shufps $ 0 b00 0 1 0 0 0 0 , % x m m 0 , % x m m 4
pxor % x m m 4 , % x m m 0
shufps $ 0 b10 0 0 1 1 0 0 , % x m m 0 , % x m m 4
pxor % x m m 4 , % x m m 0
pxor % x m m 1 , % x m m 0
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
movaps % x m m 0 , ( T K E Y P )
add $ 0 x10 , T K E Y P
2009-01-18 08:28:34 +03:00
ret
2013-01-19 15:38:55 +04:00
ENDPROC( _ k e y _ e x p a n s i o n _ 1 2 8 )
ENDPROC( _ k e y _ e x p a n s i o n _ 2 5 6 a )
2009-01-18 08:28:34 +03:00
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
.align 4
2009-01-18 08:28:34 +03:00
_key_expansion_192a :
pshufd $ 0 b01 0 1 0 1 0 1 , % x m m 1 , % x m m 1
shufps $ 0 b00 0 1 0 0 0 0 , % x m m 0 , % x m m 4
pxor % x m m 4 , % x m m 0
shufps $ 0 b10 0 0 1 1 0 0 , % x m m 0 , % x m m 4
pxor % x m m 4 , % x m m 0
pxor % x m m 1 , % x m m 0
movaps % x m m 2 , % x m m 5
movaps % x m m 2 , % x m m 6
pslldq $ 4 , % x m m 5
pshufd $ 0 b11 1 1 1 1 1 1 , % x m m 0 , % x m m 3
pxor % x m m 3 , % x m m 2
pxor % x m m 5 , % x m m 2
movaps % x m m 0 , % x m m 1
shufps $ 0 b01 0 0 0 1 0 0 , % x m m 0 , % x m m 6
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
movaps % x m m 6 , ( T K E Y P )
2009-01-18 08:28:34 +03:00
shufps $ 0 b01 0 0 1 1 1 0 , % x m m 2 , % x m m 1
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
movaps % x m m 1 , 0 x10 ( T K E Y P )
add $ 0 x20 , T K E Y P
2009-01-18 08:28:34 +03:00
ret
2013-01-19 15:38:55 +04:00
ENDPROC( _ k e y _ e x p a n s i o n _ 1 9 2 a )
2009-01-18 08:28:34 +03:00
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
.align 4
2009-01-18 08:28:34 +03:00
_key_expansion_192b :
pshufd $ 0 b01 0 1 0 1 0 1 , % x m m 1 , % x m m 1
shufps $ 0 b00 0 1 0 0 0 0 , % x m m 0 , % x m m 4
pxor % x m m 4 , % x m m 0
shufps $ 0 b10 0 0 1 1 0 0 , % x m m 0 , % x m m 4
pxor % x m m 4 , % x m m 0
pxor % x m m 1 , % x m m 0
movaps % x m m 2 , % x m m 5
pslldq $ 4 , % x m m 5
pshufd $ 0 b11 1 1 1 1 1 1 , % x m m 0 , % x m m 3
pxor % x m m 3 , % x m m 2
pxor % x m m 5 , % x m m 2
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
movaps % x m m 0 , ( T K E Y P )
add $ 0 x10 , T K E Y P
2009-01-18 08:28:34 +03:00
ret
2013-01-19 15:38:55 +04:00
ENDPROC( _ k e y _ e x p a n s i o n _ 1 9 2 b )
2009-01-18 08:28:34 +03:00
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
.align 4
2009-01-18 08:28:34 +03:00
_key_expansion_256b :
pshufd $ 0 b10 1 0 1 0 1 0 , % x m m 1 , % x m m 1
shufps $ 0 b00 0 1 0 0 0 0 , % x m m 2 , % x m m 4
pxor % x m m 4 , % x m m 2
shufps $ 0 b10 0 0 1 1 0 0 , % x m m 2 , % x m m 4
pxor % x m m 4 , % x m m 2
pxor % x m m 1 , % x m m 2
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
movaps % x m m 2 , ( T K E Y P )
add $ 0 x10 , T K E Y P
2009-01-18 08:28:34 +03:00
ret
2013-01-19 15:38:55 +04:00
ENDPROC( _ k e y _ e x p a n s i o n _ 2 5 6 b )
2009-01-18 08:28:34 +03:00
/ *
* int a e s n i _ s e t _ k e y ( s t r u c t c r y p t o _ a e s _ c t x * c t x , c o n s t u 8 * i n _ k e y ,
* unsigned i n t k e y _ l e n )
* /
ENTRY( a e s n i _ s e t _ k e y )
2016-01-22 01:49:19 +03:00
FRAME_ B E G I N
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# ifndef _ _ x86 _ 6 4 _ _
pushl K E Y P
2016-01-22 01:49:19 +03:00
movl ( F R A M E _ O F F S E T + 8 ) ( % e s p ) , K E Y P # c t x
movl ( F R A M E _ O F F S E T + 1 2 ) ( % e s p ) , U K E Y P # i n _ k e y
movl ( F R A M E _ O F F S E T + 1 6 ) ( % e s p ) , % e d x # k e y _ l e n
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# endif
movups ( U K E Y P ) , % x m m 0 # u s e r k e y ( f i r s t 16 b y t e s )
movaps % x m m 0 , ( K E Y P )
lea 0 x10 ( K E Y P ) , T K E Y P # k e y a d d r
movl % e d x , 4 8 0 ( K E Y P )
2009-01-18 08:28:34 +03:00
pxor % x m m 4 , % x m m 4 # x m m 4 i s a s s u m e d 0 i n _ k e y _ e x p a n s i o n _ x
cmp $ 2 4 , % d l
jb . L e n c _ k e y 1 2 8
je . L e n c _ k e y 1 9 2
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
movups 0 x10 ( U K E Y P ) , % x m m 2 # o t h e r u s e r k e y
movaps % x m m 2 , ( T K E Y P )
add $ 0 x10 , T K E Y P
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x1 % x m m 2 % x m m 1 # r o u n d 1
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 2 5 6 a
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x1 % x m m 0 % x m m 1
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 2 5 6 b
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x2 % x m m 2 % x m m 1 # r o u n d 2
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 2 5 6 a
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x2 % x m m 0 % x m m 1
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 2 5 6 b
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x4 % x m m 2 % x m m 1 # r o u n d 3
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 2 5 6 a
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x4 % x m m 0 % x m m 1
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 2 5 6 b
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x8 % x m m 2 % x m m 1 # r o u n d 4
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 2 5 6 a
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x8 % x m m 0 % x m m 1
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 2 5 6 b
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x10 % x m m 2 % x m m 1 # r o u n d 5
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 2 5 6 a
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x10 % x m m 0 % x m m 1
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 2 5 6 b
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x20 % x m m 2 % x m m 1 # r o u n d 6
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 2 5 6 a
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x20 % x m m 0 % x m m 1
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 2 5 6 b
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x40 % x m m 2 % x m m 1 # r o u n d 7
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 2 5 6 a
jmp . L d e c _ k e y
.Lenc_key192 :
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
movq 0 x10 ( U K E Y P ) , % x m m 2 # o t h e r u s e r k e y
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x1 % x m m 2 % x m m 1 # r o u n d 1
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 1 9 2 a
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x2 % x m m 2 % x m m 1 # r o u n d 2
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 1 9 2 b
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x4 % x m m 2 % x m m 1 # r o u n d 3
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 1 9 2 a
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x8 % x m m 2 % x m m 1 # r o u n d 4
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 1 9 2 b
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x10 % x m m 2 % x m m 1 # r o u n d 5
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 1 9 2 a
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x20 % x m m 2 % x m m 1 # r o u n d 6
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 1 9 2 b
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x40 % x m m 2 % x m m 1 # r o u n d 7
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 1 9 2 a
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x80 % x m m 2 % x m m 1 # r o u n d 8
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 1 9 2 b
jmp . L d e c _ k e y
.Lenc_key128 :
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x1 % x m m 0 % x m m 1 # r o u n d 1
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 1 2 8
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x2 % x m m 0 % x m m 1 # r o u n d 2
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 1 2 8
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x4 % x m m 0 % x m m 1 # r o u n d 3
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 1 2 8
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x8 % x m m 0 % x m m 1 # r o u n d 4
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 1 2 8
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x10 % x m m 0 % x m m 1 # r o u n d 5
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 1 2 8
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x20 % x m m 0 % x m m 1 # r o u n d 6
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 1 2 8
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x40 % x m m 0 % x m m 1 # r o u n d 7
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 1 2 8
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x80 % x m m 0 % x m m 1 # r o u n d 8
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 1 2 8
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x1 b % x m m 0 % x m m 1 # r o u n d 9
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 1 2 8
2009-11-23 14:54:06 +03:00
AESKEYGENASSIST 0 x36 % x m m 0 % x m m 1 # r o u n d 10
2009-01-18 08:28:34 +03:00
call _ k e y _ e x p a n s i o n _ 1 2 8
.Ldec_key :
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
sub $ 0 x10 , T K E Y P
movaps ( K E Y P ) , % x m m 0
movaps ( T K E Y P ) , % x m m 1
movaps % x m m 0 , 2 4 0 ( T K E Y P )
movaps % x m m 1 , 2 4 0 ( K E Y P )
add $ 0 x10 , K E Y P
lea 2 4 0 - 1 6 ( T K E Y P ) , U K E Y P
2009-01-18 08:28:34 +03:00
.align 4
.Ldec_key_loop :
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
movaps ( K E Y P ) , % x m m 0
2009-11-23 14:54:06 +03:00
AESIMC % x m m 0 % x m m 1
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
movaps % x m m 1 , ( U K E Y P )
add $ 0 x10 , K E Y P
sub $ 0 x10 , U K E Y P
cmp T K E Y P , K E Y P
2009-01-18 08:28:34 +03:00
jb . L d e c _ k e y _ l o o p
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
xor A R E G , A R E G
# ifndef _ _ x86 _ 6 4 _ _
popl K E Y P
# endif
2016-01-22 01:49:19 +03:00
FRAME_ E N D
2009-01-18 08:28:34 +03:00
ret
2013-01-19 15:38:55 +04:00
ENDPROC( a e s n i _ s e t _ k e y )
2009-01-18 08:28:34 +03:00
/ *
* void a e s n i _ e n c ( s t r u c t c r y p t o _ a e s _ c t x * c t x , u 8 * d s t , c o n s t u 8 * s r c )
* /
ENTRY( a e s n i _ e n c )
2016-01-22 01:49:19 +03:00
FRAME_ B E G I N
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# ifndef _ _ x86 _ 6 4 _ _
pushl K E Y P
pushl K L E N
2016-01-22 01:49:19 +03:00
movl ( F R A M E _ O F F S E T + 1 2 ) ( % e s p ) , K E Y P # c t x
movl ( F R A M E _ O F F S E T + 1 6 ) ( % e s p ) , O U T P # d s t
movl ( F R A M E _ O F F S E T + 2 0 ) ( % e s p ) , I N P # s r c
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# endif
2009-01-18 08:28:34 +03:00
movl 4 8 0 ( K E Y P ) , K L E N # k e y l e n g t h
movups ( I N P ) , S T A T E # i n p u t
call _ a e s n i _ e n c1
movups S T A T E , ( O U T P ) # o u t p u t
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# ifndef _ _ x86 _ 6 4 _ _
popl K L E N
popl K E Y P
# endif
2016-01-22 01:49:19 +03:00
FRAME_ E N D
2009-01-18 08:28:34 +03:00
ret
2013-01-19 15:38:55 +04:00
ENDPROC( a e s n i _ e n c )
2009-01-18 08:28:34 +03:00
/ *
* _aesni_enc1 : internal A B I
* input :
* KEYP : key s t r u c t p o i n t e r
* KLEN : round c o u n t
* STATE : initial s t a t e ( i n p u t )
* output :
* STATE : finial s t a t e ( o u t p u t )
* changed :
* KEY
* TKEYP ( T 1 )
* /
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
.align 4
2009-01-18 08:28:34 +03:00
_aesni_enc1 :
movaps ( K E Y P ) , K E Y # k e y
mov K E Y P , T K E Y P
pxor K E Y , S T A T E # r o u n d 0
add $ 0 x30 , T K E Y P
cmp $ 2 4 , K L E N
jb . L e n c12 8
lea 0 x20 ( T K E Y P ) , T K E Y P
je . L e n c19 2
add $ 0 x20 , T K E Y P
movaps - 0 x60 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps - 0 x50 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E
2009-01-18 08:28:34 +03:00
.align 4
.Lenc192 :
movaps - 0 x40 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps - 0 x30 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E
2009-01-18 08:28:34 +03:00
.align 4
.Lenc128 :
movaps - 0 x20 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps - 0 x10 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps 0 x10 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps 0 x20 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps 0 x30 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps 0 x40 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps 0 x50 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps 0 x60 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps 0 x70 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENCLAST K E Y S T A T E
2009-01-18 08:28:34 +03:00
ret
2013-01-19 15:38:55 +04:00
ENDPROC( _ a e s n i _ e n c1 )
2009-01-18 08:28:34 +03:00
/ *
* _aesni_enc4 : internal A B I
* input :
* KEYP : key s t r u c t p o i n t e r
* KLEN : round c o u n t
* STATE1 : initial s t a t e ( i n p u t )
* STATE2
* STATE3
* STATE4
* output :
* STATE1 : finial s t a t e ( o u t p u t )
* STATE2
* STATE3
* STATE4
* changed :
* KEY
* TKEYP ( T 1 )
* /
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
.align 4
2009-01-18 08:28:34 +03:00
_aesni_enc4 :
movaps ( K E Y P ) , K E Y # k e y
mov K E Y P , T K E Y P
pxor K E Y , S T A T E 1 # r o u n d 0
pxor K E Y , S T A T E 2
pxor K E Y , S T A T E 3
pxor K E Y , S T A T E 4
add $ 0 x30 , T K E Y P
cmp $ 2 4 , K L E N
jb . L 4 e n c12 8
lea 0 x20 ( T K E Y P ) , T K E Y P
je . L 4 e n c19 2
add $ 0 x20 , T K E Y P
movaps - 0 x60 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E 1
AESENC K E Y S T A T E 2
AESENC K E Y S T A T E 3
AESENC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps - 0 x50 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E 1
AESENC K E Y S T A T E 2
AESENC K E Y S T A T E 3
AESENC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
# .align 4
.L4enc192 :
movaps - 0 x40 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E 1
AESENC K E Y S T A T E 2
AESENC K E Y S T A T E 3
AESENC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps - 0 x30 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E 1
AESENC K E Y S T A T E 2
AESENC K E Y S T A T E 3
AESENC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
# .align 4
.L4enc128 :
movaps - 0 x20 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E 1
AESENC K E Y S T A T E 2
AESENC K E Y S T A T E 3
AESENC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps - 0 x10 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E 1
AESENC K E Y S T A T E 2
AESENC K E Y S T A T E 3
AESENC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E 1
AESENC K E Y S T A T E 2
AESENC K E Y S T A T E 3
AESENC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps 0 x10 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E 1
AESENC K E Y S T A T E 2
AESENC K E Y S T A T E 3
AESENC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps 0 x20 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E 1
AESENC K E Y S T A T E 2
AESENC K E Y S T A T E 3
AESENC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps 0 x30 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E 1
AESENC K E Y S T A T E 2
AESENC K E Y S T A T E 3
AESENC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps 0 x40 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E 1
AESENC K E Y S T A T E 2
AESENC K E Y S T A T E 3
AESENC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps 0 x50 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E 1
AESENC K E Y S T A T E 2
AESENC K E Y S T A T E 3
AESENC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps 0 x60 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENC K E Y S T A T E 1
AESENC K E Y S T A T E 2
AESENC K E Y S T A T E 3
AESENC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps 0 x70 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESENCLAST K E Y S T A T E 1 # l a s t r o u n d
AESENCLAST K E Y S T A T E 2
AESENCLAST K E Y S T A T E 3
AESENCLAST K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
ret
2013-01-19 15:38:55 +04:00
ENDPROC( _ a e s n i _ e n c4 )
2009-01-18 08:28:34 +03:00
/ *
* void a e s n i _ d e c ( s t r u c t c r y p t o _ a e s _ c t x * c t x , u 8 * d s t , c o n s t u 8 * s r c )
* /
ENTRY( a e s n i _ d e c )
2016-01-22 01:49:19 +03:00
FRAME_ B E G I N
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# ifndef _ _ x86 _ 6 4 _ _
pushl K E Y P
pushl K L E N
2016-01-22 01:49:19 +03:00
movl ( F R A M E _ O F F S E T + 1 2 ) ( % e s p ) , K E Y P # c t x
movl ( F R A M E _ O F F S E T + 1 6 ) ( % e s p ) , O U T P # d s t
movl ( F R A M E _ O F F S E T + 2 0 ) ( % e s p ) , I N P # s r c
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# endif
2009-01-18 08:28:34 +03:00
mov 4 8 0 ( K E Y P ) , K L E N # k e y l e n g t h
add $ 2 4 0 , K E Y P
movups ( I N P ) , S T A T E # i n p u t
call _ a e s n i _ d e c1
movups S T A T E , ( O U T P ) #o u t p u t
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# ifndef _ _ x86 _ 6 4 _ _
popl K L E N
popl K E Y P
# endif
2016-01-22 01:49:19 +03:00
FRAME_ E N D
2009-01-18 08:28:34 +03:00
ret
2013-01-19 15:38:55 +04:00
ENDPROC( a e s n i _ d e c )
2009-01-18 08:28:34 +03:00
/ *
* _aesni_dec1 : internal A B I
* input :
* KEYP : key s t r u c t p o i n t e r
* KLEN : key l e n g t h
* STATE : initial s t a t e ( i n p u t )
* output :
* STATE : finial s t a t e ( o u t p u t )
* changed :
* KEY
* TKEYP ( T 1 )
* /
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
.align 4
2009-01-18 08:28:34 +03:00
_aesni_dec1 :
movaps ( K E Y P ) , K E Y # k e y
mov K E Y P , T K E Y P
pxor K E Y , S T A T E # r o u n d 0
add $ 0 x30 , T K E Y P
cmp $ 2 4 , K L E N
jb . L d e c12 8
lea 0 x20 ( T K E Y P ) , T K E Y P
je . L d e c19 2
add $ 0 x20 , T K E Y P
movaps - 0 x60 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps - 0 x50 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E
2009-01-18 08:28:34 +03:00
.align 4
.Ldec192 :
movaps - 0 x40 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps - 0 x30 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E
2009-01-18 08:28:34 +03:00
.align 4
.Ldec128 :
movaps - 0 x20 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps - 0 x10 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps 0 x10 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps 0 x20 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps 0 x30 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps 0 x40 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps 0 x50 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps 0 x60 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E
2009-01-18 08:28:34 +03:00
movaps 0 x70 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDECLAST K E Y S T A T E
2009-01-18 08:28:34 +03:00
ret
2013-01-19 15:38:55 +04:00
ENDPROC( _ a e s n i _ d e c1 )
2009-01-18 08:28:34 +03:00
/ *
* _aesni_dec4 : internal A B I
* input :
* KEYP : key s t r u c t p o i n t e r
* KLEN : key l e n g t h
* STATE1 : initial s t a t e ( i n p u t )
* STATE2
* STATE3
* STATE4
* output :
* STATE1 : finial s t a t e ( o u t p u t )
* STATE2
* STATE3
* STATE4
* changed :
* KEY
* TKEYP ( T 1 )
* /
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
.align 4
2009-01-18 08:28:34 +03:00
_aesni_dec4 :
movaps ( K E Y P ) , K E Y # k e y
mov K E Y P , T K E Y P
pxor K E Y , S T A T E 1 # r o u n d 0
pxor K E Y , S T A T E 2
pxor K E Y , S T A T E 3
pxor K E Y , S T A T E 4
add $ 0 x30 , T K E Y P
cmp $ 2 4 , K L E N
jb . L 4 d e c12 8
lea 0 x20 ( T K E Y P ) , T K E Y P
je . L 4 d e c19 2
add $ 0 x20 , T K E Y P
movaps - 0 x60 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E 1
AESDEC K E Y S T A T E 2
AESDEC K E Y S T A T E 3
AESDEC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps - 0 x50 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E 1
AESDEC K E Y S T A T E 2
AESDEC K E Y S T A T E 3
AESDEC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
.align 4
.L4dec192 :
movaps - 0 x40 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E 1
AESDEC K E Y S T A T E 2
AESDEC K E Y S T A T E 3
AESDEC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps - 0 x30 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E 1
AESDEC K E Y S T A T E 2
AESDEC K E Y S T A T E 3
AESDEC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
.align 4
.L4dec128 :
movaps - 0 x20 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E 1
AESDEC K E Y S T A T E 2
AESDEC K E Y S T A T E 3
AESDEC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps - 0 x10 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E 1
AESDEC K E Y S T A T E 2
AESDEC K E Y S T A T E 3
AESDEC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E 1
AESDEC K E Y S T A T E 2
AESDEC K E Y S T A T E 3
AESDEC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps 0 x10 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E 1
AESDEC K E Y S T A T E 2
AESDEC K E Y S T A T E 3
AESDEC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps 0 x20 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E 1
AESDEC K E Y S T A T E 2
AESDEC K E Y S T A T E 3
AESDEC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps 0 x30 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E 1
AESDEC K E Y S T A T E 2
AESDEC K E Y S T A T E 3
AESDEC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps 0 x40 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E 1
AESDEC K E Y S T A T E 2
AESDEC K E Y S T A T E 3
AESDEC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps 0 x50 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E 1
AESDEC K E Y S T A T E 2
AESDEC K E Y S T A T E 3
AESDEC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps 0 x60 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDEC K E Y S T A T E 1
AESDEC K E Y S T A T E 2
AESDEC K E Y S T A T E 3
AESDEC K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
movaps 0 x70 ( T K E Y P ) , K E Y
2009-11-23 14:54:06 +03:00
AESDECLAST K E Y S T A T E 1 # l a s t r o u n d
AESDECLAST K E Y S T A T E 2
AESDECLAST K E Y S T A T E 3
AESDECLAST K E Y S T A T E 4
2009-01-18 08:28:34 +03:00
ret
2013-01-19 15:38:55 +04:00
ENDPROC( _ a e s n i _ d e c4 )
2009-01-18 08:28:34 +03:00
/ *
* void a e s n i _ e c b _ e n c ( s t r u c t c r y p t o _ a e s _ c t x * c t x , c o n s t u 8 * d s t , u 8 * s r c ,
* size_ t l e n )
* /
ENTRY( a e s n i _ e c b _ e n c )
2016-01-22 01:49:19 +03:00
FRAME_ B E G I N
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# ifndef _ _ x86 _ 6 4 _ _
pushl L E N
pushl K E Y P
pushl K L E N
2016-01-22 01:49:19 +03:00
movl ( F R A M E _ O F F S E T + 1 6 ) ( % e s p ) , K E Y P # c t x
movl ( F R A M E _ O F F S E T + 2 0 ) ( % e s p ) , O U T P # d s t
movl ( F R A M E _ O F F S E T + 2 4 ) ( % e s p ) , I N P # s r c
movl ( F R A M E _ O F F S E T + 2 8 ) ( % e s p ) , L E N # l e n
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# endif
2009-01-18 08:28:34 +03:00
test L E N , L E N # c h e c k l e n g t h
jz . L e c b _ e n c _ r e t
mov 4 8 0 ( K E Y P ) , K L E N
cmp $ 1 6 , L E N
jb . L e c b _ e n c _ r e t
cmp $ 6 4 , L E N
jb . L e c b _ e n c _ l o o p1
.align 4
.Lecb_enc_loop4 :
movups ( I N P ) , S T A T E 1
movups 0 x10 ( I N P ) , S T A T E 2
movups 0 x20 ( I N P ) , S T A T E 3
movups 0 x30 ( I N P ) , S T A T E 4
call _ a e s n i _ e n c4
movups S T A T E 1 , ( O U T P )
movups S T A T E 2 , 0 x10 ( O U T P )
movups S T A T E 3 , 0 x20 ( O U T P )
movups S T A T E 4 , 0 x30 ( O U T P )
sub $ 6 4 , L E N
add $ 6 4 , I N P
add $ 6 4 , O U T P
cmp $ 6 4 , L E N
jge . L e c b _ e n c _ l o o p4
cmp $ 1 6 , L E N
jb . L e c b _ e n c _ r e t
.align 4
.Lecb_enc_loop1 :
movups ( I N P ) , S T A T E 1
call _ a e s n i _ e n c1
movups S T A T E 1 , ( O U T P )
sub $ 1 6 , L E N
add $ 1 6 , I N P
add $ 1 6 , O U T P
cmp $ 1 6 , L E N
jge . L e c b _ e n c _ l o o p1
.Lecb_enc_ret :
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# ifndef _ _ x86 _ 6 4 _ _
popl K L E N
popl K E Y P
popl L E N
# endif
2016-01-22 01:49:19 +03:00
FRAME_ E N D
2009-01-18 08:28:34 +03:00
ret
2013-01-19 15:38:55 +04:00
ENDPROC( a e s n i _ e c b _ e n c )
2009-01-18 08:28:34 +03:00
/ *
* void a e s n i _ e c b _ d e c ( s t r u c t c r y p t o _ a e s _ c t x * c t x , c o n s t u 8 * d s t , u 8 * s r c ,
* size_ t l e n ) ;
* /
ENTRY( a e s n i _ e c b _ d e c )
2016-01-22 01:49:19 +03:00
FRAME_ B E G I N
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# ifndef _ _ x86 _ 6 4 _ _
pushl L E N
pushl K E Y P
pushl K L E N
2016-01-22 01:49:19 +03:00
movl ( F R A M E _ O F F S E T + 1 6 ) ( % e s p ) , K E Y P # c t x
movl ( F R A M E _ O F F S E T + 2 0 ) ( % e s p ) , O U T P # d s t
movl ( F R A M E _ O F F S E T + 2 4 ) ( % e s p ) , I N P # s r c
movl ( F R A M E _ O F F S E T + 2 8 ) ( % e s p ) , L E N # l e n
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# endif
2009-01-18 08:28:34 +03:00
test L E N , L E N
jz . L e c b _ d e c _ r e t
mov 4 8 0 ( K E Y P ) , K L E N
add $ 2 4 0 , K E Y P
cmp $ 1 6 , L E N
jb . L e c b _ d e c _ r e t
cmp $ 6 4 , L E N
jb . L e c b _ d e c _ l o o p1
.align 4
.Lecb_dec_loop4 :
movups ( I N P ) , S T A T E 1
movups 0 x10 ( I N P ) , S T A T E 2
movups 0 x20 ( I N P ) , S T A T E 3
movups 0 x30 ( I N P ) , S T A T E 4
call _ a e s n i _ d e c4
movups S T A T E 1 , ( O U T P )
movups S T A T E 2 , 0 x10 ( O U T P )
movups S T A T E 3 , 0 x20 ( O U T P )
movups S T A T E 4 , 0 x30 ( O U T P )
sub $ 6 4 , L E N
add $ 6 4 , I N P
add $ 6 4 , O U T P
cmp $ 6 4 , L E N
jge . L e c b _ d e c _ l o o p4
cmp $ 1 6 , L E N
jb . L e c b _ d e c _ r e t
.align 4
.Lecb_dec_loop1 :
movups ( I N P ) , S T A T E 1
call _ a e s n i _ d e c1
movups S T A T E 1 , ( O U T P )
sub $ 1 6 , L E N
add $ 1 6 , I N P
add $ 1 6 , O U T P
cmp $ 1 6 , L E N
jge . L e c b _ d e c _ l o o p1
.Lecb_dec_ret :
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# ifndef _ _ x86 _ 6 4 _ _
popl K L E N
popl K E Y P
popl L E N
# endif
2016-01-22 01:49:19 +03:00
FRAME_ E N D
2009-01-18 08:28:34 +03:00
ret
2013-01-19 15:38:55 +04:00
ENDPROC( a e s n i _ e c b _ d e c )
2009-01-18 08:28:34 +03:00
/ *
* void a e s n i _ c b c _ e n c ( s t r u c t c r y p t o _ a e s _ c t x * c t x , c o n s t u 8 * d s t , u 8 * s r c ,
* size_ t l e n , u 8 * i v )
* /
ENTRY( a e s n i _ c b c _ e n c )
2016-01-22 01:49:19 +03:00
FRAME_ B E G I N
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# ifndef _ _ x86 _ 6 4 _ _
pushl I V P
pushl L E N
pushl K E Y P
pushl K L E N
2016-01-22 01:49:19 +03:00
movl ( F R A M E _ O F F S E T + 2 0 ) ( % e s p ) , K E Y P # c t x
movl ( F R A M E _ O F F S E T + 2 4 ) ( % e s p ) , O U T P # d s t
movl ( F R A M E _ O F F S E T + 2 8 ) ( % e s p ) , I N P # s r c
movl ( F R A M E _ O F F S E T + 3 2 ) ( % e s p ) , L E N # l e n
movl ( F R A M E _ O F F S E T + 3 6 ) ( % e s p ) , I V P # i v
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# endif
2009-01-18 08:28:34 +03:00
cmp $ 1 6 , L E N
jb . L c b c _ e n c _ r e t
mov 4 8 0 ( K E Y P ) , K L E N
movups ( I V P ) , S T A T E # l o a d i v a s i n i t i a l s t a t e
.align 4
.Lcbc_enc_loop :
movups ( I N P ) , I N # l o a d i n p u t
pxor I N , S T A T E
call _ a e s n i _ e n c1
movups S T A T E , ( O U T P ) # s t o r e o u t p u t
sub $ 1 6 , L E N
add $ 1 6 , I N P
add $ 1 6 , O U T P
cmp $ 1 6 , L E N
jge . L c b c _ e n c _ l o o p
movups S T A T E , ( I V P )
.Lcbc_enc_ret :
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# ifndef _ _ x86 _ 6 4 _ _
popl K L E N
popl K E Y P
popl L E N
popl I V P
# endif
2016-01-22 01:49:19 +03:00
FRAME_ E N D
2009-01-18 08:28:34 +03:00
ret
2013-01-19 15:38:55 +04:00
ENDPROC( a e s n i _ c b c _ e n c )
2009-01-18 08:28:34 +03:00
/ *
* void a e s n i _ c b c _ d e c ( s t r u c t c r y p t o _ a e s _ c t x * c t x , c o n s t u 8 * d s t , u 8 * s r c ,
* size_ t l e n , u 8 * i v )
* /
ENTRY( a e s n i _ c b c _ d e c )
2016-01-22 01:49:19 +03:00
FRAME_ B E G I N
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# ifndef _ _ x86 _ 6 4 _ _
pushl I V P
pushl L E N
pushl K E Y P
pushl K L E N
2016-01-22 01:49:19 +03:00
movl ( F R A M E _ O F F S E T + 2 0 ) ( % e s p ) , K E Y P # c t x
movl ( F R A M E _ O F F S E T + 2 4 ) ( % e s p ) , O U T P # d s t
movl ( F R A M E _ O F F S E T + 2 8 ) ( % e s p ) , I N P # s r c
movl ( F R A M E _ O F F S E T + 3 2 ) ( % e s p ) , L E N # l e n
movl ( F R A M E _ O F F S E T + 3 6 ) ( % e s p ) , I V P # i v
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# endif
2009-01-18 08:28:34 +03:00
cmp $ 1 6 , L E N
2009-06-18 15:33:57 +04:00
jb . L c b c _ d e c _ j u s t _ r e t
2009-01-18 08:28:34 +03:00
mov 4 8 0 ( K E Y P ) , K L E N
add $ 2 4 0 , K E Y P
movups ( I V P ) , I V
cmp $ 6 4 , L E N
jb . L c b c _ d e c _ l o o p1
.align 4
.Lcbc_dec_loop4 :
movups ( I N P ) , I N 1
movaps I N 1 , S T A T E 1
movups 0 x10 ( I N P ) , I N 2
movaps I N 2 , S T A T E 2
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# ifdef _ _ x86 _ 6 4 _ _
2009-01-18 08:28:34 +03:00
movups 0 x20 ( I N P ) , I N 3
movaps I N 3 , S T A T E 3
movups 0 x30 ( I N P ) , I N 4
movaps I N 4 , S T A T E 4
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# else
movups 0 x20 ( I N P ) , I N 1
movaps I N 1 , S T A T E 3
movups 0 x30 ( I N P ) , I N 2
movaps I N 2 , S T A T E 4
# endif
2009-01-18 08:28:34 +03:00
call _ a e s n i _ d e c4
pxor I V , S T A T E 1
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# ifdef _ _ x86 _ 6 4 _ _
2009-01-18 08:28:34 +03:00
pxor I N 1 , S T A T E 2
pxor I N 2 , S T A T E 3
pxor I N 3 , S T A T E 4
movaps I N 4 , I V
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# else
pxor I N 1 , S T A T E 4
movaps I N 2 , I V
2012-05-30 03:43:08 +04:00
movups ( I N P ) , I N 1
pxor I N 1 , S T A T E 2
movups 0 x10 ( I N P ) , I N 2
pxor I N 2 , S T A T E 3
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# endif
2009-01-18 08:28:34 +03:00
movups S T A T E 1 , ( O U T P )
movups S T A T E 2 , 0 x10 ( O U T P )
movups S T A T E 3 , 0 x20 ( O U T P )
movups S T A T E 4 , 0 x30 ( O U T P )
sub $ 6 4 , L E N
add $ 6 4 , I N P
add $ 6 4 , O U T P
cmp $ 6 4 , L E N
jge . L c b c _ d e c _ l o o p4
cmp $ 1 6 , L E N
jb . L c b c _ d e c _ r e t
.align 4
.Lcbc_dec_loop1 :
movups ( I N P ) , I N
movaps I N , S T A T E
call _ a e s n i _ d e c1
pxor I V , S T A T E
movups S T A T E , ( O U T P )
movaps I N , I V
sub $ 1 6 , L E N
add $ 1 6 , I N P
add $ 1 6 , O U T P
cmp $ 1 6 , L E N
jge . L c b c _ d e c _ l o o p1
.Lcbc_dec_ret :
2009-06-18 15:33:57 +04:00
movups I V , ( I V P )
.Lcbc_dec_just_ret :
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# ifndef _ _ x86 _ 6 4 _ _
popl K L E N
popl K E Y P
popl L E N
popl I V P
# endif
2016-01-22 01:49:19 +03:00
FRAME_ E N D
2009-01-18 08:28:34 +03:00
ret
2013-01-19 15:38:55 +04:00
ENDPROC( a e s n i _ c b c _ d e c )
2010-03-10 13:28:55 +03:00
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# ifdef _ _ x86 _ 6 4 _ _
x86/asm/crypto: Move .Lbswap_mask data to .rodata section
stacktool reports the following warning:
stacktool: arch/x86/crypto/aesni-intel_asm.o: _aesni_inc_init(): can't find starting instruction
stacktool gets confused when it tries to disassemble the following data
in the .text section:
.Lbswap_mask:
.byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
Move it to .rodata which is a more appropriate section for read-only
data.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Bernd Petrovitsch <bernd@petrovitsch.priv.at>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chris J Arges <chris.j.arges@canonical.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/b6a2f3f8bda705143e127c025edb2b53c86e6eb4.1453405861.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-22 01:49:15 +03:00
.pushsection .rodata
2010-03-10 13:28:55 +03:00
.align 16
.Lbswap_mask :
.byte 1 5 , 1 4 , 1 3 , 1 2 , 1 1 , 1 0 , 9 , 8 , 7 , 6 , 5 , 4 , 3 , 2 , 1 , 0
x86/asm/crypto: Move .Lbswap_mask data to .rodata section
stacktool reports the following warning:
stacktool: arch/x86/crypto/aesni-intel_asm.o: _aesni_inc_init(): can't find starting instruction
stacktool gets confused when it tries to disassemble the following data
in the .text section:
.Lbswap_mask:
.byte 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
Move it to .rodata which is a more appropriate section for read-only
data.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Bernd Petrovitsch <bernd@petrovitsch.priv.at>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chris J Arges <chris.j.arges@canonical.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/b6a2f3f8bda705143e127c025edb2b53c86e6eb4.1453405861.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-22 01:49:15 +03:00
.popsection
2010-03-10 13:28:55 +03:00
/ *
* _aesni_inc_init : internal A B I
* setup r e g i s t e r s u s e d b y _ a e s n i _ i n c
* input :
* IV
* output :
* CTR : = = IV, i n l i t t l e e n d i a n
* TCTR_LOW : = = lower q w o r d o f C T R
* INC : = = 1 , in l i t t l e e n d i a n
* BSWAP_ M A S K = = e n d i a n s w a p p i n g m a s k
* /
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
.align 4
2010-03-10 13:28:55 +03:00
_aesni_inc_init :
movaps . L b s w a p _ m a s k , B S W A P _ M A S K
movaps I V , C T R
PSHUFB_ X M M B S W A P _ M A S K C T R
mov $ 1 , T C T R _ L O W
2010-03-13 11:28:42 +03:00
MOVQ_ R 6 4 _ X M M T C T R _ L O W I N C
MOVQ_ R 6 4 _ X M M C T R T C T R _ L O W
2010-03-10 13:28:55 +03:00
ret
2013-01-19 15:38:55 +04:00
ENDPROC( _ a e s n i _ i n c _ i n i t )
2010-03-10 13:28:55 +03:00
/ *
* _aesni_inc : internal A B I
* Increase I V b y 1 , I V i s i n b i g e n d i a n
* input :
* IV
* CTR : = = IV, i n l i t t l e e n d i a n
* TCTR_LOW : = = lower q w o r d o f C T R
* INC : = = 1 , in l i t t l e e n d i a n
* BSWAP_ M A S K = = e n d i a n s w a p p i n g m a s k
* output :
* IV : Increase b y 1
* changed :
* CTR : = = output I V , i n l i t t l e e n d i a n
* TCTR_LOW : = = lower q w o r d o f C T R
* /
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
.align 4
2010-03-10 13:28:55 +03:00
_aesni_inc :
paddq I N C , C T R
add $ 1 , T C T R _ L O W
jnc . L i n c _ l o w
pslldq $ 8 , I N C
paddq I N C , C T R
psrldq $ 8 , I N C
.Linc_low :
movaps C T R , I V
PSHUFB_ X M M B S W A P _ M A S K I V
ret
2013-01-19 15:38:55 +04:00
ENDPROC( _ a e s n i _ i n c )
2010-03-10 13:28:55 +03:00
/ *
* void a e s n i _ c t r _ e n c ( s t r u c t c r y p t o _ a e s _ c t x * c t x , c o n s t u 8 * d s t , u 8 * s r c ,
* size_ t l e n , u 8 * i v )
* /
ENTRY( a e s n i _ c t r _ e n c )
2016-01-22 01:49:19 +03:00
FRAME_ B E G I N
2010-03-10 13:28:55 +03:00
cmp $ 1 6 , L E N
jb . L c t r _ e n c _ j u s t _ r e t
mov 4 8 0 ( K E Y P ) , K L E N
movups ( I V P ) , I V
call _ a e s n i _ i n c _ i n i t
cmp $ 6 4 , L E N
jb . L c t r _ e n c _ l o o p1
.align 4
.Lctr_enc_loop4 :
movaps I V , S T A T E 1
call _ a e s n i _ i n c
movups ( I N P ) , I N 1
movaps I V , S T A T E 2
call _ a e s n i _ i n c
movups 0 x10 ( I N P ) , I N 2
movaps I V , S T A T E 3
call _ a e s n i _ i n c
movups 0 x20 ( I N P ) , I N 3
movaps I V , S T A T E 4
call _ a e s n i _ i n c
movups 0 x30 ( I N P ) , I N 4
call _ a e s n i _ e n c4
pxor I N 1 , S T A T E 1
movups S T A T E 1 , ( O U T P )
pxor I N 2 , S T A T E 2
movups S T A T E 2 , 0 x10 ( O U T P )
pxor I N 3 , S T A T E 3
movups S T A T E 3 , 0 x20 ( O U T P )
pxor I N 4 , S T A T E 4
movups S T A T E 4 , 0 x30 ( O U T P )
sub $ 6 4 , L E N
add $ 6 4 , I N P
add $ 6 4 , O U T P
cmp $ 6 4 , L E N
jge . L c t r _ e n c _ l o o p4
cmp $ 1 6 , L E N
jb . L c t r _ e n c _ r e t
.align 4
.Lctr_enc_loop1 :
movaps I V , S T A T E
call _ a e s n i _ i n c
movups ( I N P ) , I N
call _ a e s n i _ e n c1
pxor I N , S T A T E
movups S T A T E , ( O U T P )
sub $ 1 6 , L E N
add $ 1 6 , I N P
add $ 1 6 , O U T P
cmp $ 1 6 , L E N
jge . L c t r _ e n c _ l o o p1
.Lctr_enc_ret :
movups I V , ( I V P )
.Lctr_enc_just_ret :
2016-01-22 01:49:19 +03:00
FRAME_ E N D
2010-03-10 13:28:55 +03:00
ret
2013-01-19 15:38:55 +04:00
ENDPROC( a e s n i _ c t r _ e n c )
2013-04-08 22:51:16 +04:00
/ *
* _aesni_gf128mul_x_ble : internal A B I
* Multiply i n G F ( 2 ^ 1 2 8 ) f o r X T S I V s
* input :
* IV : current I V
* GF1 2 8 M U L _ M A S K = = m a s k w i t h 0 x87 a n d 0 x01
* output :
* IV : next I V
* changed :
* CTR : = = temporary v a l u e
* /
# define _ a e s n i _ g f12 8 m u l _ x _ b l e ( ) \
pshufd $ 0 x13 , I V , C T R ; \
paddq I V , I V ; \
psrad $ 3 1 , C T R ; \
pand G F 1 2 8 M U L _ M A S K , C T R ; \
pxor C T R , I V ;
/ *
* void a e s n i _ x t s _ c r y p t 8 ( s t r u c t c r y p t o _ a e s _ c t x * c t x , c o n s t u 8 * d s t , u 8 * s r c ,
* bool e n c , u 8 * i v )
* /
ENTRY( a e s n i _ x t s _ c r y p t 8 )
2016-01-22 01:49:19 +03:00
FRAME_ B E G I N
2013-04-08 22:51:16 +04:00
cmpb $ 0 , % c l
movl $ 0 , % e c x
movl $ 2 4 0 , % r10 d
leaq _ a e s n i _ e n c4 , % r11
leaq _ a e s n i _ d e c4 , % r a x
cmovel % r10 d , % e c x
cmoveq % r a x , % r11
movdqa . L g f12 8 m u l _ x _ b l e _ m a s k , G F 1 2 8 M U L _ M A S K
movups ( I V P ) , I V
mov 4 8 0 ( K E Y P ) , K L E N
addq % r c x , K E Y P
movdqa I V , S T A T E 1
2013-06-11 23:25:22 +04:00
movdqu 0 x00 ( I N P ) , I N C
pxor I N C , S T A T E 1
2013-04-08 22:51:16 +04:00
movdqu I V , 0 x00 ( O U T P )
_ aesni_ g f12 8 m u l _ x _ b l e ( )
movdqa I V , S T A T E 2
2013-06-11 23:25:22 +04:00
movdqu 0 x10 ( I N P ) , I N C
pxor I N C , S T A T E 2
2013-04-08 22:51:16 +04:00
movdqu I V , 0 x10 ( O U T P )
_ aesni_ g f12 8 m u l _ x _ b l e ( )
movdqa I V , S T A T E 3
2013-06-11 23:25:22 +04:00
movdqu 0 x20 ( I N P ) , I N C
pxor I N C , S T A T E 3
2013-04-08 22:51:16 +04:00
movdqu I V , 0 x20 ( O U T P )
_ aesni_ g f12 8 m u l _ x _ b l e ( )
movdqa I V , S T A T E 4
2013-06-11 23:25:22 +04:00
movdqu 0 x30 ( I N P ) , I N C
pxor I N C , S T A T E 4
2013-04-08 22:51:16 +04:00
movdqu I V , 0 x30 ( O U T P )
2018-01-12 00:46:27 +03:00
CALL_ N O S P E C % r11
2013-04-08 22:51:16 +04:00
2013-06-11 23:25:22 +04:00
movdqu 0 x00 ( O U T P ) , I N C
pxor I N C , S T A T E 1
2013-04-08 22:51:16 +04:00
movdqu S T A T E 1 , 0 x00 ( O U T P )
_ aesni_ g f12 8 m u l _ x _ b l e ( )
movdqa I V , S T A T E 1
2013-06-11 23:25:22 +04:00
movdqu 0 x40 ( I N P ) , I N C
pxor I N C , S T A T E 1
2013-04-08 22:51:16 +04:00
movdqu I V , 0 x40 ( O U T P )
2013-06-11 23:25:22 +04:00
movdqu 0 x10 ( O U T P ) , I N C
pxor I N C , S T A T E 2
2013-04-08 22:51:16 +04:00
movdqu S T A T E 2 , 0 x10 ( O U T P )
_ aesni_ g f12 8 m u l _ x _ b l e ( )
movdqa I V , S T A T E 2
2013-06-11 23:25:22 +04:00
movdqu 0 x50 ( I N P ) , I N C
pxor I N C , S T A T E 2
2013-04-08 22:51:16 +04:00
movdqu I V , 0 x50 ( O U T P )
2013-06-11 23:25:22 +04:00
movdqu 0 x20 ( O U T P ) , I N C
pxor I N C , S T A T E 3
2013-04-08 22:51:16 +04:00
movdqu S T A T E 3 , 0 x20 ( O U T P )
_ aesni_ g f12 8 m u l _ x _ b l e ( )
movdqa I V , S T A T E 3
2013-06-11 23:25:22 +04:00
movdqu 0 x60 ( I N P ) , I N C
pxor I N C , S T A T E 3
2013-04-08 22:51:16 +04:00
movdqu I V , 0 x60 ( O U T P )
2013-06-11 23:25:22 +04:00
movdqu 0 x30 ( O U T P ) , I N C
pxor I N C , S T A T E 4
2013-04-08 22:51:16 +04:00
movdqu S T A T E 4 , 0 x30 ( O U T P )
_ aesni_ g f12 8 m u l _ x _ b l e ( )
movdqa I V , S T A T E 4
2013-06-11 23:25:22 +04:00
movdqu 0 x70 ( I N P ) , I N C
pxor I N C , S T A T E 4
2013-04-08 22:51:16 +04:00
movdqu I V , 0 x70 ( O U T P )
_ aesni_ g f12 8 m u l _ x _ b l e ( )
movups I V , ( I V P )
2018-01-12 00:46:27 +03:00
CALL_ N O S P E C % r11
2013-04-08 22:51:16 +04:00
2013-06-11 23:25:22 +04:00
movdqu 0 x40 ( O U T P ) , I N C
pxor I N C , S T A T E 1
2013-04-08 22:51:16 +04:00
movdqu S T A T E 1 , 0 x40 ( O U T P )
2013-06-11 23:25:22 +04:00
movdqu 0 x50 ( O U T P ) , I N C
pxor I N C , S T A T E 2
2013-04-08 22:51:16 +04:00
movdqu S T A T E 2 , 0 x50 ( O U T P )
2013-06-11 23:25:22 +04:00
movdqu 0 x60 ( O U T P ) , I N C
pxor I N C , S T A T E 3
2013-04-08 22:51:16 +04:00
movdqu S T A T E 3 , 0 x60 ( O U T P )
2013-06-11 23:25:22 +04:00
movdqu 0 x70 ( O U T P ) , I N C
pxor I N C , S T A T E 4
2013-04-08 22:51:16 +04:00
movdqu S T A T E 4 , 0 x70 ( O U T P )
2016-01-22 01:49:19 +03:00
FRAME_ E N D
2013-04-08 22:51:16 +04:00
ret
ENDPROC( a e s n i _ x t s _ c r y p t 8 )
crypto: aesni-intel - Ported implementation to x86-32
The AES-NI instructions are also available in legacy mode so the 32-bit
architecture may profit from those, too.
To illustrate the performance gain here's a short summary of a dm-crypt
speed test on a Core i7 M620 running at 2.67GHz comparing both assembler
implementations:
x86: i568 aes-ni delta
ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4%
CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3%
LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5%
XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7%
Additionally, due to some minor optimizations, the 64-bit version also
got a minor performance gain as seen below:
x86-64: old impl. new impl. delta
ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5%
CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9%
LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6%
XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7%
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2010-11-27 11:34:46 +03:00
# endif