linux

iv/linux

History

Eric Biggers b06affb1cb crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM Add implementations of AES-GCM for x86_64 CPUs that support VAES (vector AES), VPCLMULQDQ (vector carryless multiplication), and either AVX512 or AVX10. There are two implementations, sharing most source code: one using 256-bit vectors and one using 512-bit vectors. This patch improves AES-GCM performance by up to 162%; see Tables 1 and 2 below. I wrote the new AES-GCM assembly code from scratch, focusing on correctness, performance, code size (both source and binary), and documenting the source. The new assembly file aes-gcm-avx10-x86_64.S is about 1200 lines including extensive comments, and it generates less than 8 KB of binary code. The main loop does 4 vectors at a time, with the AES and GHASH instructions interleaved. Any remainder is handled using a simple 1 vector at a time loop, with masking. Several VAES + AVX512 implementations of AES-GCM exist from Intel, including one in OpenSSL and one proposed for inclusion in Linux in 2021 (https://lore.kernel.org/linux-crypto/1611386920-28579-6-git-send-email-megha.dey@intel.com/). These aren't really suitable to be used, though, due to the massive amount of binary code generated (696 KB for OpenSSL, 200 KB for Linux) and well as the significantly larger amount of assembly source (4978 lines for OpenSSL, 1788 lines for Linux). Also, Intel's code does not support 256-bit vectors, which makes it not usable on future AVX10/256-only CPUs, and also not ideal for certain Intel CPUs that have downclocking issues. So I ended up starting from scratch. Usually my much shorter code is actually slightly faster than Intel's AVX512 code, though it depends on message length and on which of Intel's implementations is used; for details, see Tables 3 and 4 below. To facilitate potential integration into other projects, I've dual-licensed aes-gcm-avx10-x86_64.S under Apache-2.0 OR BSD-2-Clause, the same as the recently added RISC-V crypto code. The following two tables summarize the performance improvement over the existing AES-GCM code in Linux that uses AES-NI and AVX2: Table 1: AES-256-GCM encryption throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ----------------------+-------+-------+-------+-------+-------+-------+ Intel Ice Lake \| 42% \| 48% \| 60% \| 62% \| 70% \| 69% \| Intel Sapphire Rapids \| 157% \| 145% \| 162% \| 119% \| 96% \| 96% \| Intel Emerald Rapids \| 156% \| 144% \| 161% \| 115% \| 95% \| 100% \| AMD Zen 4 \| 103% \| 89% \| 78% \| 56% \| 54% \| 54% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ----------------------+-------+-------+-------+-------+-------+ Intel Ice Lake \| 66% \| 48% \| 49% \| 70% \| 53% \| Intel Sapphire Rapids \| 80% \| 60% \| 41% \| 62% \| 38% \| Intel Emerald Rapids \| 79% \| 60% \| 41% \| 62% \| 38% \| AMD Zen 4 \| 51% \| 35% \| 27% \| 32% \| 25% \| Table 2: AES-256-GCM decryption throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ----------------------+-------+-------+-------+-------+-------+-------+ Intel Ice Lake \| 42% \| 48% \| 59% \| 63% \| 67% \| 71% \| Intel Sapphire Rapids \| 159% \| 145% \| 161% \| 125% \| 102% \| 100% \| Intel Emerald Rapids \| 158% \| 144% \| 161% \| 124% \| 100% \| 103% \| AMD Zen 4 \| 110% \| 95% \| 80% \| 59% \| 56% \| 54% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ----------------------+-------+-------+-------+-------+-------+ Intel Ice Lake \| 67% \| 56% \| 46% \| 70% \| 56% \| Intel Sapphire Rapids \| 79% \| 62% \| 39% \| 61% \| 39% \| Intel Emerald Rapids \| 80% \| 62% \| 40% \| 58% \| 40% \| AMD Zen 4 \| 49% \| 36% \| 30% \| 35% \| 28% \| The above numbers are percentage improvements in single-thread throughput, so e.g. an increase from 4000 MB/s to 6000 MB/s would be listed as 50%. They were collected by directly measuring the Linux crypto API performance using a custom kernel module. Note that indirect benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O) include more overhead and won't see quite as much of a difference. All these benchmarks used an associated data length of 16 bytes. Note that AES-GCM is almost always used with short associated data lengths. The following two tables summarize how the performance of my code compares with Intel's AVX512 AES-GCM code, both the version that is in OpenSSL and the version that was proposed for inclusion in Linux. Neither version exists in Linux currently, but these are alternative AES-GCM implementations that could be chosen instead of mine. I collected the following numbers on Emerald Rapids using a userspace benchmark program that calls the assembly functions directly. I've also included a comparison with Cloudflare's AES-GCM implementation from https://boringssl-review.googlesource.com/c/boringssl/+/65987/3. Table 3: VAES-based AES-256-GCM encryption throughput in MB/s, implementation name vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ---------------------+-------+-------+-------+-------+-------+-------+ This implementation \| 14171 \| 12956 \| 12318 \| 9588 \| 7293 \| 6449 \| AVX512_Intel_OpenSSL \| 14022 \| 12467 \| 11863 \| 9107 \| 5891 \| 6472 \| AVX512_Intel_Linux \| 13954 \| 12277 \| 11530 \| 8712 \| 6627 \| 5898 \| AVX512_Cloudflare \| 12564 \| 11050 \| 10905 \| 8152 \| 5345 \| 5202 \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ---------------------+-------+-------+-------+-------+-------+ This implementation \| 4939 \| 3688 \| 1846 \| 1821 \| 738 \| AVX512_Intel_OpenSSL \| 4629 \| 4532 \| 2734 \| 2332 \| 1131 \| AVX512_Intel_Linux \| 4035 \| 2966 \| 1567 \| 1330 \| 639 \| AVX512_Cloudflare \| 3344 \| 2485 \| 1141 \| 1127 \| 456 \| Table 4: VAES-based AES-256-GCM decryption throughput in MB/s, implementation name vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ---------------------+-------+-------+-------+-------+-------+-------+ This implementation \| 14276 \| 13311 \| 13007 \| 11086 \| 8268 \| 8086 \| AVX512_Intel_OpenSSL \| 14067 \| 12620 \| 12421 \| 9587 \| 5954 \| 7060 \| AVX512_Intel_Linux \| 14116 \| 12795 \| 11778 \| 9269 \| 7735 \| 6455 \| AVX512_Cloudflare \| 13301 \| 12018 \| 11919 \| 9182 \| 7189 \| 6726 \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ---------------------+-------+-------+-------+-------+-------+ This implementation \| 6454 \| 5020 \| 2635 \| 2602 \| 1079 \| AVX512_Intel_OpenSSL \| 5184 \| 5799 \| 2957 \| 2545 \| 1228 \| AVX512_Intel_Linux \| 4394 \| 4247 \| 2235 \| 1635 \| 922 \| AVX512_Cloudflare \| 4289 \| 3851 \| 1435 \| 1417 \| 574 \| So, usually my code is actually slightly faster than Intel's code, though the OpenSSL implementation has a slight edge on messages shorter than 256 bytes in this microbenchmark. (This also holds true when doing the same tests on AMD Zen 4.) It can be seen that the large code size (up to 94x larger!) of the Intel implementations doesn't seem to bring much benefit, so starting from scratch with much smaller code, as I've done, seems appropriate. The performance of my code on messages shorter than 256 bytes could be improved through a limited amount of unrolling, but it's unclear it would be worth it, given code size considerations (e.g. caches) that don't get measured in microbenchmarks. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>		2024-06-07 19:47:58 +08:00
..
boot	Miscellaneous fixes:	2024-05-19 11:42:29 -07:00
coco	x86/cc: Add cc_platform_set/_clear() helpers	2024-04-04 10:40:27 +02:00
configs	hardening: Enable KCFI and some other options	2024-05-01 12:38:14 -07:00
crypto	crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM	2024-06-07 19:47:58 +08:00
entry	mseal: wire up mseal syscall	2024-05-23 19:40:26 -07:00
events	Driver core changes for 6.10-rc1	2024-05-22 12:13:40 -07:00
hyperv	x86/platform changes for v6.10:	2024-05-13 19:29:08 -07:00
ia32
include	Miscellaneous fixes:	2024-05-25 14:40:09 -07:00
kernel	Misc fixes:	2024-05-25 14:48:40 -07:00
kvm	tracing/treewide: Remove second parameter of __assign_str()	2024-05-22 20:14:47 -04:00
lib	Mainly singleton patches, documented in their respective changelogs.	2024-05-19 14:02:03 -07:00
math-emu	x86/math-emu: Fix function cast warnings	2024-04-08 16:06:22 +02:00
mm	The usual shower of singleton fixes and minor series all over MM,	2024-05-19 09:21:03 -07:00
net	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	2024-05-02 12:06:25 -07:00
pci	pci-v6.10-changes	2024-05-21 10:09:28 -07:00
platform	x86/platform/olpc-xo1-sci: Convert to platform remove callback returning void	2024-04-10 14:59:31 +02:00
power	- Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min	2024-03-14 18:03:09 -07:00
purgatory	Kbuild updates for v6.10	2024-05-18 12:39:20 -07:00
ras
realmode	Makefile: remove redundant tool coverage variables	2024-05-14 23:35:48 +09:00
tools	Changes:	2024-05-19 11:32:42 -07:00
um	This pull request contains the following changes for UML:	2024-05-25 13:17:48 -07:00
video	arch: Fix name collision with ACPI's video.o	2024-05-20 21:17:06 +00:00
virt	x86/cleanups changes for v6.10:	2024-05-13 18:21:24 -07:00
xen	xen: branch for v6.10-rc1	2024-05-24 10:24:49 -07:00
.gitignore
Kbuild	x86/build: Use obj-y to descend into arch/x86/virt/	2024-03-30 10:41:49 +01:00
Kconfig	x86: implement ARCH_HAS_KERNEL_FPU_SUPPORT	2024-05-19 14:36:19 -07:00
Kconfig.assembler	x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support	2024-04-05 15:46:33 +08:00
Kconfig.cpu	x86/Kconfig: Transmeta Crusoe is CPU family 5, not 6	2024-02-09 16:28:19 +01:00
Kconfig.debug	x86/kconfig: Select ARCH_WANT_FRAME_POINTERS again when UNWINDER_FRAME_POINTER=y	2024-05-20 11:37:23 +02:00
Makefile	- A series ("kbuild: enable more warnings by default") from Arnd	2024-05-22 18:59:29 -07:00
Makefile_32.cpu
Makefile.postlink	kbuild: remove ARCH_POSTLINK from module builds	2023-10-28 21:10:08 +09:00
Makefile.um