crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
#!/usr/bin/env perl
# SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
2020-01-05 22:40:47 -05:00
#
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
# Copyright (C) 2017-2018 Samuel Neves <sneves@dei.uc.pt>. All Rights Reserved.
# Copyright (C) 2017-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
# Copyright (C) 2006-2017 CRYPTOGAMS by <appro@openssl.org>. All Rights Reserved.
#
# This code is taken from the OpenSSL project but the author, Andy Polyakov,
# has relicensed it under the licenses specified in the SPDX header above.
# The original headers, including the original license headers, are
# included below for completeness.
2020-01-05 22:40:47 -05:00
#
# ====================================================================
# Written by Andy Polyakov <appro@openssl.org> for the OpenSSL
# project. The module is, however, dual licensed under OpenSSL and
# CRYPTOGAMS licenses depending on where you obtain it. For further
# details see http://www.openssl.org/~appro/cryptogams/.
# ====================================================================
#
# This module implements Poly1305 hash for x86_64.
#
# March 2015
#
# Initial release.
#
# December 2016
#
# Add AVX512F+VL+BW code path.
#
# November 2017
#
# Convert AVX512F+VL+BW code path to pure AVX512F, so that it can be
# executed even on Knights Landing. Trigger for modification was
# observation that AVX512 code paths can negatively affect overall
# Skylake-X system performance. Since we are likely to suppress
# AVX512F capability flag [at least on Skylake-X], conversion serves
# as kind of "investment protection". Note that next *lake processor,
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
# Cannonlake, has AVX512IFMA code path to execute...
2020-01-05 22:40:47 -05:00
#
# Numbers are cycles per processed byte with poly1305_blocks alone,
# measured with rdtsc at fixed clock frequency.
#
# IALU/gcc-4.8(*) AVX(**) AVX2 AVX-512
# P4 4.46/+120% -
# Core 2 2.41/+90% -
# Westmere 1.88/+120% -
# Sandy Bridge 1.39/+140% 1.10
# Haswell 1.14/+175% 1.11 0.65
# Skylake[-X] 1.13/+120% 0.96 0.51 [0.35]
# Silvermont 2.83/+95% -
# Knights L 3.60/? 1.65 1.10 0.41(***)
# Goldmont 1.70/+180% -
# VIA Nano 1.82/+150% -
# Sledgehammer 1.38/+160% -
# Bulldozer 2.30/+130% 0.97
# Ryzen 1.15/+200% 1.08 1.18
#
# (*) improvement coefficients relative to clang are more modest and
# are ~50% on most processors, in both cases we are comparing to
# __int128 code;
# (**) SSE2 implementation was attempted, but among non-AVX processors
# it was faster than integer-only code only on older Intel P4 and
# Core processors, 50-30%, less newer processor is, but slower on
# contemporary ones, for example almost 2x slower on Atom, and as
# former are naturally disappearing, SSE2 is deemed unnecessary;
# (***) strangely enough performance seems to vary from core to core,
# listed result is best case;
$ flavour = shift ;
$ output = shift ;
if ( $ flavour =~ /\./ ) { $ output = $ flavour ; undef $ flavour ; }
$ win64 = 0 ; $ win64 = 1 if ( $ flavour =~ /[nm]asm|mingw64/ || $ output =~ /\.asm$/ ) ;
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
$ kernel = 0 ; $ kernel = 1 if ( ! $ flavour && ! $ output ) ;
if ( ! $ kernel ) {
$ 0 =~ m/(.*[\/\\])[^\/\\]+$/ ; $ dir = $ 1 ;
( $ xlate = "${dir}x86_64-xlate.pl" and - f $ xlate ) or
( $ xlate = "${dir}../../perlasm/x86_64-xlate.pl" and - f $ xlate ) or
die "can't locate x86_64-xlate.pl" ;
open OUT , "| \"$^X\" \"$xlate\" $flavour \"$output\"" ;
* STDOUT = * OUT ;
if ( `$ENV{CC} -Wa,-v -c -o /dev/null -x assembler /dev/null 2>&1`
=~ /GNU assembler version ([2-9]\.[0-9]+)/ ) {
$ avx = ( $ 1 >= 2.19 ) + ( $ 1 >= 2.22 ) + ( $ 1 >= 2.25 ) ;
}
if ( ! $ avx && $ win64 && ( $ flavour =~ /nasm/ || $ ENV { ASM } =~ /nasm/ ) &&
`nasm -v 2>&1` =~ /NASM version ([2-9]\.[0-9]+)(?:\.([0-9]+))?/ ) {
$ avx = ( $ 1 >= 2.09 ) + ( $ 1 >= 2.10 ) + ( $ 1 >= 2.12 ) ;
$ avx += 1 if ( $ 1 == 2.11 && $ 2 >= 8 ) ;
}
if ( ! $ avx && $ win64 && ( $ flavour =~ /masm/ || $ ENV { ASM } =~ /ml64/ ) &&
`ml64 2>&1` =~ /Version ([0-9]+)\./ ) {
$ avx = ( $ 1 >= 10 ) + ( $ 1 >= 11 ) ;
}
if ( ! $ avx && `$ENV{CC} -v 2>&1` =~ /((?:^clang|LLVM) version|.*based on LLVM) ([3-9]\.[0-9]+)/ ) {
$ avx = ( $ 2 >= 3.0 ) + ( $ 2 > 3.0 ) ;
}
} else {
$ avx = 4 ; # The kernel uses ifdefs for this.
2020-01-05 22:40:47 -05:00
}
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
sub declare_function () {
my ( $ name , $ align , $ nargs ) = @ _ ;
if ( $ kernel ) {
$ code . = ".align $align\n" ;
$ code . = "SYM_FUNC_START($name)\n" ;
$ code . = ".L$name:\n" ;
} else {
$ code . = ".globl $name\n" ;
$ code . = ".type $name,\@function,$nargs\n" ;
$ code . = ".align $align\n" ;
$ code . = "$name:\n" ;
}
2020-01-05 22:40:47 -05:00
}
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
sub end_function () {
my ( $ name ) = @ _ ;
if ( $ kernel ) {
$ code . = "SYM_FUNC_END($name)\n" ;
} else {
$ code . = ".size $name,.-$name\n" ;
}
2020-01-05 22:40:47 -05:00
}
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
$ code . = << ___ if $ kernel ;
#include <linux/linkage.h>
___
if ( $ avx ) {
$ code . = << ___ if $ kernel ;
. section . rodata
___
$ code . = << ___ ;
. align 64
. Lconst:
. Lmask24:
. long 0x0ffffff , 0 , 0x0ffffff , 0 , 0x0ffffff , 0 , 0x0ffffff , 0
. L129:
. long `1<<24` , 0 , `1<<24` , 0 , `1<<24` , 0 , `1<<24` , 0
. Lmask26:
. long 0x3ffffff , 0 , 0x3ffffff , 0 , 0x3ffffff , 0 , 0x3ffffff , 0
. Lpermd_avx2:
. long 2 , 2 , 2 , 3 , 2 , 0 , 2 , 1
. Lpermd_avx512:
. long 0 , 0 , 0 , 1 , 0 , 2 , 0 , 3 , 0 , 4 , 0 , 5 , 0 , 6 , 0 , 7
. L2_44_inp_permd:
. long 0 , 1 , 1 , 2 , 2 , 3 , 7 , 7
. L2_44_inp_shift:
. quad 0 , 12 , 24 , 64
. L2_44_mask:
. quad 0xfffffffffff , 0xfffffffffff , 0x3ffffffffff , 0xffffffffffffffff
. L2_44_shift_rgt:
. quad 44 , 44 , 42 , 64
. L2_44_shift_lft:
. quad 8 , 8 , 10 , 64
2020-01-05 22:40:47 -05:00
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. align 64
. Lx_mask44:
. quad 0xfffffffffff , 0xfffffffffff , 0xfffffffffff , 0xfffffffffff
. quad 0xfffffffffff , 0xfffffffffff , 0xfffffffffff , 0xfffffffffff
. Lx_mask42:
. quad 0x3ffffffffff , 0x3ffffffffff , 0x3ffffffffff , 0x3ffffffffff
. quad 0x3ffffffffff , 0x3ffffffffff , 0x3ffffffffff , 0x3ffffffffff
___
}
$ code . = << ___ if ( ! $ kernel ) ;
. asciz "Poly1305 for x86_64, CRYPTOGAMS by <appro\@openssl.org>"
. align 16
___
2020-01-05 22:40:47 -05:00
my ( $ ctx , $ inp , $ len , $ padbit ) = ( "%rdi" , "%rsi" , "%rdx" , "%rcx" ) ;
my ( $ mac , $ nonce ) = ( $ inp , $ len ) ; # *_emit arguments
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
my ( $ d1 , $ d2 , $ d3 , $ r0 , $ r1 , $ s1 ) = ( "%r8" , "%r9" , "%rdi" , "%r11" , "%r12" , "%r13" ) ;
my ( $ h0 , $ h1 , $ h2 ) = ( "%r14" , "%rbx" , "%r10" ) ;
2020-01-05 22:40:47 -05:00
sub poly1305_iteration {
# input: copy of $r1 in %rax, $h0-$h2, $r0-$r1
# output: $h0-$h2 *= $r0-$r1
$ code . = << ___ ;
mulq $ h0 # h0*r1
mov % rax , $ d2
mov $ r0 , % rax
mov % rdx , $ d3
mulq $ h0 # h0*r0
mov % rax , $ h0 # future $h0
mov $ r0 , % rax
mov % rdx , $ d1
mulq $ h1 # h1*r0
add % rax , $ d2
mov $ s1 , % rax
adc % rdx , $ d3
mulq $ h1 # h1*s1
mov $ h2 , $ h1 # borrow $h1
add % rax , $ h0
adc % rdx , $ d1
imulq $ s1 , $ h1 # h2*s1
add $ h1 , $ d2
mov $ d1 , $ h1
adc \ $ 0 , $ d3
imulq $ r0 , $ h2 # h2*r0
add $ d2 , $ h1
mov \ $ - 4 , % rax # mask value
adc $ h2 , $ d3
and $ d3 , % rax # last reduction step
mov $ d3 , $ h2
shr \ $ 2 , $ d3
and \ $ 3 , $ h2
add $ d3 , % rax
add % rax , $ h0
adc \ $ 0 , $ h1
adc \ $ 0 , $ h2
___
}
########################################################################
# Layout of opaque area is following.
#
# unsigned __int64 h[3]; # current hash value base 2^64
# unsigned __int64 r[2]; # key value base 2^64
$ code . = << ___ ;
. text
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
___
$ code . = << ___ if ( ! $ kernel ) ;
2020-01-05 22:40:47 -05:00
. extern OPENSSL_ia32cap_P
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. globl poly1305_init_x86_64
. hidden poly1305_init_x86_64
. globl poly1305_blocks_x86_64
. hidden poly1305_blocks_x86_64
. globl poly1305_emit_x86_64
. hidden poly1305_emit_x86_64
___
& declare_function ( "poly1305_init_x86_64" , 32 , 3 ) ;
$ code . = << ___ ;
2020-08-27 19:38:31 +02:00
xor % eax , % eax
2020-01-05 22:40:47 -05:00
mov % rax , 0 ( $ ctx ) # initialize hash value
mov % rax , 8 ( $ ctx )
mov % rax , 16 ( $ ctx )
2020-11-27 11:18:12 +01:00
test $ inp , $ inp
2020-01-05 22:40:47 -05:00
je . Lno_key
___
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
$ code . = << ___ if ( ! $ kernel ) ;
lea poly1305_blocks_x86_64 ( % rip ) , % r10
lea poly1305_emit_x86_64 ( % rip ) , % r11
___
$ code . = << ___ if ( ! $ kernel && $ avx ) ;
2020-01-05 22:40:47 -05:00
mov OPENSSL_ia32cap_P + 4 ( % rip ) , % r9
lea poly1305_blocks_avx ( % rip ) , % rax
lea poly1305_emit_avx ( % rip ) , % rcx
bt \ $ ` 60 - 32 ` , % r9 # AVX?
cmovc % rax , % r10
cmovc % rcx , % r11
___
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
$ code . = << ___ if ( ! $ kernel && $ avx > 1 ) ;
2020-01-05 22:40:47 -05:00
lea poly1305_blocks_avx2 ( % rip ) , % rax
bt \ $ ` 5 + 32 ` , % r9 # AVX2?
cmovc % rax , % r10
___
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
$ code . = << ___ if ( ! $ kernel && $ avx > 3 ) ;
2020-01-05 22:40:47 -05:00
mov \ $` ( 1 << 31 | 1 << 21 | 1 << 16 ) ` , % rax
shr \ $ 32 , % r9
and % rax , % r9
cmp % rax , % r9
je . Linit_base2_44
___
$ code . = << ___ ;
mov \ $ 0x0ffffffc0fffffff , % rax
mov \ $ 0x0ffffffc0ffffffc , % rcx
and 0 ( $ inp ) , % rax
and 8 ( $ inp ) , % rcx
mov % rax , 24 ( $ ctx )
mov % rcx , 32 ( $ ctx )
___
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
$ code . = << ___ if ( ! $ kernel && $ flavour !~ /elf32/ ) ;
2020-01-05 22:40:47 -05:00
mov % r10 , 0 ( % rdx )
mov % r11 , 8 ( % rdx )
___
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
$ code . = << ___ if ( ! $ kernel && $ flavour =~ /elf32/ ) ;
2020-01-05 22:40:47 -05:00
mov % r10d , 0 ( % rdx )
mov % r11d , 4 ( % rdx )
___
$ code . = << ___ ;
mov \ $ 1 , % eax
. Lno_key:
ret
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
___
& end_function ( "poly1305_init_x86_64" ) ;
2020-01-05 22:40:47 -05:00
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
& declare_function ( "poly1305_blocks_x86_64" , 32 , 4 ) ;
$ code . = << ___ ;
2020-01-05 22:40:47 -05:00
. cfi_startproc
. Lblocks:
shr \ $ 4 , $ len
jz . Lno_data # too short
push % rbx
. cfi_push % rbx
push % r12
. cfi_push % r12
push % r13
. cfi_push % r13
push % r14
. cfi_push % r14
push % r15
. cfi_push % r15
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
push $ ctx
. cfi_push $ ctx
2020-01-05 22:40:47 -05:00
. Lblocks_body:
mov $ len , % r15 # reassign $len
mov 24 ( $ ctx ) , $ r0 # load r
mov 32 ( $ ctx ) , $ s1
mov 0 ( $ ctx ) , $ h0 # load hash value
mov 8 ( $ ctx ) , $ h1
mov 16 ( $ ctx ) , $ h2
mov $ s1 , $ r1
shr \ $ 2 , $ s1
mov $ r1 , % rax
add $ r1 , $ s1 # s1 = r1 + (r1 >> 2)
jmp . Loop
. align 32
. Loop:
add 0 ( $ inp ) , $ h0 # accumulate input
adc 8 ( $ inp ) , $ h1
lea 16 ( $ inp ) , $ inp
adc $ padbit , $ h2
___
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
2020-01-05 22:40:47 -05:00
& poly1305_iteration ( ) ;
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
2020-01-05 22:40:47 -05:00
$ code . = << ___ ;
mov $ r1 , % rax
dec % r15 # len-=16
jnz . Loop
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
mov 0 ( % rsp ) , $ ctx
. cfi_restore $ ctx
2020-01-05 22:40:47 -05:00
mov $ h0 , 0 ( $ ctx ) # store hash value
mov $ h1 , 8 ( $ ctx )
mov $ h2 , 16 ( $ ctx )
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
mov 8 ( % rsp ) , % r15
2020-01-05 22:40:47 -05:00
. cfi_restore % r15
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
mov 16 ( % rsp ) , % r14
2020-01-05 22:40:47 -05:00
. cfi_restore % r14
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
mov 24 ( % rsp ) , % r13
2020-01-05 22:40:47 -05:00
. cfi_restore % r13
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
mov 32 ( % rsp ) , % r12
2020-01-05 22:40:47 -05:00
. cfi_restore % r12
mov 40 ( % rsp ) , % rbx
. cfi_restore % rbx
lea 48 ( % rsp ) , % rsp
. cfi_adjust_cfa_offset - 48
. Lno_data:
. Lblocks_epilogue:
ret
. cfi_endproc
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
___
& end_function ( "poly1305_blocks_x86_64" ) ;
2020-01-05 22:40:47 -05:00
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
& declare_function ( "poly1305_emit_x86_64" , 32 , 3 ) ;
$ code . = << ___ ;
2020-01-05 22:40:47 -05:00
. Lemit:
mov 0 ( $ ctx ) , % r8 # load hash value
mov 8 ( $ ctx ) , % r9
mov 16 ( $ ctx ) , % r10
mov % r8 , % rax
add \ $ 5 , % r8 # compare to modulus
mov % r9 , % rcx
adc \ $ 0 , % r9
adc \ $ 0 , % r10
shr \ $ 2 , % r10 # did 130-bit value overflow?
cmovnz % r8 , % rax
cmovnz % r9 , % rcx
add 0 ( $ nonce ) , % rax # accumulate nonce
adc 8 ( $ nonce ) , % rcx
mov % rax , 0 ( $ mac ) # write result
mov % rcx , 8 ( $ mac )
ret
___
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
& end_function ( "poly1305_emit_x86_64" ) ;
2020-01-05 22:40:47 -05:00
if ( $ avx ) {
########################################################################
# Layout of opaque area is following.
#
# unsigned __int32 h[5]; # current hash value base 2^26
# unsigned __int32 is_base2_26;
# unsigned __int64 r[2]; # key value base 2^64
# unsigned __int64 pad;
# struct { unsigned __int32 r^2, r^1, r^4, r^3; } r[9];
#
# where r^n are base 2^26 digits of degrees of multiplier key. There are
# 5 digits, but last four are interleaved with multiples of 5, totalling
# in 9 elements: r0, r1, 5*r1, r2, 5*r2, r3, 5*r3, r4, 5*r4.
my ( $ H0 , $ H1 , $ H2 , $ H3 , $ H4 , $ T0 , $ T1 , $ T2 , $ T3 , $ T4 , $ D0 , $ D1 , $ D2 , $ D3 , $ D4 , $ MASK ) =
map ( "%xmm$_" , ( 0 .. 15 ) ) ;
$ code . = << ___ ;
. type __poly1305_block , \ @ abi - omnipotent
. align 32
__poly1305_block:
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
push $ ctx
2020-01-05 22:40:47 -05:00
___
& poly1305_iteration ( ) ;
$ code . = << ___ ;
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop $ ctx
2020-01-05 22:40:47 -05:00
ret
. size __poly1305_block , . - __poly1305_block
. type __poly1305_init_avx , \ @ abi - omnipotent
. align 32
__poly1305_init_avx:
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
push % rbp
mov % rsp , % rbp
2020-01-05 22:40:47 -05:00
mov $ r0 , $ h0
mov $ r1 , $ h1
xor $ h2 , $ h2
lea 48 + 64 ( $ ctx ) , $ ctx # size optimization
mov $ r1 , % rax
call __poly1305_block # r^2
mov \ $ 0x3ffffff , % eax # save interleaved r^2 and r base 2^26
mov \ $ 0x3ffffff , % edx
mov $ h0 , $ d1
and $ h0 #d,%eax
mov $ r0 , $ d2
and $ r0 #d,%edx
mov % eax , `16*0+0-64` ( $ ctx )
shr \ $ 26 , $ d1
mov % edx , `16*0+4-64` ( $ ctx )
shr \ $ 26 , $ d2
mov \ $ 0x3ffffff , % eax
mov \ $ 0x3ffffff , % edx
and $ d1 #d,%eax
and $ d2 #d,%edx
mov % eax , `16*1+0-64` ( $ ctx )
lea ( % rax , % rax , 4 ) , % eax # *5
mov % edx , `16*1+4-64` ( $ ctx )
lea ( % rdx , % rdx , 4 ) , % edx # *5
mov % eax , `16*2+0-64` ( $ ctx )
shr \ $ 26 , $ d1
mov % edx , `16*2+4-64` ( $ ctx )
shr \ $ 26 , $ d2
mov $ h1 , % rax
mov $ r1 , % rdx
shl \ $ 12 , % rax
shl \ $ 12 , % rdx
or $ d1 , % rax
or $ d2 , % rdx
and \ $ 0x3ffffff , % eax
and \ $ 0x3ffffff , % edx
mov % eax , `16*3+0-64` ( $ ctx )
lea ( % rax , % rax , 4 ) , % eax # *5
mov % edx , `16*3+4-64` ( $ ctx )
lea ( % rdx , % rdx , 4 ) , % edx # *5
mov % eax , `16*4+0-64` ( $ ctx )
mov $ h1 , $ d1
mov % edx , `16*4+4-64` ( $ ctx )
mov $ r1 , $ d2
mov \ $ 0x3ffffff , % eax
mov \ $ 0x3ffffff , % edx
shr \ $ 14 , $ d1
shr \ $ 14 , $ d2
and $ d1 #d,%eax
and $ d2 #d,%edx
mov % eax , `16*5+0-64` ( $ ctx )
lea ( % rax , % rax , 4 ) , % eax # *5
mov % edx , `16*5+4-64` ( $ ctx )
lea ( % rdx , % rdx , 4 ) , % edx # *5
mov % eax , `16*6+0-64` ( $ ctx )
shr \ $ 26 , $ d1
mov % edx , `16*6+4-64` ( $ ctx )
shr \ $ 26 , $ d2
mov $ h2 , % rax
shl \ $ 24 , % rax
or % rax , $ d1
mov $ d1 #d,`16*7+0-64`($ctx)
lea ( $ d1 , $ d1 , 4 ) , $ d1 # *5
mov $ d2 #d,`16*7+4-64`($ctx)
lea ( $ d2 , $ d2 , 4 ) , $ d2 # *5
mov $ d1 #d,`16*8+0-64`($ctx)
mov $ d2 #d,`16*8+4-64`($ctx)
mov $ r1 , % rax
call __poly1305_block # r^3
mov \ $ 0x3ffffff , % eax # save r^3 base 2^26
mov $ h0 , $ d1
and $ h0 #d,%eax
shr \ $ 26 , $ d1
mov % eax , `16*0+12-64` ( $ ctx )
mov \ $ 0x3ffffff , % edx
and $ d1 #d,%edx
mov % edx , `16*1+12-64` ( $ ctx )
lea ( % rdx , % rdx , 4 ) , % edx # *5
shr \ $ 26 , $ d1
mov % edx , `16*2+12-64` ( $ ctx )
mov $ h1 , % rax
shl \ $ 12 , % rax
or $ d1 , % rax
and \ $ 0x3ffffff , % eax
mov % eax , `16*3+12-64` ( $ ctx )
lea ( % rax , % rax , 4 ) , % eax # *5
mov $ h1 , $ d1
mov % eax , `16*4+12-64` ( $ ctx )
mov \ $ 0x3ffffff , % edx
shr \ $ 14 , $ d1
and $ d1 #d,%edx
mov % edx , `16*5+12-64` ( $ ctx )
lea ( % rdx , % rdx , 4 ) , % edx # *5
shr \ $ 26 , $ d1
mov % edx , `16*6+12-64` ( $ ctx )
mov $ h2 , % rax
shl \ $ 24 , % rax
or % rax , $ d1
mov $ d1 #d,`16*7+12-64`($ctx)
lea ( $ d1 , $ d1 , 4 ) , $ d1 # *5
mov $ d1 #d,`16*8+12-64`($ctx)
mov $ r1 , % rax
call __poly1305_block # r^4
mov \ $ 0x3ffffff , % eax # save r^4 base 2^26
mov $ h0 , $ d1
and $ h0 #d,%eax
shr \ $ 26 , $ d1
mov % eax , `16*0+8-64` ( $ ctx )
mov \ $ 0x3ffffff , % edx
and $ d1 #d,%edx
mov % edx , `16*1+8-64` ( $ ctx )
lea ( % rdx , % rdx , 4 ) , % edx # *5
shr \ $ 26 , $ d1
mov % edx , `16*2+8-64` ( $ ctx )
mov $ h1 , % rax
shl \ $ 12 , % rax
or $ d1 , % rax
and \ $ 0x3ffffff , % eax
mov % eax , `16*3+8-64` ( $ ctx )
lea ( % rax , % rax , 4 ) , % eax # *5
mov $ h1 , $ d1
mov % eax , `16*4+8-64` ( $ ctx )
mov \ $ 0x3ffffff , % edx
shr \ $ 14 , $ d1
and $ d1 #d,%edx
mov % edx , `16*5+8-64` ( $ ctx )
lea ( % rdx , % rdx , 4 ) , % edx # *5
shr \ $ 26 , $ d1
mov % edx , `16*6+8-64` ( $ ctx )
mov $ h2 , % rax
shl \ $ 24 , % rax
or % rax , $ d1
mov $ d1 #d,`16*7+8-64`($ctx)
lea ( $ d1 , $ d1 , 4 ) , $ d1 # *5
mov $ d1 #d,`16*8+8-64`($ctx)
lea - 48 - 64 ( $ ctx ) , $ ctx # size [de-]optimization
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % rbp
2020-01-05 22:40:47 -05:00
ret
. size __poly1305_init_avx , . - __poly1305_init_avx
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
___
2020-01-05 22:40:47 -05:00
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
& declare_function ( "poly1305_blocks_avx" , 32 , 4 ) ;
$ code . = << ___ ;
2020-01-05 22:40:47 -05:00
. cfi_startproc
mov 20 ( $ ctx ) , % r8d # is_base2_26
cmp \ $ 128 , $ len
jae . Lblocks_avx
test % r8d , % r8d
jz . Lblocks
. Lblocks_avx:
and \ $ - 16 , $ len
jz . Lno_data_avx
vzeroupper
test % r8d , % r8d
jz . Lbase2_64_avx
test \ $ 31 , $ len
jz . Leven_avx
push % rbp
. cfi_push % rbp
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
mov % rsp , % rbp
push % rbx
. cfi_push % rbx
2020-01-05 22:40:47 -05:00
push % r12
. cfi_push % r12
push % r13
. cfi_push % r13
push % r14
. cfi_push % r14
push % r15
. cfi_push % r15
. Lblocks_avx_body:
mov $ len , % r15 # reassign $len
mov 0 ( $ ctx ) , $ d1 # load hash value
mov 8 ( $ ctx ) , $ d2
mov 16 ( $ ctx ) , $ h2 #d
mov 24 ( $ ctx ) , $ r0 # load r
mov 32 ( $ ctx ) , $ s1
################################# base 2^26 -> base 2^64
mov $ d1 #d,$h0#d
and \ $` - 1 * ( 1 << 31 ) ` , $ d1
mov $ d2 , $ r1 # borrow $r1
mov $ d2 #d,$h1#d
and \ $` - 1 * ( 1 << 31 ) ` , $ d2
shr \ $ 6 , $ d1
shl \ $ 52 , $ r1
add $ d1 , $ h0
shr \ $ 12 , $ h1
shr \ $ 18 , $ d2
add $ r1 , $ h0
adc $ d2 , $ h1
mov $ h2 , $ d1
shl \ $ 40 , $ d1
shr \ $ 24 , $ h2
add $ d1 , $ h1
adc \ $ 0 , $ h2 # can be partially reduced...
mov \ $ - 4 , $ d2 # ... so reduce
mov $ h2 , $ d1
and $ h2 , $ d2
shr \ $ 2 , $ d1
and \ $ 3 , $ h2
add $ d2 , $ d1 # =*5
add $ d1 , $ h0
adc \ $ 0 , $ h1
adc \ $ 0 , $ h2
mov $ s1 , $ r1
mov $ s1 , % rax
shr \ $ 2 , $ s1
add $ r1 , $ s1 # s1 = r1 + (r1 >> 2)
add 0 ( $ inp ) , $ h0 # accumulate input
adc 8 ( $ inp ) , $ h1
lea 16 ( $ inp ) , $ inp
adc $ padbit , $ h2
call __poly1305_block
test $ padbit , $ padbit # if $padbit is zero,
jz . Lstore_base2_64_avx # store hash in base 2^64 format
################################# base 2^64 -> base 2^26
mov $ h0 , % rax
mov $ h0 , % rdx
shr \ $ 52 , $ h0
mov $ h1 , $ r0
mov $ h1 , $ r1
shr \ $ 26 , % rdx
and \ $ 0x3ffffff , % rax # h[0]
shl \ $ 12 , $ r0
and \ $ 0x3ffffff , % rdx # h[1]
shr \ $ 14 , $ h1
or $ r0 , $ h0
shl \ $ 24 , $ h2
and \ $ 0x3ffffff , $ h0 # h[2]
shr \ $ 40 , $ r1
and \ $ 0x3ffffff , $ h1 # h[3]
or $ r1 , $ h2 # h[4]
sub \ $ 1 6 , % r15
jz . Lstore_base2_26_avx
vmovd % rax #d,$H0
vmovd % rdx #d,$H1
vmovd $ h0 #d,$H2
vmovd $ h1 #d,$H3
vmovd $ h2 #d,$H4
jmp . Lproceed_avx
. align 32
. Lstore_base2_64_avx:
mov $ h0 , 0 ( $ ctx )
mov $ h1 , 8 ( $ ctx )
mov $ h2 , 16 ( $ ctx ) # note that is_base2_26 is zeroed
jmp . Ldone_avx
. align 16
. Lstore_base2_26_avx:
mov % rax #d,0($ctx) # store hash value base 2^26
mov % rdx #d,4($ctx)
mov $ h0 #d,8($ctx)
mov $ h1 #d,12($ctx)
mov $ h2 #d,16($ctx)
. align 16
. Ldone_avx:
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % r15
2020-01-05 22:40:47 -05:00
. cfi_restore % r15
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % r14
2020-01-05 22:40:47 -05:00
. cfi_restore % r14
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % r13
2020-01-05 22:40:47 -05:00
. cfi_restore % r13
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % r12
2020-01-05 22:40:47 -05:00
. cfi_restore % r12
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % rbx
2020-01-05 22:40:47 -05:00
. cfi_restore % rbx
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % rbp
. cfi_restore % rbp
2020-01-05 22:40:47 -05:00
. Lno_data_avx:
. Lblocks_avx_epilogue:
ret
. cfi_endproc
. align 32
. Lbase2_64_avx:
. cfi_startproc
push % rbp
. cfi_push % rbp
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
mov % rsp , % rbp
push % rbx
. cfi_push % rbx
2020-01-05 22:40:47 -05:00
push % r12
. cfi_push % r12
push % r13
. cfi_push % r13
push % r14
. cfi_push % r14
push % r15
. cfi_push % r15
. Lbase2_64_avx_body:
mov $ len , % r15 # reassign $len
mov 24 ( $ ctx ) , $ r0 # load r
mov 32 ( $ ctx ) , $ s1
mov 0 ( $ ctx ) , $ h0 # load hash value
mov 8 ( $ ctx ) , $ h1
mov 16 ( $ ctx ) , $ h2 #d
mov $ s1 , $ r1
mov $ s1 , % rax
shr \ $ 2 , $ s1
add $ r1 , $ s1 # s1 = r1 + (r1 >> 2)
test \ $ 31 , $ len
jz . Linit_avx
add 0 ( $ inp ) , $ h0 # accumulate input
adc 8 ( $ inp ) , $ h1
lea 16 ( $ inp ) , $ inp
adc $ padbit , $ h2
sub \ $ 1 6 , % r15
call __poly1305_block
. Linit_avx:
################################# base 2^64 -> base 2^26
mov $ h0 , % rax
mov $ h0 , % rdx
shr \ $ 52 , $ h0
mov $ h1 , $ d1
mov $ h1 , $ d2
shr \ $ 26 , % rdx
and \ $ 0x3ffffff , % rax # h[0]
shl \ $ 12 , $ d1
and \ $ 0x3ffffff , % rdx # h[1]
shr \ $ 14 , $ h1
or $ d1 , $ h0
shl \ $ 24 , $ h2
and \ $ 0x3ffffff , $ h0 # h[2]
shr \ $ 40 , $ d2
and \ $ 0x3ffffff , $ h1 # h[3]
or $ d2 , $ h2 # h[4]
vmovd % rax #d,$H0
vmovd % rdx #d,$H1
vmovd $ h0 #d,$H2
vmovd $ h1 #d,$H3
vmovd $ h2 #d,$H4
movl \ $ 1 , 20 ( $ ctx ) # set is_base2_26
call __poly1305_init_avx
. Lproceed_avx:
mov % r15 , $ len
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % r15
2020-01-05 22:40:47 -05:00
. cfi_restore % r15
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % r14
2020-01-05 22:40:47 -05:00
. cfi_restore % r14
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % r13
2020-01-05 22:40:47 -05:00
. cfi_restore % r13
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % r12
2020-01-05 22:40:47 -05:00
. cfi_restore % r12
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % rbx
2020-01-05 22:40:47 -05:00
. cfi_restore % rbx
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % rbp
. cfi_restore % rbp
2020-01-05 22:40:47 -05:00
. Lbase2_64_avx_epilogue:
jmp . Ldo_avx
. cfi_endproc
. align 32
. Leven_avx:
. cfi_startproc
vmovd 4 * 0 ( $ ctx ) , $ H0 # load hash value
vmovd 4 * 1 ( $ ctx ) , $ H1
vmovd 4 * 2 ( $ ctx ) , $ H2
vmovd 4 * 3 ( $ ctx ) , $ H3
vmovd 4 * 4 ( $ ctx ) , $ H4
. Ldo_avx:
___
$ code . = << ___ if ( ! $ win64 ) ;
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
lea 8 ( % rsp ) , % r10
. cfi_def_cfa_register % r10
and \ $ - 32 , % rsp
sub \ $ - 8 , % rsp
2020-01-05 22:40:47 -05:00
lea - 0x58 ( % rsp ) , % r11
sub \ $ 0 x178 , % rsp
___
$ code . = << ___ if ( $ win64 ) ;
lea - 0xf8 ( % rsp ) , % r11
sub \ $ 0 x218 , % rsp
vmovdqa % xmm6 , 0x50 ( % r11 )
vmovdqa % xmm7 , 0x60 ( % r11 )
vmovdqa % xmm8 , 0x70 ( % r11 )
vmovdqa % xmm9 , 0x80 ( % r11 )
vmovdqa % xmm10 , 0x90 ( % r11 )
vmovdqa % xmm11 , 0xa0 ( % r11 )
vmovdqa % xmm12 , 0xb0 ( % r11 )
vmovdqa % xmm13 , 0xc0 ( % r11 )
vmovdqa % xmm14 , 0xd0 ( % r11 )
vmovdqa % xmm15 , 0xe0 ( % r11 )
. Ldo_avx_body:
___
$ code . = << ___ ;
sub \ $ 6 4 , $ len
lea - 32 ( $ inp ) , % rax
cmovc % rax , $ inp
vmovdqu `16*3` ( $ ctx ) , $ D4 # preload r0^2
lea `16*3+64` ( $ ctx ) , $ ctx # size optimization
lea . Lconst ( % rip ) , % rcx
################################################################
# load input
vmovdqu 16 * 2 ( $ inp ) , $ T0
vmovdqu 16 * 3 ( $ inp ) , $ T1
vmovdqa 64 ( % rcx ) , $ MASK # .Lmask26
vpsrldq \ $ 6 , $ T0 , $ T2 # splat input
vpsrldq \ $ 6 , $ T1 , $ T3
vpunpckhqdq $ T1 , $ T0 , $ T4 # 4
vpunpcklqdq $ T1 , $ T0 , $ T0 # 0:1
vpunpcklqdq $ T3 , $ T2 , $ T3 # 2:3
vpsrlq \ $ 40 , $ T4 , $ T4 # 4
vpsrlq \ $ 26 , $ T0 , $ T1
vpand $ MASK , $ T0 , $ T0 # 0
vpsrlq \ $ 4 , $ T3 , $ T2
vpand $ MASK , $ T1 , $ T1 # 1
vpsrlq \ $ 30 , $ T3 , $ T3
vpand $ MASK , $ T2 , $ T2 # 2
vpand $ MASK , $ T3 , $ T3 # 3
vpor 32 ( % rcx ) , $ T4 , $ T4 # padbit, yes, always
jbe . Lskip_loop_avx
# expand and copy pre-calculated table to stack
vmovdqu `16*1-64` ( $ ctx ) , $ D1
vmovdqu `16*2-64` ( $ ctx ) , $ D2
vpshufd \ $ 0xEE , $ D4 , $ D3 # 34xx -> 3434
vpshufd \ $ 0x44 , $ D4 , $ D0 # xx12 -> 1212
vmovdqa $ D3 , - 0x90 ( % r11 )
vmovdqa $ D0 , 0x00 ( % rsp )
vpshufd \ $ 0xEE , $ D1 , $ D4
vmovdqu `16*3-64` ( $ ctx ) , $ D0
vpshufd \ $ 0x44 , $ D1 , $ D1
vmovdqa $ D4 , - 0x80 ( % r11 )
vmovdqa $ D1 , 0x10 ( % rsp )
vpshufd \ $ 0xEE , $ D2 , $ D3
vmovdqu `16*4-64` ( $ ctx ) , $ D1
vpshufd \ $ 0x44 , $ D2 , $ D2
vmovdqa $ D3 , - 0x70 ( % r11 )
vmovdqa $ D2 , 0x20 ( % rsp )
vpshufd \ $ 0xEE , $ D0 , $ D4
vmovdqu `16*5-64` ( $ ctx ) , $ D2
vpshufd \ $ 0x44 , $ D0 , $ D0
vmovdqa $ D4 , - 0x60 ( % r11 )
vmovdqa $ D0 , 0x30 ( % rsp )
vpshufd \ $ 0xEE , $ D1 , $ D3
vmovdqu `16*6-64` ( $ ctx ) , $ D0
vpshufd \ $ 0x44 , $ D1 , $ D1
vmovdqa $ D3 , - 0x50 ( % r11 )
vmovdqa $ D1 , 0x40 ( % rsp )
vpshufd \ $ 0xEE , $ D2 , $ D4
vmovdqu `16*7-64` ( $ ctx ) , $ D1
vpshufd \ $ 0x44 , $ D2 , $ D2
vmovdqa $ D4 , - 0x40 ( % r11 )
vmovdqa $ D2 , 0x50 ( % rsp )
vpshufd \ $ 0xEE , $ D0 , $ D3
vmovdqu `16*8-64` ( $ ctx ) , $ D2
vpshufd \ $ 0x44 , $ D0 , $ D0
vmovdqa $ D3 , - 0x30 ( % r11 )
vmovdqa $ D0 , 0x60 ( % rsp )
vpshufd \ $ 0xEE , $ D1 , $ D4
vpshufd \ $ 0x44 , $ D1 , $ D1
vmovdqa $ D4 , - 0x20 ( % r11 )
vmovdqa $ D1 , 0x70 ( % rsp )
vpshufd \ $ 0xEE , $ D2 , $ D3
vmovdqa 0x00 ( % rsp ) , $ D4 # preload r0^2
vpshufd \ $ 0x44 , $ D2 , $ D2
vmovdqa $ D3 , - 0x10 ( % r11 )
vmovdqa $ D2 , 0x80 ( % rsp )
jmp . Loop_avx
. align 32
. Loop_avx:
################################################################
# ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2
# ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^3+inp[7]*r
# \___________________/
# ((inp[0]*r^4+inp[2]*r^2+inp[4])*r^4+inp[6]*r^2+inp[8])*r^2
# ((inp[1]*r^4+inp[3]*r^2+inp[5])*r^4+inp[7]*r^2+inp[9])*r
# \___________________/ \____________________/
#
# Note that we start with inp[2:3]*r^2. This is because it
# doesn't depend on reduction in previous iteration.
################################################################
# d4 = h4*r0 + h3*r1 + h2*r2 + h1*r3 + h0*r4
# d3 = h3*r0 + h2*r1 + h1*r2 + h0*r3 + h4*5*r4
# d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4
# d1 = h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3 + h2*5*r4
# d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
#
# though note that $Tx and $Hx are "reversed" in this section,
# and $D4 is preloaded with r0^2...
vpmuludq $ T0 , $ D4 , $ D0 # d0 = h0*r0
vpmuludq $ T1 , $ D4 , $ D1 # d1 = h1*r0
vmovdqa $ H2 , 0x20 ( % r11 ) # offload hash
vpmuludq $ T2 , $ D4 , $ D2 # d3 = h2*r0
vmovdqa 0x10 ( % rsp ) , $ H2 # r1^2
vpmuludq $ T3 , $ D4 , $ D3 # d3 = h3*r0
vpmuludq $ T4 , $ D4 , $ D4 # d4 = h4*r0
vmovdqa $ H0 , 0x00 ( % r11 ) #
vpmuludq 0x20 ( % rsp ) , $ T4 , $ H0 # h4*s1
vmovdqa $ H1 , 0x10 ( % r11 ) #
vpmuludq $ T3 , $ H2 , $ H1 # h3*r1
vpaddq $ H0 , $ D0 , $ D0 # d0 += h4*s1
vpaddq $ H1 , $ D4 , $ D4 # d4 += h3*r1
vmovdqa $ H3 , 0x30 ( % r11 ) #
vpmuludq $ T2 , $ H2 , $ H0 # h2*r1
vpmuludq $ T1 , $ H2 , $ H1 # h1*r1
vpaddq $ H0 , $ D3 , $ D3 # d3 += h2*r1
vmovdqa 0x30 ( % rsp ) , $ H3 # r2^2
vpaddq $ H1 , $ D2 , $ D2 # d2 += h1*r1
vmovdqa $ H4 , 0x40 ( % r11 ) #
vpmuludq $ T0 , $ H2 , $ H2 # h0*r1
vpmuludq $ T2 , $ H3 , $ H0 # h2*r2
vpaddq $ H2 , $ D1 , $ D1 # d1 += h0*r1
vmovdqa 0x40 ( % rsp ) , $ H4 # s2^2
vpaddq $ H0 , $ D4 , $ D4 # d4 += h2*r2
vpmuludq $ T1 , $ H3 , $ H1 # h1*r2
vpmuludq $ T0 , $ H3 , $ H3 # h0*r2
vpaddq $ H1 , $ D3 , $ D3 # d3 += h1*r2
vmovdqa 0x50 ( % rsp ) , $ H2 # r3^2
vpaddq $ H3 , $ D2 , $ D2 # d2 += h0*r2
vpmuludq $ T4 , $ H4 , $ H0 # h4*s2
vpmuludq $ T3 , $ H4 , $ H4 # h3*s2
vpaddq $ H0 , $ D1 , $ D1 # d1 += h4*s2
vmovdqa 0x60 ( % rsp ) , $ H3 # s3^2
vpaddq $ H4 , $ D0 , $ D0 # d0 += h3*s2
vmovdqa 0x80 ( % rsp ) , $ H4 # s4^2
vpmuludq $ T1 , $ H2 , $ H1 # h1*r3
vpmuludq $ T0 , $ H2 , $ H2 # h0*r3
vpaddq $ H1 , $ D4 , $ D4 # d4 += h1*r3
vpaddq $ H2 , $ D3 , $ D3 # d3 += h0*r3
vpmuludq $ T4 , $ H3 , $ H0 # h4*s3
vpmuludq $ T3 , $ H3 , $ H1 # h3*s3
vpaddq $ H0 , $ D2 , $ D2 # d2 += h4*s3
vmovdqu 16 * 0 ( $ inp ) , $ H0 # load input
vpaddq $ H1 , $ D1 , $ D1 # d1 += h3*s3
vpmuludq $ T2 , $ H3 , $ H3 # h2*s3
vpmuludq $ T2 , $ H4 , $ T2 # h2*s4
vpaddq $ H3 , $ D0 , $ D0 # d0 += h2*s3
vmovdqu 16 * 1 ( $ inp ) , $ H1 #
vpaddq $ T2 , $ D1 , $ D1 # d1 += h2*s4
vpmuludq $ T3 , $ H4 , $ T3 # h3*s4
vpmuludq $ T4 , $ H4 , $ T4 # h4*s4
vpsrldq \ $ 6 , $ H0 , $ H2 # splat input
vpaddq $ T3 , $ D2 , $ D2 # d2 += h3*s4
vpaddq $ T4 , $ D3 , $ D3 # d3 += h4*s4
vpsrldq \ $ 6 , $ H1 , $ H3 #
vpmuludq 0x70 ( % rsp ) , $ T0 , $ T4 # h0*r4
vpmuludq $ T1 , $ H4 , $ T0 # h1*s4
vpunpckhqdq $ H1 , $ H0 , $ H4 # 4
vpaddq $ T4 , $ D4 , $ D4 # d4 += h0*r4
vmovdqa - 0x90 ( % r11 ) , $ T4 # r0^4
vpaddq $ T0 , $ D0 , $ D0 # d0 += h1*s4
vpunpcklqdq $ H1 , $ H0 , $ H0 # 0:1
vpunpcklqdq $ H3 , $ H2 , $ H3 # 2:3
#vpsrlq \$40,$H4,$H4 # 4
vpsrldq \ $ ` 40 / 8 ` , $ H4 , $ H4 # 4
vpsrlq \ $ 26 , $ H0 , $ H1
vpand $ MASK , $ H0 , $ H0 # 0
vpsrlq \ $ 4 , $ H3 , $ H2
vpand $ MASK , $ H1 , $ H1 # 1
vpand 0 ( % rcx ) , $ H4 , $ H4 # .Lmask24
vpsrlq \ $ 30 , $ H3 , $ H3
vpand $ MASK , $ H2 , $ H2 # 2
vpand $ MASK , $ H3 , $ H3 # 3
vpor 32 ( % rcx ) , $ H4 , $ H4 # padbit, yes, always
vpaddq 0x00 ( % r11 ) , $ H0 , $ H0 # add hash value
vpaddq 0x10 ( % r11 ) , $ H1 , $ H1
vpaddq 0x20 ( % r11 ) , $ H2 , $ H2
vpaddq 0x30 ( % r11 ) , $ H3 , $ H3
vpaddq 0x40 ( % r11 ) , $ H4 , $ H4
lea 16 * 2 ( $ inp ) , % rax
lea 16 * 4 ( $ inp ) , $ inp
sub \ $ 6 4 , $ len
cmovc % rax , $ inp
################################################################
# Now we accumulate (inp[0:1]+hash)*r^4
################################################################
# d4 = h4*r0 + h3*r1 + h2*r2 + h1*r3 + h0*r4
# d3 = h3*r0 + h2*r1 + h1*r2 + h0*r3 + h4*5*r4
# d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4
# d1 = h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3 + h2*5*r4
# d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
vpmuludq $ H0 , $ T4 , $ T0 # h0*r0
vpmuludq $ H1 , $ T4 , $ T1 # h1*r0
vpaddq $ T0 , $ D0 , $ D0
vpaddq $ T1 , $ D1 , $ D1
vmovdqa - 0x80 ( % r11 ) , $ T2 # r1^4
vpmuludq $ H2 , $ T4 , $ T0 # h2*r0
vpmuludq $ H3 , $ T4 , $ T1 # h3*r0
vpaddq $ T0 , $ D2 , $ D2
vpaddq $ T1 , $ D3 , $ D3
vpmuludq $ H4 , $ T4 , $ T4 # h4*r0
vpmuludq - 0x70 ( % r11 ) , $ H4 , $ T0 # h4*s1
vpaddq $ T4 , $ D4 , $ D4
vpaddq $ T0 , $ D0 , $ D0 # d0 += h4*s1
vpmuludq $ H2 , $ T2 , $ T1 # h2*r1
vpmuludq $ H3 , $ T2 , $ T0 # h3*r1
vpaddq $ T1 , $ D3 , $ D3 # d3 += h2*r1
vmovdqa - 0x60 ( % r11 ) , $ T3 # r2^4
vpaddq $ T0 , $ D4 , $ D4 # d4 += h3*r1
vpmuludq $ H1 , $ T2 , $ T1 # h1*r1
vpmuludq $ H0 , $ T2 , $ T2 # h0*r1
vpaddq $ T1 , $ D2 , $ D2 # d2 += h1*r1
vpaddq $ T2 , $ D1 , $ D1 # d1 += h0*r1
vmovdqa - 0x50 ( % r11 ) , $ T4 # s2^4
vpmuludq $ H2 , $ T3 , $ T0 # h2*r2
vpmuludq $ H1 , $ T3 , $ T1 # h1*r2
vpaddq $ T0 , $ D4 , $ D4 # d4 += h2*r2
vpaddq $ T1 , $ D3 , $ D3 # d3 += h1*r2
vmovdqa - 0x40 ( % r11 ) , $ T2 # r3^4
vpmuludq $ H0 , $ T3 , $ T3 # h0*r2
vpmuludq $ H4 , $ T4 , $ T0 # h4*s2
vpaddq $ T3 , $ D2 , $ D2 # d2 += h0*r2
vpaddq $ T0 , $ D1 , $ D1 # d1 += h4*s2
vmovdqa - 0x30 ( % r11 ) , $ T3 # s3^4
vpmuludq $ H3 , $ T4 , $ T4 # h3*s2
vpmuludq $ H1 , $ T2 , $ T1 # h1*r3
vpaddq $ T4 , $ D0 , $ D0 # d0 += h3*s2
vmovdqa - 0x10 ( % r11 ) , $ T4 # s4^4
vpaddq $ T1 , $ D4 , $ D4 # d4 += h1*r3
vpmuludq $ H0 , $ T2 , $ T2 # h0*r3
vpmuludq $ H4 , $ T3 , $ T0 # h4*s3
vpaddq $ T2 , $ D3 , $ D3 # d3 += h0*r3
vpaddq $ T0 , $ D2 , $ D2 # d2 += h4*s3
vmovdqu 16 * 2 ( $ inp ) , $ T0 # load input
vpmuludq $ H3 , $ T3 , $ T2 # h3*s3
vpmuludq $ H2 , $ T3 , $ T3 # h2*s3
vpaddq $ T2 , $ D1 , $ D1 # d1 += h3*s3
vmovdqu 16 * 3 ( $ inp ) , $ T1 #
vpaddq $ T3 , $ D0 , $ D0 # d0 += h2*s3
vpmuludq $ H2 , $ T4 , $ H2 # h2*s4
vpmuludq $ H3 , $ T4 , $ H3 # h3*s4
vpsrldq \ $ 6 , $ T0 , $ T2 # splat input
vpaddq $ H2 , $ D1 , $ D1 # d1 += h2*s4
vpmuludq $ H4 , $ T4 , $ H4 # h4*s4
vpsrldq \ $ 6 , $ T1 , $ T3 #
vpaddq $ H3 , $ D2 , $ H2 # h2 = d2 + h3*s4
vpaddq $ H4 , $ D3 , $ H3 # h3 = d3 + h4*s4
vpmuludq - 0x20 ( % r11 ) , $ H0 , $ H4 # h0*r4
vpmuludq $ H1 , $ T4 , $ H0
vpunpckhqdq $ T1 , $ T0 , $ T4 # 4
vpaddq $ H4 , $ D4 , $ H4 # h4 = d4 + h0*r4
vpaddq $ H0 , $ D0 , $ H0 # h0 = d0 + h1*s4
vpunpcklqdq $ T1 , $ T0 , $ T0 # 0:1
vpunpcklqdq $ T3 , $ T2 , $ T3 # 2:3
#vpsrlq \$40,$T4,$T4 # 4
vpsrldq \ $ ` 40 / 8 ` , $ T4 , $ T4 # 4
vpsrlq \ $ 26 , $ T0 , $ T1
vmovdqa 0x00 ( % rsp ) , $ D4 # preload r0^2
vpand $ MASK , $ T0 , $ T0 # 0
vpsrlq \ $ 4 , $ T3 , $ T2
vpand $ MASK , $ T1 , $ T1 # 1
vpand 0 ( % rcx ) , $ T4 , $ T4 # .Lmask24
vpsrlq \ $ 30 , $ T3 , $ T3
vpand $ MASK , $ T2 , $ T2 # 2
vpand $ MASK , $ T3 , $ T3 # 3
vpor 32 ( % rcx ) , $ T4 , $ T4 # padbit, yes, always
################################################################
# lazy reduction as discussed in "NEON crypto" by D.J. Bernstein
# and P. Schwabe
vpsrlq \ $ 26 , $ H3 , $ D3
vpand $ MASK , $ H3 , $ H3
vpaddq $ D3 , $ H4 , $ H4 # h3 -> h4
vpsrlq \ $ 26 , $ H0 , $ D0
vpand $ MASK , $ H0 , $ H0
vpaddq $ D0 , $ D1 , $ H1 # h0 -> h1
vpsrlq \ $ 26 , $ H4 , $ D0
vpand $ MASK , $ H4 , $ H4
vpsrlq \ $ 26 , $ H1 , $ D1
vpand $ MASK , $ H1 , $ H1
vpaddq $ D1 , $ H2 , $ H2 # h1 -> h2
vpaddq $ D0 , $ H0 , $ H0
vpsllq \ $ 2 , $ D0 , $ D0
vpaddq $ D0 , $ H0 , $ H0 # h4 -> h0
vpsrlq \ $ 26 , $ H2 , $ D2
vpand $ MASK , $ H2 , $ H2
vpaddq $ D2 , $ H3 , $ H3 # h2 -> h3
vpsrlq \ $ 26 , $ H0 , $ D0
vpand $ MASK , $ H0 , $ H0
vpaddq $ D0 , $ H1 , $ H1 # h0 -> h1
vpsrlq \ $ 26 , $ H3 , $ D3
vpand $ MASK , $ H3 , $ H3
vpaddq $ D3 , $ H4 , $ H4 # h3 -> h4
ja . Loop_avx
. Lskip_loop_avx:
################################################################
# multiply (inp[0:1]+hash) or inp[2:3] by r^2:r^1
vpshufd \ $ 0x10 , $ D4 , $ D4 # r0^n, xx12 -> x1x2
add \ $ 32 , $ len
jnz . Long_tail_avx
vpaddq $ H2 , $ T2 , $ T2
vpaddq $ H0 , $ T0 , $ T0
vpaddq $ H1 , $ T1 , $ T1
vpaddq $ H3 , $ T3 , $ T3
vpaddq $ H4 , $ T4 , $ T4
. Long_tail_avx:
vmovdqa $ H2 , 0x20 ( % r11 )
vmovdqa $ H0 , 0x00 ( % r11 )
vmovdqa $ H1 , 0x10 ( % r11 )
vmovdqa $ H3 , 0x30 ( % r11 )
vmovdqa $ H4 , 0x40 ( % r11 )
# d4 = h4*r0 + h3*r1 + h2*r2 + h1*r3 + h0*r4
# d3 = h3*r0 + h2*r1 + h1*r2 + h0*r3 + h4*5*r4
# d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4
# d1 = h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3 + h2*5*r4
# d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
vpmuludq $ T2 , $ D4 , $ D2 # d2 = h2*r0
vpmuludq $ T0 , $ D4 , $ D0 # d0 = h0*r0
vpshufd \ $ 0x10 , `16*1-64` ( $ ctx ) , $ H2 # r1^n
vpmuludq $ T1 , $ D4 , $ D1 # d1 = h1*r0
vpmuludq $ T3 , $ D4 , $ D3 # d3 = h3*r0
vpmuludq $ T4 , $ D4 , $ D4 # d4 = h4*r0
vpmuludq $ T3 , $ H2 , $ H0 # h3*r1
vpaddq $ H0 , $ D4 , $ D4 # d4 += h3*r1
vpshufd \ $ 0x10 , `16*2-64` ( $ ctx ) , $ H3 # s1^n
vpmuludq $ T2 , $ H2 , $ H1 # h2*r1
vpaddq $ H1 , $ D3 , $ D3 # d3 += h2*r1
vpshufd \ $ 0x10 , `16*3-64` ( $ ctx ) , $ H4 # r2^n
vpmuludq $ T1 , $ H2 , $ H0 # h1*r1
vpaddq $ H0 , $ D2 , $ D2 # d2 += h1*r1
vpmuludq $ T0 , $ H2 , $ H2 # h0*r1
vpaddq $ H2 , $ D1 , $ D1 # d1 += h0*r1
vpmuludq $ T4 , $ H3 , $ H3 # h4*s1
vpaddq $ H3 , $ D0 , $ D0 # d0 += h4*s1
vpshufd \ $ 0x10 , `16*4-64` ( $ ctx ) , $ H2 # s2^n
vpmuludq $ T2 , $ H4 , $ H1 # h2*r2
vpaddq $ H1 , $ D4 , $ D4 # d4 += h2*r2
vpmuludq $ T1 , $ H4 , $ H0 # h1*r2
vpaddq $ H0 , $ D3 , $ D3 # d3 += h1*r2
vpshufd \ $ 0x10 , `16*5-64` ( $ ctx ) , $ H3 # r3^n
vpmuludq $ T0 , $ H4 , $ H4 # h0*r2
vpaddq $ H4 , $ D2 , $ D2 # d2 += h0*r2
vpmuludq $ T4 , $ H2 , $ H1 # h4*s2
vpaddq $ H1 , $ D1 , $ D1 # d1 += h4*s2
vpshufd \ $ 0x10 , `16*6-64` ( $ ctx ) , $ H4 # s3^n
vpmuludq $ T3 , $ H2 , $ H2 # h3*s2
vpaddq $ H2 , $ D0 , $ D0 # d0 += h3*s2
vpmuludq $ T1 , $ H3 , $ H0 # h1*r3
vpaddq $ H0 , $ D4 , $ D4 # d4 += h1*r3
vpmuludq $ T0 , $ H3 , $ H3 # h0*r3
vpaddq $ H3 , $ D3 , $ D3 # d3 += h0*r3
vpshufd \ $ 0x10 , `16*7-64` ( $ ctx ) , $ H2 # r4^n
vpmuludq $ T4 , $ H4 , $ H1 # h4*s3
vpaddq $ H1 , $ D2 , $ D2 # d2 += h4*s3
vpshufd \ $ 0x10 , `16*8-64` ( $ ctx ) , $ H3 # s4^n
vpmuludq $ T3 , $ H4 , $ H0 # h3*s3
vpaddq $ H0 , $ D1 , $ D1 # d1 += h3*s3
vpmuludq $ T2 , $ H4 , $ H4 # h2*s3
vpaddq $ H4 , $ D0 , $ D0 # d0 += h2*s3
vpmuludq $ T0 , $ H2 , $ H2 # h0*r4
vpaddq $ H2 , $ D4 , $ D4 # h4 = d4 + h0*r4
vpmuludq $ T4 , $ H3 , $ H1 # h4*s4
vpaddq $ H1 , $ D3 , $ D3 # h3 = d3 + h4*s4
vpmuludq $ T3 , $ H3 , $ H0 # h3*s4
vpaddq $ H0 , $ D2 , $ D2 # h2 = d2 + h3*s4
vpmuludq $ T2 , $ H3 , $ H1 # h2*s4
vpaddq $ H1 , $ D1 , $ D1 # h1 = d1 + h2*s4
vpmuludq $ T1 , $ H3 , $ H3 # h1*s4
vpaddq $ H3 , $ D0 , $ D0 # h0 = d0 + h1*s4
jz . Lshort_tail_avx
vmovdqu 16 * 0 ( $ inp ) , $ H0 # load input
vmovdqu 16 * 1 ( $ inp ) , $ H1
vpsrldq \ $ 6 , $ H0 , $ H2 # splat input
vpsrldq \ $ 6 , $ H1 , $ H3
vpunpckhqdq $ H1 , $ H0 , $ H4 # 4
vpunpcklqdq $ H1 , $ H0 , $ H0 # 0:1
vpunpcklqdq $ H3 , $ H2 , $ H3 # 2:3
vpsrlq \ $ 40 , $ H4 , $ H4 # 4
vpsrlq \ $ 26 , $ H0 , $ H1
vpand $ MASK , $ H0 , $ H0 # 0
vpsrlq \ $ 4 , $ H3 , $ H2
vpand $ MASK , $ H1 , $ H1 # 1
vpsrlq \ $ 30 , $ H3 , $ H3
vpand $ MASK , $ H2 , $ H2 # 2
vpand $ MASK , $ H3 , $ H3 # 3
vpor 32 ( % rcx ) , $ H4 , $ H4 # padbit, yes, always
vpshufd \ $ 0x32 , `16*0-64` ( $ ctx ) , $ T4 # r0^n, 34xx -> x3x4
vpaddq 0x00 ( % r11 ) , $ H0 , $ H0
vpaddq 0x10 ( % r11 ) , $ H1 , $ H1
vpaddq 0x20 ( % r11 ) , $ H2 , $ H2
vpaddq 0x30 ( % r11 ) , $ H3 , $ H3
vpaddq 0x40 ( % r11 ) , $ H4 , $ H4
################################################################
# multiply (inp[0:1]+hash) by r^4:r^3 and accumulate
vpmuludq $ H0 , $ T4 , $ T0 # h0*r0
vpaddq $ T0 , $ D0 , $ D0 # d0 += h0*r0
vpmuludq $ H1 , $ T4 , $ T1 # h1*r0
vpaddq $ T1 , $ D1 , $ D1 # d1 += h1*r0
vpmuludq $ H2 , $ T4 , $ T0 # h2*r0
vpaddq $ T0 , $ D2 , $ D2 # d2 += h2*r0
vpshufd \ $ 0x32 , `16*1-64` ( $ ctx ) , $ T2 # r1^n
vpmuludq $ H3 , $ T4 , $ T1 # h3*r0
vpaddq $ T1 , $ D3 , $ D3 # d3 += h3*r0
vpmuludq $ H4 , $ T4 , $ T4 # h4*r0
vpaddq $ T4 , $ D4 , $ D4 # d4 += h4*r0
vpmuludq $ H3 , $ T2 , $ T0 # h3*r1
vpaddq $ T0 , $ D4 , $ D4 # d4 += h3*r1
vpshufd \ $ 0x32 , `16*2-64` ( $ ctx ) , $ T3 # s1
vpmuludq $ H2 , $ T2 , $ T1 # h2*r1
vpaddq $ T1 , $ D3 , $ D3 # d3 += h2*r1
vpshufd \ $ 0x32 , `16*3-64` ( $ ctx ) , $ T4 # r2
vpmuludq $ H1 , $ T2 , $ T0 # h1*r1
vpaddq $ T0 , $ D2 , $ D2 # d2 += h1*r1
vpmuludq $ H0 , $ T2 , $ T2 # h0*r1
vpaddq $ T2 , $ D1 , $ D1 # d1 += h0*r1
vpmuludq $ H4 , $ T3 , $ T3 # h4*s1
vpaddq $ T3 , $ D0 , $ D0 # d0 += h4*s1
vpshufd \ $ 0x32 , `16*4-64` ( $ ctx ) , $ T2 # s2
vpmuludq $ H2 , $ T4 , $ T1 # h2*r2
vpaddq $ T1 , $ D4 , $ D4 # d4 += h2*r2
vpmuludq $ H1 , $ T4 , $ T0 # h1*r2
vpaddq $ T0 , $ D3 , $ D3 # d3 += h1*r2
vpshufd \ $ 0x32 , `16*5-64` ( $ ctx ) , $ T3 # r3
vpmuludq $ H0 , $ T4 , $ T4 # h0*r2
vpaddq $ T4 , $ D2 , $ D2 # d2 += h0*r2
vpmuludq $ H4 , $ T2 , $ T1 # h4*s2
vpaddq $ T1 , $ D1 , $ D1 # d1 += h4*s2
vpshufd \ $ 0x32 , `16*6-64` ( $ ctx ) , $ T4 # s3
vpmuludq $ H3 , $ T2 , $ T2 # h3*s2
vpaddq $ T2 , $ D0 , $ D0 # d0 += h3*s2
vpmuludq $ H1 , $ T3 , $ T0 # h1*r3
vpaddq $ T0 , $ D4 , $ D4 # d4 += h1*r3
vpmuludq $ H0 , $ T3 , $ T3 # h0*r3
vpaddq $ T3 , $ D3 , $ D3 # d3 += h0*r3
vpshufd \ $ 0x32 , `16*7-64` ( $ ctx ) , $ T2 # r4
vpmuludq $ H4 , $ T4 , $ T1 # h4*s3
vpaddq $ T1 , $ D2 , $ D2 # d2 += h4*s3
vpshufd \ $ 0x32 , `16*8-64` ( $ ctx ) , $ T3 # s4
vpmuludq $ H3 , $ T4 , $ T0 # h3*s3
vpaddq $ T0 , $ D1 , $ D1 # d1 += h3*s3
vpmuludq $ H2 , $ T4 , $ T4 # h2*s3
vpaddq $ T4 , $ D0 , $ D0 # d0 += h2*s3
vpmuludq $ H0 , $ T2 , $ T2 # h0*r4
vpaddq $ T2 , $ D4 , $ D4 # d4 += h0*r4
vpmuludq $ H4 , $ T3 , $ T1 # h4*s4
vpaddq $ T1 , $ D3 , $ D3 # d3 += h4*s4
vpmuludq $ H3 , $ T3 , $ T0 # h3*s4
vpaddq $ T0 , $ D2 , $ D2 # d2 += h3*s4
vpmuludq $ H2 , $ T3 , $ T1 # h2*s4
vpaddq $ T1 , $ D1 , $ D1 # d1 += h2*s4
vpmuludq $ H1 , $ T3 , $ T3 # h1*s4
vpaddq $ T3 , $ D0 , $ D0 # d0 += h1*s4
. Lshort_tail_avx:
################################################################
# horizontal addition
vpsrldq \ $ 8 , $ D4 , $ T4
vpsrldq \ $ 8 , $ D3 , $ T3
vpsrldq \ $ 8 , $ D1 , $ T1
vpsrldq \ $ 8 , $ D0 , $ T0
vpsrldq \ $ 8 , $ D2 , $ T2
vpaddq $ T3 , $ D3 , $ D3
vpaddq $ T4 , $ D4 , $ D4
vpaddq $ T0 , $ D0 , $ D0
vpaddq $ T1 , $ D1 , $ D1
vpaddq $ T2 , $ D2 , $ D2
################################################################
# lazy reduction
vpsrlq \ $ 26 , $ D3 , $ H3
vpand $ MASK , $ D3 , $ D3
vpaddq $ H3 , $ D4 , $ D4 # h3 -> h4
vpsrlq \ $ 26 , $ D0 , $ H0
vpand $ MASK , $ D0 , $ D0
vpaddq $ H0 , $ D1 , $ D1 # h0 -> h1
vpsrlq \ $ 26 , $ D4 , $ H4
vpand $ MASK , $ D4 , $ D4
vpsrlq \ $ 26 , $ D1 , $ H1
vpand $ MASK , $ D1 , $ D1
vpaddq $ H1 , $ D2 , $ D2 # h1 -> h2
vpaddq $ H4 , $ D0 , $ D0
vpsllq \ $ 2 , $ H4 , $ H4
vpaddq $ H4 , $ D0 , $ D0 # h4 -> h0
vpsrlq \ $ 26 , $ D2 , $ H2
vpand $ MASK , $ D2 , $ D2
vpaddq $ H2 , $ D3 , $ D3 # h2 -> h3
vpsrlq \ $ 26 , $ D0 , $ H0
vpand $ MASK , $ D0 , $ D0
vpaddq $ H0 , $ D1 , $ D1 # h0 -> h1
vpsrlq \ $ 26 , $ D3 , $ H3
vpand $ MASK , $ D3 , $ D3
vpaddq $ H3 , $ D4 , $ D4 # h3 -> h4
vmovd $ D0 , `4*0-48-64` ( $ ctx ) # save partially reduced
vmovd $ D1 , `4*1-48-64` ( $ ctx )
vmovd $ D2 , `4*2-48-64` ( $ ctx )
vmovd $ D3 , `4*3-48-64` ( $ ctx )
vmovd $ D4 , `4*4-48-64` ( $ ctx )
___
$ code . = << ___ if ( $ win64 ) ;
vmovdqa 0x50 ( % r11 ) , % xmm6
vmovdqa 0x60 ( % r11 ) , % xmm7
vmovdqa 0x70 ( % r11 ) , % xmm8
vmovdqa 0x80 ( % r11 ) , % xmm9
vmovdqa 0x90 ( % r11 ) , % xmm10
vmovdqa 0xa0 ( % r11 ) , % xmm11
vmovdqa 0xb0 ( % r11 ) , % xmm12
vmovdqa 0xc0 ( % r11 ) , % xmm13
vmovdqa 0xd0 ( % r11 ) , % xmm14
vmovdqa 0xe0 ( % r11 ) , % xmm15
lea 0xf8 ( % r11 ) , % rsp
. Ldo_avx_epilogue:
___
$ code . = << ___ if ( ! $ win64 ) ;
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
lea - 8 ( % r10 ) , % rsp
. cfi_def_cfa_register % rsp
2020-01-05 22:40:47 -05:00
___
$ code . = << ___ ;
vzeroupper
ret
. cfi_endproc
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
___
& end_function ( "poly1305_blocks_avx" ) ;
2020-01-05 22:40:47 -05:00
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
& declare_function ( "poly1305_emit_avx" , 32 , 3 ) ;
$ code . = << ___ ;
2020-01-05 22:40:47 -05:00
cmpl \ $ 0 , 20 ( $ ctx ) # is_base2_26?
je . Lemit
mov 0 ( $ ctx ) , % eax # load hash value base 2^26
mov 4 ( $ ctx ) , % ecx
mov 8 ( $ ctx ) , % r8d
mov 12 ( $ ctx ) , % r11d
mov 16 ( $ ctx ) , % r10d
shl \ $ 26 , % rcx # base 2^26 -> base 2^64
mov % r8 , % r9
shl \ $ 52 , % r8
add % rcx , % rax
shr \ $ 12 , % r9
add % rax , % r8 # h0
adc \ $ 0 , % r9
shl \ $ 14 , % r11
mov % r10 , % rax
shr \ $ 24 , % r10
add % r11 , % r9
shl \ $ 40 , % rax
add % rax , % r9 # h1
adc \ $ 0 , % r10 # h2
mov % r10 , % rax # could be partially reduced, so reduce
mov % r10 , % rcx
and \ $ 3 , % r10
shr \ $ 2 , % rax
and \ $ - 4 , % rcx
add % rcx , % rax
add % rax , % r8
adc \ $ 0 , % r9
adc \ $ 0 , % r10
mov % r8 , % rax
add \ $ 5 , % r8 # compare to modulus
mov % r9 , % rcx
adc \ $ 0 , % r9
adc \ $ 0 , % r10
shr \ $ 2 , % r10 # did 130-bit value overflow?
cmovnz % r8 , % rax
cmovnz % r9 , % rcx
add 0 ( $ nonce ) , % rax # accumulate nonce
adc 8 ( $ nonce ) , % rcx
mov % rax , 0 ( $ mac ) # write result
mov % rcx , 8 ( $ mac )
ret
___
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
& end_function ( "poly1305_emit_avx" ) ;
2020-01-05 22:40:47 -05:00
if ( $ avx > 1 ) {
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
2020-01-05 22:40:47 -05:00
my ( $ H0 , $ H1 , $ H2 , $ H3 , $ H4 , $ MASK , $ T4 , $ T0 , $ T1 , $ T2 , $ T3 , $ D0 , $ D1 , $ D2 , $ D3 , $ D4 ) =
map ( "%ymm$_" , ( 0 .. 15 ) ) ;
my $ S4 = $ MASK ;
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
sub poly1305_blocks_avxN {
my ( $ avx512 ) = @ _ ;
my $ suffix = $ avx512 ? "_avx512" : "" ;
2020-01-05 22:40:47 -05:00
$ code . = << ___ ;
. cfi_startproc
mov 20 ( $ ctx ) , % r8d # is_base2_26
cmp \ $ 128 , $ len
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
jae . Lblocks_avx2 $ suffix
2020-01-05 22:40:47 -05:00
test % r8d , % r8d
jz . Lblocks
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. Lblocks_avx2 $ suffix:
2020-01-05 22:40:47 -05:00
and \ $ - 16 , $ len
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
jz . Lno_data_avx2 $ suffix
2020-01-05 22:40:47 -05:00
vzeroupper
test % r8d , % r8d
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
jz . Lbase2_64_avx2 $ suffix
2020-01-05 22:40:47 -05:00
test \ $ 63 , $ len
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
jz . Leven_avx2 $ suffix
2020-01-05 22:40:47 -05:00
push % rbp
. cfi_push % rbp
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
mov % rsp , % rbp
push % rbx
. cfi_push % rbx
2020-01-05 22:40:47 -05:00
push % r12
. cfi_push % r12
push % r13
. cfi_push % r13
push % r14
. cfi_push % r14
push % r15
. cfi_push % r15
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. Lblocks_avx2_body $ suffix:
2020-01-05 22:40:47 -05:00
mov $ len , % r15 # reassign $len
mov 0 ( $ ctx ) , $ d1 # load hash value
mov 8 ( $ ctx ) , $ d2
mov 16 ( $ ctx ) , $ h2 #d
mov 24 ( $ ctx ) , $ r0 # load r
mov 32 ( $ ctx ) , $ s1
################################# base 2^26 -> base 2^64
mov $ d1 #d,$h0#d
and \ $` - 1 * ( 1 << 31 ) ` , $ d1
mov $ d2 , $ r1 # borrow $r1
mov $ d2 #d,$h1#d
and \ $` - 1 * ( 1 << 31 ) ` , $ d2
shr \ $ 6 , $ d1
shl \ $ 52 , $ r1
add $ d1 , $ h0
shr \ $ 12 , $ h1
shr \ $ 18 , $ d2
add $ r1 , $ h0
adc $ d2 , $ h1
mov $ h2 , $ d1
shl \ $ 40 , $ d1
shr \ $ 24 , $ h2
add $ d1 , $ h1
adc \ $ 0 , $ h2 # can be partially reduced...
mov \ $ - 4 , $ d2 # ... so reduce
mov $ h2 , $ d1
and $ h2 , $ d2
shr \ $ 2 , $ d1
and \ $ 3 , $ h2
add $ d2 , $ d1 # =*5
add $ d1 , $ h0
adc \ $ 0 , $ h1
adc \ $ 0 , $ h2
mov $ s1 , $ r1
mov $ s1 , % rax
shr \ $ 2 , $ s1
add $ r1 , $ s1 # s1 = r1 + (r1 >> 2)
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. Lbase2_26_pre_avx2 $ suffix:
2020-01-05 22:40:47 -05:00
add 0 ( $ inp ) , $ h0 # accumulate input
adc 8 ( $ inp ) , $ h1
lea 16 ( $ inp ) , $ inp
adc $ padbit , $ h2
sub \ $ 1 6 , % r15
call __poly1305_block
mov $ r1 , % rax
test \ $ 63 , % r15
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
jnz . Lbase2_26_pre_avx2 $ suffix
2020-01-05 22:40:47 -05:00
test $ padbit , $ padbit # if $padbit is zero,
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
jz . Lstore_base2_64_avx2 $ suffix # store hash in base 2^64 format
2020-01-05 22:40:47 -05:00
################################# base 2^64 -> base 2^26
mov $ h0 , % rax
mov $ h0 , % rdx
shr \ $ 52 , $ h0
mov $ h1 , $ r0
mov $ h1 , $ r1
shr \ $ 26 , % rdx
and \ $ 0x3ffffff , % rax # h[0]
shl \ $ 12 , $ r0
and \ $ 0x3ffffff , % rdx # h[1]
shr \ $ 14 , $ h1
or $ r0 , $ h0
shl \ $ 24 , $ h2
and \ $ 0x3ffffff , $ h0 # h[2]
shr \ $ 40 , $ r1
and \ $ 0x3ffffff , $ h1 # h[3]
or $ r1 , $ h2 # h[4]
test % r15 , % r15
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
jz . Lstore_base2_26_avx2 $ suffix
2020-01-05 22:40:47 -05:00
vmovd % rax #d,%x#$H0
vmovd % rdx #d,%x#$H1
vmovd $ h0 #d,%x#$H2
vmovd $ h1 #d,%x#$H3
vmovd $ h2 #d,%x#$H4
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
jmp . Lproceed_avx2 $ suffix
2020-01-05 22:40:47 -05:00
. align 32
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. Lstore_base2_64_avx2 $ suffix:
2020-01-05 22:40:47 -05:00
mov $ h0 , 0 ( $ ctx )
mov $ h1 , 8 ( $ ctx )
mov $ h2 , 16 ( $ ctx ) # note that is_base2_26 is zeroed
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
jmp . Ldone_avx2 $ suffix
2020-01-05 22:40:47 -05:00
. align 16
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. Lstore_base2_26_avx2 $ suffix:
2020-01-05 22:40:47 -05:00
mov % rax #d,0($ctx) # store hash value base 2^26
mov % rdx #d,4($ctx)
mov $ h0 #d,8($ctx)
mov $ h1 #d,12($ctx)
mov $ h2 #d,16($ctx)
. align 16
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. Ldone_avx2 $ suffix:
pop % r15
2020-01-05 22:40:47 -05:00
. cfi_restore % r15
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % r14
2020-01-05 22:40:47 -05:00
. cfi_restore % r14
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % r13
2020-01-05 22:40:47 -05:00
. cfi_restore % r13
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % r12
2020-01-05 22:40:47 -05:00
. cfi_restore % r12
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % rbx
2020-01-05 22:40:47 -05:00
. cfi_restore % rbx
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % rbp
. cfi_restore % rbp
. Lno_data_avx2 $ suffix:
. Lblocks_avx2_epilogue $ suffix:
2020-01-05 22:40:47 -05:00
ret
. cfi_endproc
. align 32
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. Lbase2_64_avx2 $ suffix:
2020-01-05 22:40:47 -05:00
. cfi_startproc
push % rbp
. cfi_push % rbp
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
mov % rsp , % rbp
push % rbx
. cfi_push % rbx
2020-01-05 22:40:47 -05:00
push % r12
. cfi_push % r12
push % r13
. cfi_push % r13
push % r14
. cfi_push % r14
push % r15
. cfi_push % r15
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. Lbase2_64_avx2_body $ suffix:
2020-01-05 22:40:47 -05:00
mov $ len , % r15 # reassign $len
mov 24 ( $ ctx ) , $ r0 # load r
mov 32 ( $ ctx ) , $ s1
mov 0 ( $ ctx ) , $ h0 # load hash value
mov 8 ( $ ctx ) , $ h1
mov 16 ( $ ctx ) , $ h2 #d
mov $ s1 , $ r1
mov $ s1 , % rax
shr \ $ 2 , $ s1
add $ r1 , $ s1 # s1 = r1 + (r1 >> 2)
test \ $ 63 , $ len
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
jz . Linit_avx2 $ suffix
2020-01-05 22:40:47 -05:00
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. Lbase2_64_pre_avx2 $ suffix:
2020-01-05 22:40:47 -05:00
add 0 ( $ inp ) , $ h0 # accumulate input
adc 8 ( $ inp ) , $ h1
lea 16 ( $ inp ) , $ inp
adc $ padbit , $ h2
sub \ $ 1 6 , % r15
call __poly1305_block
mov $ r1 , % rax
test \ $ 63 , % r15
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
jnz . Lbase2_64_pre_avx2 $ suffix
2020-01-05 22:40:47 -05:00
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. Linit_avx2 $ suffix:
2020-01-05 22:40:47 -05:00
################################# base 2^64 -> base 2^26
mov $ h0 , % rax
mov $ h0 , % rdx
shr \ $ 52 , $ h0
mov $ h1 , $ d1
mov $ h1 , $ d2
shr \ $ 26 , % rdx
and \ $ 0x3ffffff , % rax # h[0]
shl \ $ 12 , $ d1
and \ $ 0x3ffffff , % rdx # h[1]
shr \ $ 14 , $ h1
or $ d1 , $ h0
shl \ $ 24 , $ h2
and \ $ 0x3ffffff , $ h0 # h[2]
shr \ $ 40 , $ d2
and \ $ 0x3ffffff , $ h1 # h[3]
or $ d2 , $ h2 # h[4]
vmovd % rax #d,%x#$H0
vmovd % rdx #d,%x#$H1
vmovd $ h0 #d,%x#$H2
vmovd $ h1 #d,%x#$H3
vmovd $ h2 #d,%x#$H4
movl \ $ 1 , 20 ( $ ctx ) # set is_base2_26
call __poly1305_init_avx
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. Lproceed_avx2 $ suffix:
2020-01-05 22:40:47 -05:00
mov % r15 , $ len # restore $len
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
___
$ code . = << ___ if ( ! $ kernel ) ;
mov OPENSSL_ia32cap_P + 8 ( % rip ) , % r9d
2020-01-05 22:40:47 -05:00
mov \ $` ( 1 << 31 | 1 << 30 | 1 << 16 ) ` , % r11d
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
___
$ code . = << ___ ;
pop % r15
2020-01-05 22:40:47 -05:00
. cfi_restore % r15
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % r14
2020-01-05 22:40:47 -05:00
. cfi_restore % r14
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % r13
2020-01-05 22:40:47 -05:00
. cfi_restore % r13
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % r12
2020-01-05 22:40:47 -05:00
. cfi_restore % r12
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % rbx
2020-01-05 22:40:47 -05:00
. cfi_restore % rbx
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
pop % rbp
. cfi_restore % rbp
. Lbase2_64_avx2_epilogue $ suffix:
jmp . Ldo_avx2 $ suffix
2020-01-05 22:40:47 -05:00
. cfi_endproc
. align 32
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. Leven_avx2 $ suffix:
2020-01-05 22:40:47 -05:00
. cfi_startproc
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
___
$ code . = << ___ if ( ! $ kernel ) ;
mov OPENSSL_ia32cap_P + 8 ( % rip ) , % r9d
___
$ code . = << ___ ;
2020-01-05 22:40:47 -05:00
vmovd 4 * 0 ( $ ctx ) , % x #$H0 # load hash value base 2^26
vmovd 4 * 1 ( $ ctx ) , % x #$H1
vmovd 4 * 2 ( $ ctx ) , % x #$H2
vmovd 4 * 3 ( $ ctx ) , % x #$H3
vmovd 4 * 4 ( $ ctx ) , % x #$H4
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. Ldo_avx2 $ suffix:
2020-01-05 22:40:47 -05:00
___
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
$ code . = << ___ if ( ! $ kernel && $ avx > 2 ) ;
2020-01-05 22:40:47 -05:00
cmp \ $ 512 , $ len
jb . Lskip_avx512
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
and % r11d , % r9d
test \ $ ` 1 << 16 ` , % r9d # check for AVX512F
2020-01-05 22:40:47 -05:00
jnz . Lblocks_avx512
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. Lskip_avx512 $ suffix:
___
$ code . = << ___ if ( $ avx > 2 && $ avx512 && $ kernel ) ;
cmp \ $ 512 , $ len
jae . Lblocks_avx512
2020-01-05 22:40:47 -05:00
___
$ code . = << ___ if ( ! $ win64 ) ;
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
lea 8 ( % rsp ) , % r10
. cfi_def_cfa_register % r10
2020-01-05 22:40:47 -05:00
sub \ $ 0 x128 , % rsp
___
$ code . = << ___ if ( $ win64 ) ;
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
lea 8 ( % rsp ) , % r10
2020-01-05 22:40:47 -05:00
sub \ $ 0 x1c8 , % rsp
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
vmovdqa % xmm6 , - 0xb0 ( % r10 )
vmovdqa % xmm7 , - 0xa0 ( % r10 )
vmovdqa % xmm8 , - 0x90 ( % r10 )
vmovdqa % xmm9 , - 0x80 ( % r10 )
vmovdqa % xmm10 , - 0x70 ( % r10 )
vmovdqa % xmm11 , - 0x60 ( % r10 )
vmovdqa % xmm12 , - 0x50 ( % r10 )
vmovdqa % xmm13 , - 0x40 ( % r10 )
vmovdqa % xmm14 , - 0x30 ( % r10 )
vmovdqa % xmm15 , - 0x20 ( % r10 )
. Ldo_avx2_body $ suffix:
2020-01-05 22:40:47 -05:00
___
$ code . = << ___ ;
lea . Lconst ( % rip ) , % rcx
lea 48 + 64 ( $ ctx ) , $ ctx # size optimization
vmovdqa 96 ( % rcx ) , $ T0 # .Lpermd_avx2
# expand and copy pre-calculated table to stack
vmovdqu `16*0-64` ( $ ctx ) , % x #$T2
and \ $ - 512 , % rsp
vmovdqu `16*1-64` ( $ ctx ) , % x #$T3
vmovdqu `16*2-64` ( $ ctx ) , % x #$T4
vmovdqu `16*3-64` ( $ ctx ) , % x #$D0
vmovdqu `16*4-64` ( $ ctx ) , % x #$D1
vmovdqu `16*5-64` ( $ ctx ) , % x #$D2
lea 0x90 ( % rsp ) , % rax # size optimization
vmovdqu `16*6-64` ( $ ctx ) , % x #$D3
vpermd $ T2 , $ T0 , $ T2 # 00003412 -> 14243444
vmovdqu `16*7-64` ( $ ctx ) , % x #$D4
vpermd $ T3 , $ T0 , $ T3
vmovdqu `16*8-64` ( $ ctx ) , % x #$MASK
vpermd $ T4 , $ T0 , $ T4
vmovdqa $ T2 , 0x00 ( % rsp )
vpermd $ D0 , $ T0 , $ D0
vmovdqa $ T3 , 0x20 - 0x90 ( % rax )
vpermd $ D1 , $ T0 , $ D1
vmovdqa $ T4 , 0x40 - 0x90 ( % rax )
vpermd $ D2 , $ T0 , $ D2
vmovdqa $ D0 , 0x60 - 0x90 ( % rax )
vpermd $ D3 , $ T0 , $ D3
vmovdqa $ D1 , 0x80 - 0x90 ( % rax )
vpermd $ D4 , $ T0 , $ D4
vmovdqa $ D2 , 0xa0 - 0x90 ( % rax )
vpermd $ MASK , $ T0 , $ MASK
vmovdqa $ D3 , 0xc0 - 0x90 ( % rax )
vmovdqa $ D4 , 0xe0 - 0x90 ( % rax )
vmovdqa $ MASK , 0x100 - 0x90 ( % rax )
vmovdqa 64 ( % rcx ) , $ MASK # .Lmask26
################################################################
# load input
vmovdqu 16 * 0 ( $ inp ) , % x #$T0
vmovdqu 16 * 1 ( $ inp ) , % x #$T1
vinserti128 \ $ 1 , 16 * 2 ( $ inp ) , $ T0 , $ T0
vinserti128 \ $ 1 , 16 * 3 ( $ inp ) , $ T1 , $ T1
lea 16 * 4 ( $ inp ) , $ inp
vpsrldq \ $ 6 , $ T0 , $ T2 # splat input
vpsrldq \ $ 6 , $ T1 , $ T3
vpunpckhqdq $ T1 , $ T0 , $ T4 # 4
vpunpcklqdq $ T3 , $ T2 , $ T2 # 2:3
vpunpcklqdq $ T1 , $ T0 , $ T0 # 0:1
vpsrlq \ $ 30 , $ T2 , $ T3
vpsrlq \ $ 4 , $ T2 , $ T2
vpsrlq \ $ 26 , $ T0 , $ T1
vpsrlq \ $ 40 , $ T4 , $ T4 # 4
vpand $ MASK , $ T2 , $ T2 # 2
vpand $ MASK , $ T0 , $ T0 # 0
vpand $ MASK , $ T1 , $ T1 # 1
vpand $ MASK , $ T3 , $ T3 # 3
vpor 32 ( % rcx ) , $ T4 , $ T4 # padbit, yes, always
vpaddq $ H2 , $ T2 , $ H2 # accumulate input
sub \ $ 6 4 , $ len
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
jz . Ltail_avx2 $ suffix
jmp . Loop_avx2 $ suffix
2020-01-05 22:40:47 -05:00
. align 32
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. Loop_avx2 $ suffix:
2020-01-05 22:40:47 -05:00
################################################################
# ((inp[0]*r^4+inp[4])*r^4+inp[ 8])*r^4
# ((inp[1]*r^4+inp[5])*r^4+inp[ 9])*r^3
# ((inp[2]*r^4+inp[6])*r^4+inp[10])*r^2
# ((inp[3]*r^4+inp[7])*r^4+inp[11])*r^1
# \________/\__________/
################################################################
#vpaddq $H2,$T2,$H2 # accumulate input
vpaddq $ H0 , $ T0 , $ H0
vmovdqa `32*0` ( % rsp ) , $ T0 # r0^4
vpaddq $ H1 , $ T1 , $ H1
vmovdqa `32*1` ( % rsp ) , $ T1 # r1^4
vpaddq $ H3 , $ T3 , $ H3
vmovdqa `32*3` ( % rsp ) , $ T2 # r2^4
vpaddq $ H4 , $ T4 , $ H4
vmovdqa `32*6-0x90` ( % rax ) , $ T3 # s3^4
vmovdqa `32*8-0x90` ( % rax ) , $ S4 # s4^4
# d4 = h4*r0 + h3*r1 + h2*r2 + h1*r3 + h0*r4
# d3 = h3*r0 + h2*r1 + h1*r2 + h0*r3 + h4*5*r4
# d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4
# d1 = h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3 + h2*5*r4
# d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
#
# however, as h2 is "chronologically" first one available pull
# corresponding operations up, so it's
#
# d4 = h2*r2 + h4*r0 + h3*r1 + h1*r3 + h0*r4
# d3 = h2*r1 + h3*r0 + h1*r2 + h0*r3 + h4*5*r4
# d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4
# d1 = h2*5*r4 + h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3
# d0 = h2*5*r3 + h0*r0 + h4*5*r1 + h3*5*r2 + h1*5*r4
vpmuludq $ H2 , $ T0 , $ D2 # d2 = h2*r0
vpmuludq $ H2 , $ T1 , $ D3 # d3 = h2*r1
vpmuludq $ H2 , $ T2 , $ D4 # d4 = h2*r2
vpmuludq $ H2 , $ T3 , $ D0 # d0 = h2*s3
vpmuludq $ H2 , $ S4 , $ D1 # d1 = h2*s4
vpmuludq $ H0 , $ T1 , $ T4 # h0*r1
vpmuludq $ H1 , $ T1 , $ H2 # h1*r1, borrow $H2 as temp
vpaddq $ T4 , $ D1 , $ D1 # d1 += h0*r1
vpaddq $ H2 , $ D2 , $ D2 # d2 += h1*r1
vpmuludq $ H3 , $ T1 , $ T4 # h3*r1
vpmuludq `32*2` ( % rsp ) , $ H4 , $ H2 # h4*s1
vpaddq $ T4 , $ D4 , $ D4 # d4 += h3*r1
vpaddq $ H2 , $ D0 , $ D0 # d0 += h4*s1
vmovdqa `32*4-0x90` ( % rax ) , $ T1 # s2
vpmuludq $ H0 , $ T0 , $ T4 # h0*r0
vpmuludq $ H1 , $ T0 , $ H2 # h1*r0
vpaddq $ T4 , $ D0 , $ D0 # d0 += h0*r0
vpaddq $ H2 , $ D1 , $ D1 # d1 += h1*r0
vpmuludq $ H3 , $ T0 , $ T4 # h3*r0
vpmuludq $ H4 , $ T0 , $ H2 # h4*r0
vmovdqu 16 * 0 ( $ inp ) , % x #$T0 # load input
vpaddq $ T4 , $ D3 , $ D3 # d3 += h3*r0
vpaddq $ H2 , $ D4 , $ D4 # d4 += h4*r0
vinserti128 \ $ 1 , 16 * 2 ( $ inp ) , $ T0 , $ T0
vpmuludq $ H3 , $ T1 , $ T4 # h3*s2
vpmuludq $ H4 , $ T1 , $ H2 # h4*s2
vmovdqu 16 * 1 ( $ inp ) , % x #$T1
vpaddq $ T4 , $ D0 , $ D0 # d0 += h3*s2
vpaddq $ H2 , $ D1 , $ D1 # d1 += h4*s2
vmovdqa `32*5-0x90` ( % rax ) , $ H2 # r3
vpmuludq $ H1 , $ T2 , $ T4 # h1*r2
vpmuludq $ H0 , $ T2 , $ T2 # h0*r2
vpaddq $ T4 , $ D3 , $ D3 # d3 += h1*r2
vpaddq $ T2 , $ D2 , $ D2 # d2 += h0*r2
vinserti128 \ $ 1 , 16 * 3 ( $ inp ) , $ T1 , $ T1
lea 16 * 4 ( $ inp ) , $ inp
vpmuludq $ H1 , $ H2 , $ T4 # h1*r3
vpmuludq $ H0 , $ H2 , $ H2 # h0*r3
vpsrldq \ $ 6 , $ T0 , $ T2 # splat input
vpaddq $ T4 , $ D4 , $ D4 # d4 += h1*r3
vpaddq $ H2 , $ D3 , $ D3 # d3 += h0*r3
vpmuludq $ H3 , $ T3 , $ T4 # h3*s3
vpmuludq $ H4 , $ T3 , $ H2 # h4*s3
vpsrldq \ $ 6 , $ T1 , $ T3
vpaddq $ T4 , $ D1 , $ D1 # d1 += h3*s3
vpaddq $ H2 , $ D2 , $ D2 # d2 += h4*s3
vpunpckhqdq $ T1 , $ T0 , $ T4 # 4
vpmuludq $ H3 , $ S4 , $ H3 # h3*s4
vpmuludq $ H4 , $ S4 , $ H4 # h4*s4
vpunpcklqdq $ T1 , $ T0 , $ T0 # 0:1
vpaddq $ H3 , $ D2 , $ H2 # h2 = d2 + h3*r4
vpaddq $ H4 , $ D3 , $ H3 # h3 = d3 + h4*r4
vpunpcklqdq $ T3 , $ T2 , $ T3 # 2:3
vpmuludq `32*7-0x90` ( % rax ) , $ H0 , $ H4 # h0*r4
vpmuludq $ H1 , $ S4 , $ H0 # h1*s4
vmovdqa 64 ( % rcx ) , $ MASK # .Lmask26
vpaddq $ H4 , $ D4 , $ H4 # h4 = d4 + h0*r4
vpaddq $ H0 , $ D0 , $ H0 # h0 = d0 + h1*s4
################################################################
# lazy reduction (interleaved with tail of input splat)
vpsrlq \ $ 26 , $ H3 , $ D3
vpand $ MASK , $ H3 , $ H3
vpaddq $ D3 , $ H4 , $ H4 # h3 -> h4
vpsrlq \ $ 26 , $ H0 , $ D0
vpand $ MASK , $ H0 , $ H0
vpaddq $ D0 , $ D1 , $ H1 # h0 -> h1
vpsrlq \ $ 26 , $ H4 , $ D4
vpand $ MASK , $ H4 , $ H4
vpsrlq \ $ 4 , $ T3 , $ T2
vpsrlq \ $ 26 , $ H1 , $ D1
vpand $ MASK , $ H1 , $ H1
vpaddq $ D1 , $ H2 , $ H2 # h1 -> h2
vpaddq $ D4 , $ H0 , $ H0
vpsllq \ $ 2 , $ D4 , $ D4
vpaddq $ D4 , $ H0 , $ H0 # h4 -> h0
vpand $ MASK , $ T2 , $ T2 # 2
vpsrlq \ $ 26 , $ T0 , $ T1
vpsrlq \ $ 26 , $ H2 , $ D2
vpand $ MASK , $ H2 , $ H2
vpaddq $ D2 , $ H3 , $ H3 # h2 -> h3
vpaddq $ T2 , $ H2 , $ H2 # modulo-scheduled
vpsrlq \ $ 30 , $ T3 , $ T3
vpsrlq \ $ 26 , $ H0 , $ D0
vpand $ MASK , $ H0 , $ H0
vpaddq $ D0 , $ H1 , $ H1 # h0 -> h1
vpsrlq \ $ 40 , $ T4 , $ T4 # 4
vpsrlq \ $ 26 , $ H3 , $ D3
vpand $ MASK , $ H3 , $ H3
vpaddq $ D3 , $ H4 , $ H4 # h3 -> h4
vpand $ MASK , $ T0 , $ T0 # 0
vpand $ MASK , $ T1 , $ T1 # 1
vpand $ MASK , $ T3 , $ T3 # 3
vpor 32 ( % rcx ) , $ T4 , $ T4 # padbit, yes, always
sub \ $ 6 4 , $ len
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
jnz . Loop_avx2 $ suffix
2020-01-05 22:40:47 -05:00
. byte 0x66 , 0x90
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. Ltail_avx2 $ suffix:
2020-01-05 22:40:47 -05:00
################################################################
# while above multiplications were by r^4 in all lanes, in last
# iteration we multiply least significant lane by r^4 and most
# significant one by r, so copy of above except that references
# to the precomputed table are displaced by 4...
#vpaddq $H2,$T2,$H2 # accumulate input
vpaddq $ H0 , $ T0 , $ H0
vmovdqu `32*0+4` ( % rsp ) , $ T0 # r0^4
vpaddq $ H1 , $ T1 , $ H1
vmovdqu `32*1+4` ( % rsp ) , $ T1 # r1^4
vpaddq $ H3 , $ T3 , $ H3
vmovdqu `32*3+4` ( % rsp ) , $ T2 # r2^4
vpaddq $ H4 , $ T4 , $ H4
vmovdqu `32*6+4-0x90` ( % rax ) , $ T3 # s3^4
vmovdqu `32*8+4-0x90` ( % rax ) , $ S4 # s4^4
vpmuludq $ H2 , $ T0 , $ D2 # d2 = h2*r0
vpmuludq $ H2 , $ T1 , $ D3 # d3 = h2*r1
vpmuludq $ H2 , $ T2 , $ D4 # d4 = h2*r2
vpmuludq $ H2 , $ T3 , $ D0 # d0 = h2*s3
vpmuludq $ H2 , $ S4 , $ D1 # d1 = h2*s4
vpmuludq $ H0 , $ T1 , $ T4 # h0*r1
vpmuludq $ H1 , $ T1 , $ H2 # h1*r1
vpaddq $ T4 , $ D1 , $ D1 # d1 += h0*r1
vpaddq $ H2 , $ D2 , $ D2 # d2 += h1*r1
vpmuludq $ H3 , $ T1 , $ T4 # h3*r1
vpmuludq `32*2+4` ( % rsp ) , $ H4 , $ H2 # h4*s1
vpaddq $ T4 , $ D4 , $ D4 # d4 += h3*r1
vpaddq $ H2 , $ D0 , $ D0 # d0 += h4*s1
vpmuludq $ H0 , $ T0 , $ T4 # h0*r0
vpmuludq $ H1 , $ T0 , $ H2 # h1*r0
vpaddq $ T4 , $ D0 , $ D0 # d0 += h0*r0
vmovdqu `32*4+4-0x90` ( % rax ) , $ T1 # s2
vpaddq $ H2 , $ D1 , $ D1 # d1 += h1*r0
vpmuludq $ H3 , $ T0 , $ T4 # h3*r0
vpmuludq $ H4 , $ T0 , $ H2 # h4*r0
vpaddq $ T4 , $ D3 , $ D3 # d3 += h3*r0
vpaddq $ H2 , $ D4 , $ D4 # d4 += h4*r0
vpmuludq $ H3 , $ T1 , $ T4 # h3*s2
vpmuludq $ H4 , $ T1 , $ H2 # h4*s2
vpaddq $ T4 , $ D0 , $ D0 # d0 += h3*s2
vpaddq $ H2 , $ D1 , $ D1 # d1 += h4*s2
vmovdqu `32*5+4-0x90` ( % rax ) , $ H2 # r3
vpmuludq $ H1 , $ T2 , $ T4 # h1*r2
vpmuludq $ H0 , $ T2 , $ T2 # h0*r2
vpaddq $ T4 , $ D3 , $ D3 # d3 += h1*r2
vpaddq $ T2 , $ D2 , $ D2 # d2 += h0*r2
vpmuludq $ H1 , $ H2 , $ T4 # h1*r3
vpmuludq $ H0 , $ H2 , $ H2 # h0*r3
vpaddq $ T4 , $ D4 , $ D4 # d4 += h1*r3
vpaddq $ H2 , $ D3 , $ D3 # d3 += h0*r3
vpmuludq $ H3 , $ T3 , $ T4 # h3*s3
vpmuludq $ H4 , $ T3 , $ H2 # h4*s3
vpaddq $ T4 , $ D1 , $ D1 # d1 += h3*s3
vpaddq $ H2 , $ D2 , $ D2 # d2 += h4*s3
vpmuludq $ H3 , $ S4 , $ H3 # h3*s4
vpmuludq $ H4 , $ S4 , $ H4 # h4*s4
vpaddq $ H3 , $ D2 , $ H2 # h2 = d2 + h3*r4
vpaddq $ H4 , $ D3 , $ H3 # h3 = d3 + h4*r4
vpmuludq `32*7+4-0x90` ( % rax ) , $ H0 , $ H4 # h0*r4
vpmuludq $ H1 , $ S4 , $ H0 # h1*s4
vmovdqa 64 ( % rcx ) , $ MASK # .Lmask26
vpaddq $ H4 , $ D4 , $ H4 # h4 = d4 + h0*r4
vpaddq $ H0 , $ D0 , $ H0 # h0 = d0 + h1*s4
################################################################
# horizontal addition
vpsrldq \ $ 8 , $ D1 , $ T1
vpsrldq \ $ 8 , $ H2 , $ T2
vpsrldq \ $ 8 , $ H3 , $ T3
vpsrldq \ $ 8 , $ H4 , $ T4
vpsrldq \ $ 8 , $ H0 , $ T0
vpaddq $ T1 , $ D1 , $ D1
vpaddq $ T2 , $ H2 , $ H2
vpaddq $ T3 , $ H3 , $ H3
vpaddq $ T4 , $ H4 , $ H4
vpaddq $ T0 , $ H0 , $ H0
vpermq \ $ 0x2 , $ H3 , $ T3
vpermq \ $ 0x2 , $ H4 , $ T4
vpermq \ $ 0x2 , $ H0 , $ T0
vpermq \ $ 0x2 , $ D1 , $ T1
vpermq \ $ 0x2 , $ H2 , $ T2
vpaddq $ T3 , $ H3 , $ H3
vpaddq $ T4 , $ H4 , $ H4
vpaddq $ T0 , $ H0 , $ H0
vpaddq $ T1 , $ D1 , $ D1
vpaddq $ T2 , $ H2 , $ H2
################################################################
# lazy reduction
vpsrlq \ $ 26 , $ H3 , $ D3
vpand $ MASK , $ H3 , $ H3
vpaddq $ D3 , $ H4 , $ H4 # h3 -> h4
vpsrlq \ $ 26 , $ H0 , $ D0
vpand $ MASK , $ H0 , $ H0
vpaddq $ D0 , $ D1 , $ H1 # h0 -> h1
vpsrlq \ $ 26 , $ H4 , $ D4
vpand $ MASK , $ H4 , $ H4
vpsrlq \ $ 26 , $ H1 , $ D1
vpand $ MASK , $ H1 , $ H1
vpaddq $ D1 , $ H2 , $ H2 # h1 -> h2
vpaddq $ D4 , $ H0 , $ H0
vpsllq \ $ 2 , $ D4 , $ D4
vpaddq $ D4 , $ H0 , $ H0 # h4 -> h0
vpsrlq \ $ 26 , $ H2 , $ D2
vpand $ MASK , $ H2 , $ H2
vpaddq $ D2 , $ H3 , $ H3 # h2 -> h3
vpsrlq \ $ 26 , $ H0 , $ D0
vpand $ MASK , $ H0 , $ H0
vpaddq $ D0 , $ H1 , $ H1 # h0 -> h1
vpsrlq \ $ 26 , $ H3 , $ D3
vpand $ MASK , $ H3 , $ H3
vpaddq $ D3 , $ H4 , $ H4 # h3 -> h4
vmovd % x #$H0,`4*0-48-64`($ctx)# save partially reduced
vmovd % x #$H1,`4*1-48-64`($ctx)
vmovd % x #$H2,`4*2-48-64`($ctx)
vmovd % x #$H3,`4*3-48-64`($ctx)
vmovd % x #$H4,`4*4-48-64`($ctx)
___
$ code . = << ___ if ( $ win64 ) ;
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
vmovdqa - 0xb0 ( % r10 ) , % xmm6
vmovdqa - 0xa0 ( % r10 ) , % xmm7
vmovdqa - 0x90 ( % r10 ) , % xmm8
vmovdqa - 0x80 ( % r10 ) , % xmm9
vmovdqa - 0x70 ( % r10 ) , % xmm10
vmovdqa - 0x60 ( % r10 ) , % xmm11
vmovdqa - 0x50 ( % r10 ) , % xmm12
vmovdqa - 0x40 ( % r10 ) , % xmm13
vmovdqa - 0x30 ( % r10 ) , % xmm14
vmovdqa - 0x20 ( % r10 ) , % xmm15
lea - 8 ( % r10 ) , % rsp
. Ldo_avx2_epilogue $ suffix:
2020-01-05 22:40:47 -05:00
___
$ code . = << ___ if ( ! $ win64 ) ;
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
lea - 8 ( % r10 ) , % rsp
. cfi_def_cfa_register % rsp
2020-01-05 22:40:47 -05:00
___
$ code . = << ___ ;
vzeroupper
ret
. cfi_endproc
___
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
if ( $ avx > 2 && $ avx512 ) {
2020-01-05 22:40:47 -05:00
my ( $ R0 , $ R1 , $ R2 , $ R3 , $ R4 , $ S1 , $ S2 , $ S3 , $ S4 ) = map ( "%zmm$_" , ( 16 .. 24 ) ) ;
my ( $ M0 , $ M1 , $ M2 , $ M3 , $ M4 ) = map ( "%zmm$_" , ( 25 .. 29 ) ) ;
my $ PADBIT = "%zmm30" ;
map ( s/%y/%z/ , ( $ T4 , $ T0 , $ T1 , $ T2 , $ T3 ) ) ; # switch to %zmm domain
map ( s/%y/%z/ , ( $ D0 , $ D1 , $ D2 , $ D3 , $ D4 ) ) ;
map ( s/%y/%z/ , ( $ H0 , $ H1 , $ H2 , $ H3 , $ H4 ) ) ;
map ( s/%y/%z/ , ( $ MASK ) ) ;
$ code . = << ___ ;
. cfi_startproc
. Lblocks_avx512:
mov \ $ 15 , % eax
kmovw % eax , % k2
___
$ code . = << ___ if ( ! $ win64 ) ;
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
lea 8 ( % rsp ) , % r10
. cfi_def_cfa_register % r10
2020-01-05 22:40:47 -05:00
sub \ $ 0 x128 , % rsp
___
$ code . = << ___ if ( $ win64 ) ;
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
lea 8 ( % rsp ) , % r10
2020-01-05 22:40:47 -05:00
sub \ $ 0 x1c8 , % rsp
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
vmovdqa % xmm6 , - 0xb0 ( % r10 )
vmovdqa % xmm7 , - 0xa0 ( % r10 )
vmovdqa % xmm8 , - 0x90 ( % r10 )
vmovdqa % xmm9 , - 0x80 ( % r10 )
vmovdqa % xmm10 , - 0x70 ( % r10 )
vmovdqa % xmm11 , - 0x60 ( % r10 )
vmovdqa % xmm12 , - 0x50 ( % r10 )
vmovdqa % xmm13 , - 0x40 ( % r10 )
vmovdqa % xmm14 , - 0x30 ( % r10 )
vmovdqa % xmm15 , - 0x20 ( % r10 )
2020-01-05 22:40:47 -05:00
. Ldo_avx512_body:
___
$ code . = << ___ ;
lea . Lconst ( % rip ) , % rcx
lea 48 + 64 ( $ ctx ) , $ ctx # size optimization
vmovdqa 96 ( % rcx ) , % y #$T2 # .Lpermd_avx2
# expand pre-calculated table
vmovdqu `16*0-64` ( $ ctx ) , % x #$D0 # will become expanded ${R0}
and \ $ - 512 , % rsp
vmovdqu `16*1-64` ( $ ctx ) , % x #$D1 # will become ... ${R1}
mov \ $ 0x20 , % rax
vmovdqu `16*2-64` ( $ ctx ) , % x #$T0 # ... ${S1}
vmovdqu `16*3-64` ( $ ctx ) , % x #$D2 # ... ${R2}
vmovdqu `16*4-64` ( $ ctx ) , % x #$T1 # ... ${S2}
vmovdqu `16*5-64` ( $ ctx ) , % x #$D3 # ... ${R3}
vmovdqu `16*6-64` ( $ ctx ) , % x #$T3 # ... ${S3}
vmovdqu `16*7-64` ( $ ctx ) , % x #$D4 # ... ${R4}
vmovdqu `16*8-64` ( $ ctx ) , % x #$T4 # ... ${S4}
vpermd $ D0 , $ T2 , $ R0 # 00003412 -> 14243444
vpbroadcastq 64 ( % rcx ) , $ MASK # .Lmask26
vpermd $ D1 , $ T2 , $ R1
vpermd $ T0 , $ T2 , $ S1
vpermd $ D2 , $ T2 , $ R2
vmovdqa64 $ R0 , 0x00 ( % rsp ) { % k2 } # save in case $len%128 != 0
vpsrlq \ $ 32 , $ R0 , $ T0 # 14243444 -> 01020304
vpermd $ T1 , $ T2 , $ S2
vmovdqu64 $ R1 , 0x00 ( % rsp , % rax ) { % k2 }
vpsrlq \ $ 32 , $ R1 , $ T1
vpermd $ D3 , $ T2 , $ R3
vmovdqa64 $ S1 , 0x40 ( % rsp ) { % k2 }
vpermd $ T3 , $ T2 , $ S3
vpermd $ D4 , $ T2 , $ R4
vmovdqu64 $ R2 , 0x40 ( % rsp , % rax ) { % k2 }
vpermd $ T4 , $ T2 , $ S4
vmovdqa64 $ S2 , 0x80 ( % rsp ) { % k2 }
vmovdqu64 $ R3 , 0x80 ( % rsp , % rax ) { % k2 }
vmovdqa64 $ S3 , 0xc0 ( % rsp ) { % k2 }
vmovdqu64 $ R4 , 0xc0 ( % rsp , % rax ) { % k2 }
vmovdqa64 $ S4 , 0x100 ( % rsp ) { % k2 }
################################################################
# calculate 5th through 8th powers of the key
#
# d0 = r0'*r0 + r1'*5*r4 + r2'*5*r3 + r3'*5*r2 + r4'*5*r1
# d1 = r0'*r1 + r1'*r0 + r2'*5*r4 + r3'*5*r3 + r4'*5*r2
# d2 = r0'*r2 + r1'*r1 + r2'*r0 + r3'*5*r4 + r4'*5*r3
# d3 = r0'*r3 + r1'*r2 + r2'*r1 + r3'*r0 + r4'*5*r4
# d4 = r0'*r4 + r1'*r3 + r2'*r2 + r3'*r1 + r4'*r0
vpmuludq $ T0 , $ R0 , $ D0 # d0 = r0'*r0
vpmuludq $ T0 , $ R1 , $ D1 # d1 = r0'*r1
vpmuludq $ T0 , $ R2 , $ D2 # d2 = r0'*r2
vpmuludq $ T0 , $ R3 , $ D3 # d3 = r0'*r3
vpmuludq $ T0 , $ R4 , $ D4 # d4 = r0'*r4
vpsrlq \ $ 32 , $ R2 , $ T2
vpmuludq $ T1 , $ S4 , $ M0
vpmuludq $ T1 , $ R0 , $ M1
vpmuludq $ T1 , $ R1 , $ M2
vpmuludq $ T1 , $ R2 , $ M3
vpmuludq $ T1 , $ R3 , $ M4
vpsrlq \ $ 32 , $ R3 , $ T3
vpaddq $ M0 , $ D0 , $ D0 # d0 += r1'*5*r4
vpaddq $ M1 , $ D1 , $ D1 # d1 += r1'*r0
vpaddq $ M2 , $ D2 , $ D2 # d2 += r1'*r1
vpaddq $ M3 , $ D3 , $ D3 # d3 += r1'*r2
vpaddq $ M4 , $ D4 , $ D4 # d4 += r1'*r3
vpmuludq $ T2 , $ S3 , $ M0
vpmuludq $ T2 , $ S4 , $ M1
vpmuludq $ T2 , $ R1 , $ M3
vpmuludq $ T2 , $ R2 , $ M4
vpmuludq $ T2 , $ R0 , $ M2
vpsrlq \ $ 32 , $ R4 , $ T4
vpaddq $ M0 , $ D0 , $ D0 # d0 += r2'*5*r3
vpaddq $ M1 , $ D1 , $ D1 # d1 += r2'*5*r4
vpaddq $ M3 , $ D3 , $ D3 # d3 += r2'*r1
vpaddq $ M4 , $ D4 , $ D4 # d4 += r2'*r2
vpaddq $ M2 , $ D2 , $ D2 # d2 += r2'*r0
vpmuludq $ T3 , $ S2 , $ M0
vpmuludq $ T3 , $ R0 , $ M3
vpmuludq $ T3 , $ R1 , $ M4
vpmuludq $ T3 , $ S3 , $ M1
vpmuludq $ T3 , $ S4 , $ M2
vpaddq $ M0 , $ D0 , $ D0 # d0 += r3'*5*r2
vpaddq $ M3 , $ D3 , $ D3 # d3 += r3'*r0
vpaddq $ M4 , $ D4 , $ D4 # d4 += r3'*r1
vpaddq $ M1 , $ D1 , $ D1 # d1 += r3'*5*r3
vpaddq $ M2 , $ D2 , $ D2 # d2 += r3'*5*r4
vpmuludq $ T4 , $ S4 , $ M3
vpmuludq $ T4 , $ R0 , $ M4
vpmuludq $ T4 , $ S1 , $ M0
vpmuludq $ T4 , $ S2 , $ M1
vpmuludq $ T4 , $ S3 , $ M2
vpaddq $ M3 , $ D3 , $ D3 # d3 += r2'*5*r4
vpaddq $ M4 , $ D4 , $ D4 # d4 += r2'*r0
vpaddq $ M0 , $ D0 , $ D0 # d0 += r2'*5*r1
vpaddq $ M1 , $ D1 , $ D1 # d1 += r2'*5*r2
vpaddq $ M2 , $ D2 , $ D2 # d2 += r2'*5*r3
################################################################
# load input
vmovdqu64 16 * 0 ( $ inp ) , % z #$T3
vmovdqu64 16 * 4 ( $ inp ) , % z #$T4
lea 16 * 8 ( $ inp ) , $ inp
################################################################
# lazy reduction
vpsrlq \ $ 26 , $ D3 , $ M3
vpandq $ MASK , $ D3 , $ D3
vpaddq $ M3 , $ D4 , $ D4 # d3 -> d4
vpsrlq \ $ 26 , $ D0 , $ M0
vpandq $ MASK , $ D0 , $ D0
vpaddq $ M0 , $ D1 , $ D1 # d0 -> d1
vpsrlq \ $ 26 , $ D4 , $ M4
vpandq $ MASK , $ D4 , $ D4
vpsrlq \ $ 26 , $ D1 , $ M1
vpandq $ MASK , $ D1 , $ D1
vpaddq $ M1 , $ D2 , $ D2 # d1 -> d2
vpaddq $ M4 , $ D0 , $ D0
vpsllq \ $ 2 , $ M4 , $ M4
vpaddq $ M4 , $ D0 , $ D0 # d4 -> d0
vpsrlq \ $ 26 , $ D2 , $ M2
vpandq $ MASK , $ D2 , $ D2
vpaddq $ M2 , $ D3 , $ D3 # d2 -> d3
vpsrlq \ $ 26 , $ D0 , $ M0
vpandq $ MASK , $ D0 , $ D0
vpaddq $ M0 , $ D1 , $ D1 # d0 -> d1
vpsrlq \ $ 26 , $ D3 , $ M3
vpandq $ MASK , $ D3 , $ D3
vpaddq $ M3 , $ D4 , $ D4 # d3 -> d4
################################################################
# at this point we have 14243444 in $R0-$S4 and 05060708 in
# $D0-$D4, ...
vpunpcklqdq $ T4 , $ T3 , $ T0 # transpose input
vpunpckhqdq $ T4 , $ T3 , $ T4
# ... since input 64-bit lanes are ordered as 73625140, we could
# "vperm" it to 76543210 (here and in each loop iteration), *or*
# we could just flow along, hence the goal for $R0-$S4 is
# 1858286838784888 ...
vmovdqa32 128 ( % rcx ) , $ M0 # .Lpermd_avx512:
mov \ $ 0x7777 , % eax
kmovw % eax , % k1
vpermd $ R0 , $ M0 , $ R0 # 14243444 -> 1---2---3---4---
vpermd $ R1 , $ M0 , $ R1
vpermd $ R2 , $ M0 , $ R2
vpermd $ R3 , $ M0 , $ R3
vpermd $ R4 , $ M0 , $ R4
vpermd $ D0 , $ M0 , $ { R0 } { % k1 } # 05060708 -> 1858286838784888
vpermd $ D1 , $ M0 , $ { R1 } { % k1 }
vpermd $ D2 , $ M0 , $ { R2 } { % k1 }
vpermd $ D3 , $ M0 , $ { R3 } { % k1 }
vpermd $ D4 , $ M0 , $ { R4 } { % k1 }
vpslld \ $ 2 , $ R1 , $ S1 # *5
vpslld \ $ 2 , $ R2 , $ S2
vpslld \ $ 2 , $ R3 , $ S3
vpslld \ $ 2 , $ R4 , $ S4
vpaddd $ R1 , $ S1 , $ S1
vpaddd $ R2 , $ S2 , $ S2
vpaddd $ R3 , $ S3 , $ S3
vpaddd $ R4 , $ S4 , $ S4
vpbroadcastq 32 ( % rcx ) , $ PADBIT # .L129
vpsrlq \ $ 52 , $ T0 , $ T2 # splat input
vpsllq \ $ 12 , $ T4 , $ T3
vporq $ T3 , $ T2 , $ T2
vpsrlq \ $ 26 , $ T0 , $ T1
vpsrlq \ $ 14 , $ T4 , $ T3
vpsrlq \ $ 40 , $ T4 , $ T4 # 4
vpandq $ MASK , $ T2 , $ T2 # 2
vpandq $ MASK , $ T0 , $ T0 # 0
#vpandq $MASK,$T1,$T1 # 1
#vpandq $MASK,$T3,$T3 # 3
#vporq $PADBIT,$T4,$T4 # padbit, yes, always
vpaddq $ H2 , $ T2 , $ H2 # accumulate input
sub \ $ 1 9 2 , $ len
jbe . Ltail_avx512
jmp . Loop_avx512
. align 32
. Loop_avx512:
################################################################
# ((inp[0]*r^8+inp[ 8])*r^8+inp[16])*r^8
# ((inp[1]*r^8+inp[ 9])*r^8+inp[17])*r^7
# ((inp[2]*r^8+inp[10])*r^8+inp[18])*r^6
# ((inp[3]*r^8+inp[11])*r^8+inp[19])*r^5
# ((inp[4]*r^8+inp[12])*r^8+inp[20])*r^4
# ((inp[5]*r^8+inp[13])*r^8+inp[21])*r^3
# ((inp[6]*r^8+inp[14])*r^8+inp[22])*r^2
# ((inp[7]*r^8+inp[15])*r^8+inp[23])*r^1
# \________/\___________/
################################################################
#vpaddq $H2,$T2,$H2 # accumulate input
# d4 = h4*r0 + h3*r1 + h2*r2 + h1*r3 + h0*r4
# d3 = h3*r0 + h2*r1 + h1*r2 + h0*r3 + h4*5*r4
# d2 = h2*r0 + h1*r1 + h0*r2 + h4*5*r3 + h3*5*r4
# d1 = h1*r0 + h0*r1 + h4*5*r2 + h3*5*r3 + h2*5*r4
# d0 = h0*r0 + h4*5*r1 + h3*5*r2 + h2*5*r3 + h1*5*r4
#
# however, as h2 is "chronologically" first one available pull
# corresponding operations up, so it's
#
# d3 = h2*r1 + h0*r3 + h1*r2 + h3*r0 + h4*5*r4
# d4 = h2*r2 + h0*r4 + h1*r3 + h3*r1 + h4*r0
# d0 = h2*5*r3 + h0*r0 + h1*5*r4 + h3*5*r2 + h4*5*r1
# d1 = h2*5*r4 + h0*r1 + h1*r0 + h3*5*r3 + h4*5*r2
# d2 = h2*r0 + h0*r2 + h1*r1 + h3*5*r4 + h4*5*r3
vpmuludq $ H2 , $ R1 , $ D3 # d3 = h2*r1
vpaddq $ H0 , $ T0 , $ H0
vpmuludq $ H2 , $ R2 , $ D4 # d4 = h2*r2
vpandq $ MASK , $ T1 , $ T1 # 1
vpmuludq $ H2 , $ S3 , $ D0 # d0 = h2*s3
vpandq $ MASK , $ T3 , $ T3 # 3
vpmuludq $ H2 , $ S4 , $ D1 # d1 = h2*s4
vporq $ PADBIT , $ T4 , $ T4 # padbit, yes, always
vpmuludq $ H2 , $ R0 , $ D2 # d2 = h2*r0
vpaddq $ H1 , $ T1 , $ H1 # accumulate input
vpaddq $ H3 , $ T3 , $ H3
vpaddq $ H4 , $ T4 , $ H4
vmovdqu64 16 * 0 ( $ inp ) , $ T3 # load input
vmovdqu64 16 * 4 ( $ inp ) , $ T4
lea 16 * 8 ( $ inp ) , $ inp
vpmuludq $ H0 , $ R3 , $ M3
vpmuludq $ H0 , $ R4 , $ M4
vpmuludq $ H0 , $ R0 , $ M0
vpmuludq $ H0 , $ R1 , $ M1
vpaddq $ M3 , $ D3 , $ D3 # d3 += h0*r3
vpaddq $ M4 , $ D4 , $ D4 # d4 += h0*r4
vpaddq $ M0 , $ D0 , $ D0 # d0 += h0*r0
vpaddq $ M1 , $ D1 , $ D1 # d1 += h0*r1
vpmuludq $ H1 , $ R2 , $ M3
vpmuludq $ H1 , $ R3 , $ M4
vpmuludq $ H1 , $ S4 , $ M0
vpmuludq $ H0 , $ R2 , $ M2
vpaddq $ M3 , $ D3 , $ D3 # d3 += h1*r2
vpaddq $ M4 , $ D4 , $ D4 # d4 += h1*r3
vpaddq $ M0 , $ D0 , $ D0 # d0 += h1*s4
vpaddq $ M2 , $ D2 , $ D2 # d2 += h0*r2
vpunpcklqdq $ T4 , $ T3 , $ T0 # transpose input
vpunpckhqdq $ T4 , $ T3 , $ T4
vpmuludq $ H3 , $ R0 , $ M3
vpmuludq $ H3 , $ R1 , $ M4
vpmuludq $ H1 , $ R0 , $ M1
vpmuludq $ H1 , $ R1 , $ M2
vpaddq $ M3 , $ D3 , $ D3 # d3 += h3*r0
vpaddq $ M4 , $ D4 , $ D4 # d4 += h3*r1
vpaddq $ M1 , $ D1 , $ D1 # d1 += h1*r0
vpaddq $ M2 , $ D2 , $ D2 # d2 += h1*r1
vpmuludq $ H4 , $ S4 , $ M3
vpmuludq $ H4 , $ R0 , $ M4
vpmuludq $ H3 , $ S2 , $ M0
vpmuludq $ H3 , $ S3 , $ M1
vpaddq $ M3 , $ D3 , $ D3 # d3 += h4*s4
vpmuludq $ H3 , $ S4 , $ M2
vpaddq $ M4 , $ D4 , $ D4 # d4 += h4*r0
vpaddq $ M0 , $ D0 , $ D0 # d0 += h3*s2
vpaddq $ M1 , $ D1 , $ D1 # d1 += h3*s3
vpaddq $ M2 , $ D2 , $ D2 # d2 += h3*s4
vpmuludq $ H4 , $ S1 , $ M0
vpmuludq $ H4 , $ S2 , $ M1
vpmuludq $ H4 , $ S3 , $ M2
vpaddq $ M0 , $ D0 , $ H0 # h0 = d0 + h4*s1
vpaddq $ M1 , $ D1 , $ H1 # h1 = d2 + h4*s2
vpaddq $ M2 , $ D2 , $ H2 # h2 = d3 + h4*s3
################################################################
# lazy reduction (interleaved with input splat)
vpsrlq \ $ 52 , $ T0 , $ T2 # splat input
vpsllq \ $ 12 , $ T4 , $ T3
vpsrlq \ $ 26 , $ D3 , $ H3
vpandq $ MASK , $ D3 , $ D3
vpaddq $ H3 , $ D4 , $ H4 # h3 -> h4
vporq $ T3 , $ T2 , $ T2
vpsrlq \ $ 26 , $ H0 , $ D0
vpandq $ MASK , $ H0 , $ H0
vpaddq $ D0 , $ H1 , $ H1 # h0 -> h1
vpandq $ MASK , $ T2 , $ T2 # 2
vpsrlq \ $ 26 , $ H4 , $ D4
vpandq $ MASK , $ H4 , $ H4
vpsrlq \ $ 26 , $ H1 , $ D1
vpandq $ MASK , $ H1 , $ H1
vpaddq $ D1 , $ H2 , $ H2 # h1 -> h2
vpaddq $ D4 , $ H0 , $ H0
vpsllq \ $ 2 , $ D4 , $ D4
vpaddq $ D4 , $ H0 , $ H0 # h4 -> h0
vpaddq $ T2 , $ H2 , $ H2 # modulo-scheduled
vpsrlq \ $ 26 , $ T0 , $ T1
vpsrlq \ $ 26 , $ H2 , $ D2
vpandq $ MASK , $ H2 , $ H2
vpaddq $ D2 , $ D3 , $ H3 # h2 -> h3
vpsrlq \ $ 14 , $ T4 , $ T3
vpsrlq \ $ 26 , $ H0 , $ D0
vpandq $ MASK , $ H0 , $ H0
vpaddq $ D0 , $ H1 , $ H1 # h0 -> h1
vpsrlq \ $ 40 , $ T4 , $ T4 # 4
vpsrlq \ $ 26 , $ H3 , $ D3
vpandq $ MASK , $ H3 , $ H3
vpaddq $ D3 , $ H4 , $ H4 # h3 -> h4
vpandq $ MASK , $ T0 , $ T0 # 0
#vpandq $MASK,$T1,$T1 # 1
#vpandq $MASK,$T3,$T3 # 3
#vporq $PADBIT,$T4,$T4 # padbit, yes, always
sub \ $ 1 2 8 , $ len
ja . Loop_avx512
. Ltail_avx512:
################################################################
# while above multiplications were by r^8 in all lanes, in last
# iteration we multiply least significant lane by r^8 and most
# significant one by r, that's why table gets shifted...
vpsrlq \ $ 32 , $ R0 , $ R0 # 0105020603070408
vpsrlq \ $ 32 , $ R1 , $ R1
vpsrlq \ $ 32 , $ R2 , $ R2
vpsrlq \ $ 32 , $ S3 , $ S3
vpsrlq \ $ 32 , $ S4 , $ S4
vpsrlq \ $ 32 , $ R3 , $ R3
vpsrlq \ $ 32 , $ R4 , $ R4
vpsrlq \ $ 32 , $ S1 , $ S1
vpsrlq \ $ 32 , $ S2 , $ S2
################################################################
# load either next or last 64 byte of input
lea ( $ inp , $ len ) , $ inp
#vpaddq $H2,$T2,$H2 # accumulate input
vpaddq $ H0 , $ T0 , $ H0
vpmuludq $ H2 , $ R1 , $ D3 # d3 = h2*r1
vpmuludq $ H2 , $ R2 , $ D4 # d4 = h2*r2
vpmuludq $ H2 , $ S3 , $ D0 # d0 = h2*s3
vpandq $ MASK , $ T1 , $ T1 # 1
vpmuludq $ H2 , $ S4 , $ D1 # d1 = h2*s4
vpandq $ MASK , $ T3 , $ T3 # 3
vpmuludq $ H2 , $ R0 , $ D2 # d2 = h2*r0
vporq $ PADBIT , $ T4 , $ T4 # padbit, yes, always
vpaddq $ H1 , $ T1 , $ H1 # accumulate input
vpaddq $ H3 , $ T3 , $ H3
vpaddq $ H4 , $ T4 , $ H4
vmovdqu 16 * 0 ( $ inp ) , % x #$T0
vpmuludq $ H0 , $ R3 , $ M3
vpmuludq $ H0 , $ R4 , $ M4
vpmuludq $ H0 , $ R0 , $ M0
vpmuludq $ H0 , $ R1 , $ M1
vpaddq $ M3 , $ D3 , $ D3 # d3 += h0*r3
vpaddq $ M4 , $ D4 , $ D4 # d4 += h0*r4
vpaddq $ M0 , $ D0 , $ D0 # d0 += h0*r0
vpaddq $ M1 , $ D1 , $ D1 # d1 += h0*r1
vmovdqu 16 * 1 ( $ inp ) , % x #$T1
vpmuludq $ H1 , $ R2 , $ M3
vpmuludq $ H1 , $ R3 , $ M4
vpmuludq $ H1 , $ S4 , $ M0
vpmuludq $ H0 , $ R2 , $ M2
vpaddq $ M3 , $ D3 , $ D3 # d3 += h1*r2
vpaddq $ M4 , $ D4 , $ D4 # d4 += h1*r3
vpaddq $ M0 , $ D0 , $ D0 # d0 += h1*s4
vpaddq $ M2 , $ D2 , $ D2 # d2 += h0*r2
vinserti128 \ $ 1 , 16 * 2 ( $ inp ) , % y #$T0,%y#$T0
vpmuludq $ H3 , $ R0 , $ M3
vpmuludq $ H3 , $ R1 , $ M4
vpmuludq $ H1 , $ R0 , $ M1
vpmuludq $ H1 , $ R1 , $ M2
vpaddq $ M3 , $ D3 , $ D3 # d3 += h3*r0
vpaddq $ M4 , $ D4 , $ D4 # d4 += h3*r1
vpaddq $ M1 , $ D1 , $ D1 # d1 += h1*r0
vpaddq $ M2 , $ D2 , $ D2 # d2 += h1*r1
vinserti128 \ $ 1 , 16 * 3 ( $ inp ) , % y #$T1,%y#$T1
vpmuludq $ H4 , $ S4 , $ M3
vpmuludq $ H4 , $ R0 , $ M4
vpmuludq $ H3 , $ S2 , $ M0
vpmuludq $ H3 , $ S3 , $ M1
vpmuludq $ H3 , $ S4 , $ M2
vpaddq $ M3 , $ D3 , $ H3 # h3 = d3 + h4*s4
vpaddq $ M4 , $ D4 , $ D4 # d4 += h4*r0
vpaddq $ M0 , $ D0 , $ D0 # d0 += h3*s2
vpaddq $ M1 , $ D1 , $ D1 # d1 += h3*s3
vpaddq $ M2 , $ D2 , $ D2 # d2 += h3*s4
vpmuludq $ H4 , $ S1 , $ M0
vpmuludq $ H4 , $ S2 , $ M1
vpmuludq $ H4 , $ S3 , $ M2
vpaddq $ M0 , $ D0 , $ H0 # h0 = d0 + h4*s1
vpaddq $ M1 , $ D1 , $ H1 # h1 = d2 + h4*s2
vpaddq $ M2 , $ D2 , $ H2 # h2 = d3 + h4*s3
################################################################
# horizontal addition
mov \ $ 1 , % eax
vpermq \ $ 0xb1 , $ H3 , $ D3
vpermq \ $ 0xb1 , $ D4 , $ H4
vpermq \ $ 0xb1 , $ H0 , $ D0
vpermq \ $ 0xb1 , $ H1 , $ D1
vpermq \ $ 0xb1 , $ H2 , $ D2
vpaddq $ D3 , $ H3 , $ H3
vpaddq $ D4 , $ H4 , $ H4
vpaddq $ D0 , $ H0 , $ H0
vpaddq $ D1 , $ H1 , $ H1
vpaddq $ D2 , $ H2 , $ H2
kmovw % eax , % k3
vpermq \ $ 0x2 , $ H3 , $ D3
vpermq \ $ 0x2 , $ H4 , $ D4
vpermq \ $ 0x2 , $ H0 , $ D0
vpermq \ $ 0x2 , $ H1 , $ D1
vpermq \ $ 0x2 , $ H2 , $ D2
vpaddq $ D3 , $ H3 , $ H3
vpaddq $ D4 , $ H4 , $ H4
vpaddq $ D0 , $ H0 , $ H0
vpaddq $ D1 , $ H1 , $ H1
vpaddq $ D2 , $ H2 , $ H2
vextracti64x4 \ $ 0x1 , $ H3 , % y #$D3
vextracti64x4 \ $ 0x1 , $ H4 , % y #$D4
vextracti64x4 \ $ 0x1 , $ H0 , % y #$D0
vextracti64x4 \ $ 0x1 , $ H1 , % y #$D1
vextracti64x4 \ $ 0x1 , $ H2 , % y #$D2
vpaddq $ D3 , $ H3 , $ { H3 } { % k3 } { z } # keep single qword in case
vpaddq $ D4 , $ H4 , $ { H4 } { % k3 } { z } # it's passed to .Ltail_avx2
vpaddq $ D0 , $ H0 , $ { H0 } { % k3 } { z }
vpaddq $ D1 , $ H1 , $ { H1 } { % k3 } { z }
vpaddq $ D2 , $ H2 , $ { H2 } { % k3 } { z }
___
map ( s/%z/%y/ , ( $ T0 , $ T1 , $ T2 , $ T3 , $ T4 , $ PADBIT ) ) ;
map ( s/%z/%y/ , ( $ H0 , $ H1 , $ H2 , $ H3 , $ H4 , $ D0 , $ D1 , $ D2 , $ D3 , $ D4 , $ MASK ) ) ;
$ code . = << ___ ;
################################################################
# lazy reduction (interleaved with input splat)
vpsrlq \ $ 26 , $ H3 , $ D3
vpand $ MASK , $ H3 , $ H3
vpsrldq \ $ 6 , $ T0 , $ T2 # splat input
vpsrldq \ $ 6 , $ T1 , $ T3
vpunpckhqdq $ T1 , $ T0 , $ T4 # 4
vpaddq $ D3 , $ H4 , $ H4 # h3 -> h4
vpsrlq \ $ 26 , $ H0 , $ D0
vpand $ MASK , $ H0 , $ H0
vpunpcklqdq $ T3 , $ T2 , $ T2 # 2:3
vpunpcklqdq $ T1 , $ T0 , $ T0 # 0:1
vpaddq $ D0 , $ H1 , $ H1 # h0 -> h1
vpsrlq \ $ 26 , $ H4 , $ D4
vpand $ MASK , $ H4 , $ H4
vpsrlq \ $ 26 , $ H1 , $ D1
vpand $ MASK , $ H1 , $ H1
vpsrlq \ $ 30 , $ T2 , $ T3
vpsrlq \ $ 4 , $ T2 , $ T2
vpaddq $ D1 , $ H2 , $ H2 # h1 -> h2
vpaddq $ D4 , $ H0 , $ H0
vpsllq \ $ 2 , $ D4 , $ D4
vpsrlq \ $ 26 , $ T0 , $ T1
vpsrlq \ $ 40 , $ T4 , $ T4 # 4
vpaddq $ D4 , $ H0 , $ H0 # h4 -> h0
vpsrlq \ $ 26 , $ H2 , $ D2
vpand $ MASK , $ H2 , $ H2
vpand $ MASK , $ T2 , $ T2 # 2
vpand $ MASK , $ T0 , $ T0 # 0
vpaddq $ D2 , $ H3 , $ H3 # h2 -> h3
vpsrlq \ $ 26 , $ H0 , $ D0
vpand $ MASK , $ H0 , $ H0
vpaddq $ H2 , $ T2 , $ H2 # accumulate input for .Ltail_avx2
vpand $ MASK , $ T1 , $ T1 # 1
vpaddq $ D0 , $ H1 , $ H1 # h0 -> h1
vpsrlq \ $ 26 , $ H3 , $ D3
vpand $ MASK , $ H3 , $ H3
vpand $ MASK , $ T3 , $ T3 # 3
vpor 32 ( % rcx ) , $ T4 , $ T4 # padbit, yes, always
vpaddq $ D3 , $ H4 , $ H4 # h3 -> h4
lea 0x90 ( % rsp ) , % rax # size optimization for .Ltail_avx2
add \ $ 64 , $ len
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
jnz . Ltail_avx2 $ suffix
2020-01-05 22:40:47 -05:00
vpsubq $ T2 , $ H2 , $ H2 # undo input accumulation
vmovd % x #$H0,`4*0-48-64`($ctx)# save partially reduced
vmovd % x #$H1,`4*1-48-64`($ctx)
vmovd % x #$H2,`4*2-48-64`($ctx)
vmovd % x #$H3,`4*3-48-64`($ctx)
vmovd % x #$H4,`4*4-48-64`($ctx)
vzeroall
___
$ code . = << ___ if ( $ win64 ) ;
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
movdqa - 0xb0 ( % r10 ) , % xmm6
movdqa - 0xa0 ( % r10 ) , % xmm7
movdqa - 0x90 ( % r10 ) , % xmm8
movdqa - 0x80 ( % r10 ) , % xmm9
movdqa - 0x70 ( % r10 ) , % xmm10
movdqa - 0x60 ( % r10 ) , % xmm11
movdqa - 0x50 ( % r10 ) , % xmm12
movdqa - 0x40 ( % r10 ) , % xmm13
movdqa - 0x30 ( % r10 ) , % xmm14
movdqa - 0x20 ( % r10 ) , % xmm15
lea - 8 ( % r10 ) , % rsp
2020-01-05 22:40:47 -05:00
. Ldo_avx512_epilogue:
___
$ code . = << ___ if ( ! $ win64 ) ;
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
lea - 8 ( % r10 ) , % rsp
. cfi_def_cfa_register % rsp
2020-01-05 22:40:47 -05:00
___
$ code . = << ___ ;
ret
. cfi_endproc
___
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
}
}
& declare_function ( "poly1305_blocks_avx2" , 32 , 4 ) ;
poly1305_blocks_avxN ( 0 ) ;
& end_function ( "poly1305_blocks_avx2" ) ;
#######################################################################
if ( $ avx > 2 ) {
# On entry we have input length divisible by 64. But since inner loop
# processes 128 bytes per iteration, cases when length is not divisible
# by 128 are handled by passing tail 64 bytes to .Ltail_avx2. For this
# reason stack layout is kept identical to poly1305_blocks_avx2. If not
# for this tail, we wouldn't have to even allocate stack frame...
if ( $ kernel ) {
$ code . = "#ifdef CONFIG_AS_AVX512\n" ;
}
& declare_function ( "poly1305_blocks_avx512" , 32 , 4 ) ;
poly1305_blocks_avxN ( 1 ) ;
& end_function ( "poly1305_blocks_avx512" ) ;
if ( $ kernel ) {
$ code . = "#endif\n" ;
}
if ( ! $ kernel && $ avx > 3 ) {
2020-01-05 22:40:47 -05:00
########################################################################
# VPMADD52 version using 2^44 radix.
#
# One can argue that base 2^52 would be more natural. Well, even though
# some operations would be more natural, one has to recognize couple of
# things. Base 2^52 doesn't provide advantage over base 2^44 if you look
# at amount of multiply-n-accumulate operations. Secondly, it makes it
# impossible to pre-compute multiples of 5 [referred to as s[]/sN in
# reference implementations], which means that more such operations
# would have to be performed in inner loop, which in turn makes critical
# path longer. In other words, even though base 2^44 reduction might
# look less elegant, overall critical path is actually shorter...
########################################################################
# Layout of opaque area is following.
#
# unsigned __int64 h[3]; # current hash value base 2^44
# unsigned __int64 s[2]; # key value*20 base 2^44
# unsigned __int64 r[3]; # key value base 2^44
# struct { unsigned __int64 r^1, r^3, r^2, r^4; } R[4];
# # r^n positions reflect
# # placement in register, not
# # memory, R[3] is R[1]*20
$ code . = << ___ ;
. type poly1305_init_base2_44 , \ @ function , 3
. align 32
poly1305_init_base2_44:
2020-08-27 19:38:31 +02:00
xor % eax , % eax
2020-01-05 22:40:47 -05:00
mov % rax , 0 ( $ ctx ) # initialize hash value
mov % rax , 8 ( $ ctx )
mov % rax , 16 ( $ ctx )
. Linit_base2_44:
lea poly1305_blocks_vpmadd52 ( % rip ) , % r10
lea poly1305_emit_base2_44 ( % rip ) , % r11
mov \ $ 0x0ffffffc0fffffff , % rax
mov \ $ 0x0ffffffc0ffffffc , % rcx
and 0 ( $ inp ) , % rax
mov \ $ 0x00000fffffffffff , % r8
and 8 ( $ inp ) , % rcx
mov \ $ 0x00000fffffffffff , % r9
and % rax , % r8
shrd \ $ 44 , % rcx , % rax
mov % r8 , 40 ( $ ctx ) # r0
and % r9 , % rax
shr \ $ 24 , % rcx
mov % rax , 48 ( $ ctx ) # r1
lea ( % rax , % rax , 4 ) , % rax # *5
mov % rcx , 56 ( $ ctx ) # r2
shl \ $ 2 , % rax # magic <<2
lea ( % rcx , % rcx , 4 ) , % rcx # *5
shl \ $ 2 , % rcx # magic <<2
mov % rax , 24 ( $ ctx ) # s1
mov % rcx , 32 ( $ ctx ) # s2
movq \ $ - 1 , 64 ( $ ctx ) # write impossible value
___
$ code . = << ___ if ( $ flavour !~ /elf32/ ) ;
mov % r10 , 0 ( % rdx )
mov % r11 , 8 ( % rdx )
___
$ code . = << ___ if ( $ flavour =~ /elf32/ ) ;
mov % r10d , 0 ( % rdx )
mov % r11d , 4 ( % rdx )
___
$ code . = << ___ ;
mov \ $ 1 , % eax
ret
. size poly1305_init_base2_44 , . - poly1305_init_base2_44
___
{
my ( $ H0 , $ H1 , $ H2 , $ r2r1r0 , $ r1r0s2 , $ r0s2s1 , $ Dlo , $ Dhi ) = map ( "%ymm$_" , ( 0 .. 5 , 16 , 17 ) ) ;
my ( $ T0 , $ inp_permd , $ inp_shift , $ PAD ) = map ( "%ymm$_" , ( 18 .. 21 ) ) ;
my ( $ reduc_mask , $ reduc_rght , $ reduc_left ) = map ( "%ymm$_" , ( 22 .. 25 ) ) ;
$ code . = << ___ ;
. type poly1305_blocks_vpmadd52 , \ @ function , 4
. align 32
poly1305_blocks_vpmadd52:
shr \ $ 4 , $ len
jz . Lno_data_vpmadd52 # too short
shl \ $ 40 , $ padbit
mov 64 ( $ ctx ) , % r8 # peek on power of the key
# if powers of the key are not calculated yet, process up to 3
# blocks with this single-block subroutine, otherwise ensure that
# length is divisible by 2 blocks and pass the rest down to next
# subroutine...
mov \ $ 3 , % rax
mov \ $ 1 , % r10
cmp \ $ 4 , $ len # is input long
cmovae % r10 , % rax
test % r8 , % r8 # is power value impossible?
cmovns % r10 , % rax
and $ len , % rax # is input of favourable length?
jz . Lblocks_vpmadd52_4x
sub % rax , $ len
mov \ $ 7 , % r10d
mov \ $ 1 , % r11d
kmovw % r10d , % k7
lea . L2_44_inp_permd ( % rip ) , % r10
kmovw % r11d , % k1
vmovq $ padbit , % x #$PAD
vmovdqa64 0 ( % r10 ) , $ inp_permd # .L2_44_inp_permd
vmovdqa64 32 ( % r10 ) , $ inp_shift # .L2_44_inp_shift
vpermq \ $ 0xcf , $ PAD , $ PAD
vmovdqa64 64 ( % r10 ) , $ reduc_mask # .L2_44_mask
vmovdqu64 0 ( $ ctx ) , $ { Dlo } { % k7 } { z } # load hash value
vmovdqu64 40 ( $ ctx ) , $ { r2r1r0 } { % k7 } { z } # load keys
vmovdqu64 32 ( $ ctx ) , $ { r1r0s2 } { % k7 } { z }
vmovdqu64 24 ( $ ctx ) , $ { r0s2s1 } { % k7 } { z }
vmovdqa64 96 ( % r10 ) , $ reduc_rght # .L2_44_shift_rgt
vmovdqa64 128 ( % r10 ) , $ reduc_left # .L2_44_shift_lft
jmp . Loop_vpmadd52
. align 32
. Loop_vpmadd52:
vmovdqu32 0 ( $ inp ) , % x #$T0 # load input as ----3210
lea 16 ( $ inp ) , $ inp
vpermd $ T0 , $ inp_permd , $ T0 # ----3210 -> --322110
vpsrlvq $ inp_shift , $ T0 , $ T0
vpandq $ reduc_mask , $ T0 , $ T0
vporq $ PAD , $ T0 , $ T0
vpaddq $ T0 , $ Dlo , $ Dlo # accumulate input
vpermq \ $ 0 , $ Dlo , $ { H0 } { % k7 } { z } # smash hash value
vpermq \ $ 0b01010101 , $ Dlo , $ { H1 } { % k7 } { z }
vpermq \ $ 0b10101010 , $ Dlo , $ { H2 } { % k7 } { z }
vpxord $ Dlo , $ Dlo , $ Dlo
vpxord $ Dhi , $ Dhi , $ Dhi
vpmadd52luq $ r2r1r0 , $ H0 , $ Dlo
vpmadd52huq $ r2r1r0 , $ H0 , $ Dhi
vpmadd52luq $ r1r0s2 , $ H1 , $ Dlo
vpmadd52huq $ r1r0s2 , $ H1 , $ Dhi
vpmadd52luq $ r0s2s1 , $ H2 , $ Dlo
vpmadd52huq $ r0s2s1 , $ H2 , $ Dhi
vpsrlvq $ reduc_rght , $ Dlo , $ T0 # 0 in topmost qword
vpsllvq $ reduc_left , $ Dhi , $ Dhi # 0 in topmost qword
vpandq $ reduc_mask , $ Dlo , $ Dlo
vpaddq $ T0 , $ Dhi , $ Dhi
vpermq \ $ 0b10010011 , $ Dhi , $ Dhi # 0 in lowest qword
vpaddq $ Dhi , $ Dlo , $ Dlo # note topmost qword :-)
vpsrlvq $ reduc_rght , $ Dlo , $ T0 # 0 in topmost word
vpandq $ reduc_mask , $ Dlo , $ Dlo
vpermq \ $ 0b10010011 , $ T0 , $ T0
vpaddq $ T0 , $ Dlo , $ Dlo
vpermq \ $ 0b10010011 , $ Dlo , $ { T0 } { % k1 } { z }
vpaddq $ T0 , $ Dlo , $ Dlo
vpsllq \ $ 2 , $ T0 , $ T0
vpaddq $ T0 , $ Dlo , $ Dlo
dec % rax # len-=16
jnz . Loop_vpmadd52
vmovdqu64 $ Dlo , 0 ( $ ctx ) { % k7 } # store hash value
test $ len , $ len
jnz . Lblocks_vpmadd52_4x
. Lno_data_vpmadd52:
ret
. size poly1305_blocks_vpmadd52 , . - poly1305_blocks_vpmadd52
___
}
{
########################################################################
# As implied by its name 4x subroutine processes 4 blocks in parallel
# (but handles even 4*n+2 blocks lengths). It takes up to 4th key power
# and is handled in 256-bit %ymm registers.
my ( $ H0 , $ H1 , $ H2 , $ R0 , $ R1 , $ R2 , $ S1 , $ S2 ) = map ( "%ymm$_" , ( 0 .. 5 , 16 , 17 ) ) ;
my ( $ D0lo , $ D0hi , $ D1lo , $ D1hi , $ D2lo , $ D2hi ) = map ( "%ymm$_" , ( 18 .. 23 ) ) ;
my ( $ T0 , $ T1 , $ T2 , $ T3 , $ mask44 , $ mask42 , $ tmp , $ PAD ) = map ( "%ymm$_" , ( 24 .. 31 ) ) ;
$ code . = << ___ ;
. type poly1305_blocks_vpmadd52_4x , \ @ function , 4
. align 32
poly1305_blocks_vpmadd52_4x:
shr \ $ 4 , $ len
jz . Lno_data_vpmadd52_4x # too short
shl \ $ 40 , $ padbit
mov 64 ( $ ctx ) , % r8 # peek on power of the key
. Lblocks_vpmadd52_4x:
vpbroadcastq $ padbit , $ PAD
vmovdqa64 . Lx_mask44 ( % rip ) , $ mask44
mov \ $ 5 , % eax
vmovdqa64 . Lx_mask42 ( % rip ) , $ mask42
kmovw % eax , % k1 # used in 2x path
test % r8 , % r8 # is power value impossible?
js . Linit_vpmadd52 # if it is, then init R[4]
vmovq 0 ( $ ctx ) , % x #$H0 # load current hash value
vmovq 8 ( $ ctx ) , % x #$H1
vmovq 16 ( $ ctx ) , % x #$H2
test \ $ 3 , $ len # is length 4*n+2?
jnz . Lblocks_vpmadd52_2x_do
. Lblocks_vpmadd52_4x_do:
vpbroadcastq 64 ( $ ctx ) , $ R0 # load 4th power of the key
vpbroadcastq 96 ( $ ctx ) , $ R1
vpbroadcastq 128 ( $ ctx ) , $ R2
vpbroadcastq 160 ( $ ctx ) , $ S1
. Lblocks_vpmadd52_4x_key_loaded:
vpsllq \ $ 2 , $ R2 , $ S2 # S2 = R2*5*4
vpaddq $ R2 , $ S2 , $ S2
vpsllq \ $ 2 , $ S2 , $ S2
test \ $ 7 , $ len # is len 8*n?
jz . Lblocks_vpmadd52_8x
vmovdqu64 16 * 0 ( $ inp ) , $ T2 # load data
vmovdqu64 16 * 2 ( $ inp ) , $ T3
lea 16 * 4 ( $ inp ) , $ inp
vpunpcklqdq $ T3 , $ T2 , $ T1 # transpose data
vpunpckhqdq $ T3 , $ T2 , $ T3
# at this point 64-bit lanes are ordered as 3-1-2-0
vpsrlq \ $ 24 , $ T3 , $ T2 # splat the data
vporq $ PAD , $ T2 , $ T2
vpaddq $ T2 , $ H2 , $ H2 # accumulate input
vpandq $ mask44 , $ T1 , $ T0
vpsrlq \ $ 44 , $ T1 , $ T1
vpsllq \ $ 20 , $ T3 , $ T3
vporq $ T3 , $ T1 , $ T1
vpandq $ mask44 , $ T1 , $ T1
sub \ $ 4 , $ len
jz . Ltail_vpmadd52_4x
jmp . Loop_vpmadd52_4x
ud2
. align 32
. Linit_vpmadd52:
vmovq 24 ( $ ctx ) , % x #$S1 # load key
vmovq 56 ( $ ctx ) , % x #$H2
vmovq 32 ( $ ctx ) , % x #$S2
vmovq 40 ( $ ctx ) , % x #$R0
vmovq 48 ( $ ctx ) , % x #$R1
vmovdqa $ R0 , $ H0
vmovdqa $ R1 , $ H1
vmovdqa $ H2 , $ R2
mov \ $ 2 , % eax
. Lmul_init_vpmadd52:
vpxorq $ D0lo , $ D0lo , $ D0lo
vpmadd52luq $ H2 , $ S1 , $ D0lo
vpxorq $ D0hi , $ D0hi , $ D0hi
vpmadd52huq $ H2 , $ S1 , $ D0hi
vpxorq $ D1lo , $ D1lo , $ D1lo
vpmadd52luq $ H2 , $ S2 , $ D1lo
vpxorq $ D1hi , $ D1hi , $ D1hi
vpmadd52huq $ H2 , $ S2 , $ D1hi
vpxorq $ D2lo , $ D2lo , $ D2lo
vpmadd52luq $ H2 , $ R0 , $ D2lo
vpxorq $ D2hi , $ D2hi , $ D2hi
vpmadd52huq $ H2 , $ R0 , $ D2hi
vpmadd52luq $ H0 , $ R0 , $ D0lo
vpmadd52huq $ H0 , $ R0 , $ D0hi
vpmadd52luq $ H0 , $ R1 , $ D1lo
vpmadd52huq $ H0 , $ R1 , $ D1hi
vpmadd52luq $ H0 , $ R2 , $ D2lo
vpmadd52huq $ H0 , $ R2 , $ D2hi
vpmadd52luq $ H1 , $ S2 , $ D0lo
vpmadd52huq $ H1 , $ S2 , $ D0hi
vpmadd52luq $ H1 , $ R0 , $ D1lo
vpmadd52huq $ H1 , $ R0 , $ D1hi
vpmadd52luq $ H1 , $ R1 , $ D2lo
vpmadd52huq $ H1 , $ R1 , $ D2hi
################################################################
# partial reduction
vpsrlq \ $ 44 , $ D0lo , $ tmp
vpsllq \ $ 8 , $ D0hi , $ D0hi
vpandq $ mask44 , $ D0lo , $ H0
vpaddq $ tmp , $ D0hi , $ D0hi
vpaddq $ D0hi , $ D1lo , $ D1lo
vpsrlq \ $ 44 , $ D1lo , $ tmp
vpsllq \ $ 8 , $ D1hi , $ D1hi
vpandq $ mask44 , $ D1lo , $ H1
vpaddq $ tmp , $ D1hi , $ D1hi
vpaddq $ D1hi , $ D2lo , $ D2lo
vpsrlq \ $ 42 , $ D2lo , $ tmp
vpsllq \ $ 10 , $ D2hi , $ D2hi
vpandq $ mask42 , $ D2lo , $ H2
vpaddq $ tmp , $ D2hi , $ D2hi
vpaddq $ D2hi , $ H0 , $ H0
vpsllq \ $ 2 , $ D2hi , $ D2hi
vpaddq $ D2hi , $ H0 , $ H0
vpsrlq \ $ 44 , $ H0 , $ tmp # additional step
vpandq $ mask44 , $ H0 , $ H0
vpaddq $ tmp , $ H1 , $ H1
dec % eax
jz . Ldone_init_vpmadd52
vpunpcklqdq $ R1 , $ H1 , $ R1 # 1,2
vpbroadcastq % x #$H1,%x#$H1 # 2,2
vpunpcklqdq $ R2 , $ H2 , $ R2
vpbroadcastq % x #$H2,%x#$H2
vpunpcklqdq $ R0 , $ H0 , $ R0
vpbroadcastq % x #$H0,%x#$H0
vpsllq \ $ 2 , $ R1 , $ S1 # S1 = R1*5*4
vpsllq \ $ 2 , $ R2 , $ S2 # S2 = R2*5*4
vpaddq $ R1 , $ S1 , $ S1
vpaddq $ R2 , $ S2 , $ S2
vpsllq \ $ 2 , $ S1 , $ S1
vpsllq \ $ 2 , $ S2 , $ S2
jmp . Lmul_init_vpmadd52
ud2
. align 32
. Ldone_init_vpmadd52:
vinserti128 \ $ 1 , % x #$R1,$H1,$R1 # 1,2,3,4
vinserti128 \ $ 1 , % x #$R2,$H2,$R2
vinserti128 \ $ 1 , % x #$R0,$H0,$R0
vpermq \ $ 0b11011000 , $ R1 , $ R1 # 1,3,2,4
vpermq \ $ 0b11011000 , $ R2 , $ R2
vpermq \ $ 0b11011000 , $ R0 , $ R0
vpsllq \ $ 2 , $ R1 , $ S1 # S1 = R1*5*4
vpaddq $ R1 , $ S1 , $ S1
vpsllq \ $ 2 , $ S1 , $ S1
vmovq 0 ( $ ctx ) , % x #$H0 # load current hash value
vmovq 8 ( $ ctx ) , % x #$H1
vmovq 16 ( $ ctx ) , % x #$H2
test \ $ 3 , $ len # is length 4*n+2?
jnz . Ldone_init_vpmadd52_2x
vmovdqu64 $ R0 , 64 ( $ ctx ) # save key powers
vpbroadcastq % x #$R0,$R0 # broadcast 4th power
vmovdqu64 $ R1 , 96 ( $ ctx )
vpbroadcastq % x #$R1,$R1
vmovdqu64 $ R2 , 128 ( $ ctx )
vpbroadcastq % x #$R2,$R2
vmovdqu64 $ S1 , 160 ( $ ctx )
vpbroadcastq % x #$S1,$S1
jmp . Lblocks_vpmadd52_4x_key_loaded
ud2
. align 32
. Ldone_init_vpmadd52_2x:
vmovdqu64 $ R0 , 64 ( $ ctx ) # save key powers
vpsrldq \ $ 8 , $ R0 , $ R0 # 0-1-0-2
vmovdqu64 $ R1 , 96 ( $ ctx )
vpsrldq \ $ 8 , $ R1 , $ R1
vmovdqu64 $ R2 , 128 ( $ ctx )
vpsrldq \ $ 8 , $ R2 , $ R2
vmovdqu64 $ S1 , 160 ( $ ctx )
vpsrldq \ $ 8 , $ S1 , $ S1
jmp . Lblocks_vpmadd52_2x_key_loaded
ud2
. align 32
. Lblocks_vpmadd52_2x_do:
vmovdqu64 128 + 8 ( $ ctx ) , $ { R2 } { % k1 } { z } # load 2nd and 1st key powers
vmovdqu64 160 + 8 ( $ ctx ) , $ { S1 } { % k1 } { z }
vmovdqu64 64 + 8 ( $ ctx ) , $ { R0 } { % k1 } { z }
vmovdqu64 96 + 8 ( $ ctx ) , $ { R1 } { % k1 } { z }
. Lblocks_vpmadd52_2x_key_loaded:
vmovdqu64 16 * 0 ( $ inp ) , $ T2 # load data
vpxorq $ T3 , $ T3 , $ T3
lea 16 * 2 ( $ inp ) , $ inp
vpunpcklqdq $ T3 , $ T2 , $ T1 # transpose data
vpunpckhqdq $ T3 , $ T2 , $ T3
# at this point 64-bit lanes are ordered as x-1-x-0
vpsrlq \ $ 24 , $ T3 , $ T2 # splat the data
vporq $ PAD , $ T2 , $ T2
vpaddq $ T2 , $ H2 , $ H2 # accumulate input
vpandq $ mask44 , $ T1 , $ T0
vpsrlq \ $ 44 , $ T1 , $ T1
vpsllq \ $ 20 , $ T3 , $ T3
vporq $ T3 , $ T1 , $ T1
vpandq $ mask44 , $ T1 , $ T1
jmp . Ltail_vpmadd52_2x
ud2
. align 32
. Loop_vpmadd52_4x:
#vpaddq $T2,$H2,$H2 # accumulate input
vpaddq $ T0 , $ H0 , $ H0
vpaddq $ T1 , $ H1 , $ H1
vpxorq $ D0lo , $ D0lo , $ D0lo
vpmadd52luq $ H2 , $ S1 , $ D0lo
vpxorq $ D0hi , $ D0hi , $ D0hi
vpmadd52huq $ H2 , $ S1 , $ D0hi
vpxorq $ D1lo , $ D1lo , $ D1lo
vpmadd52luq $ H2 , $ S2 , $ D1lo
vpxorq $ D1hi , $ D1hi , $ D1hi
vpmadd52huq $ H2 , $ S2 , $ D1hi
vpxorq $ D2lo , $ D2lo , $ D2lo
vpmadd52luq $ H2 , $ R0 , $ D2lo
vpxorq $ D2hi , $ D2hi , $ D2hi
vpmadd52huq $ H2 , $ R0 , $ D2hi
vmovdqu64 16 * 0 ( $ inp ) , $ T2 # load data
vmovdqu64 16 * 2 ( $ inp ) , $ T3
lea 16 * 4 ( $ inp ) , $ inp
vpmadd52luq $ H0 , $ R0 , $ D0lo
vpmadd52huq $ H0 , $ R0 , $ D0hi
vpmadd52luq $ H0 , $ R1 , $ D1lo
vpmadd52huq $ H0 , $ R1 , $ D1hi
vpmadd52luq $ H0 , $ R2 , $ D2lo
vpmadd52huq $ H0 , $ R2 , $ D2hi
vpunpcklqdq $ T3 , $ T2 , $ T1 # transpose data
vpunpckhqdq $ T3 , $ T2 , $ T3
vpmadd52luq $ H1 , $ S2 , $ D0lo
vpmadd52huq $ H1 , $ S2 , $ D0hi
vpmadd52luq $ H1 , $ R0 , $ D1lo
vpmadd52huq $ H1 , $ R0 , $ D1hi
vpmadd52luq $ H1 , $ R1 , $ D2lo
vpmadd52huq $ H1 , $ R1 , $ D2hi
################################################################
# partial reduction (interleaved with data splat)
vpsrlq \ $ 44 , $ D0lo , $ tmp
vpsllq \ $ 8 , $ D0hi , $ D0hi
vpandq $ mask44 , $ D0lo , $ H0
vpaddq $ tmp , $ D0hi , $ D0hi
vpsrlq \ $ 24 , $ T3 , $ T2
vporq $ PAD , $ T2 , $ T2
vpaddq $ D0hi , $ D1lo , $ D1lo
vpsrlq \ $ 44 , $ D1lo , $ tmp
vpsllq \ $ 8 , $ D1hi , $ D1hi
vpandq $ mask44 , $ D1lo , $ H1
vpaddq $ tmp , $ D1hi , $ D1hi
vpandq $ mask44 , $ T1 , $ T0
vpsrlq \ $ 44 , $ T1 , $ T1
vpsllq \ $ 20 , $ T3 , $ T3
vpaddq $ D1hi , $ D2lo , $ D2lo
vpsrlq \ $ 42 , $ D2lo , $ tmp
vpsllq \ $ 10 , $ D2hi , $ D2hi
vpandq $ mask42 , $ D2lo , $ H2
vpaddq $ tmp , $ D2hi , $ D2hi
vpaddq $ T2 , $ H2 , $ H2 # accumulate input
vpaddq $ D2hi , $ H0 , $ H0
vpsllq \ $ 2 , $ D2hi , $ D2hi
vpaddq $ D2hi , $ H0 , $ H0
vporq $ T3 , $ T1 , $ T1
vpandq $ mask44 , $ T1 , $ T1
vpsrlq \ $ 44 , $ H0 , $ tmp # additional step
vpandq $ mask44 , $ H0 , $ H0
vpaddq $ tmp , $ H1 , $ H1
sub \ $ 4 , $ len # len - = 6 4
jnz . Loop_vpmadd52_4x
. Ltail_vpmadd52_4x:
vmovdqu64 128 ( $ ctx ) , $ R2 # load all key powers
vmovdqu64 160 ( $ ctx ) , $ S1
vmovdqu64 64 ( $ ctx ) , $ R0
vmovdqu64 96 ( $ ctx ) , $ R1
. Ltail_vpmadd52_2x:
vpsllq \ $ 2 , $ R2 , $ S2 # S2 = R2*5*4
vpaddq $ R2 , $ S2 , $ S2
vpsllq \ $ 2 , $ S2 , $ S2
#vpaddq $T2,$H2,$H2 # accumulate input
vpaddq $ T0 , $ H0 , $ H0
vpaddq $ T1 , $ H1 , $ H1
vpxorq $ D0lo , $ D0lo , $ D0lo
vpmadd52luq $ H2 , $ S1 , $ D0lo
vpxorq $ D0hi , $ D0hi , $ D0hi
vpmadd52huq $ H2 , $ S1 , $ D0hi
vpxorq $ D1lo , $ D1lo , $ D1lo
vpmadd52luq $ H2 , $ S2 , $ D1lo
vpxorq $ D1hi , $ D1hi , $ D1hi
vpmadd52huq $ H2 , $ S2 , $ D1hi
vpxorq $ D2lo , $ D2lo , $ D2lo
vpmadd52luq $ H2 , $ R0 , $ D2lo
vpxorq $ D2hi , $ D2hi , $ D2hi
vpmadd52huq $ H2 , $ R0 , $ D2hi
vpmadd52luq $ H0 , $ R0 , $ D0lo
vpmadd52huq $ H0 , $ R0 , $ D0hi
vpmadd52luq $ H0 , $ R1 , $ D1lo
vpmadd52huq $ H0 , $ R1 , $ D1hi
vpmadd52luq $ H0 , $ R2 , $ D2lo
vpmadd52huq $ H0 , $ R2 , $ D2hi
vpmadd52luq $ H1 , $ S2 , $ D0lo
vpmadd52huq $ H1 , $ S2 , $ D0hi
vpmadd52luq $ H1 , $ R0 , $ D1lo
vpmadd52huq $ H1 , $ R0 , $ D1hi
vpmadd52luq $ H1 , $ R1 , $ D2lo
vpmadd52huq $ H1 , $ R1 , $ D2hi
################################################################
# horizontal addition
mov \ $ 1 , % eax
kmovw % eax , % k1
vpsrldq \ $ 8 , $ D0lo , $ T0
vpsrldq \ $ 8 , $ D0hi , $ H0
vpsrldq \ $ 8 , $ D1lo , $ T1
vpsrldq \ $ 8 , $ D1hi , $ H1
vpaddq $ T0 , $ D0lo , $ D0lo
vpaddq $ H0 , $ D0hi , $ D0hi
vpsrldq \ $ 8 , $ D2lo , $ T2
vpsrldq \ $ 8 , $ D2hi , $ H2
vpaddq $ T1 , $ D1lo , $ D1lo
vpaddq $ H1 , $ D1hi , $ D1hi
vpermq \ $ 0x2 , $ D0lo , $ T0
vpermq \ $ 0x2 , $ D0hi , $ H0
vpaddq $ T2 , $ D2lo , $ D2lo
vpaddq $ H2 , $ D2hi , $ D2hi
vpermq \ $ 0x2 , $ D1lo , $ T1
vpermq \ $ 0x2 , $ D1hi , $ H1
vpaddq $ T0 , $ D0lo , $ { D0lo } { % k1 } { z }
vpaddq $ H0 , $ D0hi , $ { D0hi } { % k1 } { z }
vpermq \ $ 0x2 , $ D2lo , $ T2
vpermq \ $ 0x2 , $ D2hi , $ H2
vpaddq $ T1 , $ D1lo , $ { D1lo } { % k1 } { z }
vpaddq $ H1 , $ D1hi , $ { D1hi } { % k1 } { z }
vpaddq $ T2 , $ D2lo , $ { D2lo } { % k1 } { z }
vpaddq $ H2 , $ D2hi , $ { D2hi } { % k1 } { z }
################################################################
# partial reduction
vpsrlq \ $ 44 , $ D0lo , $ tmp
vpsllq \ $ 8 , $ D0hi , $ D0hi
vpandq $ mask44 , $ D0lo , $ H0
vpaddq $ tmp , $ D0hi , $ D0hi
vpaddq $ D0hi , $ D1lo , $ D1lo
vpsrlq \ $ 44 , $ D1lo , $ tmp
vpsllq \ $ 8 , $ D1hi , $ D1hi
vpandq $ mask44 , $ D1lo , $ H1
vpaddq $ tmp , $ D1hi , $ D1hi
vpaddq $ D1hi , $ D2lo , $ D2lo
vpsrlq \ $ 42 , $ D2lo , $ tmp
vpsllq \ $ 10 , $ D2hi , $ D2hi
vpandq $ mask42 , $ D2lo , $ H2
vpaddq $ tmp , $ D2hi , $ D2hi
vpaddq $ D2hi , $ H0 , $ H0
vpsllq \ $ 2 , $ D2hi , $ D2hi
vpaddq $ D2hi , $ H0 , $ H0
vpsrlq \ $ 44 , $ H0 , $ tmp # additional step
vpandq $ mask44 , $ H0 , $ H0
vpaddq $ tmp , $ H1 , $ H1
# at this point $len is
# either 4*n+2 or 0...
sub \ $ 2 , $ len # len - = 3 2
ja . Lblocks_vpmadd52_4x_do
vmovq % x #$H0,0($ctx)
vmovq % x #$H1,8($ctx)
vmovq % x #$H2,16($ctx)
vzeroall
. Lno_data_vpmadd52_4x:
ret
. size poly1305_blocks_vpmadd52_4x , . - poly1305_blocks_vpmadd52_4x
___
}
{
########################################################################
# As implied by its name 8x subroutine processes 8 blocks in parallel...
# This is intermediate version, as it's used only in cases when input
# length is either 8*n, 8*n+1 or 8*n+2...
my ( $ H0 , $ H1 , $ H2 , $ R0 , $ R1 , $ R2 , $ S1 , $ S2 ) = map ( "%ymm$_" , ( 0 .. 5 , 16 , 17 ) ) ;
my ( $ D0lo , $ D0hi , $ D1lo , $ D1hi , $ D2lo , $ D2hi ) = map ( "%ymm$_" , ( 18 .. 23 ) ) ;
my ( $ T0 , $ T1 , $ T2 , $ T3 , $ mask44 , $ mask42 , $ tmp , $ PAD ) = map ( "%ymm$_" , ( 24 .. 31 ) ) ;
my ( $ RR0 , $ RR1 , $ RR2 , $ SS1 , $ SS2 ) = map ( "%ymm$_" , ( 6 .. 10 ) ) ;
$ code . = << ___ ;
. type poly1305_blocks_vpmadd52_8x , \ @ function , 4
. align 32
poly1305_blocks_vpmadd52_8x:
shr \ $ 4 , $ len
jz . Lno_data_vpmadd52_8x # too short
shl \ $ 40 , $ padbit
mov 64 ( $ ctx ) , % r8 # peek on power of the key
vmovdqa64 . Lx_mask44 ( % rip ) , $ mask44
vmovdqa64 . Lx_mask42 ( % rip ) , $ mask42
test % r8 , % r8 # is power value impossible?
js . Linit_vpmadd52 # if it is, then init R[4]
vmovq 0 ( $ ctx ) , % x #$H0 # load current hash value
vmovq 8 ( $ ctx ) , % x #$H1
vmovq 16 ( $ ctx ) , % x #$H2
. Lblocks_vpmadd52_8x:
################################################################
# fist we calculate more key powers
vmovdqu64 128 ( $ ctx ) , $ R2 # load 1-3-2-4 powers
vmovdqu64 160 ( $ ctx ) , $ S1
vmovdqu64 64 ( $ ctx ) , $ R0
vmovdqu64 96 ( $ ctx ) , $ R1
vpsllq \ $ 2 , $ R2 , $ S2 # S2 = R2*5*4
vpaddq $ R2 , $ S2 , $ S2
vpsllq \ $ 2 , $ S2 , $ S2
vpbroadcastq % x #$R2,$RR2 # broadcast 4th power
vpbroadcastq % x #$R0,$RR0
vpbroadcastq % x #$R1,$RR1
vpxorq $ D0lo , $ D0lo , $ D0lo
vpmadd52luq $ RR2 , $ S1 , $ D0lo
vpxorq $ D0hi , $ D0hi , $ D0hi
vpmadd52huq $ RR2 , $ S1 , $ D0hi
vpxorq $ D1lo , $ D1lo , $ D1lo
vpmadd52luq $ RR2 , $ S2 , $ D1lo
vpxorq $ D1hi , $ D1hi , $ D1hi
vpmadd52huq $ RR2 , $ S2 , $ D1hi
vpxorq $ D2lo , $ D2lo , $ D2lo
vpmadd52luq $ RR2 , $ R0 , $ D2lo
vpxorq $ D2hi , $ D2hi , $ D2hi
vpmadd52huq $ RR2 , $ R0 , $ D2hi
vpmadd52luq $ RR0 , $ R0 , $ D0lo
vpmadd52huq $ RR0 , $ R0 , $ D0hi
vpmadd52luq $ RR0 , $ R1 , $ D1lo
vpmadd52huq $ RR0 , $ R1 , $ D1hi
vpmadd52luq $ RR0 , $ R2 , $ D2lo
vpmadd52huq $ RR0 , $ R2 , $ D2hi
vpmadd52luq $ RR1 , $ S2 , $ D0lo
vpmadd52huq $ RR1 , $ S2 , $ D0hi
vpmadd52luq $ RR1 , $ R0 , $ D1lo
vpmadd52huq $ RR1 , $ R0 , $ D1hi
vpmadd52luq $ RR1 , $ R1 , $ D2lo
vpmadd52huq $ RR1 , $ R1 , $ D2hi
################################################################
# partial reduction
vpsrlq \ $ 44 , $ D0lo , $ tmp
vpsllq \ $ 8 , $ D0hi , $ D0hi
vpandq $ mask44 , $ D0lo , $ RR0
vpaddq $ tmp , $ D0hi , $ D0hi
vpaddq $ D0hi , $ D1lo , $ D1lo
vpsrlq \ $ 44 , $ D1lo , $ tmp
vpsllq \ $ 8 , $ D1hi , $ D1hi
vpandq $ mask44 , $ D1lo , $ RR1
vpaddq $ tmp , $ D1hi , $ D1hi
vpaddq $ D1hi , $ D2lo , $ D2lo
vpsrlq \ $ 42 , $ D2lo , $ tmp
vpsllq \ $ 10 , $ D2hi , $ D2hi
vpandq $ mask42 , $ D2lo , $ RR2
vpaddq $ tmp , $ D2hi , $ D2hi
vpaddq $ D2hi , $ RR0 , $ RR0
vpsllq \ $ 2 , $ D2hi , $ D2hi
vpaddq $ D2hi , $ RR0 , $ RR0
vpsrlq \ $ 44 , $ RR0 , $ tmp # additional step
vpandq $ mask44 , $ RR0 , $ RR0
vpaddq $ tmp , $ RR1 , $ RR1
################################################################
# At this point Rx holds 1324 powers, RRx - 5768, and the goal
# is 15263748, which reflects how data is loaded...
vpunpcklqdq $ R2 , $ RR2 , $ T2 # 3748
vpunpckhqdq $ R2 , $ RR2 , $ R2 # 1526
vpunpcklqdq $ R0 , $ RR0 , $ T0
vpunpckhqdq $ R0 , $ RR0 , $ R0
vpunpcklqdq $ R1 , $ RR1 , $ T1
vpunpckhqdq $ R1 , $ RR1 , $ R1
___
######## switch to %zmm
map ( s/%y/%z/ , $ H0 , $ H1 , $ H2 , $ R0 , $ R1 , $ R2 , $ S1 , $ S2 ) ;
map ( s/%y/%z/ , $ D0lo , $ D0hi , $ D1lo , $ D1hi , $ D2lo , $ D2hi ) ;
map ( s/%y/%z/ , $ T0 , $ T1 , $ T2 , $ T3 , $ mask44 , $ mask42 , $ tmp , $ PAD ) ;
map ( s/%y/%z/ , $ RR0 , $ RR1 , $ RR2 , $ SS1 , $ SS2 ) ;
$ code . = << ___ ;
vshufi64x2 \ $ 0x44 , $ R2 , $ T2 , $ RR2 # 15263748
vshufi64x2 \ $ 0x44 , $ R0 , $ T0 , $ RR0
vshufi64x2 \ $ 0x44 , $ R1 , $ T1 , $ RR1
vmovdqu64 16 * 0 ( $ inp ) , $ T2 # load data
vmovdqu64 16 * 4 ( $ inp ) , $ T3
lea 16 * 8 ( $ inp ) , $ inp
vpsllq \ $ 2 , $ RR2 , $ SS2 # S2 = R2*5*4
vpsllq \ $ 2 , $ RR1 , $ SS1 # S1 = R1*5*4
vpaddq $ RR2 , $ SS2 , $ SS2
vpaddq $ RR1 , $ SS1 , $ SS1
vpsllq \ $ 2 , $ SS2 , $ SS2
vpsllq \ $ 2 , $ SS1 , $ SS1
vpbroadcastq $ padbit , $ PAD
vpbroadcastq % x #$mask44,$mask44
vpbroadcastq % x #$mask42,$mask42
vpbroadcastq % x #$SS1,$S1 # broadcast 8th power
vpbroadcastq % x #$SS2,$S2
vpbroadcastq % x #$RR0,$R0
vpbroadcastq % x #$RR1,$R1
vpbroadcastq % x #$RR2,$R2
vpunpcklqdq $ T3 , $ T2 , $ T1 # transpose data
vpunpckhqdq $ T3 , $ T2 , $ T3
# at this point 64-bit lanes are ordered as 73625140
vpsrlq \ $ 24 , $ T3 , $ T2 # splat the data
vporq $ PAD , $ T2 , $ T2
vpaddq $ T2 , $ H2 , $ H2 # accumulate input
vpandq $ mask44 , $ T1 , $ T0
vpsrlq \ $ 44 , $ T1 , $ T1
vpsllq \ $ 20 , $ T3 , $ T3
vporq $ T3 , $ T1 , $ T1
vpandq $ mask44 , $ T1 , $ T1
sub \ $ 8 , $ len
jz . Ltail_vpmadd52_8x
jmp . Loop_vpmadd52_8x
. align 32
. Loop_vpmadd52_8x:
#vpaddq $T2,$H2,$H2 # accumulate input
vpaddq $ T0 , $ H0 , $ H0
vpaddq $ T1 , $ H1 , $ H1
vpxorq $ D0lo , $ D0lo , $ D0lo
vpmadd52luq $ H2 , $ S1 , $ D0lo
vpxorq $ D0hi , $ D0hi , $ D0hi
vpmadd52huq $ H2 , $ S1 , $ D0hi
vpxorq $ D1lo , $ D1lo , $ D1lo
vpmadd52luq $ H2 , $ S2 , $ D1lo
vpxorq $ D1hi , $ D1hi , $ D1hi
vpmadd52huq $ H2 , $ S2 , $ D1hi
vpxorq $ D2lo , $ D2lo , $ D2lo
vpmadd52luq $ H2 , $ R0 , $ D2lo
vpxorq $ D2hi , $ D2hi , $ D2hi
vpmadd52huq $ H2 , $ R0 , $ D2hi
vmovdqu64 16 * 0 ( $ inp ) , $ T2 # load data
vmovdqu64 16 * 4 ( $ inp ) , $ T3
lea 16 * 8 ( $ inp ) , $ inp
vpmadd52luq $ H0 , $ R0 , $ D0lo
vpmadd52huq $ H0 , $ R0 , $ D0hi
vpmadd52luq $ H0 , $ R1 , $ D1lo
vpmadd52huq $ H0 , $ R1 , $ D1hi
vpmadd52luq $ H0 , $ R2 , $ D2lo
vpmadd52huq $ H0 , $ R2 , $ D2hi
vpunpcklqdq $ T3 , $ T2 , $ T1 # transpose data
vpunpckhqdq $ T3 , $ T2 , $ T3
vpmadd52luq $ H1 , $ S2 , $ D0lo
vpmadd52huq $ H1 , $ S2 , $ D0hi
vpmadd52luq $ H1 , $ R0 , $ D1lo
vpmadd52huq $ H1 , $ R0 , $ D1hi
vpmadd52luq $ H1 , $ R1 , $ D2lo
vpmadd52huq $ H1 , $ R1 , $ D2hi
################################################################
# partial reduction (interleaved with data splat)
vpsrlq \ $ 44 , $ D0lo , $ tmp
vpsllq \ $ 8 , $ D0hi , $ D0hi
vpandq $ mask44 , $ D0lo , $ H0
vpaddq $ tmp , $ D0hi , $ D0hi
vpsrlq \ $ 24 , $ T3 , $ T2
vporq $ PAD , $ T2 , $ T2
vpaddq $ D0hi , $ D1lo , $ D1lo
vpsrlq \ $ 44 , $ D1lo , $ tmp
vpsllq \ $ 8 , $ D1hi , $ D1hi
vpandq $ mask44 , $ D1lo , $ H1
vpaddq $ tmp , $ D1hi , $ D1hi
vpandq $ mask44 , $ T1 , $ T0
vpsrlq \ $ 44 , $ T1 , $ T1
vpsllq \ $ 20 , $ T3 , $ T3
vpaddq $ D1hi , $ D2lo , $ D2lo
vpsrlq \ $ 42 , $ D2lo , $ tmp
vpsllq \ $ 10 , $ D2hi , $ D2hi
vpandq $ mask42 , $ D2lo , $ H2
vpaddq $ tmp , $ D2hi , $ D2hi
vpaddq $ T2 , $ H2 , $ H2 # accumulate input
vpaddq $ D2hi , $ H0 , $ H0
vpsllq \ $ 2 , $ D2hi , $ D2hi
vpaddq $ D2hi , $ H0 , $ H0
vporq $ T3 , $ T1 , $ T1
vpandq $ mask44 , $ T1 , $ T1
vpsrlq \ $ 44 , $ H0 , $ tmp # additional step
vpandq $ mask44 , $ H0 , $ H0
vpaddq $ tmp , $ H1 , $ H1
sub \ $ 8 , $ len # len - = 1 2 8
jnz . Loop_vpmadd52_8x
. Ltail_vpmadd52_8x:
#vpaddq $T2,$H2,$H2 # accumulate input
vpaddq $ T0 , $ H0 , $ H0
vpaddq $ T1 , $ H1 , $ H1
vpxorq $ D0lo , $ D0lo , $ D0lo
vpmadd52luq $ H2 , $ SS1 , $ D0lo
vpxorq $ D0hi , $ D0hi , $ D0hi
vpmadd52huq $ H2 , $ SS1 , $ D0hi
vpxorq $ D1lo , $ D1lo , $ D1lo
vpmadd52luq $ H2 , $ SS2 , $ D1lo
vpxorq $ D1hi , $ D1hi , $ D1hi
vpmadd52huq $ H2 , $ SS2 , $ D1hi
vpxorq $ D2lo , $ D2lo , $ D2lo
vpmadd52luq $ H2 , $ RR0 , $ D2lo
vpxorq $ D2hi , $ D2hi , $ D2hi
vpmadd52huq $ H2 , $ RR0 , $ D2hi
vpmadd52luq $ H0 , $ RR0 , $ D0lo
vpmadd52huq $ H0 , $ RR0 , $ D0hi
vpmadd52luq $ H0 , $ RR1 , $ D1lo
vpmadd52huq $ H0 , $ RR1 , $ D1hi
vpmadd52luq $ H0 , $ RR2 , $ D2lo
vpmadd52huq $ H0 , $ RR2 , $ D2hi
vpmadd52luq $ H1 , $ SS2 , $ D0lo
vpmadd52huq $ H1 , $ SS2 , $ D0hi
vpmadd52luq $ H1 , $ RR0 , $ D1lo
vpmadd52huq $ H1 , $ RR0 , $ D1hi
vpmadd52luq $ H1 , $ RR1 , $ D2lo
vpmadd52huq $ H1 , $ RR1 , $ D2hi
################################################################
# horizontal addition
mov \ $ 1 , % eax
kmovw % eax , % k1
vpsrldq \ $ 8 , $ D0lo , $ T0
vpsrldq \ $ 8 , $ D0hi , $ H0
vpsrldq \ $ 8 , $ D1lo , $ T1
vpsrldq \ $ 8 , $ D1hi , $ H1
vpaddq $ T0 , $ D0lo , $ D0lo
vpaddq $ H0 , $ D0hi , $ D0hi
vpsrldq \ $ 8 , $ D2lo , $ T2
vpsrldq \ $ 8 , $ D2hi , $ H2
vpaddq $ T1 , $ D1lo , $ D1lo
vpaddq $ H1 , $ D1hi , $ D1hi
vpermq \ $ 0x2 , $ D0lo , $ T0
vpermq \ $ 0x2 , $ D0hi , $ H0
vpaddq $ T2 , $ D2lo , $ D2lo
vpaddq $ H2 , $ D2hi , $ D2hi
vpermq \ $ 0x2 , $ D1lo , $ T1
vpermq \ $ 0x2 , $ D1hi , $ H1
vpaddq $ T0 , $ D0lo , $ D0lo
vpaddq $ H0 , $ D0hi , $ D0hi
vpermq \ $ 0x2 , $ D2lo , $ T2
vpermq \ $ 0x2 , $ D2hi , $ H2
vpaddq $ T1 , $ D1lo , $ D1lo
vpaddq $ H1 , $ D1hi , $ D1hi
vextracti64x4 \ $ 1 , $ D0lo , % y #$T0
vextracti64x4 \ $ 1 , $ D0hi , % y #$H0
vpaddq $ T2 , $ D2lo , $ D2lo
vpaddq $ H2 , $ D2hi , $ D2hi
vextracti64x4 \ $ 1 , $ D1lo , % y #$T1
vextracti64x4 \ $ 1 , $ D1hi , % y #$H1
vextracti64x4 \ $ 1 , $ D2lo , % y #$T2
vextracti64x4 \ $ 1 , $ D2hi , % y #$H2
___
######## switch back to %ymm
map ( s/%z/%y/ , $ H0 , $ H1 , $ H2 , $ R0 , $ R1 , $ R2 , $ S1 , $ S2 ) ;
map ( s/%z/%y/ , $ D0lo , $ D0hi , $ D1lo , $ D1hi , $ D2lo , $ D2hi ) ;
map ( s/%z/%y/ , $ T0 , $ T1 , $ T2 , $ T3 , $ mask44 , $ mask42 , $ tmp , $ PAD ) ;
$ code . = << ___ ;
vpaddq $ T0 , $ D0lo , $ { D0lo } { % k1 } { z }
vpaddq $ H0 , $ D0hi , $ { D0hi } { % k1 } { z }
vpaddq $ T1 , $ D1lo , $ { D1lo } { % k1 } { z }
vpaddq $ H1 , $ D1hi , $ { D1hi } { % k1 } { z }
vpaddq $ T2 , $ D2lo , $ { D2lo } { % k1 } { z }
vpaddq $ H2 , $ D2hi , $ { D2hi } { % k1 } { z }
################################################################
# partial reduction
vpsrlq \ $ 44 , $ D0lo , $ tmp
vpsllq \ $ 8 , $ D0hi , $ D0hi
vpandq $ mask44 , $ D0lo , $ H0
vpaddq $ tmp , $ D0hi , $ D0hi
vpaddq $ D0hi , $ D1lo , $ D1lo
vpsrlq \ $ 44 , $ D1lo , $ tmp
vpsllq \ $ 8 , $ D1hi , $ D1hi
vpandq $ mask44 , $ D1lo , $ H1
vpaddq $ tmp , $ D1hi , $ D1hi
vpaddq $ D1hi , $ D2lo , $ D2lo
vpsrlq \ $ 42 , $ D2lo , $ tmp
vpsllq \ $ 10 , $ D2hi , $ D2hi
vpandq $ mask42 , $ D2lo , $ H2
vpaddq $ tmp , $ D2hi , $ D2hi
vpaddq $ D2hi , $ H0 , $ H0
vpsllq \ $ 2 , $ D2hi , $ D2hi
vpaddq $ D2hi , $ H0 , $ H0
vpsrlq \ $ 44 , $ H0 , $ tmp # additional step
vpandq $ mask44 , $ H0 , $ H0
vpaddq $ tmp , $ H1 , $ H1
################################################################
vmovq % x #$H0,0($ctx)
vmovq % x #$H1,8($ctx)
vmovq % x #$H2,16($ctx)
vzeroall
. Lno_data_vpmadd52_8x:
ret
. size poly1305_blocks_vpmadd52_8x , . - poly1305_blocks_vpmadd52_8x
___
}
$ code . = << ___ ;
. type poly1305_emit_base2_44 , \ @ function , 3
. align 32
poly1305_emit_base2_44:
mov 0 ( $ ctx ) , % r8 # load hash value
mov 8 ( $ ctx ) , % r9
mov 16 ( $ ctx ) , % r10
mov % r9 , % rax
shr \ $ 20 , % r9
shl \ $ 44 , % rax
mov % r10 , % rcx
shr \ $ 40 , % r10
shl \ $ 24 , % rcx
add % rax , % r8
adc % rcx , % r9
adc \ $ 0 , % r10
mov % r8 , % rax
add \ $ 5 , % r8 # compare to modulus
mov % r9 , % rcx
adc \ $ 0 , % r9
adc \ $ 0 , % r10
shr \ $ 2 , % r10 # did 130-bit value overflow?
cmovnz % r8 , % rax
cmovnz % r9 , % rcx
add 0 ( $ nonce ) , % rax # accumulate nonce
adc 8 ( $ nonce ) , % rcx
mov % rax , 0 ( $ mac ) # write result
mov % rcx , 8 ( $ mac )
ret
. size poly1305_emit_base2_44 , . - poly1305_emit_base2_44
___
} } }
}
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
if ( ! $ kernel )
2020-01-05 22:40:47 -05:00
{ # chacha20-poly1305 helpers
my ( $ out , $ inp , $ otp , $ len ) = $ win64 ? ( "%rcx" , "%rdx" , "%r8" , "%r9" ) : # Win64 order
( "%rdi" , "%rsi" , "%rdx" , "%rcx" ) ; # Unix order
$ code . = << ___ ;
. globl xor128_encrypt_n_pad
. type xor128_encrypt_n_pad , \ @ abi - omnipotent
. align 16
xor128_encrypt_n_pad:
sub $ otp , $ inp
sub $ otp , $ out
mov $ len , % r10 # put len aside
shr \ $ 4 , $ len # len / 16
jz . Ltail_enc
nop
. Loop_enc_xmm:
movdqu ( $ inp , $ otp ) , % xmm0
pxor ( $ otp ) , % xmm0
movdqu % xmm0 , ( $ out , $ otp )
movdqa % xmm0 , ( $ otp )
lea 16 ( $ otp ) , $ otp
dec $ len
jnz . Loop_enc_xmm
and \ $ 15 , % r10 # len % 16
jz . Ldone_enc
. Ltail_enc:
mov \ $ 16 , $ len
sub % r10 , $ len
xor % eax , % eax
. Loop_enc_byte:
mov ( $ inp , $ otp ) , % al
xor ( $ otp ) , % al
mov % al , ( $ out , $ otp )
mov % al , ( $ otp )
lea 1 ( $ otp ) , $ otp
dec % r10
jnz . Loop_enc_byte
xor % eax , % eax
. Loop_enc_pad:
mov % al , ( $ otp )
lea 1 ( $ otp ) , $ otp
dec $ len
jnz . Loop_enc_pad
. Ldone_enc:
mov $ otp , % rax
ret
. size xor128_encrypt_n_pad , . - xor128_encrypt_n_pad
. globl xor128_decrypt_n_pad
. type xor128_decrypt_n_pad , \ @ abi - omnipotent
. align 16
xor128_decrypt_n_pad:
sub $ otp , $ inp
sub $ otp , $ out
mov $ len , % r10 # put len aside
shr \ $ 4 , $ len # len / 16
jz . Ltail_dec
nop
. Loop_dec_xmm:
movdqu ( $ inp , $ otp ) , % xmm0
movdqa ( $ otp ) , % xmm1
pxor % xmm0 , % xmm1
movdqu % xmm1 , ( $ out , $ otp )
movdqa % xmm0 , ( $ otp )
lea 16 ( $ otp ) , $ otp
dec $ len
jnz . Loop_dec_xmm
pxor % xmm1 , % xmm1
and \ $ 15 , % r10 # len % 16
jz . Ldone_dec
. Ltail_dec:
mov \ $ 16 , $ len
sub % r10 , $ len
xor % eax , % eax
2020-08-27 19:38:31 +02:00
xor % r11d , % r11d
2020-01-05 22:40:47 -05:00
. Loop_dec_byte:
mov ( $ inp , $ otp ) , % r11b
mov ( $ otp ) , % al
xor % r11b , % al
mov % al , ( $ out , $ otp )
mov % r11b , ( $ otp )
lea 1 ( $ otp ) , $ otp
dec % r10
jnz . Loop_dec_byte
xor % eax , % eax
. Loop_dec_pad:
mov % al , ( $ otp )
lea 1 ( $ otp ) , $ otp
dec $ len
jnz . Loop_dec_pad
. Ldone_dec:
mov $ otp , % rax
ret
. size xor128_decrypt_n_pad , . - xor128_decrypt_n_pad
___
}
# EXCEPTION_DISPOSITION handler (EXCEPTION_RECORD *rec,ULONG64 frame,
# CONTEXT *context,DISPATCHER_CONTEXT *disp)
if ( $ win64 ) {
$ rec = "%rcx" ;
$ frame = "%rdx" ;
$ context = "%r8" ;
$ disp = "%r9" ;
$ code . = << ___ ;
. extern __imp_RtlVirtualUnwind
. type se_handler , \ @ abi - omnipotent
. align 16
se_handler:
push % rsi
push % rdi
push % rbx
push % rbp
push % r12
push % r13
push % r14
push % r15
pushfq
sub \ $ 6 4 , % rsp
mov 120 ( $ context ) , % rax # pull context->Rax
mov 248 ( $ context ) , % rbx # pull context->Rip
mov 8 ( $ disp ) , % rsi # disp->ImageBase
mov 56 ( $ disp ) , % r11 # disp->HandlerData
mov 0 ( % r11 ) , % r10d # HandlerData[0]
lea ( % rsi , % r10 ) , % r10 # prologue label
cmp % r10 , % rbx # context->Rip<.Lprologue
jb . Lcommon_seh_tail
mov 152 ( $ context ) , % rax # pull context->Rsp
mov 4 ( % r11 ) , % r10d # HandlerData[1]
lea ( % rsi , % r10 ) , % r10 # epilogue label
cmp % r10 , % rbx # context->Rip>=.Lepilogue
jae . Lcommon_seh_tail
lea 48 ( % rax ) , % rax
mov - 8 ( % rax ) , % rbx
mov - 16 ( % rax ) , % rbp
mov - 24 ( % rax ) , % r12
mov - 32 ( % rax ) , % r13
mov - 40 ( % rax ) , % r14
mov - 48 ( % rax ) , % r15
mov % rbx , 144 ( $ context ) # restore context->Rbx
mov % rbp , 160 ( $ context ) # restore context->Rbp
mov % r12 , 216 ( $ context ) # restore context->R12
mov % r13 , 224 ( $ context ) # restore context->R13
mov % r14 , 232 ( $ context ) # restore context->R14
mov % r15 , 240 ( $ context ) # restore context->R14
jmp . Lcommon_seh_tail
. size se_handler , . - se_handler
. type avx_handler , \ @ abi - omnipotent
. align 16
avx_handler:
push % rsi
push % rdi
push % rbx
push % rbp
push % r12
push % r13
push % r14
push % r15
pushfq
sub \ $ 6 4 , % rsp
mov 120 ( $ context ) , % rax # pull context->Rax
mov 248 ( $ context ) , % rbx # pull context->Rip
mov 8 ( $ disp ) , % rsi # disp->ImageBase
mov 56 ( $ disp ) , % r11 # disp->HandlerData
mov 0 ( % r11 ) , % r10d # HandlerData[0]
lea ( % rsi , % r10 ) , % r10 # prologue label
cmp % r10 , % rbx # context->Rip<prologue label
jb . Lcommon_seh_tail
mov 152 ( $ context ) , % rax # pull context->Rsp
mov 4 ( % r11 ) , % r10d # HandlerData[1]
lea ( % rsi , % r10 ) , % r10 # epilogue label
cmp % r10 , % rbx # context->Rip>=epilogue label
jae . Lcommon_seh_tail
mov 208 ( $ context ) , % rax # pull context->R11
lea 0x50 ( % rax ) , % rsi
lea 0xf8 ( % rax ) , % rax
lea 512 ( $ context ) , % rdi # &context.Xmm6
mov \ $ 20 , % ecx
. long 0xa548f3fc # cld; rep movsq
. Lcommon_seh_tail:
mov 8 ( % rax ) , % rdi
mov 16 ( % rax ) , % rsi
mov % rax , 152 ( $ context ) # restore context->Rsp
mov % rsi , 168 ( $ context ) # restore context->Rsi
mov % rdi , 176 ( $ context ) # restore context->Rdi
mov 40 ( $ disp ) , % rdi # disp->ContextRecord
mov $ context , % rsi # context
mov \ $ 154 , % ecx # sizeof(CONTEXT)
. long 0xa548f3fc # cld; rep movsq
mov $ disp , % rsi
2020-08-27 19:38:31 +02:00
xor % ecx , % ecx # arg1, UNW_FLAG_NHANDLER
2020-01-05 22:40:47 -05:00
mov 8 ( % rsi ) , % rdx # arg2, disp->ImageBase
mov 0 ( % rsi ) , % r8 # arg3, disp->ControlPc
mov 16 ( % rsi ) , % r9 # arg4, disp->FunctionEntry
mov 40 ( % rsi ) , % r10 # disp->ContextRecord
lea 56 ( % rsi ) , % r11 # &disp->HandlerData
lea 24 ( % rsi ) , % r12 # &disp->EstablisherFrame
mov % r10 , 32 ( % rsp ) # arg5
mov % r11 , 40 ( % rsp ) # arg6
mov % r12 , 48 ( % rsp ) # arg7
mov % rcx , 56 ( % rsp ) # arg8, (NULL)
call * __imp_RtlVirtualUnwind ( % rip )
mov \ $ 1 , % eax # ExceptionContinueSearch
add \ $ 64 , % rsp
popfq
pop % r15
pop % r14
pop % r13
pop % r12
pop % rbp
pop % rbx
pop % rdi
pop % rsi
ret
. size avx_handler , . - avx_handler
. section . pdata
. align 4
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. rva . LSEH_begin_poly1305_init_x86_64
. rva . LSEH_end_poly1305_init_x86_64
. rva . LSEH_info_poly1305_init_x86_64
2020-01-05 22:40:47 -05:00
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. rva . LSEH_begin_poly1305_blocks_x86_64
. rva . LSEH_end_poly1305_blocks_x86_64
. rva . LSEH_info_poly1305_blocks_x86_64
2020-01-05 22:40:47 -05:00
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. rva . LSEH_begin_poly1305_emit_x86_64
. rva . LSEH_end_poly1305_emit_x86_64
. rva . LSEH_info_poly1305_emit_x86_64
2020-01-05 22:40:47 -05:00
___
$ code . = << ___ if ( $ avx ) ;
. rva . LSEH_begin_poly1305_blocks_avx
. rva . Lbase2_64_avx
. rva . LSEH_info_poly1305_blocks_avx_1
. rva . Lbase2_64_avx
. rva . Leven_avx
. rva . LSEH_info_poly1305_blocks_avx_2
. rva . Leven_avx
. rva . LSEH_end_poly1305_blocks_avx
. rva . LSEH_info_poly1305_blocks_avx_3
. rva . LSEH_begin_poly1305_emit_avx
. rva . LSEH_end_poly1305_emit_avx
. rva . LSEH_info_poly1305_emit_avx
___
$ code . = << ___ if ( $ avx > 1 ) ;
. rva . LSEH_begin_poly1305_blocks_avx2
. rva . Lbase2_64_avx2
. rva . LSEH_info_poly1305_blocks_avx2_1
. rva . Lbase2_64_avx2
. rva . Leven_avx2
. rva . LSEH_info_poly1305_blocks_avx2_2
. rva . Leven_avx2
. rva . LSEH_end_poly1305_blocks_avx2
. rva . LSEH_info_poly1305_blocks_avx2_3
___
$ code . = << ___ if ( $ avx > 2 ) ;
. rva . LSEH_begin_poly1305_blocks_avx512
. rva . LSEH_end_poly1305_blocks_avx512
. rva . LSEH_info_poly1305_blocks_avx512
___
$ code . = << ___ ;
. section . xdata
. align 8
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. LSEH_info_poly1305_init_x86_64:
2020-01-05 22:40:47 -05:00
. byte 9 , 0 , 0 , 0
. rva se_handler
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. rva . LSEH_begin_poly1305_init_x86_64 , . LSEH_begin_poly1305_init_x86_64
2020-01-05 22:40:47 -05:00
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. LSEH_info_poly1305_blocks_x86_64:
2020-01-05 22:40:47 -05:00
. byte 9 , 0 , 0 , 0
. rva se_handler
. rva . Lblocks_body , . Lblocks_epilogue
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. LSEH_info_poly1305_emit_x86_64:
2020-01-05 22:40:47 -05:00
. byte 9 , 0 , 0 , 0
. rva se_handler
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
. rva . LSEH_begin_poly1305_emit_x86_64 , . LSEH_begin_poly1305_emit_x86_64
2020-01-05 22:40:47 -05:00
___
$ code . = << ___ if ( $ avx ) ;
. LSEH_info_poly1305_blocks_avx_1:
. byte 9 , 0 , 0 , 0
. rva se_handler
. rva . Lblocks_avx_body , . Lblocks_avx_epilogue # HandlerData[]
. LSEH_info_poly1305_blocks_avx_2:
. byte 9 , 0 , 0 , 0
. rva se_handler
. rva . Lbase2_64_avx_body , . Lbase2_64_avx_epilogue # HandlerData[]
. LSEH_info_poly1305_blocks_avx_3:
. byte 9 , 0 , 0 , 0
. rva avx_handler
. rva . Ldo_avx_body , . Ldo_avx_epilogue # HandlerData[]
. LSEH_info_poly1305_emit_avx:
. byte 9 , 0 , 0 , 0
. rva se_handler
. rva . LSEH_begin_poly1305_emit_avx , . LSEH_begin_poly1305_emit_avx
___
$ code . = << ___ if ( $ avx > 1 ) ;
. LSEH_info_poly1305_blocks_avx2_1:
. byte 9 , 0 , 0 , 0
. rva se_handler
. rva . Lblocks_avx2_body , . Lblocks_avx2_epilogue # HandlerData[]
. LSEH_info_poly1305_blocks_avx2_2:
. byte 9 , 0 , 0 , 0
. rva se_handler
. rva . Lbase2_64_avx2_body , . Lbase2_64_avx2_epilogue # HandlerData[]
. LSEH_info_poly1305_blocks_avx2_3:
. byte 9 , 0 , 0 , 0
. rva avx_handler
. rva . Ldo_avx2_body , . Ldo_avx2_epilogue # HandlerData[]
___
$ code . = << ___ if ( $ avx > 2 ) ;
. LSEH_info_poly1305_blocks_avx512:
. byte 9 , 0 , 0 , 0
. rva avx_handler
. rva . Ldo_avx512_body , . Ldo_avx512_epilogue # HandlerData[]
___
}
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
open SELF , $ 0 ;
while ( <SELF> ) {
next if ( /^#!/ ) ;
last if ( ! s/^#/\/\// and ! /^$/ ) ;
print ;
}
close SELF ;
2020-01-05 22:40:47 -05:00
foreach ( split ( '\n' , $ code ) ) {
s/\`([^\`]*)\`/eval($1)/ge ;
s/%r([a-z]+)#d/%e$1/g ;
s/%r([0-9]+)#d/%r$1d/g ;
s/%x#%[yz]/%x/g or s/%y#%z/%y/g or s/%z#%[yz]/%z/g ;
crypto: x86/poly1305 - wire up faster implementations for kernel
These x86_64 vectorized implementations support AVX, AVX-2, and AVX512F.
The AVX-512F implementation is disabled on Skylake, due to throttling,
but it is quite fast on >= Cannonlake.
On the left is cycle counts on a Core i7 6700HQ using the AVX-2
codepath, comparing this implementation ("new") to the implementation in
the current crypto api ("old"). On the right are benchmarks on a Xeon
Gold 5120 using the AVX-512 codepath. The new implementation is faster
on all benchmarks.
AVX-2 AVX-512
--------- -----------
size old new size old new
---- ---- ---- ---- ---- ----
0 70 68 0 74 70
16 92 90 16 96 92
32 134 104 32 136 106
48 172 120 48 184 124
64 218 136 64 218 138
80 254 158 80 260 160
96 298 174 96 300 176
112 342 192 112 342 194
128 388 212 128 384 212
144 428 228 144 420 226
160 466 246 160 464 248
176 510 264 176 504 264
192 550 282 192 544 282
208 594 302 208 582 300
224 628 316 224 624 318
240 676 334 240 662 338
256 716 354 256 708 358
272 764 374 272 748 372
288 802 352 288 788 358
304 420 366 304 422 370
320 428 360 320 432 364
336 484 378 336 486 380
352 426 384 352 434 390
368 478 400 368 480 408
384 488 394 384 490 398
400 542 408 400 542 412
416 486 416 416 492 426
432 534 430 432 538 436
448 544 422 448 546 432
464 600 438 464 600 448
480 540 448 480 548 456
496 594 464 496 594 476
512 602 456 512 606 470
528 656 476 528 656 480
544 600 480 544 606 498
560 650 494 560 652 512
576 664 490 576 662 508
592 714 508 592 716 522
608 656 514 608 664 538
624 708 532 624 710 552
640 716 524 640 720 516
656 770 536 656 772 526
672 716 548 672 722 544
688 770 562 688 768 556
704 774 552 704 778 556
720 826 568 720 832 568
736 768 574 736 780 584
752 822 592 752 826 600
768 830 584 768 836 560
784 884 602 784 888 572
800 828 610 800 838 588
816 884 628 816 884 604
832 888 618 832 894 598
848 942 632 848 946 612
864 884 644 864 896 628
880 936 660 880 942 644
896 948 652 896 952 608
912 1000 664 912 1004 616
928 942 676 928 954 634
944 994 690 944 1000 646
960 1002 680 960 1008 646
976 1054 694 976 1062 658
992 1002 706 992 1012 674
1008 1052 720 1008 1058 690
This commit wires in the prior implementation from Andy, and makes the
following changes to be suitable for kernel land.
- Some cosmetic and structural changes, like renaming labels to
.Lname, constants, and other Linux conventions, as well as making
the code easy for us to maintain moving forward.
- CPU feature checking is done in C by the glue code.
- We avoid jumping into the middle of functions, to appease objtool,
and instead parameterize shared code.
- We maintain frame pointers so that stack traces make sense.
- We remove the dependency on the perl xlate code, which transforms
the output into things that assemblers we don't care about use.
Importantly, none of our changes affect the arithmetic or core code, but
just involve the differing environment of kernel space.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-01-05 22:40:48 -05:00
if ( $ kernel ) {
s/(^\.type.*),[0-9]+$/\1/ ;
s/(^\.type.*),\@abi-omnipotent+$/\1,\@function/ ;
next if /^\.cfi.*/ ;
}
2020-01-05 22:40:47 -05:00
print $ _ , "\n" ;
}
close STDOUT ;