Skip to content

Conversation

@aqjune
Copy link
Contributor

@aqjune aqjune commented Jul 20, 2023

Description of changes:

This patch adopts s2n-bignum's bignum functions to aws-lc's montgomery multiplication. Additionally, if the __ARM_NEON macro is defined, it invokes the functions that use NEON instructions (https://developer.arm.com/documentation/ihi0053/d/).

The scalar bignum functions are fully verified versions (verified by @jargh). The NEON instructions are still under verification by myself. I will mark this PR as draft, and mark it as ready once it is done.

If the NEON is used, the performance numbers of RSA signing are as follows. The processor is Graviton 2, and tool/bssl speed -filter RSA has been used. (Unit: ops/sec)

Bits Operation baseline AWS-LC s2n-bignum (+ NEON) speedup vs baseline
2048 RSA sign 299.3 495.8 65.65%
verify (fresh key) 10736.3 18836.6 75.45%
3072 RSA sign 95.4 126.4 32.49%
verify (fresh key) 4917.7 6579.1 33.78%
4096 RSA sign 41.7 78.3 87.77%
verify (fresh key) 2781.6 3800.3 36.62%

Without NEON, there is a milder but still observable speedup.

Bits Operation baseline AWS-LC s2n-bignum speedup vs baseline
2048 RSA sign 299.3 399 33.31%
verify (fresh key) 10736.3 15491 44.29%
3072 RSA sign 95.4 113.2 18.66%
verify (fresh key) 4917.7 6001.7 22.04%
4096 RSA sign 41.7 63.2 51.56%
verify (fresh key) 2781.6 3451 24.07%

This is only adopted for AArch64. For x86-64 the performance benefit isn't clear. For other architectures, s2n-bignum does not have implementation yet.

Currently, the s2n-bignum alternatives are always used, but I expect this must be conditionally used in some cases. I will be happy to receive feedbacks about this part.

Call-outs:

This PR contains addition of multiple files from awslabs/s2n-bignum.

Testing:

Tested via bssl speed -filter RSA

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and the ISC license.

aqjune-aws and others added 6 commits July 20, 2023 14:24
…tion

This patch adopts s2n-bignum's bignum functions to aws-lc's montgomery multiplication.
Additionally, if the `__ARM_NEON` macro is defined, it invokes the functions that use NEON instructions
(https://developer.arm.com/documentation/ihi0053/d/).

The scalar bignum functions are fully verified versions (verified by @jargh).
The NEON instructions are still under verification by myself. I will mark this PR as draft, and mark it as ready once it is done.

If the NEON is used, the performance numbers are as follows. The processor is Graviton 2,
and `tool/bssl speed -filter RSA` has been used.

```
		Unit: ops/sec
Bits	Operation	baseline AWS-LC	s2n-bignum (+ NEON)	speedup vs baseline
2048	RSA sign	299.3	495.8	65.65%
	verify (fresh key)	10736.3	18836.6	75.45%
3072	RSA sign	95.4	126.4	32.49%
	verify (fresh key)	4917.7	6579.1	33.78%
4096	RSA sign	41.7	78.3	87.77%
	verify (fresh key)	2781.6	3800.3	36.62%
```

Without NEON, there is a milder but still observable speedup.

```
Bits	Operation	baseline AWS-LC	s2n-bignum	speedup vs baseline
2048	RSA sign	299.3	399	33.31%
	verify (fresh key)	10736.3	15491	44.29%
3072	RSA sign	95.4	113.2	18.66%
	verify (fresh key)	4917.7	6001.7	22.04%
4096	RSA sign	41.7	63.2	51.56%
	verify (fresh key)	2781.6	3451	24.07%
```

Currently, the s2n-bignum alternatives are always used, but I expect this must be conditionally used in some cases.
I will be happy to receive feedbacks about this part.
This commit
- Removes unused assembly files in s2n-bignum/arm/generic
- Add x86 scalar s2n-bignum ops
- add BN_MONTGOMERY_USE_S2N_BIGNUM macro
- Fix compilation errors
@aqjune aqjune changed the title Adopt s2n-bignum's bignum functions to aws-lc's montgomery multiplican Adopt s2n-bignum's AArch64 bignum functions to aws-lc's montgomery multiplication Jul 20, 2023
@aqjune
Copy link
Contributor Author

aqjune commented Jul 21, 2023

From the latest CI failure I could see this error message, but couldn't figure out its solution:

FAILED: crypto/fipsmodule/bcm-delocated.S 
cd /codebuild/output/src3947276767/src/github.com/aws/aws-lc/test_build_dir/crypto/fipsmodule && ./delocate -a /codebuild/output/src3947276767/src/github.com/aws/aws-lc/test_build_dir/crypto/fipsmodule/libbcm_c_generated_asm.a -o bcm-delocated.S aesv8-armx.S.s aesv8-gcm-armv8.S.s aesv8-gcm-armv8-unroll8.S.s armv8-mont.S.s bn-armv8.S.s ghash-neon-armv8.S.s ghashv8-armx.S.s keccak1600-armv8.S.s md5-armv8.S.s p256-armv8-asm.S.s p256_beeu-armv8-asm.S.s sha1-armv8.S.s sha256-armv8.S.s sha512-armv8.S.s vpaes-armv8.S.s p384/bignum_add_p384.S.S.s p384/bignum_sub_p384.S.S.s p384/bignum_neg_p384.S.S.s p384/bignum_tomont_p384.S.S.s p384/bignum_deamont_p384.S.S.s p384/bignum_montmul_p384.S.S.s p384/bignum_montmul_p384_alt.S.S.s p384/bignum_montsqr_p384.S.S.s p384/bignum_montsqr_p384_alt.S.S.s p384/bignum_nonzero_6.S.S.s p384/bignum_littleendian_6.S.S.s p521/bignum_add_p521.S.S.s p521/bignum_sub_p521.S.S.s p521/bignum_neg_p521.S.S.s p521/bignum_mul_p521.S.S.s p521/bignum_mul_p521_alt.S.S.s p521/bignum_sqr_p521.S.S.s p521/bignum_sqr_p521_alt.S.S.s p521/bignum_tolebytes_p521.S.S.s p521/bignum_fromlebytes_p521.S.S.s curve25519/curve25519_x25519_byte.S.S.s curve25519/curve25519_x25519_byte_alt.S.S.s curve25519/curve25519_x25519base_byte.S.S.s curve25519/curve25519_x25519base_byte_alt.S.S.s fastmul/bignum_kmul_16_32.S.S.s fastmul/bignum_kmul_32_64.S.S.s fastmul/bignum_ksqr_16_32.S.S.s fastmul/bignum_ksqr_32_64.S.S.s fastmul/bignum_emontredc_8n.S.S.s generic/bignum_ge.S.S.s generic/bignum_mul.S.S.s generic/bignum_optsub.S.S.s generic/bignum_sqr.S.S.s fastmul/bignum_kmul_16_32_neon.S.S.s fastmul/bignum_kmul_32_64_neon.S.S.s fastmul/bignum_ksqr_16_32_neon.S.S.s fastmul/bignum_ksqr_32_64_neon.S.S.s fastmul/bignum_emontredc_8n_neon.S.S.s
error while parsing "fastmul/bignum_kmul_16_32.S.S.s": 
parse error near OffsetOperator (line 142 symbol 33 - line 142 symbol 34):
"+"

I would appreciate very much if someone could give advice about this failure, thanks.

@aqjune
Copy link
Contributor Author

aqjune commented Jul 21, 2023

I could locally reproduce the failure after building boringssl's delocate using cmake .. -DFIPS_DELOCATE=On in aws-lc.
(https://github.com/google/boringssl/tree/master/util/fipstools/delocate)
It looks like the delocate tool cannot parser ARM assembly like this:

139         .set I, 0
140
141         ldp x10, x11, [x25, #128+8*I] << The '+' is here
142         ldp x12, x13, [x25, #64+8*I]   << .. and here as well. These `+` cannot be parsed by `delocate`
143         adds x10, x10, x12

For this case I can manually calculate #128+8*I (which is simply #128) and fix the assembly to use the number, but there are cases that are hard to fix because a loop is involved:

 265         .set    I, 1
 266         .rep (L-1)
 267         ldp     x0, x1, [z, #16*(K+I)]
 268         ldp     x2, x3, [z, #16*(L+I)]
 269         adcs    x0, x0, x2
 270         adcs    x1, x1, x3
 271         stp     x0, x1, [z, #16*(K+I)]
 272         .set    I, (I+1)
 273         .endr

I can preprocess these macros by simply compiling & disassembling s2n-bignum assemblies - this will remove these uses of complicated syntax.
Does it sound like a reasonable option?

@dkostic
Copy link
Contributor

dkostic commented Jul 21, 2023

I could locally reproduce the failure after building boringssl's delocate using cmake .. -DFIPS_DELOCATE=On in aws-lc. (https://github.com/google/boringssl/tree/master/util/fipstools/delocate) It looks like the delocate tool cannot parser ARM assembly like this:

139         .set I, 0
140
141         ldp x10, x11, [x25, #128+8*I] << The '+' is here
142         ldp x12, x13, [x25, #64+8*I]   << .. and here as well. These `+` cannot be parsed by `delocate`
143         adds x10, x10, x12

For this case I can manually calculate #128+8*I (which is simply #128) and fix the assembly to use the number, but there are cases that are hard to fix because a loop is involved:

 265         .set    I, 1
 266         .rep (L-1)
 267         ldp     x0, x1, [z, #16*(K+I)]
 268         ldp     x2, x3, [z, #16*(L+I)]
 269         adcs    x0, x0, x2
 270         adcs    x1, x1, x3
 271         stp     x0, x1, [z, #16*(K+I)]
 272         .set    I, (I+1)
 273         .endr

I can preprocess these macros by simply compiling & disassembling s2n-bignum assemblies - this will remove these uses of complicated syntax. Does it sound like a reasonable option?

The offset form has to fit into one of the regex here: https://github.com/aws/aws-lc/blob/main/util/fipstools/delocate/delocate.peg#L114-L123. So ideally we should add an expression to that list that covers your particular case.

@aqjune
Copy link
Contributor Author

aqjune commented Jul 21, 2023

Hello @dkostic , thanks for the reference! I could fix the case by adding rules specific to the patterns to the Offset rule on my local machine.

However, I am afraid that I have a few concerns about fixing delocate.peg...
delocate.peg seems to require more updates than I had expected, and I am afraid that my updates can inadvertently introduce buggy behavior.
For example, s2n-bignum uses .rep directive in ARM assembly to repeat a code snippet for a constant amount of time:

.set I, 1
.rep ((16/2)-1)
ldp x0, x1, [x19, #16*(16+I)]
ldp x2, x3, [x19, #16*((16/2)+I)]
adcs x0, x0, x2
adcs x1, x1, x3
stp x0, x1, [x19, #16*(16 +I)]
.set I, (I+1)
.endr

It seems the PEG syntax isn't considering macro directives in ARM such as .set or .rep. Adding .set and .rep to LabelContainingDirectiveName would work, but then SymbolArg must be also expanded to describe arbitrarily complicated expressions as well. I think the script is targetting both x86 and arm assemblies, so this can affect x86 assemblies.

Another concern is that extending delocate.peg doesn't seem straightforward.
This might sound ridiculous a bit, but I will try to explain why.
delocate.peg contains multiple parsing rules that are overlapping (e.g., they can parse the same string), which is bad...
In order to do expand this I need to find the 'first' rule that applies to the string input (https://en.wikipedia.org/wiki/Parsing_expression_grammar).
I tried finding the first rules & minimally extending these to support new cases (e.g., the [x25, #128+8*I] case), but it sometimes didn't work; I received an impression that PEG does not try all parsing rules and give up at some point if parsing the string is considered too costly.
Maybe this is why the Offset rule is having multiple right-hands that are tailored to very specific assembly patterns.

I think help is needed in this direction.

Or, things will be simpler if we simply put macro-expanded assemblies into third_party/s2n-bignum directory.
Expanding macros can be done by compiling the assembly & objdump-ing it.
s2n-bignum's proof also verifies the assembly code that has all macros already expanded.

@aqjune
Copy link
Contributor Author

aqjune commented Aug 17, 2023

Closing this PR, as this will be finally superceded by #1164

@aqjune aqjune closed this Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants