Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: improve ManagedAuthenticatedEncryptor Decrypt() and Encrypt() flow #59424

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

DeagleGross
Copy link
Contributor

@DeagleGross DeagleGross commented Dec 10, 2024

The goal of PR is to bring linux performance closer to windows performance for DataProtection scenario. Below is the picture of Antiforgery benchmarks on win vs lin machines.
image

Results: DataProtection Benchmark

The benchmark I am relying on to show the result numbers is here, which is basically building default ServiceProvider, adding DataProtection via .AddDataProtection() and calling IDataProtector.Protect() or IDataProtector.Unprotect().

net Code Method Mean Error StdDev Gen0 Allocated
net10 main branch Unprotect 9.706 µs 0.0143 µs 0.0134 µs 0.1068 3.34 KB
net10 changed Unprotect 7.936 µs 0.0137 µs 0.0122 µs 0.2441 1.56 KB
net10 main branch Protect 10.88 µs 0.015 µs 0.014 µs 0.6256 3.91 KB
net10 changed Protect 9.248 µs 0.0096 µs 0.0090 µs 0.3357 2.14 KB

Results: Antiforgery Benchmark

note: results below dont have the latest numbers, I will update them once we change to the new API

However, since we are originally looking at improving the Antiforgery performance on linux, I ran the Antiforgery benchmark including locally built dll's from this PR.

crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/refs/heads/main/scenarios/antiforgery.benchmarks.yml --scenario antiforgery-validation --profile aspnet-perf-lin --application.framework net10.0 --application.options.outputFiles "D:\code\aspnetcore\artifacts\bin\Microsoft.AspNetCore.DataProtection\Release\net10.0\Microsoft.AspNetCore.*.dll"

current aspnet core gives these stats on the benchmark:

Stat Value
Requests/sec 111,835
Max Working Set (MB) 140
Max Private Memory (MB) 491

RPS of run with changed dll's varies from run to run, therefore I ran it 10 times

# Stat Value
1 Requests/sec 115,082
2 Requests/sec 114,498
3 Requests/sec 115,924
4 Requests/sec 122,823
5 Requests/sec 117,024
6 Requests/sec 116,474
7 Requests/sec 117,056
8 Requests/sec 115,497
9 Requests/sec 116,082
10 Requests/sec 114,606
Mean: 116506 Mean Diff: 4.2% Min/Max Diff: [2.38% - 9.8%]

But memory usage is stable with such values:

Stat Value diff
Max Working Set (MB) 127 9.3%
Max Private Memory (MB) 442 10%

Which provide an evidence of ~10% of max app allocation size and ~5% RPS improvement.

Optimization details

I looked into Unprotect method for ManagedAuthenticatedEncryptor and spotted MemoryStream usage and multiple Buffer.BlockCopy usages. Also I saw that there is some shuffling of byte[] data, which I think can be skipped and performed in such a way, that some allocations are skipped.

In order to be as safe as possible, I created a separate DataProtectionPool which provides API to rent and return byte arrays. It is not intersecting with ArrayPool<byte>.Shared.

  1. ManagedSP800_108_CTR_HMACSHA512.DeriveKeys is changed to explicit usage ManagedSP800_108_CTR_HMACSHA512.DeriveKeysHMACSHA512, because _kdkPrfFactory is anyway hardcoded to use HMACSHA512. There is a static API allowing to hash without allocating kdk byte[] which is rented from the buffer: HMACSHA512.TryHashData(kdk, prfInput, prfOutput, out _);

  2. Avoided usage of DeriveKeysWithContextHeader which allocates a separate intermediate array for contextHeader and context. Instead passing the spans operationSubkey and validationSubkey directly into ManagedSP800_108_CTR_HMACSHA512.DeriveKeys

  3. ManagedSP800_108_CTR_HMACSHA512.DeriveKeysHMACSHA512 had 2 more arrays (prfInput and prfOutput), which now I am renting (via DataProtectionPool) or even stackalloc'ing. They are returned to the pool with clearArray: true flag to make sure key material is removed from the memory after usage.

  4. In Decrypt() flow I am again using HashAlgorithm.TryComputeHash overload, which works based on the Span<byte> types, compared to previously used HashAlgorithm.ComputeHash

  5. In Decrypt() flow changed usage to SymmetricAlgorithm.DecryptCbc() instead of CryptoTransform.TransformBlock() with same idea to use Span<byte> API instead of another byte[] allocation.

  6. Encrypt() flow is reusing №1, №2 and №3 optimizations as well

  7. Encrypt() before was relying on the MemoryStream and CryptoStream to write data in the result buffer, but I am pre-calculating the length, and then doing a single allocation of result array: var outputArray = new byte[keyModifierLength + ivLength + cipherTextLength + macLength]; All required data is copied into the outputArray via APIs supporting Span<byte>.

All listed optimizations are included in the net10 TFM, but only some (№ 2, №3 and №6) are used in netstandard2.0 and netFx TFMs which DataProtection also targets.

Related to #59287

@DeagleGross DeagleGross changed the title [DRAFT] perf: improve ManagedAuthenticatedEncryptor.Decrypt flow [DRAFT] perf: improve ManagedAuthenticatedEncryptor Decrypt() and Encrypt() flow Dec 17, 2024
@DeagleGross DeagleGross marked this pull request as ready for review December 19, 2024 14:31
@DeagleGross DeagleGross changed the title [DRAFT] perf: improve ManagedAuthenticatedEncryptor Decrypt() and Encrypt() flow perf: improve ManagedAuthenticatedEncryptor Decrypt() and Encrypt() flow Dec 19, 2024
@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Dec 19, 2024

Code review notes

  1. It's generally not advisable to pool buffers for sensitive cryptographic operations, such as those which perform key storage or manipulation. This tends to increase the attack surface of the application and should only be performed if it's absolutely required to meet some performance goal. If you absolutely must use pooled buffers, use different pooled buffers for key material specifically vs. (all other data). In dataprotection, key material would be the KDK, KEK, and individual decryption + validation keys.

  2. This PR introduces a call to the SymmetricAlgorithm.Key property setter. The original code intentionally avoided calling this property setter because it duplicates the sensitive key material in such a way that the caller has no control over the new lifetime, and this undermines other protections present in the system (the use of the Secret type and the widespread use of pinning within the core crypto logic). Of course, the EG can always say that the new behavior is preferred over the old behavior, but there needs to be an explicit acknowledgement that there is a security tradeoff here. The tradeoff shouldn't be a mere side effect that is likely to go unnoticed.

    It's possible we can make changes to the underlying SymmetricAlgorithm type to improve the perf without reducing the security stance, but this would require new API to be exposed within corelib. Since you're targeting these changes for net10 that's probably acceptable? Work with Jeff's team to give them your requirements and they can add the work to the backlog.

  3. The pattern if (len < CONST) { foo = stackalloc[len]; } else { foo = Rent(len).Slice(len); } is typically considered an antipattern. Prefer a pattern like if (len < CONST) { foo = stackalloc[CONST]; } else { foo = Rent(len); } foo = foo.Slice(len); instead. (Basically, the stackalloc should be a const, not variable-length.)

Scenario notes

Is the goal to improve the performance of the [real-world?] AntiForgery benchmark or to improve the performance of DataProtection in a standalone benchmark? The PR description (and attached graph) make it sound like improving the performance of the crank-based benchmark is the goal, but no throughput measurement is provided for the changes in this PR. Please provide that graph. It would supply evidence that these changes have real-world impact and aren't just microbenchmark improvements.

@DeagleGross
Copy link
Contributor Author

Code review notes

  1. It's generally not advisable to pool buffers for sensitive cryptographic operations, such as those which perform key storage or manipulation. This tends to increase the attack surface of the application and should only be performed if it's absolutely required to meet some performance goal. If you absolutely must use pooled buffers, use different pooled buffers for key material specifically vs. (all other data). In dataprotection, key material would be the KDK, KEK, and individual decryption + validation keys.
  2. This PR introduces a call to the SymmetricAlgorithm.Key property setter. The original code intentionally avoided calling this property setter because it duplicates the sensitive key material in such a way that the caller has no control over the new lifetime, and this undermines other protections present in the system (the use of the Secret type and the widespread use of pinning within the core crypto logic). Of course, the EG can always say that the new behavior is preferred over the old behavior, but there needs to be an explicit acknowledgement that there is a security tradeoff here. The tradeoff shouldn't be a mere side effect that is likely to go unnoticed.
    It's possible we can make changes to the underlying SymmetricAlgorithm type to improve the perf without reducing the security stance, but this would require new API to be exposed within corelib. Since you're targeting these changes for net10 that's probably acceptable? Work with Jeff's team to give them your requirements and they can add the work to the backlog.
  3. The pattern if (len < CONST) { foo = stackalloc[len]; } else { foo = Rent(len).Slice(len); } is typically considered an antipattern. Prefer a pattern like if (len < CONST) { foo = stackalloc[CONST]; } else { foo = Rent(len); } foo = foo.Slice(len); instead. (Basically, the stackalloc should be a const, not variable-length.)

Scenario notes

Is the goal to improve the performance of the [real-world?] AntiForgery benchmark or to improve the performance of DataProtection in a standalone benchmark? The PR description (and attached graph) make it sound like improving the performance of the crank-based benchmark is the goal, but no throughput measurement is provided for the changes in this PR. Please provide that graph. It would supply evidence that these changes have real-world impact and aren't just microbenchmark improvements.

Thanks for detailed answer @GrabYourPitchforks! Firstly, I ran the Antiforgery benchmark multiple times, and I provided the results in the PR description.

Re №3: I ran a BenchmarkDotNet for stackalloc with dynamic \ constant length of stackalloc (also with or without [SkipLocalsInit]): results are that without [SkipLocalsInit] dynamic length is even faster, but with [SkipLocalsInit] you are correct. I will probably enable the [SkipLocalsInit] on the method and then we can use constant length. What do you think?

Re №2: Thanks for clarifying it, I will create issues on the dotnet/runtime explaining what API I would like to have to make DataProtection's flow dont use byte[] directly.

Re №1: Could you please describe how attack surface of the application is increased if pool buffers are used? Does that mean that pooling is easier to inject into via reflection for example? Actually, even if we will not be using pooling byte arrays, if I work with corelib to introduce APIs supporting Span<byte>, we can choose to stackalloc or allocate a new byte via new byte[], meaning that we will not ever touch pool buffers. Does it make sense and is that more secure? I think most of the code is not using "big" arrays (say with length more than 128), and we will hit stackalloc in most cases.

Span<byte> arr = length <= 128
    ? stackalloc byte[128]
    : new byte[length];

…r comments + distinguish .net10 and .netFx impls
@DeagleGross
Copy link
Contributor Author

Here is the progress:

  1. changed the code to not rent buffers, but either stackalloc or allocate a new array
  2. changed to proposed stackalloc pattern by Levi
  3. distinguished net10 and netFx flows for clarity
  4. created an API proposal for SymmetricAlgorithm using Span<byte> as a key: [API Proposal]: add Span<byte> key parameter to SymmetricAlgorithm.EncryptCbc and SymmetricAlgorithm.DecryptCbc runtime#111154
  5. found out that validation algorithms can be improved to use Span<byte> as well if we use a specific implementation reference (like HMACSHA512). If we support enough of known types, we would cover most of the cases IMO.

@GrabYourPitchforks let me know what you think about №1, 4, 5 please
cc @mgravell

@@ -55,6 +57,217 @@ public static void DeriveKeys(byte[] kdk, ArraySegment<byte> label, ArraySegment
}
}

#if NET10_0_OR_GREATER
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.NET already has SP800108 in the box. We should probably just use the one that is built-in to .NET if it is available.

https://learn.microsoft.com/en-us/dotnet/api/system.security.cryptography.sp800108hmaccounterkdf

@danmoseley
Copy link
Member

Re №1: Could you please describe how attack surface of the application is increased if pool buffers are used? Does that mean that pooling is easier to inject into via reflection for example?

I'm interested in the answer to this, my assumption was it's just more copies in the heap, so higher probability of appearing in a dump of the service

@GrabYourPitchforks
Copy link
Member

Could you please describe how attack surface of the application is increased if pool buffers are used?

Using pooled arrays increases the attack surface by allowing other components of the system to have managed references to sensitive key material. It is unfortunately quite common to see buggy components misuse the array pool. And to be fair, the bug would be in those components, not within ASP.NET. And the consequences of such bugs span the gamut of issues - data corruption, information disclosure, etc.

If we allow the pool to contain sensitive data, then the stakes are raised, as it were. These buggy components might now inadvertently disclose sensitive plaintext data - or even active key material. Leaking the latter especially would cause the collapse of other security features within ASP.NET applications, including allowing attackers to carry our CSRF attacks or even mint their own administrative auth tokens.

Pinning is similar. By keeping the sensitive data contained within a pinned array, we ensure that no copies of it are made before we have a chance to clear it, which limits the chances of buggy code (either through direct memory access or by leaking uninitialized memory from the array pool) to disclose it.

To be sure, this is all defense-in-depth. In an otherwise properly functioning application, the security of DataProtection does not rely on any of this. These behaviors are solely to minimize the chances of other bugs in the system from manifesting in such a way that it undermines DataProtection. If you'd like to think of it another way: the information processed by DataProtection is so sensitive and failure would be so wildly catastrophic that this is an attempt to isolate the component from other failures within the same process.

Any perf-vs-security tradeoff is always at the discretion of the feature team. And the metrics used to judge these tradeoffs are obviously subjective! But it is nevertheless a tradeoff, and the product team needs to stand behind a statement akin to:

We do not believe that these protections provide significant enough value to our customers when weighed against the xx% performance gain (or other tangible benefit) that foregoing these protections would offer.

--

Fun fact! The original design of this component didn't even store the keys in-proc. It actually stored the keys out-of-proc to further reduce the risk of a buggy component within the process leaking keys. The only reason we didn't move forward with it at GA is that it didn't work in Azure Web Sites. At the time, their sandbox saw our IPC mechanism as a potential threat, and it protected the stability of the machine by throttling our IPC channel to 100 transactions per second. That's 100 Protect or Unprotect operations per second -- max.

The design was very secure, but it was unworkably slow to the point where no real-world customer could ever depend on it. So for the sake of practicality we ended up with the design we have today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-dataprotection Includes: DataProtection pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants