-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion: use AVX2 in FastParseCommand #813
Comments
(borked the test; closing while I fix that, in case this is a nothing-burger) |
@badrishc I closed this while I fixed the test/data (now done); feel free to reopen if it is of interest (I don't have access for that). |
This is a very cool idea! We have actually been thinking about doing this for a while, but we never got around to do it. The numbers look promising, I assume the speedup will be much greater for commands at the bottom of the switch statement. The bigger question: what is the expected end to end gain? So, for example, what is the speedup if we process a batch of commands with parameters instead of individual commands headers? We can use BDN for this. |
@mgravell - this is indeed interesting but we need to see if this helps end-to-end. There are BDN benchmarks for Garnet operations at https://github.com/microsoft/garnet/tree/main/benchmark/BDN.benchmark/Operations -- you might want to prototype against that to check that there is an opportunity here. |
I did some more digging this morning; it turns out that if you're testing multiple values,
The key thing here being: a |
All of this sounds good and potentially is a huge win. But it does seem like a non-trivial amount of dev effort to translate the idea into Garnet's parser. Would this be of interest to you or anyone else in the community to actually try out in Garnet code? The starting point would likely be https://github.com/microsoft/garnet/blob/main/libs/server/Resp/Parser/RespCommand.cs#L594 |
I will see what I can do with this shortly - hopefully I should have some useful numbers based on the real code (where "useful" can include "no meaningful impact, let's do nothing") this side of 2025. |
Hey @mgravell - any update on this? Thanks! |
Feature request type
Performance enhancement in command identification
Is your feature request related to a problem? Please describe
The [current implementation] (
garnet/libs/server/Resp/Parser/RespCommand.cs
Line 594 in ec54e3e
Describe the solution you'd like
It turns out that AVX2 is good at this - at least, if we limit ourselves to 4-byte chunks, which is a pretty good filter, especially if we ignore the trailing
\r\n
. The basic approach here is discussed here, and is fundamentally:_mm256_set1_epi32
_mm256_cmpeq_epi32
_mm256_movemask_ps
into a single value (8 effective bits)tzcnt
It would still need a final "were we right?" check, like the existing code does if it doesn't fit in just 8 bytes.
I did a quick demo, here, which so far only handles 8 commands. Obviously it is very incomplete, just to indicate "this is what is available". I also did a test against using
static readonly uint
values instead ofRead<uint>
each time, to look at all options (that didn't move the needle any, or is possibly worse somehow, or is just jitter).Tests here include:
PING
)INCR
)JUNK
)Results:
The data shows that the existing
switch
code has time that extends with the number of tests, where-as the AVX2 approach is consistent and fast, giving basically constant time, rather than time that is the number of missed cases. I would expect this to grow proportional to the number of AVX2 tests performed, but it should still be faster.Describe alternatives you've considered
I have not considered AVX512; in theory this would allow either 16 32-bit tests, or 8 64-bit tests, at a time - but AVX512 is still pretty under-served. It may not be worth the additional effort.
Additional context
The biggest question I can think of here is: which 4 bytes to test? the first 4? or the last 4? someone could probably do a collision analysis to see which narrows it down more. There are collisions either way (
LPUSH
vsBRPOPLPUSH
,INCR
vsINCRBY
,FLUSHDB
vsFLUSHALL
,SSCAN
vsSCAN
vsHSCAN
etc). My guess is that the first four bytes give more entropy, but heck: if youswitch
onlength
first, you could always do 2 AVX2 tests for collision scenarios, i.e. test the first four and the last four. Or something.The text was updated successfully, but these errors were encountered: