-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize run_spin_precession! for GPU #459
Conversation
Very nice results. Should I wait for the oneAPI problem to be solved? |
Yeah, the cumsum kernel implementation for oneAPI was wrong since it is no longer being computed in-place, but it should be fixed now. If the oneAPI tests pass, this should be fine to merge and I can work on run_spin_excitation! in a separate pull request. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #459 +/- ##
==========================================
+ Coverage 90.75% 90.91% +0.15%
==========================================
Files 53 53
Lines 2855 2916 +61
==========================================
+ Hits 2591 2651 +60
- Misses 264 265 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KomaMRI Benchmarks
Benchmark suite | Current: 21e2645 | Previous: 24b21f4 | Ratio |
---|---|---|---|
MRI Lab/Bloch/CPU/2 thread(s) |
226903941 ns |
235229966.5 ns |
0.96 |
MRI Lab/Bloch/CPU/4 thread(s) |
192380031.5 ns |
140495905 ns |
1.37 |
MRI Lab/Bloch/CPU/8 thread(s) |
91018417 ns |
169591756.5 ns |
0.54 |
MRI Lab/Bloch/CPU/1 thread(s) |
404831123 ns |
419227547 ns |
0.97 |
MRI Lab/Bloch/GPU/CUDA |
138891590 ns |
135837984 ns |
1.02 |
MRI Lab/Bloch/GPU/oneAPI |
13970023686.5 ns |
18356788557 ns |
0.76 |
MRI Lab/Bloch/GPU/Metal |
3152951458 ns |
2931106125 ns |
1.08 |
MRI Lab/Bloch/GPU/AMDGPU |
75260880 ns |
1750964243 ns |
0.042982533938587114 |
Slice Selection 3D/Bloch/CPU/2 thread(s) |
1170444194 ns |
1174040352 ns |
1.00 |
Slice Selection 3D/Bloch/CPU/4 thread(s) |
686878305.5 ns |
622515059.5 ns |
1.10 |
Slice Selection 3D/Bloch/CPU/8 thread(s) |
342777785 ns |
492840880 ns |
0.70 |
Slice Selection 3D/Bloch/CPU/1 thread(s) |
2229170773 ns |
2264093136 ns |
0.98 |
Slice Selection 3D/Bloch/GPU/CUDA |
108678209.5 ns |
257306603 ns |
0.42 |
Slice Selection 3D/Bloch/GPU/oneAPI |
777830059 ns |
1678945735.5 ns |
0.46 |
Slice Selection 3D/Bloch/GPU/Metal |
760369666 ns |
1129875875 ns |
0.67 |
Slice Selection 3D/Bloch/GPU/AMDGPU |
63844786.5 ns |
679066674 ns |
0.09401843580973611 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, it looks good! Most of the comments are about code organization and naming of variable/functions.
One "major" change would be too move the "views" from precession and excitation outside, to not bother the user with having the correct dimensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! :D
I've started working on an optimized implementation of run_spin_precession! for the GPU. With Metal on my own computer, I was able to reduce the time for the precession-heavy 3D Slice benchmark by around 40%. Similar to the optimized CPU implementation, the arrays used inside the function are now pre-allocated. I also moved some of the computations for the sequence block properties to be done on the CPU before the simulation is run, since these are operations for smallish 1D arrays that don't really make sense to do on the GPU.
Interestingly, this line is now what is taking the most of the time for run_spin_precession!:
ϕ_ADC = @view ϕ[:,seq_block.ϕ_indices]
I think non-uniform dynamic memory access is difficult for GPUs to deal with. I thought that maybe a kernel implementation would fix this, but the kernel I wrote ended up being slower compared with creating the view above and then doing this computation:
Mxy .= M.xy .* exp.(-seq_block.tp_ADC' ./ p.T2) .* _cis.(ϕ_ADC)
The problem seems to be indexing into GPU arrays dynamically, so in this case with the array of ints seq_block.ϕ_indices, rather than with a thread id which is known at compile time. This post: https://developer.nvidia.com/blog/fast-dynamic-indexing-private-arrays-cuda/#:~:text=Dynamic%20indexing%20with%20Uniform%20Access&text=Logical%20local%20memory%20actually%20resides,array%20reads%20and%20writes%2C%20respectively. was interesting to look at, but their solution of dynamically indexing from shared memory would be very tricky for us to implement, since shared memory size per block is limited and it is not necessarily known beforehand how large a chunk of ϕ would need to be loaded into shared memory to index from. Nevertheless, this could be interesting to look into at a later point. I'll leave my kernel implementation which I abandoned below:
ComputeSignalMagnetizationKernel.pdf
It's possible this could be optimized a bit more, but I do want to start working on run_spin_excitation! soon.