Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize run_spin_precession! for GPU #459

Merged
merged 13 commits into from
Jul 31, 2024
Merged

Optimize run_spin_precession! for GPU #459

merged 13 commits into from
Jul 31, 2024

Conversation

rkierulf
Copy link
Collaborator

I've started working on an optimized implementation of run_spin_precession! for the GPU. With Metal on my own computer, I was able to reduce the time for the precession-heavy 3D Slice benchmark by around 40%. Similar to the optimized CPU implementation, the arrays used inside the function are now pre-allocated. I also moved some of the computations for the sequence block properties to be done on the CPU before the simulation is run, since these are operations for smallish 1D arrays that don't really make sense to do on the GPU.

Interestingly, this line is now what is taking the most of the time for run_spin_precession!:

ϕ_ADC = @view ϕ[:,seq_block.ϕ_indices]

I think non-uniform dynamic memory access is difficult for GPUs to deal with. I thought that maybe a kernel implementation would fix this, but the kernel I wrote ended up being slower compared with creating the view above and then doing this computation:

Mxy .= M.xy .* exp.(-seq_block.tp_ADC' ./ p.T2) .* _cis.(ϕ_ADC)

The problem seems to be indexing into GPU arrays dynamically, so in this case with the array of ints seq_block.ϕ_indices, rather than with a thread id which is known at compile time. This post: https://developer.nvidia.com/blog/fast-dynamic-indexing-private-arrays-cuda/#:~:text=Dynamic%20indexing%20with%20Uniform%20Access&text=Logical%20local%20memory%20actually%20resides,array%20reads%20and%20writes%2C%20respectively. was interesting to look at, but their solution of dynamically indexing from shared memory would be very tricky for us to implement, since shared memory size per block is limited and it is not necessarily known beforehand how large a chunk of ϕ would need to be loaded into shared memory to index from. Nevertheless, this could be interesting to look into at a later point. I'll leave my kernel implementation which I abandoned below:

ComputeSignalMagnetizationKernel.pdf

It's possible this could be optimized a bit more, but I do want to start working on run_spin_excitation! soon.

@rkierulf rkierulf requested a review from cncastillo as a code owner July 30, 2024 03:53
@cncastillo
Copy link
Member

Very nice results. Should I wait for the oneAPI problem to be solved?

@rkierulf rkierulf changed the title Optimize run_spin_precession! and run_spin_excitation! for GPU Optimize run_spin_precession! for GPU Jul 30, 2024
@rkierulf
Copy link
Collaborator Author

Yeah, the cumsum kernel implementation for oneAPI was wrong since it is no longer being computed in-place, but it should be fixed now. If the oneAPI tests pass, this should be fine to merge and I can work on run_spin_excitation! in a separate pull request.

Copy link

codecov bot commented Jul 30, 2024

Codecov Report

Attention: Patch coverage is 95.18072% with 4 lines in your changes missing coverage. Please review.

Project coverage is 90.91%. Comparing base (b74a9ff) to head (21e2645).

Files Patch % Lines
KomaMRIBase/src/timing/TrapezoidalIntegration.jl 50.00% 1 Missing ⚠️
KomaMRICore/ext/KomaAMDGPUExt.jl 0.00% 1 Missing ⚠️
KomaMRICore/ext/KomaCUDAExt.jl 0.00% 1 Missing ⚠️
KomaMRICore/ext/KomaoneAPIExt.jl 93.33% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #459      +/-   ##
==========================================
+ Coverage   90.75%   90.91%   +0.15%     
==========================================
  Files          53       53              
  Lines        2855     2916      +61     
==========================================
+ Hits         2591     2651      +60     
- Misses        264      265       +1     
Flag Coverage Δ
base 88.20% <50.00%> (ø)
core 92.55% <96.29%> (+0.76%) ⬆️
files 93.55% <ø> (ø)
komamri 93.98% <ø> (ø)
plots 89.30% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
KomaMRIBase/src/KomaMRIBase.jl 100.00% <ø> (ø)
KomaMRICore/src/simulation/GPUFunctions.jl 62.16% <ø> (ø)
...RICore/src/simulation/SimMethods/Bloch/BlochCPU.jl 100.00% <100.00%> (ø)
...RICore/src/simulation/SimMethods/Bloch/BlochGPU.jl 100.00% <100.00%> (ø)
...e/src/simulation/SimMethods/BlochDict/BlochDict.jl 87.50% <100.00%> (ø)
...c/simulation/SimMethods/BlochSimple/BlochSimple.jl 100.00% <100.00%> (ø)
...Core/src/simulation/SimMethods/SimulationMethod.jl 100.00% <100.00%> (ø)
KomaMRICore/src/simulation/SimulatorCore.jl 94.83% <100.00%> (+0.06%) ⬆️
KomaMRIBase/src/timing/TrapezoidalIntegration.jl 75.00% <50.00%> (ø)
KomaMRICore/ext/KomaAMDGPUExt.jl 72.72% <0.00%> (-7.28%) ⬇️
... and 2 more

... and 1 file with indirect coverage changes

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KomaMRI Benchmarks

Benchmark suite Current: 21e2645 Previous: 24b21f4 Ratio
MRI Lab/Bloch/CPU/2 thread(s) 226903941 ns 235229966.5 ns 0.96
MRI Lab/Bloch/CPU/4 thread(s) 192380031.5 ns 140495905 ns 1.37
MRI Lab/Bloch/CPU/8 thread(s) 91018417 ns 169591756.5 ns 0.54
MRI Lab/Bloch/CPU/1 thread(s) 404831123 ns 419227547 ns 0.97
MRI Lab/Bloch/GPU/CUDA 138891590 ns 135837984 ns 1.02
MRI Lab/Bloch/GPU/oneAPI 13970023686.5 ns 18356788557 ns 0.76
MRI Lab/Bloch/GPU/Metal 3152951458 ns 2931106125 ns 1.08
MRI Lab/Bloch/GPU/AMDGPU 75260880 ns 1750964243 ns 0.042982533938587114
Slice Selection 3D/Bloch/CPU/2 thread(s) 1170444194 ns 1174040352 ns 1.00
Slice Selection 3D/Bloch/CPU/4 thread(s) 686878305.5 ns 622515059.5 ns 1.10
Slice Selection 3D/Bloch/CPU/8 thread(s) 342777785 ns 492840880 ns 0.70
Slice Selection 3D/Bloch/CPU/1 thread(s) 2229170773 ns 2264093136 ns 0.98
Slice Selection 3D/Bloch/GPU/CUDA 108678209.5 ns 257306603 ns 0.42
Slice Selection 3D/Bloch/GPU/oneAPI 777830059 ns 1678945735.5 ns 0.46
Slice Selection 3D/Bloch/GPU/Metal 760369666 ns 1129875875 ns 0.67
Slice Selection 3D/Bloch/GPU/AMDGPU 63844786.5 ns 679066674 ns 0.09401843580973611

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Member

@cncastillo cncastillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, it looks good! Most of the comments are about code organization and naming of variable/functions.

One "major" change would be too move the "views" from precession and excitation outside, to not bother the user with having the correct dimensions.

KomaMRIBase/src/timing/TrapezoidalIntegration.jl Outdated Show resolved Hide resolved
KomaMRICore/src/simulation/SimMethods/Bloch/BlochCPU.jl Outdated Show resolved Hide resolved
KomaMRICore/src/simulation/SimMethods/Bloch/BlochGPU.jl Outdated Show resolved Hide resolved
KomaMRICore/src/simulation/SimMethods/Bloch/BlochGPU.jl Outdated Show resolved Hide resolved
KomaMRICore/src/simulation/SimMethods/Bloch/BlochGPU.jl Outdated Show resolved Hide resolved
KomaMRICore/src/simulation/SimMethods/Bloch/BlochGPU.jl Outdated Show resolved Hide resolved
KomaMRICore/src/simulation/SimMethods/Bloch/BlochGPU.jl Outdated Show resolved Hide resolved
KomaMRIBase/src/timing/TrapezoidalIntegration.jl Outdated Show resolved Hide resolved
benchmarks/runbenchmarks.jl Show resolved Hide resolved
@rkierulf rkierulf requested a review from cncastillo July 31, 2024 18:18
Copy link
Member

@cncastillo cncastillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! :D

@rkierulf rkierulf merged commit 1457a4c into master Jul 31, 2024
19 checks passed
@rkierulf rkierulf deleted the bloch-gpu branch July 31, 2024 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants