Optimize run_spin_precession! for GPU #459

rkierulf · 2024-07-30T03:53:01Z

I've started working on an optimized implementation of run_spin_precession! for the GPU. With Metal on my own computer, I was able to reduce the time for the precession-heavy 3D Slice benchmark by around 40%. Similar to the optimized CPU implementation, the arrays used inside the function are now pre-allocated. I also moved some of the computations for the sequence block properties to be done on the CPU before the simulation is run, since these are operations for smallish 1D arrays that don't really make sense to do on the GPU.

Interestingly, this line is now what is taking the most of the time for run_spin_precession!:

ϕ_ADC = @view ϕ[:,seq_block.ϕ_indices]

I think non-uniform dynamic memory access is difficult for GPUs to deal with. I thought that maybe a kernel implementation would fix this, but the kernel I wrote ended up being slower compared with creating the view above and then doing this computation:

Mxy .= M.xy .* exp.(-seq_block.tp_ADC' ./ p.T2) .* _cis.(ϕ_ADC)

The problem seems to be indexing into GPU arrays dynamically, so in this case with the array of ints seq_block.ϕ_indices, rather than with a thread id which is known at compile time. This post: https://developer.nvidia.com/blog/fast-dynamic-indexing-private-arrays-cuda/#:~:text=Dynamic%20indexing%20with%20Uniform%20Access&text=Logical%20local%20memory%20actually%20resides,array%20reads%20and%20writes%2C%20respectively. was interesting to look at, but their solution of dynamically indexing from shared memory would be very tricky for us to implement, since shared memory size per block is limited and it is not necessarily known beforehand how large a chunk of ϕ would need to be loaded into shared memory to index from. Nevertheless, this could be interesting to look into at a later point. I'll leave my kernel implementation which I abandoned below:

ComputeSignalMagnetizationKernel.pdf

It's possible this could be optimized a bit more, but I do want to start working on run_spin_excitation! soon.

…to bloch-gpu

cncastillo · 2024-07-30T16:35:01Z

Very nice results. Should I wait for the oneAPI problem to be solved?

rkierulf · 2024-07-30T18:03:53Z

Yeah, the cumsum kernel implementation for oneAPI was wrong since it is no longer being computed in-place, but it should be fixed now. If the oneAPI tests pass, this should be fine to merge and I can work on run_spin_excitation! in a separate pull request.

codecov · 2024-07-30T18:35:41Z

Codecov Report

Attention: Patch coverage is 95.18072% with 4 lines in your changes missing coverage. Please review.

Project coverage is 90.91%. Comparing base (b74a9ff) to head (21e2645).

Files	Patch %	Lines
KomaMRIBase/src/timing/TrapezoidalIntegration.jl	50.00%	1 Missing ⚠️
KomaMRICore/ext/KomaAMDGPUExt.jl	0.00%	1 Missing ⚠️
KomaMRICore/ext/KomaCUDAExt.jl	0.00%	1 Missing ⚠️
KomaMRICore/ext/KomaoneAPIExt.jl	93.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #459      +/-   ##
==========================================
+ Coverage   90.75%   90.91%   +0.15%     
==========================================
  Files          53       53              
  Lines        2855     2916      +61     
==========================================
+ Hits         2591     2651      +60     
- Misses        264      265       +1

Flag	Coverage Δ
base	`88.20% <50.00%> (ø)`
core	`92.55% <96.29%> (+0.76%)`	⬆️
files	`93.55% <ø> (ø)`
komamri	`93.98% <ø> (ø)`
plots	`89.30% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
KomaMRIBase/src/KomaMRIBase.jl	`100.00% <ø> (ø)`
KomaMRICore/src/simulation/GPUFunctions.jl	`62.16% <ø> (ø)`
...RICore/src/simulation/SimMethods/Bloch/BlochCPU.jl	`100.00% <100.00%> (ø)`
...RICore/src/simulation/SimMethods/Bloch/BlochGPU.jl	`100.00% <100.00%> (ø)`
...e/src/simulation/SimMethods/BlochDict/BlochDict.jl	`87.50% <100.00%> (ø)`
...c/simulation/SimMethods/BlochSimple/BlochSimple.jl	`100.00% <100.00%> (ø)`
...Core/src/simulation/SimMethods/SimulationMethod.jl	`100.00% <100.00%> (ø)`
KomaMRICore/src/simulation/SimulatorCore.jl	`94.83% <100.00%> (+0.06%)`	⬆️
KomaMRIBase/src/timing/TrapezoidalIntegration.jl	`75.00% <50.00%> (ø)`
KomaMRICore/ext/KomaAMDGPUExt.jl	`72.72% <0.00%> (-7.28%)`	⬇️
... and 2 more

... and 1 file with indirect coverage changes

github-actions

KomaMRI Benchmarks

Benchmark suite	Current: `21e2645`	Previous: `24b21f4`	Ratio
`MRI Lab/Bloch/CPU/2 thread(s)`	`226903941` ns	`235229966.5` ns	`0.96`
`MRI Lab/Bloch/CPU/4 thread(s)`	`192380031.5` ns	`140495905` ns	`1.37`
`MRI Lab/Bloch/CPU/8 thread(s)`	`91018417` ns	`169591756.5` ns	`0.54`
`MRI Lab/Bloch/CPU/1 thread(s)`	`404831123` ns	`419227547` ns	`0.97`
`MRI Lab/Bloch/GPU/CUDA`	`138891590` ns	`135837984` ns	`1.02`
`MRI Lab/Bloch/GPU/oneAPI`	`13970023686.5` ns	`18356788557` ns	`0.76`
`MRI Lab/Bloch/GPU/Metal`	`3152951458` ns	`2931106125` ns	`1.08`
`MRI Lab/Bloch/GPU/AMDGPU`	`75260880` ns	`1750964243` ns	`0.042982533938587114`
`Slice Selection 3D/Bloch/CPU/2 thread(s)`	`1170444194` ns	`1174040352` ns	`1.00`
`Slice Selection 3D/Bloch/CPU/4 thread(s)`	`686878305.5` ns	`622515059.5` ns	`1.10`
`Slice Selection 3D/Bloch/CPU/8 thread(s)`	`342777785` ns	`492840880` ns	`0.70`
`Slice Selection 3D/Bloch/CPU/1 thread(s)`	`2229170773` ns	`2264093136` ns	`0.98`
`Slice Selection 3D/Bloch/GPU/CUDA`	`108678209.5` ns	`257306603` ns	`0.42`
`Slice Selection 3D/Bloch/GPU/oneAPI`	`777830059` ns	`1678945735.5` ns	`0.46`
`Slice Selection 3D/Bloch/GPU/Metal`	`760369666` ns	`1129875875` ns	`0.67`
`Slice Selection 3D/Bloch/GPU/AMDGPU`	`63844786.5` ns	`679066674` ns	`0.09401843580973611`

This comment was automatically generated by workflow using github-action-benchmark.

cncastillo

Hi, it looks good! Most of the comments are about code organization and naming of variable/functions.

One "major" change would be too move the "views" from precession and excitation outside, to not bother the user with having the correct dimensions.

KomaMRIBase/src/timing/TrapezoidalIntegration.jl

KomaMRICore/src/simulation/SimMethods/Bloch/BlochCPU.jl

KomaMRICore/src/simulation/SimMethods/Bloch/BlochGPU.jl

KomaMRIBase/src/timing/TrapezoidalIntegration.jl

KomaMRICore/src/simulation/SimMethods/SimulationMethod.jl

benchmarks/runbenchmarks.jl

cncastillo

Great job! :D

rkierulf added 6 commits July 22, 2024 13:53

More optimizations

bb10357

Add test for BlochSimple

2d0b994

Add optimized run_spin_precession!

adbba32

Merge branch 'master' of https://github.com/JuliaHealth/KomaMRI.jl in…

79d95f4

…to bloch-gpu

Exclude kernel from code coverage

05b84dd

Clean up BlochGPU.jl

fe3d45d

rkierulf requested a review from cncastillo as a code owner July 30, 2024 03:53

Fix oneAPI issue

ef1939b

Fix oneAPI cumsum

a6960af

rkierulf changed the title ~~Optimize run_spin_precession! and run_spin_excitation! for GPU~~ Optimize run_spin_precession! for GPU Jul 30, 2024

Use 0.0f0 since oneAPI doesn't support Float64 values

d8c5875

github-actions bot reviewed Jul 30, 2024

View reviewed changes

rkierulf added 2 commits July 30, 2024 13:57

Delete run_spin_excitation! for BlochGPU

e279934

Rerun benchmarks

7577c9e

cncastillo requested changes Jul 31, 2024

View reviewed changes

Refactor some parts

88a1aa3

rkierulf force-pushed the bloch-gpu branch from c5e0992 to 88a1aa3 Compare July 31, 2024 18:08

Undo GPU sync changes

21e2645

rkierulf requested a review from cncastillo July 31, 2024 18:18

cncastillo approved these changes Jul 31, 2024

View reviewed changes

rkierulf merged commit 1457a4c into master Jul 31, 2024
19 checks passed

rkierulf deleted the bloch-gpu branch July 31, 2024 18:51

rkierulf mentioned this pull request Aug 23, 2024

GSOC: Add GPU Explanation Section to Documentation #470

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize run_spin_precession! for GPU #459

Optimize run_spin_precession! for GPU #459

rkierulf commented Jul 30, 2024

cncastillo commented Jul 30, 2024

rkierulf commented Jul 30, 2024

codecov bot commented Jul 30, 2024 •

edited

Loading

github-actions bot left a comment •

edited

Loading

cncastillo left a comment

cncastillo left a comment

Optimize run_spin_precession! for GPU #459

Optimize run_spin_precession! for GPU #459

Conversation

rkierulf commented Jul 30, 2024

cncastillo commented Jul 30, 2024

rkierulf commented Jul 30, 2024

codecov bot commented Jul 30, 2024 • edited Loading

Codecov Report

github-actions bot left a comment • edited Loading

Choose a reason for hiding this comment

KomaMRI Benchmarks

cncastillo left a comment

Choose a reason for hiding this comment

cncastillo left a comment

Choose a reason for hiding this comment

codecov bot commented Jul 30, 2024 •

edited

Loading

github-actions bot left a comment •

edited

Loading