-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A specialized Winograd Conv2d op #971
base: master
Are you sure you want to change the base?
Conversation
… if not satisfied
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is based on openCNN project which uses Apache-2.0 license.
Did you get permission from the authors to re-license their code as MIT?
src/ggml-cuda/conv-winograd.cu
Outdated
typedef float(*pointFunction_t)(float *, int); | ||
|
||
template<typename T> | ||
__global__ void FX(const T *pInputs, float *pOutputs, int filt_k, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__global__ void FX(const T *pInputs, float *pOutputs, int filt_k, | |
__global__ void FX(const T * __restrict__ pInputs, float * __restrict__ pOutputs, int filt_k, |
On Pascal this can be a 5x speedup.
src/ggml-cuda/conv-winograd.cu
Outdated
|
||
} | ||
|
||
__device__ __forceinline__ void prefetch_filter_tile(const float *pInputs, float *tiles, int filt_k){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compiler will rearrange these instructions as it sees fit so there will in effect not be any actual prefetching. For that you need to use asnychronous memcpys (Ampere or newer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure here, as this is done by openCNN.
__constant__ int access_f_s[2][32]; | ||
__constant__ int access_s[2][32]; | ||
__constant__ int tileid[2][32]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens in the case of multiple GPUs? Is the constant memory duplicated across GPUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty ignorant about multi-gpu. I guess they will be duplicated. I don't have a setup to test. Plus, this kernel only works for single GPU, I think.
static void ggml_compute_forward_winograd_stage0( | ||
const struct ggml_compute_params * params, | ||
struct ggml_tensor * dst) { | ||
|
||
GGML_ASSERT(false && " CPU backend not implemented!"); | ||
return; | ||
} | ||
|
||
static void ggml_compute_forward_winograd_stage1( | ||
const struct ggml_compute_params * params, | ||
struct ggml_tensor * dst) { | ||
|
||
GGML_ASSERT(false && " CPU backend not implemented!"); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If at all possible a CPU implementation should always be done since it serves both as a fallback and as a reference implementation to test other backends against.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A CPU backend should be done, but I am not sure the benefit of it compared to the current im2col+gemm version.
bool is_node = false; | ||
|
||
if (a->grad) { | ||
is_node = true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If #966 is merged first this will need to be removed (should be very straightforward).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look forward to it...
Have you done any tests regarding performance? This code does not use tensor cores at all so intuitively I would expect it to be slower than im2col + GEMM with tensor cores. |
Thank you for your review, @JohannesGaessler. I leaned a lot from your PRs and comments. First, I have asked openCNN's author for license issue. As to performance, I only tested it in SD.cpp as it is developed for it. It is not faster (surprised) than im2col+GEMM with tensor cores (my gpu has them so assuming being used) but definitely not slower. It reduces memory used by VAE quite a lot while increasing UNET param buffer. There is room to further improve its performance as I see several places are not working in an optimal way. I'll add test cases in test-backend-ops to more rigorously measure performance. I addressed your other comments above. |
Have you tried NPP? It is a library bundled with the CUDA toolkit that has all kinds of kernels for image processing. I don't think this can be merged unless the license situation is resolved. |
Generally speaking my stance regarding this PR would be as follows: I think it's good to have convolution operations instead of having to rely on And as slaren said, the licensing issue must be resolved or this is a total non-starter anyways.
From what I can tell, there is convolution support. |
I'm already tired so maybe I'm just misreading the docs, but I get the impression that NPP convolutions only support 1-4 input channels. |
Thanks to both of you for reviewing. I am not familiar with the license. In case it is not resolvable, I'll ditch this PR. |
Also one important question that I forgot to ask: are you going to be available long-term to maintain this code? |
If this PR makes into the main, I intend to maintain it long term and improve its performance. |
I am getting some puzzling benchmark results. @JohannesGaessler or @slaren, could you help? I added a conv2d and a winograd case in test-backend-ops, using its eval_perf function to gauge the performance. Winograd is consistently 10x slower than conv2d(IM2COL++GEMM). However, the same winograd kernel in stable-diffusion improves the running time by 7-8% for UNET and almost 50% for VAE. So I am confused. What's even more interesting is the same input size for IM2COL itself runs slower than conv2d which has a GEMM after IM2COL. Maybe the way I tested is not proper for conv2d since it is a composite operator unlike others? |
Yes, this is the reason. The perf mode works by running multiple times the last node of the graph only, everything before that is assumed to be setup for the op and only run once. |
Wow, now I see. Turned out that the conv2d case is just testing the cuda_dup op which is the last node. I need to find another way to compare. Thanks. |
@bssrdf just took your sd.cpp fork for a spin, and I like the numbers. with sd2 the unet changes seem to make it use almost double the vram for the model? images where 512x512. |
@Green-Sky, thanks for trying out. Right now, winograd is not ready for SD.cpp due to duplicated UNET and VAE parameter buffer. My winograd implementation has filter weight transform in a separate kernel (inherited from openCNN) so needs a buffer for that. I added filter transform in initialization stage so during unet denoising this step can be skipped (just use the buffer). This is one of the benefits of using winograd (a small time saving though). However, the original/untransformed filter weight still occupies VRAM as they are allocated earlier and there is no mechanism in GGML to release them. I have some ideas to fix this so no VRAM duplication happens; but still the parameter buffers for UNET and VAE will increase in size because the transformed weights are In the meantime, I am adding tensor core support to make it even faster. I think it could be faster by at least 50% once the tensor cores are used. |
Hello, good luck trying to use the tensor cores to get Winograd working; I think it’s possible. I saw an implementation that used matrix multiplication in Tinygrad. I found a Winograd implementation that omitted redundant calculations, but for that same reason, it wasn't possible to use the tensor cores. In the end, I couldn't improve the performance. After all, I decided to better merge im2col and GEMM (which reduces memory consumption by 50%, but is 40% slower in the VAE phase, which is primarily the one that uses the most memory). |
Thanks for the comments. Right now, using tensor cores, my winograd kernel outperforms cudnn's winograd (not using tensor cores) by 15% for certain inputs (large input size and number of channels and filters). Well, dealing with the bank conflicts are real pain and I have to program in PTX 😄. For such large inputs, the king is cudnn's IMPLICIT_PRECOMP_GEMM using tensor cores; it is 50% faster than my winograd. I hope to get more performance boost once the final "Uncoalesced Shared Accesses" issue is resolved. |
Be aware that if you are using the Edit: judging by the edit to your previous comment you are already using PTX. |
For avoiding shared memory bank conflicts with tensor cores, consider padding the data. For the int8 tensor cores I found that a padding of 16 bytes between rows/columns avoids shared memory bank conflicts when loading data from shared memory for use with tensor cores. |
Thanks for the tips, @JohannesGaessler. Yes, I used padding in several places and they really helped with resolving conflicts. I also played around with swizzling but in the end couldn't get it working for the smem layout. |
Also take a look at the ldmatrix instruction. While it doesn't reduce shared memory bank conflicts it does help by issuing fewer instructions which helped with one of the throttle reasons (I forgot which one). But I didn't find it to be universally better than just loading the data as 4 bit values from shared memory. |
I see people used |
@JohannesGaessler I know it's out of context, but I'm compiling the latest version of stable diffusion.cpp and now it's taking more than 25 minutes to compile the CUDA code. Before (four months ago, I was busy trying to figure out what to do with my life), it took at most 5 minutes or less, and binaries now are more bigger. |
That's probably due to the MMQ changes. For development builds you can edit |
cee185e
to
864e4a7
Compare
I have come out a more sensible way for measuring the performance. Here are the results running on 4090 :
Note winograd here is not using tensor cores. |
I suppose that by im2col you mean the entire conv2d process (im2col and gemm), and what Winograd does is the complete convolution. |
Yes, I have updated the table header to clarify. |
updates on this? been testing this implementation SealAILab/stable-diffusion-cpp and it works really well. |
This PR added a new conv2d op using Winograd algorithm.
Currently ggml's conv2d operator uses im2col and GEMM. There have been efforts to speed up this process using other faster algorithms. Winograd is such a method used by many neural network libraries, e.g.
Cudnn
. For small kernels, e.g. 3x3, Winograd outperforms GEMM based methods. However, efficient implementation of Winograd on GPUs requires significant engineering efforts. This PR 's Winograd implementation specializes in several ways:Other features:
It is mainly used for applications such as stable-diffusion.cpp.
The code is based on openCNN project which uses Apache-2.0 license.
Please review and let me know any problems I'll address. Thanks.