-
Notifications
You must be signed in to change notification settings - Fork 13.3k
CUDA: add conv2d #15635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: add conv2d #15635
Conversation
* CUDA: add conv2d * CUDA: conv2d - correct formatting and added const
* CUDA: add conv2d * CUDA: conv2d - correct formatting and added const
Just ran a test using sd.cpp, and for VAE, this is ~25 times slower than the im2col+mat_mul version. some numbers from the same device: So there is still a lot of room for improvements. (: edit: also this code works, as far as I can tell :) |
yeah it’s definitely slower — I wasn’t sure if I could actually use a memory buffer for this, or how big of one would be okay, so I just went with this approach. if it’s fine to use something like
same used in cpu conv2d conv2d cpu pull request then I can switch over and do the im2col + mat_mul version. I'd reuse the existing im2col kernel and call cublas on that patch for the mat_mul. @Green-Sky |
Yeah, doing "a tilled Im2col + GEMM approach" similar to the cpu implementation should work. |
working on it |
You can formulate a convolution as a matrix multiplication more generally. For optimal performance (for large input tensors), what would need to be done is load the data into shared memory, then load it into registers and use tensor cores. IIRC you need a minimum number of channels to fully utilize tensor cores so I think it will also be necessary to write variants with different memory organization pattern. |
Can't we just use padding to align the channel dimension with the tensor cores requirements? |
You can but for e.g. an RGB image with 3 channels you would be wasting at least 5/8 of the compute. |
According to the cudnn documentation, the number of channels must be a multiple of 8 to use tensor cores, so they apply padding under the hood. I think there may be some scenarios where it’s still worth taking the tensor core path even at half speed, rather than using regular cores |
Using tensor cores with padding will be faster than not using them, but using tensor cores with a memory access pattern that gets higher utilization for less than 8 channels will likely be even better. |
Hi, Thanks for your contribution. Have you checked the Vulkan CONV_2D implementation? I compared the perf of this code and it is 8 to 10 times slower than the Vulkan impl on my RTX 2060 device. So it hevily underutilizes the GPU for some reason. Is there any advantage of this impl over the Vulkan kernel? If not, might be better to translate the Vulkan kernel or fix the issues with this kernel. I've already done this, but did not commit because I could not reach the perf of cuBLAS yet... Vulkan:
CUDA (0320ac5):
|
No need to explicitly pad because the block size is aligned. |
Part of #14909