-
Notifications
You must be signed in to change notification settings - Fork 7
Cuda acceleration #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: ros2
Are you sure you want to change the base?
Conversation
did this ever happend? FIXED WITH BELOW minor fix Commit |
@hmoyen After having above build error fix, I have a question, I remember that user needs to clone |
@hmoyen When I tried, the sonar image was not blazing. I've modified the random noise to be calculated inside cuda code without using |
Looks great! |
About the
I will add to the wiki. |
I've changed the curand state type to produce 4 random numbers per call (we use 2, but it seems to accelerate the kernel a bit) @woensug-choi . |
Hi @hmoyen, the fresh Installation and build of our project results in the error -
This can be simply be resolved by - I think we must add this to the docker file and the installation script or add it to the documentation itself. |
@hmoyen, I have tested rest of the plugin, seems to be working fine 🥳🎉 |
This PR aims to improve the CUDA code used for the sonar calculations, focusing on reducing execution time. The following changes were implemented:
1. Optimized Matrix Multiplication
Previously, the matrix multiplication was performed using the custom kernel.
This was replaced with cublasSgemm for matrix multiplication.
It provides acceleration, but is more significant for larger matrices (e.g., higher numbers of beams).
See commit: add sgemm acceleration
2. Ray Summation Optimization
Original kernel using column-wise reduction and kernel launch (2 x nBeams launches per sonar frame):
New kernel (1 launch per sonar frame):
thrust::complex<float>
.memcpy
callsIn the tested GPU (NVIDIA GeForce MX330), this part of the code provided the highest speedup (4–5× faster).
See commit: add new summation kernel
3. Reuse Global Buffers
Another change was reusing constant-size buffers (whose dimensions are known at plugin launch) instead of allocating memory for every frame. Now, the buffers are allocated at launch and freed when the plugin is destroyed.
See commits: reuse buffers and add more global buffers.
4. Replace exp() and use intrinsic functions
One change made to the sonar calculation kernel that provided a 9× speedup (reducing execution time from 30 ms to 3.15 ms on the GeForce MX330) was switching to intrinsic math functions inside the for loop. First,
exp()
was replaced with__sincosf()
.exp(i * theta)
, which is equivalent tocos(theta) + i*sin(theta)
, so the new code computes the real and imaginary parts directly with__sincosf
, avoiding the complex exponential. Additionally, regular division was replaced with__fdividef
.See commit: change complex exp calculation
See commit: replace functions to use gpu intrinsic ones
How to test it
Follow the demos in the Wiki page: DAVE ROS 2 Multibeam Sonar Plugin Wiki. For example:
Use this branch to run the demos; a .txt file with the debug times will be generated to evaluate performance.
Tocompare this branch’s performance with the old CUDA code (which generates the same .txt), checkout the branch cuda-performance and run the same demos.