Skip to content

Conversation

hmoyen
Copy link
Collaborator

@hmoyen hmoyen commented Aug 20, 2025

This PR aims to improve the CUDA code used for the sonar calculations, focusing on reducing execution time. The following changes were implemented:

1. Optimized Matrix Multiplication

Previously, the matrix multiplication was performed using the custom kernel.

__global__ void gpu_matrix_mult(float *a, float *b, float *c, int m, int n, int k)

This was replaced with cublasSgemm for matrix multiplication.
It provides acceleration, but is more significant for larger matrices (e.g., higher numbers of beams).
See commit: add sgemm acceleration

2. Ray Summation Optimization

Original kernel using column-wise reduction and kernel launch (2 x nBeams launches per sonar frame):

template <typename T>
__global__ void column_sums_reduce(
    const T *__restrict__ in, T *__restrict__ out, size_t width, size_t height)

New kernel (1 launch per sonar frame):

__global__ void reduce_beams_kernel(
    const thrust::complex<float> *__restrict__ d_P_Beams,
    float *d_P_Beams_F_real, float *d_P_Beams_F_imag,
    int nBeams, int nFreq, int nRaysSkipped)
  • Combined real and imaginary summation in a single kernel (reduce_beams_kernel) using thrust::complex<float>.
  • Accumulate sums in registers and reduce in shared memory, writing once per beam-frequency to global memory.
  • Fewer __syncthreads() and kernel launches, reducing synchronization and stall latency.
  • Removed multiple memcpy calls

In the tested GPU (NVIDIA GeForce MX330), this part of the code provided the highest speedup (4–5× faster).

See commit: add new summation kernel

3. Reuse Global Buffers

Another change was reusing constant-size buffers (whose dimensions are known at plugin launch) instead of allocating memory for every frame. Now, the buffers are allocated at launch and freed when the plugin is destroyed.

See commits: reuse buffers and add more global buffers.

4. Replace exp() and use intrinsic functions

One change made to the sonar calculation kernel that provided a 9× speedup (reducing execution time from 30 ms to 3.15 ms on the GeForce MX330) was switching to intrinsic math functions inside the for loop. First, exp() was replaced with __sincosf(). exp(i * theta), which is equivalent to cos(theta) + i*sin(theta), so the new code computes the real and imaginary parts directly with __sincosf, avoiding the complex exponential. Additionally, regular division was replaced with __fdividef.

See commit: change complex exp calculation
See commit: replace functions to use gpu intrinsic ones

How to test it

Follow the demos in the Wiki page: DAVE ROS 2 Multibeam Sonar Plugin Wiki. For example:

ros2 launch dave_multibeam_sonar_demo multibeam_sonar_demo.launch.py

Use this branch to run the demos; a .txt file with the debug times will be generated to evaluate performance.

Tocompare this branch’s performance with the old CUDA code (which generates the same .txt), checkout the branch cuda-performance and run the same demos.

@woensug-choi
Copy link

woensug-choi commented Aug 25, 2025

--- stderr: multibeam_sonar_system                                                                     
In file included from /home/ioes/dave_ws/src/dave/gazebo/dave_gz_multibeam_sonar/multibeam_sonar_system/MultibeamSonarSystem.cc:54:
/home/ioes/dave_ws/src/dave/gazebo/dave_gz_multibeam_sonar/multibeam_sonar_system/./../multibeam_sonar/MultibeamSonarSensor.hh:20:10: fatal error: marine_acoustic_msgs/msg/projected_sonar_image.hpp: No such file or directory
   20 | #include <marine_acoustic_msgs/msg/projected_sonar_image.hpp>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
gmake[2]: *** [CMakeFiles/multibeam_sonar_system.dir/build.make:76: CMakeFiles/multibeam_sonar_system.dir/MultibeamSonarSystem.cc.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:154: CMakeFiles/multibeam_sonar_system.dir/all] Error 2
gmake: *** [Makefile:146: all] Error 2
---
Failed   <<< multibeam_sonar_system [3.37s, exited with code 2]
Aborted  <<< marine_sensor_msgs [3.91s]                                          
Aborted  <<< dave_interfaces [4.56s]                             

Summary: 9 packages finished [4.99s]
  1 package failed: multibeam_sonar_system
  2 packages aborted: dave_interfaces marine_sensor_msgs
  1 package had stderr output: multibeam_sonar_system
  5 packages not processed

did this ever happend?

FIXED WITH BELOW minor fix Commit

@woensug-choi
Copy link

@hmoyen After having above build error fix, I have a question,

I remember that user needs to clone marine_msgs to /src to build marine_acoustic_msgs before. It seems it's not required anymore? Also some packages that are needed to be installed using apt, sudo apt install libpcl-dev ros-jazzy-pcl-ros libpcap-dev not in wiki page.

@woensug-choi
Copy link

@hmoyen When I tried, the sonar image was not blazing. I've modified the random noise to be calculated inside cuda code without using rand_image. Also added handle <blazingSonarImage> to set true or false for the blazing image noise.

@hmoyen
Copy link
Collaborator Author

hmoyen commented Aug 25, 2025

@hmoyen When I tried, the sonar image was not blazing. I've modified the random noise to be calculated inside cuda code without using rand_image. Also added handle <blazingSonarImage> to set true or false for the blazing image noise.

Looks great!

@hmoyen
Copy link
Collaborator Author

hmoyen commented Aug 25, 2025

@hmoyen After having above build error fix, I have a question,

I remember that user needs to clone marine_msgs to /src to build marine_acoustic_msgs before. It seems it's not required anymore? Also some packages that are needed to be installed using apt, sudo apt install libpcl-dev ros-jazzy-pcl-ros libpcap-dev not in wiki page.

About the marine_acoustic_msgs, I installed by:

sudo apt install ros-jazzy-marine-acoustic-msgs

I will add to the wiki.
Just noticed that I kept the PCL headers, but we don't actually use this package anymore.

@hmoyen
Copy link
Collaborator Author

hmoyen commented Aug 25, 2025

I've changed the curand state type to produce 4 random numbers per call (we use 2, but it seems to accelerate the kernel a bit) @woensug-choi .

@GauravKumar9920
Copy link
Member

Hi @hmoyen, the fresh Installation and build of our project results in the error -

--- stderr: multibeam_sonar_system In file included from /root/HOST/dave_ws/src/dave/gazebo/dave_gz_multibeam_sonar/multibeam_sonar/MultibeamSonarSensor.cc:49: /root/HOST/dave_ws/src/dave/gazebo/dave_gz_multibeam_sonar/multibeam_sonar/sonar_calculation_cuda.cuh:20:10: fatal error: thrust/complex.h: No such file or directory 20 | #include <thrust/complex.h> | ^~~~~~~~~~~~~~~~~~ compilation terminated.

This can be simply be resolved by -
sudo apt-get install -y libthrust-dev

I think we must add this to the docker file and the installation script or add it to the documentation itself.

@GauravKumar9920
Copy link
Member

@hmoyen, I have tested rest of the plugin, seems to be working fine 🥳🎉
You can hit the merge button !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants