Cuda acceleration #29

hmoyen · 2025-08-20T07:39:29Z

This PR aims to improve the CUDA code used for the sonar calculations, focusing on reducing execution time. The following changes were implemented:

1. Optimized Matrix Multiplication

Previously, the matrix multiplication was performed using the custom kernel.

__global__ void gpu_matrix_mult(float *a, float *b, float *c, int m, int n, int k)

This was replaced with cublasSgemm for matrix multiplication.
It provides acceleration, but is more significant for larger matrices (e.g., higher numbers of beams).
See commit: add sgemm acceleration

2. Ray Summation Optimization

Original kernel using column-wise reduction and kernel launch (2 x nBeams launches per sonar frame):

template <typename T>
__global__ void column_sums_reduce(
    const T *__restrict__ in, T *__restrict__ out, size_t width, size_t height)

New kernel (1 launch per sonar frame):

__global__ void reduce_beams_kernel(
    const thrust::complex<float> *__restrict__ d_P_Beams,
    float *d_P_Beams_F_real, float *d_P_Beams_F_imag,
    int nBeams, int nFreq, int nRaysSkipped)

Combined real and imaginary summation in a single kernel (reduce_beams_kernel) using thrust::complex<float>.
Accumulate sums in registers and reduce in shared memory, writing once per beam-frequency to global memory.
Fewer __syncthreads() and kernel launches, reducing synchronization and stall latency.
Removed multiple memcpy calls

In the tested GPU (NVIDIA GeForce MX330), this part of the code provided the highest speedup (4–5× faster).

See commit: add new summation kernel

3. Reuse Global Buffers

Another change was reusing constant-size buffers (whose dimensions are known at plugin launch) instead of allocating memory for every frame. Now, the buffers are allocated at launch and freed when the plugin is destroyed.

See commits: reuse buffers and add more global buffers.

4. Replace exp() and use intrinsic functions

One change made to the sonar calculation kernel that provided a 9× speedup (reducing execution time from 30 ms to 3.15 ms on the GeForce MX330) was switching to intrinsic math functions inside the for loop. First, exp() was replaced with __sincosf(). exp(i * theta), which is equivalent to cos(theta) + i*sin(theta), so the new code computes the real and imaginary parts directly with __sincosf, avoiding the complex exponential. Additionally, regular division was replaced with __fdividef.

See commit: change complex exp calculation
See commit: replace functions to use gpu intrinsic ones

How to test it

Follow the demos in the Wiki page: DAVE ROS 2 Multibeam Sonar Plugin Wiki. For example:

ros2 launch dave_multibeam_sonar_demo multibeam_sonar_demo.launch.py

Use this branch to run the demos; a .txt file with the debug times will be generated to evaluate performance.

Tocompare this branch’s performance with the old CUDA code (which generates the same .txt), checkout the branch cuda-performance and run the same demos.

woensug-choi · 2025-08-25T04:33:58Z

--- stderr: multibeam_sonar_system                                                                     
In file included from /home/ioes/dave_ws/src/dave/gazebo/dave_gz_multibeam_sonar/multibeam_sonar_system/MultibeamSonarSystem.cc:54:
/home/ioes/dave_ws/src/dave/gazebo/dave_gz_multibeam_sonar/multibeam_sonar_system/./../multibeam_sonar/MultibeamSonarSensor.hh:20:10: fatal error: marine_acoustic_msgs/msg/projected_sonar_image.hpp: No such file or directory
   20 | #include <marine_acoustic_msgs/msg/projected_sonar_image.hpp>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
gmake[2]: *** [CMakeFiles/multibeam_sonar_system.dir/build.make:76: CMakeFiles/multibeam_sonar_system.dir/MultibeamSonarSystem.cc.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:154: CMakeFiles/multibeam_sonar_system.dir/all] Error 2
gmake: *** [Makefile:146: all] Error 2
---
Failed   <<< multibeam_sonar_system [3.37s, exited with code 2]
Aborted  <<< marine_sensor_msgs [3.91s]                                          
Aborted  <<< dave_interfaces [4.56s]                             

Summary: 9 packages finished [4.99s]
  1 package failed: multibeam_sonar_system
  2 packages aborted: dave_interfaces marine_sensor_msgs
  1 package had stderr output: multibeam_sonar_system
  5 packages not processed

did this ever happend?

FIXED WITH BELOW minor fix Commit

woensug-choi · 2025-08-25T05:34:07Z

@hmoyen After having above build error fix, I have a question,

I remember that user needs to clone marine_msgs to /src to build marine_acoustic_msgs before. It seems it's not required anymore? Also some packages that are needed to be installed using apt, sudo apt install libpcl-dev ros-jazzy-pcl-ros libpcap-dev not in wiki page.

woensug-choi · 2025-08-25T06:23:25Z

@hmoyen When I tried, the sonar image was not blazing. I've modified the random noise to be calculated inside cuda code without using rand_image. Also added handle <blazingSonarImage> to set true or false for the blazing image noise.

hmoyen · 2025-08-25T07:12:46Z

@hmoyen When I tried, the sonar image was not blazing. I've modified the random noise to be calculated inside cuda code without using rand_image. Also added handle <blazingSonarImage> to set true or false for the blazing image noise.

Looks great!

hmoyen · 2025-08-25T07:26:52Z

@hmoyen After having above build error fix, I have a question,

I remember that user needs to clone marine_msgs to /src to build marine_acoustic_msgs before. It seems it's not required anymore? Also some packages that are needed to be installed using apt, sudo apt install libpcl-dev ros-jazzy-pcl-ros libpcap-dev not in wiki page.

About the marine_acoustic_msgs, I installed by:

sudo apt install ros-jazzy-marine-acoustic-msgs

I will add to the wiki.
Just noticed that I kept the PCL headers, but we don't actually use this package anymore.

hmoyen · 2025-08-25T08:00:30Z

I've changed the curand state type to produce 4 random numbers per call (we use 2, but it seems to accelerate the kernel a bit) @woensug-choi .

GauravKumar9920 · 2025-09-05T10:19:48Z

Hi @hmoyen, the fresh Installation and build of our project results in the error -

--- stderr: multibeam_sonar_system In file included from /root/HOST/dave_ws/src/dave/gazebo/dave_gz_multibeam_sonar/multibeam_sonar/MultibeamSonarSensor.cc:49: /root/HOST/dave_ws/src/dave/gazebo/dave_gz_multibeam_sonar/multibeam_sonar/sonar_calculation_cuda.cuh:20:10: fatal error: thrust/complex.h: No such file or directory 20 | #include <thrust/complex.h> | ^~~~~~~~~~~~~~~~~~ compilation terminated.

This can be simply be resolved by -
sudo apt-get install -y libthrust-dev

I think we must add this to the docker file and the installation script or add it to the documentation itself.

GauravKumar9920 · 2025-09-05T11:15:43Z

@hmoyen, I have tested rest of the plugin, seems to be working fine 🥳🎉
You can hit the merge button !

hmoyen added 5 commits July 24, 2025 15:19

add sgemm acceleration

50cded8

add new summation kernel and reuse buffers

5aea3e1

add free_memory to header

e92156f

change complex exp calculation

edd14e0

cleanup and add debug

cb1f251

hmoyen requested review from woensug-choi and rakeshv24 August 20, 2025 07:40

hmoyen and others added 5 commits August 20, 2025 04:47

Merge branch 'ros2' into cuda-acceleration

ddce8ec

clear debug file

aaf4a93

correct debug

64096c6

remove memcpy for pbeams

3aaf0d0

cleanup and remove redundancy

b678bbf

minor fix

629271e

fix blazing sonar image and add bool handle

e86b7e0

clean up not used vectors

2394ee3

hmoyen added 2 commits August 25, 2025 04:39

remove pcl dependencies

4ff3de5

change curand state type

b62d18a

hmoyen and others added 2 commits August 26, 2025 19:00

replace functions to use gpu intrinsic ones

0f20acd

add ros-jazzy-marine-acoustic-msgs

3df3fbe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cuda acceleration #29

Cuda acceleration #29

Uh oh!

hmoyen commented Aug 20, 2025 •

edited

Loading

Uh oh!

woensug-choi commented Aug 25, 2025 •

edited

Loading

Uh oh!

woensug-choi commented Aug 25, 2025

Uh oh!

woensug-choi commented Aug 25, 2025

Uh oh!

hmoyen commented Aug 25, 2025

Uh oh!

hmoyen commented Aug 25, 2025 •

edited

Loading

Uh oh!

hmoyen commented Aug 25, 2025 •

edited

Loading

Uh oh!

GauravKumar9920 commented Sep 5, 2025

Uh oh!

GauravKumar9920 commented Sep 5, 2025

Uh oh!

Uh oh!

Cuda acceleration #29

Are you sure you want to change the base?

Cuda acceleration #29

Uh oh!

Conversation

hmoyen commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Optimized Matrix Multiplication

2. Ray Summation Optimization

3. Reuse Global Buffers

4. Replace exp() and use intrinsic functions

How to test it

Uh oh!

woensug-choi commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

woensug-choi commented Aug 25, 2025

Uh oh!

woensug-choi commented Aug 25, 2025

Uh oh!

hmoyen commented Aug 25, 2025

Uh oh!

hmoyen commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hmoyen commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GauravKumar9920 commented Sep 5, 2025

Uh oh!

GauravKumar9920 commented Sep 5, 2025

Uh oh!

Uh oh!

hmoyen commented Aug 20, 2025 •

edited

Loading

woensug-choi commented Aug 25, 2025 •

edited

Loading

hmoyen commented Aug 25, 2025 •

edited

Loading

hmoyen commented Aug 25, 2025 •

edited

Loading