perf: optimize string tensor deserialization with high performance c++ implementation #416

wweic · 2025-09-17T22:23:48Z

issue: triton-inference-server/server#8348

resolves triton-inference-server/server#8348

…+ implementation

src/pb_tensor.cc

kthui · 2025-09-18T18:36:17Z

Pipeline # 35130400

wweic · 2025-09-18T20:48:26Z

@kthui I addressed your comments in PR and from issue. I ran a local benchmark (spin up a triton container, send requests to the container for 5000 times), the p99, p80, p50 latencies remain the same before making the changes.

kthui · 2025-09-20T02:14:22Z

Could it be the bottleneck is not with the string deserialization, but limited by SHM read/write speed? SHM enables different Python processes to communicate with the Triton process.

kthui · 2025-09-20T02:26:57Z

New pipeline # 35224390

wweic · 2025-09-20T17:01:27Z

@kthui SHM is the same for both cpp implementation and python implementation I think because they are all invoked from the stub process. I think it's that struct.unpack_from is not well optimized leading to the slow down, I just wrote a benchmark for various length strings. The output indicates that this is likely the reason:

=== PERFORMANCE BENCHMARK: struct.unpack_from vs C++ ===                                                                                                                                                                                                                                                    
=========================================================                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                            
Testing single string of size: 10 bytes                                                                                                                                                                                                                                                                     
  Testing: sb = struct.unpack_from("<{}s".format(l), val_buf, offset)[0]                                                                                                                                                                                                                                    
    Python struct.unpack_from: 930.5 ns per call                                                                                                                                                                                                                                                            
    C++ direct access:         56.0 ns per call                                                                                                                                                                                                                                                             
    Overhead per call:         874.5 ns                                                                                                                                                                                                                                                                     
    Speedup:                   16.63x                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                            
  Testing: l = struct.unpack_from("<I", val_buf, offset)[0]                
    Python struct.unpack_from: 405.94 ns per call                          
    C++ direct access:         0.34 ns per call                            
    Overhead per call:         405.60 ns                                   
    Speedup:                   1192.20x                                    

Testing single string of size: 100 bytes                                   
  Testing: sb = struct.unpack_from("<{}s".format(l), val_buf, offset)[0]                                                                              
    Python struct.unpack_from: 938.49 ns per call                          
    C++ direct access:         54.00 ns per call                           
    Overhead per call:         884.48 ns                                   
    Speedup:                   17.38x                                      

  Testing: l = struct.unpack_from("<I", val_buf, offset)[0]                
    Python struct.unpack_from: 403.81 ns per call                          
    C++ direct access:         0.34 ns per call                            
    Overhead per call:         403.47 ns                                   
    Speedup:                   1188.37x                                    

Testing single string of size: 1000 bytes                                  
  Testing: sb = struct.unpack_from("<{}s".format(l), val_buf, offset)[0]                                                                              
    Python struct.unpack_from: 1059.43 ns per call                         
    C++ direct access:         125.74 ns per call                          
    Overhead per call:         933.69 ns                                   
    Speedup:                   8.43x                                       

  Testing: l = struct.unpack_from("<I", val_buf, offset)[0]                
    Python struct.unpack_from: 469.23 ns per call                          
    C++ direct access:         0.36 ns per call                            
    Overhead per call:         468.87 ns                                   
    Speedup:                   1312.16x                                    

Testing single string of size: 1000 bytes                                  
  Testing: sb = struct.unpack_from("<{}s".format(l), val_buf, offset)[0]                                                                              
    Python struct.unpack_from: 1059.43 ns per call                         
    C++ direct access:         125.74 ns per call                          
    Overhead per call:         933.69 ns                                   
    Speedup:                   8.43x                                       

  Testing: l = struct.unpack_from("<I", val_buf, offset)[0]                
    Python struct.unpack_from: 469.23 ns per call                          
    C++ direct access:         0.36 ns per call                            
    Overhead per call:         468.87 ns                                   
    Speedup:                   1312.16x                                    

Testing single string of size: 10000 bytes                                 
  Testing: sb = struct.unpack_from("<{}s".format(l), val_buf, offset)[0]                                                                              
    Python struct.unpack_from: 1250.60 ns per call                         
    C++ direct access:         216.47 ns per call                          
    Overhead per call:         1034.13 ns                                  
    Speedup:                   5.78x                                       

  Testing: l = struct.unpack_from("<I", val_buf, offset)[0]                
    Python struct.unpack_from: 479.62 ns per call                          
    C++ direct access:         0.38 ns per call                            
    Overhead per call:         479.24 ns                                   
    Speedup:                   1277.27x                                    

Testing single string of size: 100000 bytes                                
  Testing: sb = struct.unpack_from("<{}s".format(l), val_buf, offset)[0]                                                                              
    Python struct.unpack_from: 5002.57 ns per call                         
    C++ direct access:         3485.09 ns per call                         
    Overhead per call:         1517.48 ns                                  
    Speedup:                   1.44x                                       

  Testing: l = struct.unpack_from("<I", val_buf, offset)[0]                
    Python struct.unpack_from: 509.01 ns per call                          
    C++ direct access:         0.40 ns per call                            
    Overhead per call:         508.61 ns                                   
    Speedup:                   1287.65x                                    


Realistic Workload Test (15000 strings)                                    
========================================                                   
  Python deserialize_bytes_tensor: 19074 μs per call                       
  C++ deserialize_bytes_tensor:    673 μs per call                         
  Speedup:                          28.34x                                 

  Estimated struct.unpack_from overhead: ~15000.00 μs                      
  Actual performance difference:         18401 μs                          

=== END OF PERFORMANCE BENCHMARK ===

kthui · 2025-09-22T19:12:35Z

If the string deserialization can be moved to C++, this should give the Python model more useful CPU cycles, because the deserialization is taken off from Python, which should allow a slightly higher throughput when the Python model is busy (using >= 100% cpu).

whoisj · 2025-09-22T19:21:10Z

@wweic given your relationship with Unity, should I assume you're consuming Triton via libtriton as opposed to as a web service via tritonserver?

wweic · 2025-09-22T22:47:48Z

@wweic given your relationship with Unity, should I assume you're consuming Triton via libtriton as opposed to as a web service via tritonserver?

@whoisj We actually deploy tritonserver as is as an internal service.

whoisj · 2025-09-24T20:58:58Z

@kthui what is your assessment? if there's even a modest performance improvement with minimal risk, I'd like to accept the contribution.

wweic · 2025-09-24T21:10:07Z

@kthui @whoisj feel free to share any testing you would like me to do, or unit tests to add. Happy to do. I believe it will help lots of python backend users due to the speedup.

perf: optimize string tensor deserialization with high performance c+…

145b89e

…+ implementation

This was referenced Sep 17, 2025

perf: optimize string tensor deserialization with high performance c++ implementation wweic/python_backend#1

Closed

Contribution to accelerate python backend latency triton-inference-server/server#8348

Open

kthui reviewed Sep 18, 2025

View reviewed changes

src/pb_tensor.cc Outdated Show resolved Hide resolved

kthui assigned wweic, whoisj and kthui Sep 18, 2025

kthui added the PR: perf A code change that improves performance label Sep 18, 2025

whoisj added the enhancement New feature or request label Sep 18, 2025

Address PR comments

cf1a489

wweic force-pushed the wweic/optimize-string-tensor-pr branch from 9fbfc13 to cf1a489 Compare September 18, 2025 20:45

wweic marked this pull request as ready for review September 18, 2025 20:46

kthui requested a review from whoisj September 20, 2025 02:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: optimize string tensor deserialization with high performance c++ implementation #416

perf: optimize string tensor deserialization with high performance c++ implementation #416

Uh oh!

wweic commented Sep 17, 2025 •

edited by kthui

Loading

Uh oh!

Uh oh!

kthui commented Sep 18, 2025

Uh oh!

wweic commented Sep 18, 2025

Uh oh!

kthui commented Sep 20, 2025 •

edited

Loading

Uh oh!

kthui commented Sep 20, 2025 •

edited

Loading

Uh oh!

wweic commented Sep 20, 2025

Uh oh!

kthui commented Sep 22, 2025

Uh oh!

whoisj commented Sep 22, 2025

Uh oh!

wweic commented Sep 22, 2025 •

edited

Loading

Uh oh!

whoisj commented Sep 24, 2025

Uh oh!

wweic commented Sep 24, 2025

Uh oh!

Uh oh!

perf: optimize string tensor deserialization with high performance c++ implementation #416

Are you sure you want to change the base?

perf: optimize string tensor deserialization with high performance c++ implementation #416

Uh oh!

Conversation

wweic commented Sep 17, 2025 • edited by kthui Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kthui commented Sep 18, 2025

Uh oh!

wweic commented Sep 18, 2025

Uh oh!

kthui commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kthui commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wweic commented Sep 20, 2025

Uh oh!

kthui commented Sep 22, 2025

Uh oh!

whoisj commented Sep 22, 2025

Uh oh!

wweic commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

whoisj commented Sep 24, 2025

Uh oh!

wweic commented Sep 24, 2025

Uh oh!

Uh oh!

wweic commented Sep 17, 2025 •

edited by kthui

Loading

kthui commented Sep 20, 2025 •

edited

Loading

kthui commented Sep 20, 2025 •

edited

Loading

wweic commented Sep 22, 2025 •

edited

Loading