Skip to content

Conversation

wweic
Copy link

@wweic wweic commented Sep 17, 2025

@kthui kthui added the PR: perf A code change that improves performance label Sep 18, 2025
@kthui
Copy link
Contributor

kthui commented Sep 18, 2025

Pipeline # 35130400

@whoisj whoisj added the enhancement New feature or request label Sep 18, 2025
@wweic wweic force-pushed the wweic/optimize-string-tensor-pr branch from 9fbfc13 to cf1a489 Compare September 18, 2025 20:45
@wweic wweic marked this pull request as ready for review September 18, 2025 20:46
@wweic
Copy link
Author

wweic commented Sep 18, 2025

@kthui I addressed your comments in PR and from issue. I ran a local benchmark (spin up a triton container, send requests to the container for 5000 times), the p99, p80, p50 latencies remain the same before making the changes.

@kthui
Copy link
Contributor

kthui commented Sep 20, 2025

Could it be the bottleneck is not with the string deserialization, but limited by SHM read/write speed? SHM enables different Python processes to communicate with the Triton process.

@kthui kthui requested a review from whoisj September 20, 2025 02:14
@kthui
Copy link
Contributor

kthui commented Sep 20, 2025

New pipeline # 35224390

@wweic
Copy link
Author

wweic commented Sep 20, 2025

@kthui SHM is the same for both cpp implementation and python implementation I think because they are all invoked from the stub process. I think it's that struct.unpack_from is not well optimized leading to the slow down, I just wrote a benchmark for various length strings. The output indicates that this is likely the reason:

=== PERFORMANCE BENCHMARK: struct.unpack_from vs C++ ===                                                                                                                                                                                                                                                    
=========================================================                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                            
Testing single string of size: 10 bytes                                                                                                                                                                                                                                                                     
  Testing: sb = struct.unpack_from("<{}s".format(l), val_buf, offset)[0]                                                                                                                                                                                                                                    
    Python struct.unpack_from: 930.5 ns per call                                                                                                                                                                                                                                                            
    C++ direct access:         56.0 ns per call                                                                                                                                                                                                                                                             
    Overhead per call:         874.5 ns                                                                                                                                                                                                                                                                     
    Speedup:                   16.63x                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                            
  Testing: l = struct.unpack_from("<I", val_buf, offset)[0]                
    Python struct.unpack_from: 405.94 ns per call                          
    C++ direct access:         0.34 ns per call                            
    Overhead per call:         405.60 ns                                   
    Speedup:                   1192.20x                                    

Testing single string of size: 100 bytes                                   
  Testing: sb = struct.unpack_from("<{}s".format(l), val_buf, offset)[0]                                                                              
    Python struct.unpack_from: 938.49 ns per call                          
    C++ direct access:         54.00 ns per call                           
    Overhead per call:         884.48 ns                                   
    Speedup:                   17.38x                                      

  Testing: l = struct.unpack_from("<I", val_buf, offset)[0]                
    Python struct.unpack_from: 403.81 ns per call                          
    C++ direct access:         0.34 ns per call                            
    Overhead per call:         403.47 ns                                   
    Speedup:                   1188.37x                                    

Testing single string of size: 1000 bytes                                  
  Testing: sb = struct.unpack_from("<{}s".format(l), val_buf, offset)[0]                                                                              
    Python struct.unpack_from: 1059.43 ns per call                         
    C++ direct access:         125.74 ns per call                          
    Overhead per call:         933.69 ns                                   
    Speedup:                   8.43x                                       

  Testing: l = struct.unpack_from("<I", val_buf, offset)[0]                
    Python struct.unpack_from: 469.23 ns per call                          
    C++ direct access:         0.36 ns per call                            
    Overhead per call:         468.87 ns                                   
    Speedup:                   1312.16x                                    

Testing single string of size: 1000 bytes                                  
  Testing: sb = struct.unpack_from("<{}s".format(l), val_buf, offset)[0]                                                                              
    Python struct.unpack_from: 1059.43 ns per call                         
    C++ direct access:         125.74 ns per call                          
    Overhead per call:         933.69 ns                                   
    Speedup:                   8.43x                                       

  Testing: l = struct.unpack_from("<I", val_buf, offset)[0]                
    Python struct.unpack_from: 469.23 ns per call                          
    C++ direct access:         0.36 ns per call                            
    Overhead per call:         468.87 ns                                   
    Speedup:                   1312.16x                                    

Testing single string of size: 10000 bytes                                 
  Testing: sb = struct.unpack_from("<{}s".format(l), val_buf, offset)[0]                                                                              
    Python struct.unpack_from: 1250.60 ns per call                         
    C++ direct access:         216.47 ns per call                          
    Overhead per call:         1034.13 ns                                  
    Speedup:                   5.78x                                       

  Testing: l = struct.unpack_from("<I", val_buf, offset)[0]                
    Python struct.unpack_from: 479.62 ns per call                          
    C++ direct access:         0.38 ns per call                            
    Overhead per call:         479.24 ns                                   
    Speedup:                   1277.27x                                    

Testing single string of size: 100000 bytes                                
  Testing: sb = struct.unpack_from("<{}s".format(l), val_buf, offset)[0]                                                                              
    Python struct.unpack_from: 5002.57 ns per call                         
    C++ direct access:         3485.09 ns per call                         
    Overhead per call:         1517.48 ns                                  
    Speedup:                   1.44x                                       

  Testing: l = struct.unpack_from("<I", val_buf, offset)[0]                
    Python struct.unpack_from: 509.01 ns per call                          
    C++ direct access:         0.40 ns per call                            
    Overhead per call:         508.61 ns                                   
    Speedup:                   1287.65x                                    


Realistic Workload Test (15000 strings)                                    
========================================                                   
  Python deserialize_bytes_tensor: 19074 μs per call                       
  C++ deserialize_bytes_tensor:    673 μs per call                         
  Speedup:                          28.34x                                 

  Estimated struct.unpack_from overhead: ~15000.00 μs                      
  Actual performance difference:         18401 μs                          

=== END OF PERFORMANCE BENCHMARK ===       

@kthui
Copy link
Contributor

kthui commented Sep 22, 2025

If the string deserialization can be moved to C++, this should give the Python model more useful CPU cycles, because the deserialization is taken off from Python, which should allow a slightly higher throughput when the Python model is busy (using >= 100% cpu).

@whoisj
Copy link

whoisj commented Sep 22, 2025

@wweic given your relationship with Unity, should I assume you're consuming Triton via libtriton as opposed to as a web service via tritonserver?

@wweic
Copy link
Author

wweic commented Sep 22, 2025

@wweic given your relationship with Unity, should I assume you're consuming Triton via libtriton as opposed to as a web service via tritonserver?

@whoisj We actually deploy tritonserver as is as an internal service.

@whoisj
Copy link

whoisj commented Sep 24, 2025

@kthui what is your assessment? if there's even a modest performance improvement with minimal risk, I'd like to accept the contribution.

@wweic
Copy link
Author

wweic commented Sep 24, 2025

@kthui @whoisj feel free to share any testing you would like me to do, or unit tests to add. Happy to do. I believe it will help lots of python backend users due to the speedup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request PR: perf A code change that improves performance
Development

Successfully merging this pull request may close these issues.

Contribution to accelerate python backend latency
3 participants