-
Notifications
You must be signed in to change notification settings - Fork 180
perf: optimize string tensor deserialization with high performance c++ implementation #416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
perf: optimize string tensor deserialization with high performance c++ implementation #416
Conversation
Pipeline # 35130400 |
9fbfc13
to
cf1a489
Compare
@kthui I addressed your comments in PR and from issue. I ran a local benchmark (spin up a triton container, send requests to the container for 5000 times), the p99, p80, p50 latencies remain the same before making the changes. |
Could it be the bottleneck is not with the string deserialization, but limited by SHM read/write speed? SHM enables different Python processes to communicate with the Triton process. |
New pipeline # 35224390 |
@kthui SHM is the same for both cpp implementation and python implementation I think because they are all invoked from the stub process. I think it's that === PERFORMANCE BENCHMARK: struct.unpack_from vs C++ ===
=========================================================
Testing single string of size: 10 bytes
Testing: sb = struct.unpack_from("<{}s".format(l), val_buf, offset)[0]
Python struct.unpack_from: 930.5 ns per call
C++ direct access: 56.0 ns per call
Overhead per call: 874.5 ns
Speedup: 16.63x
Testing: l = struct.unpack_from("<I", val_buf, offset)[0]
Python struct.unpack_from: 405.94 ns per call
C++ direct access: 0.34 ns per call
Overhead per call: 405.60 ns
Speedup: 1192.20x
Testing single string of size: 100 bytes
Testing: sb = struct.unpack_from("<{}s".format(l), val_buf, offset)[0]
Python struct.unpack_from: 938.49 ns per call
C++ direct access: 54.00 ns per call
Overhead per call: 884.48 ns
Speedup: 17.38x
Testing: l = struct.unpack_from("<I", val_buf, offset)[0]
Python struct.unpack_from: 403.81 ns per call
C++ direct access: 0.34 ns per call
Overhead per call: 403.47 ns
Speedup: 1188.37x
Testing single string of size: 1000 bytes
Testing: sb = struct.unpack_from("<{}s".format(l), val_buf, offset)[0]
Python struct.unpack_from: 1059.43 ns per call
C++ direct access: 125.74 ns per call
Overhead per call: 933.69 ns
Speedup: 8.43x
Testing: l = struct.unpack_from("<I", val_buf, offset)[0]
Python struct.unpack_from: 469.23 ns per call
C++ direct access: 0.36 ns per call
Overhead per call: 468.87 ns
Speedup: 1312.16x
Testing single string of size: 1000 bytes
Testing: sb = struct.unpack_from("<{}s".format(l), val_buf, offset)[0]
Python struct.unpack_from: 1059.43 ns per call
C++ direct access: 125.74 ns per call
Overhead per call: 933.69 ns
Speedup: 8.43x
Testing: l = struct.unpack_from("<I", val_buf, offset)[0]
Python struct.unpack_from: 469.23 ns per call
C++ direct access: 0.36 ns per call
Overhead per call: 468.87 ns
Speedup: 1312.16x
Testing single string of size: 10000 bytes
Testing: sb = struct.unpack_from("<{}s".format(l), val_buf, offset)[0]
Python struct.unpack_from: 1250.60 ns per call
C++ direct access: 216.47 ns per call
Overhead per call: 1034.13 ns
Speedup: 5.78x
Testing: l = struct.unpack_from("<I", val_buf, offset)[0]
Python struct.unpack_from: 479.62 ns per call
C++ direct access: 0.38 ns per call
Overhead per call: 479.24 ns
Speedup: 1277.27x
Testing single string of size: 100000 bytes
Testing: sb = struct.unpack_from("<{}s".format(l), val_buf, offset)[0]
Python struct.unpack_from: 5002.57 ns per call
C++ direct access: 3485.09 ns per call
Overhead per call: 1517.48 ns
Speedup: 1.44x
Testing: l = struct.unpack_from("<I", val_buf, offset)[0]
Python struct.unpack_from: 509.01 ns per call
C++ direct access: 0.40 ns per call
Overhead per call: 508.61 ns
Speedup: 1287.65x
Realistic Workload Test (15000 strings)
========================================
Python deserialize_bytes_tensor: 19074 μs per call
C++ deserialize_bytes_tensor: 673 μs per call
Speedup: 28.34x
Estimated struct.unpack_from overhead: ~15000.00 μs
Actual performance difference: 18401 μs
=== END OF PERFORMANCE BENCHMARK === |
If the string deserialization can be moved to C++, this should give the Python model more useful CPU cycles, because the deserialization is taken off from Python, which should allow a slightly higher throughput when the Python model is busy (using >= 100% cpu). |
@wweic given your relationship with Unity, should I assume you're consuming Triton via libtriton as opposed to as a web service via tritonserver? |
@kthui what is your assessment? if there's even a modest performance improvement with minimal risk, I'd like to accept the contribution. |
issue: triton-inference-server/server#8348
resolves triton-inference-server/server#8348