fix: [nvbugs/5066257] serialization improvments #3869

coldwaterq · 2025-04-25T16:47:14Z

Description

Add an approve list into the pickle deserialization process to reduce the attack surface of using pickle to a subset of supported objects.

Test Coverage

if a new object is added this PR will cause a ValueError to be raised that clearly states what module is not supported that is being serialized/deserialized.
Fully utilizing the zmq interfaces would test this code.

yibinl-nvidia · 2025-04-25T20:49:47Z

Thanks a lot for making this PR! Several general comments:

Could you please follow https://github.com/NVIDIA/TensorRT-LLM/blob/main/CONTRIBUTING.md to sign your commits and run pre-commit hook.
Please add the complicated dataclass serialization unit test we discussed offline to test_executor.py. Also would be good to have additional tests for serailization.py. I can trigger a pipeline test after you added the tests to see if there is any test failure.

tensorrt_llm/executor/ipc.py

yibinl-nvidia · 2025-04-25T21:03:12Z

@kaiyux This is the "approved list approach" I discussed with you offline. Let me know your thoughts.

cc @Superjomn @litaotju @juney-nvidia

tensorrt_llm/executor/ipc.py

tensorrt_llm/executor/serialization.py

coldwaterq · 2025-04-28T20:10:12Z

@yibinl-nvidia I fixed the DCO and ran the pre-commit hooks, added the tests, and made some improvements based on the comments. let me know if there is anything else I should address.

yibinl-nvidia · 2025-04-28T22:49:37Z

/bot run

tensorrt-cicd · 2025-04-28T22:55:03Z

PR_Github #3654 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-28T23:06:06Z

PR_Github #3654 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #2585 completed with status: 'FAILURE'

yibinl-nvidia · 2025-04-29T03:25:44Z

I did a micro-benchmark comparing the custom serialization and deserialization against Pickle. The first result is based on a simple GenerationResult object (same one as in the test_executor.py) approximately 2 KB in size. The second result focused on benchmarking message with various size. This shows that the new approach introduces trivial μs-level of perf degradation.

Serialization and Deserialization Performance Results for GenerationRequest with 10000 iterations:
Pickle (avg/min/max): 199.038μs / 186.559μs / 1271.870μs
Custom (avg/min/max): 204.601μs / 192.718μs / 363.867μs
Average time difference: 5.56 μs

Size Comparison:
Pickle size: 1920 bytes
Custom size: 1920 bytes
Size ratio (custom/pickle): 1.00x

Serialization Performance Across Different Message Sizes:
Size            Pickle Avg (μs)         Custom Avg (μs)         Diff (μs)       Size Ratio
--------------------------------------------------------------------------------
1.0 KB               200.348                 205.710               5                    1.00x
10.0 KB              279.806                 285.365               5                    1.00x
100.0 KB            1044.078                1051.731               7                    1.00x
1.0 MB              8487.678                8505.488              17                    1.00x
10.0 MB            92523.012               93069.062             546                    1.00x

coldwaterq · 2025-04-29T16:26:12Z

The last commit should resolve the issue reported by blossom.

yibinl-nvidia · 2025-04-29T16:28:02Z

/bot run

tensorrt-cicd · 2025-04-29T16:33:56Z

PR_Github #3744 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-29T18:17:19Z

PR_Github #3744 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2650 completed with status: 'FAILURE'

yibinl-nvidia · 2025-04-29T21:46:05Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2025-04-29T21:52:12Z

PR_Github #3766 [ run ] triggered by Bot

coldwaterq · 2025-04-30T02:24:34Z

the tests that were failed in the basic bot run should be fixed with those new commits.

tensorrt-cicd · 2025-04-30T04:50:25Z

PR_Github #3766 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2667 completed with status: 'FAILURE'

yibinl-nvidia · 2025-04-30T16:25:27Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2025-04-30T16:31:04Z

PR_Github #3873 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-30T23:31:10Z

PR_Github #3873 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2746 completed with status: 'FAILURE'

tensorrt-cicd · 2025-05-22T04:38:06Z

PR_Github #6102 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-22T06:07:35Z

PR_Github #6102 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4461 completed with status: 'FAILURE'

yibinl-nvidia · 2025-05-22T07:02:14Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2025-05-22T07:07:50Z

PR_Github #6123 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-22T18:54:19Z

PR_Github #6123 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4471 completed with status: 'SUCCESS'

… function. Signed-off-by: [email protected] <[email protected]>

…ction because it didn't work for all objects that made debugging harder, added tests. Signed-off-by: [email protected] <[email protected]>

…e function. Also added missing classes to approved list. Signed-off-by: coldwaterq <[email protected]>

Signed-off-by: coldwaterq <[email protected]>

Signed-off-by: Yibin Li <[email protected]>

…spawned child processes Signed-off-by: coldwaterq <[email protected]>

Signed-off-by: Yibin Li <[email protected]>

kaiyux · 2025-05-23T02:50:22Z

/bot reuse-pipeline

tensorrt-cicd · 2025-05-23T02:55:26Z

PR_Github #6217 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-05-23T03:01:30Z

PR_Github #6217 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #6123 for commit 9b9df25

* added a restricted pcikler and depickler in a sepparate serialization function. Signed-off-by: [email protected] <[email protected]> * updated IPC to remove approved classes, removed the serialization function because it didn't work for all objects that made debugging harder, added tests. Signed-off-by: [email protected] <[email protected]> * removed LLM arg and moved class registration to a serialization module function. Also added missing classes to approved list. Signed-off-by: coldwaterq <[email protected]> * cleaned up a couple files to reduce conflicts with main. Signed-off-by: coldwaterq <[email protected]> * fix unit tests Signed-off-by: Yibin Li <[email protected]> * reorder BASE_ZMQ_CLASSES list alphabetically Signed-off-by: Yibin Li <[email protected]> * fix tests and move LogitsProcessor registration to base class Signed-off-by: Yibin Li <[email protected]> * revert changes to import log of tensorrt_llm._torch.models Signed-off-by: Yibin Li <[email protected]> * added comments to explain why BASE_ZMQ_CLASSES has to be passed into spawned child processes Signed-off-by: coldwaterq <[email protected]> * fix tests and move LogitsProcessor registration to base class Signed-off-by: Yibin Li <[email protected]> * additional comments for multiprocess approved list sync Signed-off-by: Yibin Li <[email protected]> * add dataclass from tests Signed-off-by: Yibin Li <[email protected]> --------- Signed-off-by: [email protected] <[email protected]> Signed-off-by: coldwaterq <[email protected]> Signed-off-by: Yibin Li <[email protected]> Co-authored-by: Yibin Li <[email protected]>

* added a restricted pcikler and depickler in a sepparate serialization function. Signed-off-by: [email protected] <[email protected]> * updated IPC to remove approved classes, removed the serialization function because it didn't work for all objects that made debugging harder, added tests. Signed-off-by: [email protected] <[email protected]> * removed LLM arg and moved class registration to a serialization module function. Also added missing classes to approved list. Signed-off-by: coldwaterq <[email protected]> * cleaned up a couple files to reduce conflicts with main. Signed-off-by: coldwaterq <[email protected]> * fix unit tests Signed-off-by: Yibin Li <[email protected]> * reorder BASE_ZMQ_CLASSES list alphabetically Signed-off-by: Yibin Li <[email protected]> * fix tests and move LogitsProcessor registration to base class Signed-off-by: Yibin Li <[email protected]> * revert changes to import log of tensorrt_llm._torch.models Signed-off-by: Yibin Li <[email protected]> * added comments to explain why BASE_ZMQ_CLASSES has to be passed into spawned child processes Signed-off-by: coldwaterq <[email protected]> * fix tests and move LogitsProcessor registration to base class Signed-off-by: Yibin Li <[email protected]> * additional comments for multiprocess approved list sync Signed-off-by: Yibin Li <[email protected]> * add dataclass from tests Signed-off-by: Yibin Li <[email protected]> --------- Signed-off-by: [email protected] <[email protected]> Signed-off-by: coldwaterq <[email protected]> Signed-off-by: Yibin Li <[email protected]> Co-authored-by: Yibin Li <[email protected]> Signed-off-by: darraghdog <[email protected]>

yibinl-nvidia self-requested a review April 25, 2025 16:52

yibinl-nvidia reviewed Apr 25, 2025

View reviewed changes

tensorrt_llm/executor/ipc.py Outdated Show resolved Hide resolved

tensorrt_llm/executor/ipc.py Outdated Show resolved Hide resolved

yibinl-nvidia requested review from Superjomn and kaiyux April 25, 2025 21:05

Superjomn reviewed Apr 25, 2025

View reviewed changes

tensorrt_llm/executor/ipc.py Outdated Show resolved Hide resolved

tensorrt_llm/executor/serialization.py Outdated Show resolved Hide resolved

juney-nvidia changed the title ~~[nvbugs/5066257] serialization improvments~~ fix: [nvbugs/5066257] serialization improvments Apr 26, 2025

coldwaterq force-pushed the restricted-pickler branch from 90a7915 to 743fa7b Compare April 28, 2025 20:07

coldwaterq requested review from Superjomn and yibinl-nvidia April 28, 2025 20:11

coldwaterq requested a review from a team as a code owner April 29, 2025 16:23

coldwaterq force-pushed the restricted-pickler branch from 9653b63 to fa5a912 Compare April 29, 2025 20:39

yibinl-nvidia force-pushed the restricted-pickler branch from b422fec to dbbce8c Compare April 30, 2025 16:24

yibinl-nvidia approved these changes May 22, 2025

View reviewed changes

coldwaterq and others added 12 commits May 22, 2025 16:25

added a restricted pcikler and depickler in a sepparate serialization…

dce7e98

… function. Signed-off-by: [email protected] <[email protected]>

updated IPC to remove approved classes, removed the serialization fun…

691ef66

…ction because it didn't work for all objects that made debugging harder, added tests. Signed-off-by: [email protected] <[email protected]>

removed LLM arg and moved class registration to a serialization modul…

29233e0

…e function. Also added missing classes to approved list. Signed-off-by: coldwaterq <[email protected]>

cleaned up a couple files to reduce conflicts with main.

f43c65f

Signed-off-by: coldwaterq <[email protected]>

fix unit tests

93658f2

Signed-off-by: Yibin Li <[email protected]>

reorder BASE_ZMQ_CLASSES list alphabetically

d3712b9

Signed-off-by: Yibin Li <[email protected]>

fix tests and move LogitsProcessor registration to base class

685ed95

Signed-off-by: Yibin Li <[email protected]>

revert changes to import log of tensorrt_llm._torch.models

2b24390

Signed-off-by: Yibin Li <[email protected]>

added comments to explain why BASE_ZMQ_CLASSES has to be passed into …

5caac89

…spawned child processes Signed-off-by: coldwaterq <[email protected]>

fix tests and move LogitsProcessor registration to base class

e21e326

Signed-off-by: Yibin Li <[email protected]>

additional comments for multiprocess approved list sync

4241f97

Signed-off-by: Yibin Li <[email protected]>

add dataclass from tests

9b9df25

Signed-off-by: Yibin Li <[email protected]>

yibinl-nvidia force-pushed the restricted-pickler branch from 567c059 to 9b9df25 Compare May 22, 2025 23:25

kaiyux enabled auto-merge (squash) May 23, 2025 02:50

yibinl-nvidia approved these changes May 23, 2025

View reviewed changes

kaiyux merged commit 1cf0e67 into NVIDIA:main May 23, 2025
2 checks passed

yibinl-nvidia deleted the restricted-pickler branch May 23, 2025 19:01

yuxianq mentioned this pull request May 25, 2025

feat: Skip sampler for intermediate pp stages. #4514

Merged

fix: [nvbugs/5066257] serialization improvments #3869

fix: [nvbugs/5066257] serialization improvments #3869

Uh oh!

Conversation

coldwaterq commented Apr 25, 2025

Description

Test Coverage

Uh oh!

yibinl-nvidia commented Apr 25, 2025

Uh oh!

Uh oh!

Uh oh!

yibinl-nvidia commented Apr 25, 2025

Uh oh!

Uh oh!

Uh oh!

coldwaterq commented Apr 28, 2025

Uh oh!

yibinl-nvidia commented Apr 28, 2025

Uh oh!

tensorrt-cicd commented Apr 28, 2025

Uh oh!

tensorrt-cicd commented Apr 28, 2025

Uh oh!

yibinl-nvidia commented Apr 29, 2025

Uh oh!

coldwaterq commented Apr 29, 2025

Uh oh!

yibinl-nvidia commented Apr 29, 2025

Uh oh!

tensorrt-cicd commented Apr 29, 2025

Uh oh!

tensorrt-cicd commented Apr 29, 2025

Uh oh!

yibinl-nvidia commented Apr 29, 2025

Uh oh!

tensorrt-cicd commented Apr 29, 2025

Uh oh!

coldwaterq commented Apr 30, 2025

Uh oh!

tensorrt-cicd commented Apr 30, 2025

Uh oh!

yibinl-nvidia commented Apr 30, 2025

Uh oh!

tensorrt-cicd commented Apr 30, 2025

Uh oh!

tensorrt-cicd commented Apr 30, 2025

Uh oh!

tensorrt-cicd commented May 22, 2025

Uh oh!

tensorrt-cicd commented May 22, 2025

Uh oh!

yibinl-nvidia commented May 22, 2025

Uh oh!

tensorrt-cicd commented May 22, 2025

Uh oh!

tensorrt-cicd commented May 22, 2025

Uh oh!

kaiyux commented May 23, 2025

Uh oh!

tensorrt-cicd commented May 23, 2025

Uh oh!

tensorrt-cicd commented May 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants