Skip to content

Conversation

@coldwaterq
Copy link
Contributor

Description

Add an approve list into the pickle deserialization process to reduce the attack surface of using pickle to a subset of supported objects.

Test Coverage

if a new object is added this PR will cause a ValueError to be raised that clearly states what module is not supported that is being serialized/deserialized.
Fully utilizing the zmq interfaces would test this code.

@yibinl-nvidia yibinl-nvidia self-requested a review April 25, 2025 16:52
@yibinl-nvidia
Copy link
Collaborator

Thanks a lot for making this PR! Several general comments:

  1. Could you please follow https://github.com/NVIDIA/TensorRT-LLM/blob/main/CONTRIBUTING.md to sign your commits and run pre-commit hook.
  2. Please add the complicated dataclass serialization unit test we discussed offline to test_executor.py. Also would be good to have additional tests for serailization.py. I can trigger a pipeline test after you added the tests to see if there is any test failure.

@yibinl-nvidia
Copy link
Collaborator

@kaiyux This is the "approved list approach" I discussed with you offline. Let me know your thoughts.

cc @Superjomn @litaotju @juney-nvidia

@juney-nvidia juney-nvidia changed the title [nvbugs/5066257] serialization improvments fix: [nvbugs/5066257] serialization improvments Apr 26, 2025
@coldwaterq
Copy link
Contributor Author

@yibinl-nvidia I fixed the DCO and ran the pre-commit hooks, added the tests, and made some improvements based on the comments. let me know if there is anything else I should address.

@yibinl-nvidia
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3654 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3654 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #2585 completed with status: 'FAILURE'

@yibinl-nvidia
Copy link
Collaborator

I did a micro-benchmark comparing the custom serialization and deserialization against Pickle. The first result is based on a simple GenerationResult object (same one as in the test_executor.py) approximately 2 KB in size. The second result focused on benchmarking message with various size. This shows that the new approach introduces trivial μs-level of perf degradation.

Serialization and Deserialization Performance Results for GenerationRequest with 10000 iterations:
Pickle (avg/min/max): 199.038μs / 186.559μs / 1271.870μs
Custom (avg/min/max): 204.601μs / 192.718μs / 363.867μs
Average time difference: 5.56 μs

Size Comparison:
Pickle size: 1920 bytes
Custom size: 1920 bytes
Size ratio (custom/pickle): 1.00x

Serialization Performance Across Different Message Sizes:
Size            Pickle Avg (μs)         Custom Avg (μs)         Diff (μs)       Size Ratio
--------------------------------------------------------------------------------
1.0 KB               200.348                 205.710               5                    1.00x
10.0 KB              279.806                 285.365               5                    1.00x
100.0 KB            1044.078                1051.731               7                    1.00x
1.0 MB              8487.678                8505.488              17                    1.00x
10.0 MB            92523.012               93069.062             546                    1.00x

@coldwaterq coldwaterq requested a review from a team as a code owner April 29, 2025 16:23
@coldwaterq
Copy link
Contributor Author

The last commit should resolve the issue reported by blossom.

@yibinl-nvidia
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3744 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3744 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2650 completed with status: 'FAILURE'

@yibinl-nvidia
Copy link
Collaborator

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3766 [ run ] triggered by Bot

@coldwaterq
Copy link
Contributor Author

the tests that were failed in the basic bot run should be fixed with those new commits.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3766 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2667 completed with status: 'FAILURE'

@yibinl-nvidia
Copy link
Collaborator

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3873 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3873 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2746 completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6102 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6102 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4461 completed with status: 'FAILURE'

@yibinl-nvidia
Copy link
Collaborator

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6123 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6123 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4471 completed with status: 'SUCCESS'

coldwaterq and others added 12 commits May 22, 2025 16:25
…ction because it didn't work for all objects that made debugging harder, added tests.

Signed-off-by: [email protected] <[email protected]>
…e function. Also added missing classes to approved list.

Signed-off-by: coldwaterq <[email protected]>
Signed-off-by: Yibin Li <[email protected]>
@kaiyux kaiyux enabled auto-merge (squash) May 23, 2025 02:50
@kaiyux
Copy link
Member

kaiyux commented May 23, 2025

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6217 [ reuse-pipeline ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6217 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #6123 for commit 9b9df25

@kaiyux kaiyux merged commit 1cf0e67 into NVIDIA:main May 23, 2025
2 checks passed
@yibinl-nvidia yibinl-nvidia deleted the restricted-pickler branch May 23, 2025 19:01
Shang-Pin pushed a commit to deepinfra/TensorRT-LLM that referenced this pull request May 27, 2025
* added a restricted pcikler and depickler in a sepparate serialization function.

Signed-off-by: [email protected] <[email protected]>

* updated IPC to remove approved classes, removed the serialization function because it didn't work for all objects that made debugging harder, added tests.

Signed-off-by: [email protected] <[email protected]>

* removed LLM arg and moved class registration to a serialization module function. Also added missing classes to approved list.

Signed-off-by: coldwaterq <[email protected]>

* cleaned up a couple files to reduce conflicts with main.

Signed-off-by: coldwaterq <[email protected]>

* fix unit tests

Signed-off-by: Yibin Li <[email protected]>

* reorder BASE_ZMQ_CLASSES list alphabetically

Signed-off-by: Yibin Li <[email protected]>

* fix tests and move LogitsProcessor registration to base class

Signed-off-by: Yibin Li <[email protected]>

* revert changes to import log of tensorrt_llm._torch.models

Signed-off-by: Yibin Li <[email protected]>

* added comments to explain why BASE_ZMQ_CLASSES has to be passed into spawned child processes

Signed-off-by: coldwaterq <[email protected]>

* fix tests and move LogitsProcessor registration to base class

Signed-off-by: Yibin Li <[email protected]>

* additional comments for multiprocess approved list sync

Signed-off-by: Yibin Li <[email protected]>

* add dataclass from tests

Signed-off-by: Yibin Li <[email protected]>

---------

Signed-off-by: [email protected] <[email protected]>
Signed-off-by: coldwaterq <[email protected]>
Signed-off-by: Yibin Li <[email protected]>
Co-authored-by: Yibin Li <[email protected]>
darraghdog pushed a commit to darraghdog/TensorRT-LLM that referenced this pull request Jun 3, 2025
* added a restricted pcikler and depickler in a sepparate serialization function.

Signed-off-by: [email protected] <[email protected]>

* updated IPC to remove approved classes, removed the serialization function because it didn't work for all objects that made debugging harder, added tests.

Signed-off-by: [email protected] <[email protected]>

* removed LLM arg and moved class registration to a serialization module function. Also added missing classes to approved list.

Signed-off-by: coldwaterq <[email protected]>

* cleaned up a couple files to reduce conflicts with main.

Signed-off-by: coldwaterq <[email protected]>

* fix unit tests

Signed-off-by: Yibin Li <[email protected]>

* reorder BASE_ZMQ_CLASSES list alphabetically

Signed-off-by: Yibin Li <[email protected]>

* fix tests and move LogitsProcessor registration to base class

Signed-off-by: Yibin Li <[email protected]>

* revert changes to import log of tensorrt_llm._torch.models

Signed-off-by: Yibin Li <[email protected]>

* added comments to explain why BASE_ZMQ_CLASSES has to be passed into spawned child processes

Signed-off-by: coldwaterq <[email protected]>

* fix tests and move LogitsProcessor registration to base class

Signed-off-by: Yibin Li <[email protected]>

* additional comments for multiprocess approved list sync

Signed-off-by: Yibin Li <[email protected]>

* add dataclass from tests

Signed-off-by: Yibin Li <[email protected]>

---------

Signed-off-by: [email protected] <[email protected]>
Signed-off-by: coldwaterq <[email protected]>
Signed-off-by: Yibin Li <[email protected]>
Co-authored-by: Yibin Li <[email protected]>
Signed-off-by: darraghdog <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants