-
Couldn't load subscription status.
- Fork 485
fix(profiler): reduce memory usage for compression #4058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
BenchmarksBenchmark execution time: 2025-10-28 12:14:12 Comparing candidate commit a37f42c in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 24 metrics, 0 unstable metrics. |
|
cfe3f71 to
6e78e2e
Compare
The zstd compression library uses ~8MiB per compressor by default, primarily for the back-reference window. See this parameter: https://pkg.go.dev/github.com/klauspost/compress/zstd#WithWindowSize Since we have an encoder per profile type, this leads to a noticable increase in memory usage after switching to zstd by default. We can make the window smaller, but this can negatively affect the compression ratio. Instead, we can just use a single encoder and share it between the profile types. This commit does the bare minimum to implement a single encoder. It's a bit kludgy to use a separate global lock to guard access to the encoder. But it's awkward to plumb the synchronization around and keep it more encapsulated without a bigger refactor.
6e78e2e to
a6be279
Compare
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
|
The zstd compression library uses ~8MiB per compressor by default,
primarily for the back-reference window. See this parameter:
https://pkg.go.dev/github.com/klauspost/compress/zstd#WithWindowSize
Since we have an encoder per profile type, this leads to a noticable
increase in memory usage after switching to zstd by default. We can
make the window smaller, but this can negatively affect the compression
ratio. Instead, we can just use a single encoder and share it between
the profile types.
This PR does the bare minimum to implement a single encoder. It's a
bit kludgy to use a separate global lock to guard access to the encoder.
But it's awkward to plumb the synchronization around and keep it more
encapsulated without a bigger refactor.
This will probably make our cycle time slightly longer, since we now wait
for all the processing to complete serially before advancing to the next
profile cycle. It's hard to quantify exactly how much since it depends on
how much profiling data the program produces.
Also worth noting: the execution tracer and CPU profile APIs take a
writer when they start, rather than when we read the data. The tracer in
particular periodically writes out data as it's running. The CPU profiler
technically only writes data to the writer when it's stopped. Either way,
we don't want to hold the global lock in a way that would block either
of these things from completing. So this PR collects this data in a separate
buffer and (re)compresses with the lock held after stopping collection.
We should still come out ahead memory-usage wise by not using 8MiB
per profile type for compression.