-
Notifications
You must be signed in to change notification settings - Fork 10
Fix calltracestorage concurrent and reentrant issues #280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CLAUDE
ef69524 to
8bc6da3
Compare
Introduces race-free critical section using atomic compare-and-swap instead of expensive signal blocking syscalls. Uses thread-local atomic flags with CAS to prevent signal handler reentrancy while maintaining high performance in hot paths. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
3a43f71 to
404d5a0
Compare
Upgrades from double-buffered to triple-buffered architecture with hazard pointer-based memory reclamation. Replaces SpinLock with atomic pointers, adds generation counter for ABA protection, and improves thread safety for high-frequency profiling scenarios. Key improvements: - Triple buffering: active, standby, and cleanup storage rotation - Per-instance hazard pointer system for lock-free memory reclamation - Instance-based 64-bit trace IDs for collision-free management - Enhanced liveness preservation across storage rotations 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Introduces comprehensive testing improvements including detailed crash reporting with backtraces and register state, stress testing framework for call trace storage under high contention, and platform-specific debugging support using async-signal-safe functions. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> FIX 2
Integrates critical section management into core profiling components for improved signal handler safety and TLS priming. Ensures proper coordination between profiling events and storage operations. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Updates all native unit tests to use improved critical section management and testing infrastructure. Ensures compatibility with new triple-buffered storage and hazard pointer systems. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Updates Gradle configuration and dependencies to support new critical section management and triple-buffered storage architecture. Includes dependency locks for reproducible builds. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Adds comprehensive 'Architectural Tidbits' section documenting key architectural decisions including critical section management, triple-buffered storage, enhanced testing infrastructure, and TLS priming improvements made in 2025. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Replace alignas(DEFAULT_CACHE_LINE_SIZE) with alignas(alignof(SpinLock)) for classes that contain SpinLock members. The DEFAULT_CACHE_LINE_SIZE was semantically incorrect - these classes need alignment to satisfy SpinLock requirements (64 bytes), not cache line optimization. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Enhance alignment comments to clarify cache line alignment usage in SpinLock and ThreadFilter components for performance optimization and false sharing prevention. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Replace stack allocation of highly-aligned CallTraceHashTable instances with heap allocation using std::aligned_alloc(). Stack allocation of 64-byte aligned objects causes ASAN runtime errors. Use RAII pattern with unique_ptr and custom deleter for proper memory management. Fixes alignment issues in: - ConcurrentTableExpansionRegression test - HashTableContentionTest - TCMallocABRunner test - HashTableSpinWaitEdgeCasesTest - HashTableAllocationFailureStressTest 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
c541ddb to
d9dd7ad
Compare
Benchmarks [aarch64 wall]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 23 unstable metrics. |
Benchmarks [x86_64 cpu]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 16 metrics, 22 unstable metrics. |
Benchmarks [x86_64 cpu,wall,alloc,memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics. |
Benchmarks [x86_64 memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 14 metrics, 24 unstable metrics. |
Benchmarks [x86_64 memleak,alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 23 unstable metrics. |
Benchmarks [x86_64 cpu,wall]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 19 metrics, 0 unstable metrics. |
Benchmarks [aarch64 cpu]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics. |
Benchmarks [aarch64 alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics. |
Benchmarks [aarch64 memleak,alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics. |
Benchmarks [aarch64 cpu,wall,alloc,memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 16 metrics, 22 unstable metrics. |
Benchmarks [aarch64 cpu,wall]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics. |
Benchmarks [x86_64 alloc]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 14 metrics, 24 unstable metrics. |
Benchmarks [aarch64 memleak]Parameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics. |
zhengyu123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have yet gone through tests.
…ace_storage_fix # Conflicts: # ddprof-lib/src/main/cpp/callTraceStorage.cpp # ddprof-lib/src/main/cpp/criticalSection.cpp # ddprof-lib/src/main/cpp/criticalSection.h
zhengyu123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What does this PR do?:
The PR is improving the robustness of the calltrace storage implementation such that it does not get tripped by eg. profiling signals coming in the middle of the code which makes sure the concurrent access is safe.
The detailed description is in the arch doc
Motivation:
The memory corruption is pretty bad and we need to prevent it.
Additional Notes:
I have added a bunch of special stress tests for the calltrace storage with the help of Claude Code.
Also, I had to somehow placate ASAN so instead of some pseudo issues it actually started showing real issues in the stresstests.
As a side-effect, I have realized that the calltrace storage is using malloc, potentially in signal handler, when the underlying LongHashTable gets resized - and that can happen when we are trying to put a calltrace from eg. signal handler. And allocating from signal handler is never good ...
How to test the change?:
Added a bunch of focused tests for the calltrace storage functionality. All the existing tests are still passing.
For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance.Unsure? Have a question? Request a review!