Skip to content

Conversation

@jbachorik
Copy link
Collaborator

@jbachorik jbachorik commented Oct 20, 2025

What does this PR do?:
The PR is improving the robustness of the calltrace storage implementation such that it does not get tripped by eg. profiling signals coming in the middle of the code which makes sure the concurrent access is safe.

The detailed description is in the arch doc

Motivation:
The memory corruption is pretty bad and we need to prevent it.

Additional Notes:
I have added a bunch of special stress tests for the calltrace storage with the help of Claude Code.
Also, I had to somehow placate ASAN so instead of some pseudo issues it actually started showing real issues in the stresstests.

As a side-effect, I have realized that the calltrace storage is using malloc, potentially in signal handler, when the underlying LongHashTable gets resized - and that can happen when we are trying to put a calltrace from eg. signal handler. And allocating from signal handler is never good ...

How to test the change?:
Added a bunch of focused tests for the calltrace storage functionality. All the existing tests are still passing.

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.
  • JIRA: PROF-12778

Unsure? Have a question? Request a review!

Introduces race-free critical section using atomic compare-and-swap
instead of expensive signal blocking syscalls. Uses thread-local
atomic flags with CAS to prevent signal handler reentrancy while
maintaining high performance in hot paths.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@jbachorik jbachorik force-pushed the jb/trace_storage_fix branch 9 times, most recently from 3a43f71 to 404d5a0 Compare October 20, 2025 17:20
jbachorik and others added 12 commits October 20, 2025 19:52
Upgrades from double-buffered to triple-buffered architecture with
hazard pointer-based memory reclamation. Replaces SpinLock with atomic
pointers, adds generation counter for ABA protection, and improves
thread safety for high-frequency profiling scenarios.

Key improvements:
- Triple buffering: active, standby, and cleanup storage rotation
- Per-instance hazard pointer system for lock-free memory reclamation
- Instance-based 64-bit trace IDs for collision-free management
- Enhanced liveness preservation across storage rotations

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Introduces comprehensive testing improvements including detailed crash
reporting with backtraces and register state, stress testing framework
for call trace storage under high contention, and platform-specific
debugging support using async-signal-safe functions.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

FIX 2
Integrates critical section management into core profiling components
for improved signal handler safety and TLS priming. Ensures proper
coordination between profiling events and storage operations.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Updates all native unit tests to use improved critical section
management and testing infrastructure. Ensures compatibility with
new triple-buffered storage and hazard pointer systems.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Updates Gradle configuration and dependencies to support new critical
section management and triple-buffered storage architecture. Includes
dependency locks for reproducible builds.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Adds comprehensive 'Architectural Tidbits' section documenting key
architectural decisions including critical section management,
triple-buffered storage, enhanced testing infrastructure, and TLS
priming improvements made in 2025.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Replace alignas(DEFAULT_CACHE_LINE_SIZE) with alignas(alignof(SpinLock))
for classes that contain SpinLock members. The DEFAULT_CACHE_LINE_SIZE
was semantically incorrect - these classes need alignment to satisfy
SpinLock requirements (64 bytes), not cache line optimization.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Enhance alignment comments to clarify cache line alignment usage
in SpinLock and ThreadFilter components for performance optimization
and false sharing prevention.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Replace stack allocation of highly-aligned CallTraceHashTable instances
with heap allocation using std::aligned_alloc(). Stack allocation of
64-byte aligned objects causes ASAN runtime errors. Use RAII pattern
with unique_ptr and custom deleter for proper memory management.

Fixes alignment issues in:
- ConcurrentTableExpansionRegression test
- HashTableContentionTest
- TCMallocABRunner test
- HashTableSpinWaitEdgeCasesTest
- HashTableAllocationFailureStressTest

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@jbachorik jbachorik force-pushed the jb/trace_storage_fix branch from c541ddb to d9dd7ad Compare October 20, 2025 17:53
@jbachorik jbachorik marked this pull request as ready for review October 20, 2025 17:57
@pr-commenter
Copy link

pr-commenter bot commented Oct 21, 2025

Benchmarks [aarch64 wall]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.33.0 1.34.0-jb_trace_storage_fix-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes wall wall
wall on on

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 23 unstable metrics.

@pr-commenter
Copy link

pr-commenter bot commented Oct 21, 2025

Benchmarks [x86_64 cpu]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.33.0 1.34.0-jb_trace_storage_fix-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu on on
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes cpu cpu
wall off off

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 16 metrics, 22 unstable metrics.

@pr-commenter
Copy link

pr-commenter bot commented Oct 21, 2025

Benchmarks [x86_64 cpu,wall,alloc,memleak]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.33.0 1.34.0-jb_trace_storage_fix-SNAPSHOT
See matching parameters
Baseline Candidate
alloc on on
cpu on on
iterations 5 5
java "11.0.28" "11.0.28"
memleak on on
modes cpu,wall,alloc,memleak cpu,wall,alloc,memleak
wall on on

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics.

@pr-commenter
Copy link

pr-commenter bot commented Oct 21, 2025

Benchmarks [x86_64 memleak]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.33.0 1.34.0-jb_trace_storage_fix-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak on on
modes memleak memleak
wall off off

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 14 metrics, 24 unstable metrics.

@pr-commenter
Copy link

pr-commenter bot commented Oct 21, 2025

Benchmarks [x86_64 memleak,alloc]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.33.0 1.34.0-jb_trace_storage_fix-SNAPSHOT
See matching parameters
Baseline Candidate
alloc on on
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak on on
modes memleak,alloc memleak,alloc
wall off off

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 15 metrics, 23 unstable metrics.

@pr-commenter
Copy link

pr-commenter bot commented Oct 21, 2025

Benchmarks [x86_64 cpu,wall]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.33.0 1.34.0-jb_trace_storage_fix-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu on on
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes cpu,wall cpu,wall
wall on on

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 19 metrics, 0 unstable metrics.

@pr-commenter
Copy link

pr-commenter bot commented Oct 21, 2025

Benchmarks [aarch64 cpu]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.33.0 1.34.0-jb_trace_storage_fix-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu on on
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes cpu cpu
wall off off

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics.

@pr-commenter
Copy link

pr-commenter bot commented Oct 21, 2025

Benchmarks [aarch64 alloc]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.33.0 1.34.0-jb_trace_storage_fix-SNAPSHOT
See matching parameters
Baseline Candidate
alloc on on
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes alloc alloc
wall off off

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics.

@pr-commenter
Copy link

pr-commenter bot commented Oct 21, 2025

Benchmarks [aarch64 memleak,alloc]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.33.0 1.34.0-jb_trace_storage_fix-SNAPSHOT
See matching parameters
Baseline Candidate
alloc on on
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak on on
modes memleak,alloc memleak,alloc
wall off off

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics.

@pr-commenter
Copy link

pr-commenter bot commented Oct 21, 2025

Benchmarks [aarch64 cpu,wall,alloc,memleak]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.33.0 1.34.0-jb_trace_storage_fix-SNAPSHOT
See matching parameters
Baseline Candidate
alloc on on
cpu on on
iterations 5 5
java "11.0.28" "11.0.28"
memleak on on
modes cpu,wall,alloc,memleak cpu,wall,alloc,memleak
wall on on

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 16 metrics, 22 unstable metrics.

@pr-commenter
Copy link

pr-commenter bot commented Oct 21, 2025

Benchmarks [aarch64 cpu,wall]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.33.0 1.34.0-jb_trace_storage_fix-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu on on
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes cpu,wall cpu,wall
wall on on

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics.

@pr-commenter
Copy link

pr-commenter bot commented Oct 21, 2025

Benchmarks [x86_64 alloc]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.33.0 1.34.0-jb_trace_storage_fix-SNAPSHOT
See matching parameters
Baseline Candidate
alloc on on
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak off off
modes alloc alloc
wall off off

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 14 metrics, 24 unstable metrics.

@pr-commenter
Copy link

pr-commenter bot commented Oct 21, 2025

Benchmarks [aarch64 memleak]

Parameters

Baseline Candidate
config baseline candidate
ddprof 1.33.0 1.34.0-jb_trace_storage_fix-SNAPSHOT
See matching parameters
Baseline Candidate
alloc off off
cpu off off
iterations 5 5
java "11.0.28" "11.0.28"
memleak on on
modes memleak memleak
wall off off

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 17 metrics, 21 unstable metrics.

Copy link
Contributor

@zhengyu123 zhengyu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have yet gone through tests.

Copy link
Contributor

@zhengyu123 zhengyu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jbachorik jbachorik merged commit ae61732 into main Oct 23, 2025
95 checks passed
@jbachorik jbachorik deleted the jb/trace_storage_fix branch October 23, 2025 14:30
@github-actions github-actions bot added this to the 1.34.0 milestone Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants