CUTracer is an NVBit-based CUDA binary instrumentation tool. It cleanly separates lightweight data collection (instrumentation) from host-side processing (analysis). Typical workflows include per-warp instruction histograms (delimited by GPU clock reads) and kernel hang detection.
- NVBit-powered, runtime attach via
CUDA_INJECTION64_PATH(no app rebuild needed) - Multiple instrumentation modes: opcode-only, register trace, memory trace
- Built-in analyses:
- Instruction Histogram (for Proton/Triton workflows)
- Deadlock/Hang Detection
- CUDA Graph and stream-capture aware flows
- Deterministic kernel log file naming and CSV outputs
- Install third-party dependency (NVBit):
git clone [email protected]:facebookresearch/CUTracer.git
cd CUTracer
./install_third_party.sh- Build the tool:
make -j$(nproc)- Run your CUDA app with CUTracer (example: No instrumentation):
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
./your_appCUTRACER_INSTRUMENT: comma-separated modes:opcode_only,reg_trace,mem_traceCUTRACER_ANALYSIS: comma-separated analyses:proton_instr_histogram,deadlock_detection- Enabling
proton_instr_histogramauto-enablesopcode_only - Enabling
deadlock_detectionauto-enablesreg_trace
- Enabling
KERNEL_FILTERS: comma-separated substrings matching unmangled or mangled kernel namesINSTR_BEGIN,INSTR_END: static instruction index gate during instrumentationTOOL_VERBOSE: 0/1/2
Note: The tool sets CUDA_MANAGED_FORCE_DEVICE_ALLOC=1 to simplify channel memory handling.
- Counts SASS instruction mnemonics per warp within regions delimited by clock reads (start/stop model; nested regions not supported)
- Output: one CSV per kernel launch with columns
warp_id,region_id,instruction,count
- Detects sustained hangs by identifying warps stuck in stable PC loops; logs and issues SIGTERM→SIGKILL if sustained
- Requires
reg_trace(auto-enabled)
cd ~/CUTracer/tests/proton_tests
# 1) Collect histogram with CUTracer
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=proton_instr_histogram \
KERNEL_FILTERS=add_kernel \
python ./vector-add-instrumented.py
# 2) Run without CUTracer to generate a clean Chrome trace
python ./vector-add-instrumented.py
# 3) Merge and compute IPC
python ~/CUTracer/scripts/parse_instr_hist_trace.py \
--chrome-trace ./vector.chrome_trace \
--cutracer-trace ./kernel_*_add_kernel_hist.csv \
--cutracer-log ./cutracer_main_*.log \
--output vectoradd_ipc.csvcd ~/CUTracer/tests/hang_test
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=deadlock_detection \
python ./test_hang.py- No CSV/log: check
CUDA_INJECTION64_PATH,KERNEL_FILTERS, and write permissions - Empty histogram: ensure kernels emit clock instructions (e.g., Triton
pl.scope) - High overhead: prefer opcode-only; narrow filters; use
INSTR_BEGIN/INSTR_END - CUDA Graph/stream capture: data is flushed at
cuGraphLaunchexit; ensure stream sync - IPC merge issues: resolve warp mismatches and kernel hash ambiguity with parser flags
This repository contains code under the MIT license (Meta) and the BSD-3-Clause license (NVIDIA). See LICENSE and LICENSE-BSD for details.
The full documentation lives in the Wiki. Key topics include Quickstart, Analyses, Post-processing, Configuration, Outputs, API & Data Structures, Developer Guide, and Troubleshooting.