CUTracer is an NVBit-based CUDA binary instrumentation tool. It cleanly separates lightweight data collection (instrumentation) from host-side processing (analysis). Typical workflows include per-warp instruction histograms (delimited by GPU clock reads) and kernel hang detection.
- NVBit-powered, runtime attach via
CUDA_INJECTION64_PATH
(no app rebuild needed) - Multiple instrumentation modes: opcode-only, register trace, memory trace
- Built-in analyses:
- Instruction Histogram (for Proton/Triton workflows)
- Deadlock/Hang Detection
- CUDA Graph and stream-capture aware flows
- Deterministic kernel log file naming and CSV outputs
- Install third-party dependency (NVBit):
git clone [email protected]:facebookresearch/CUTracer.git
cd CUTracer
./install_third_party.sh
- Build the tool:
make -j$(nproc)
- Run your CUDA app with CUTracer (example: No instrumentation):
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
./your_app
CUTRACER_INSTRUMENT
: comma-separated modes:opcode_only
,reg_trace
,mem_trace
CUTRACER_ANALYSIS
: comma-separated analyses:proton_instr_histogram
,deadlock_detection
- Enabling
proton_instr_histogram
auto-enablesopcode_only
- Enabling
deadlock_detection
auto-enablesreg_trace
- Enabling
KERNEL_FILTERS
: comma-separated substrings matching unmangled or mangled kernel namesINSTR_BEGIN
,INSTR_END
: static instruction index gate during instrumentationTOOL_VERBOSE
: 0/1/2
Note: The tool sets CUDA_MANAGED_FORCE_DEVICE_ALLOC=1
to simplify channel memory handling.
- Counts SASS instruction mnemonics per warp within regions delimited by clock reads (start/stop model; nested regions not supported)
- Output: one CSV per kernel launch with columns
warp_id,region_id,instruction,count
- Detects sustained hangs by identifying warps stuck in stable PC loops; logs and issues SIGTERM→SIGKILL if sustained
- Requires
reg_trace
(auto-enabled)
cd ~/CUTracer/tests/proton_tests
# 1) Collect histogram with CUTracer
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=proton_instr_histogram \
KERNEL_FILTERS=add_kernel \
python ./vector-add-instrumented.py
# 2) Run without CUTracer to generate a clean Chrome trace
python ./vector-add-instrumented.py
# 3) Merge and compute IPC
python ~/CUTracer/scripts/parse_instr_hist_trace.py \
--chrome-trace ./vector.chrome_trace \
--cutracer-trace ./kernel_*_add_kernel_hist.csv \
--cutracer-log ./cutracer_main_*.log \
--output vectoradd_ipc.csv
cd ~/CUTracer/tests/hang_test
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=deadlock_detection \
python ./test_hang.py
- No CSV/log: check
CUDA_INJECTION64_PATH
,KERNEL_FILTERS
, and write permissions - Empty histogram: ensure kernels emit clock instructions (e.g., Triton
pl.scope
) - High overhead: prefer opcode-only; narrow filters; use
INSTR_BEGIN/INSTR_END
- CUDA Graph/stream capture: data is flushed at
cuGraphLaunch
exit; ensure stream sync - IPC merge issues: resolve warp mismatches and kernel hash ambiguity with parser flags
This repository contains code under the MIT license (Meta) and the BSD-3-Clause license (NVIDIA). See LICENSE and LICENSE-BSD for details.
The full documentation lives in the Wiki. Key topics include Quickstart, Analyses, Post-processing, Configuration, Outputs, API & Data Structures, Developer Guide, and Troubleshooting.