Enterprise-grade error handling and logging #45

tzanko-matev · 2025-10-03T12:41:47Z

Tried to centralize and manage error-handling and logging. Here's the status of the task. See the README.md for better description of the new features.

WS1 – Foundations & Inventory

State: In progress
Tooling: just errors-audit (finds PyRuntimeError::new_err, unwrap/expect/panic!, Python RuntimeError/ValueError).
What we saw:

Rust modules now emit RecorderError; raw Python exceptions survive in codetracer_python_recorder/session.py and tests (ISSUE-014).
src/monitoring/tracer.rs still uses lock().unwrap() and lacks error reporting for callback failures (ISSUE-013).
Python glue keeps legacy assertions/unwraps (ISSUE-012).
Next moves:
Land ISSUE-013 to sort the locking story.
Plan the Python facade cleanup (ISSUE-014) once WS4 is steady.

WS2 – `recorder-errors` Crate

State: Done (2025-10-02)
Highlights:

Added crates/recorder-errors with RecorderError, enums, context helpers, macros (usage!, enverr!, target!, bug!, ensure_*), plus tests and optional serde support.
cargo test -p recorder-errors + workspace cargo check stay green.
Next moves: Use this crate everywhere in WS3/WS4 work.

WS3 – Retrofit Rust Modules

State: Done (2025-10-02)
Highlights:

session/*, runtime/*, and monitoring/tracer.rs now return RecorderError via the shared macros.
Python exposure happens through one errors mapper; IO errors now carry context.
No stray PyRuntimeError::new_err left outside that mapper.
Next moves: Feed findings into WS4 and loop back to WS1 issues.

WS4 – FFI Wrapper & Python Exception Hierarchy

State: Done (2025-10-02)
Highlights:

Added ffi guard around each PyO3 entry point to map RecorderError plus panic safety.
Exposed Python classes RecorderError, UsageError, EnvironmentError, TargetError, InternalError.
Rust and Python tests cover the new flow (uv run cargo nextest run ...; .venv/bin/python -m pytest ...).
Next moves: Hold for WS5 until ISSUES 013/014 close.

WS5 – Policy Switches & Runtime Configuration

State: Done (2025-10-03)
Highlights:

TraceSession.start() and trace() now refresh policy from env vars and accept override mappings so embeds wire recorder switches without manual plumbing.
Rust exports expose configure_policy/configure_policy_from_env under the expected Python names; unit tests cover env-driven and explicit override flows.
Runtime tracer finish path honours RecorderPolicy: callback errors respect on_recorder_error (disable detaches without surfacing exceptions), require_trace now fails cleanly when no events land, and partial traces are deleted or retained based on keep_partial_trace.
Python CLI integration tests exercise disable vs abort paths and require-trace enforcement using the new failure-injection toggles; CLI now propagates runtime shutdown errors so exit codes reflect policy outcomes while partial traces are cleaned per configuration.
Next moves: Kick off WS6 once upstream WS1 cleanups land.

WS6 – Logging, Metrics, and Diagnostics

State: Done (2025-10-03)
Highlights:

Replaced the env_logger helper with a structured JSON logger that always emits run_id, active trace_id, and error_code fields while honouring policy-driven log level and log file overrides.
Introduced a pluggable RecorderMetrics sink and instrumented dropped locations, policy-triggered detachments, and caught panics across the monitoring/runtime paths; Rust unit tests exercise the metrics capture.
Enabled the --json-errors policy path so runtime shutdown emits a single-line JSON trailer on stderr; CLI integration tests now assert the abort flow surfaces the trailer alongside existing stack traces.
Next moves: Wire the metrics sink into the chosen exporter and align the log schema with Observability consumption before rolling out to downstream tooling.

WS7 – Test Coverage & Tooling Enforcement

State: Done (2025-10-04)
Highlights:

Expanded recorder-errors and policy unit tests covering every macro (usage/target/internal ensures) plus invalid boolean parsing.
Added FFI unit tests for dispatch/wrap_pyfunction, panic containment, and Python exception attribute propagation.
Introduced integration coverage for environment permission failures, injected target argument capture errors, and synthetic callback panics (verifying JSON trailers and error classes).
Implemented just lint orchestration running cargo clippy -D clippy::panic and a repository script that blocks unchecked .unwrap( usage outside the legacy allowlist.
Next moves: Monitor unwrap allowlist shrinkage once WS1 follow-ups land; evaluate extending the lint to .expect( once monitoring refactor closes.

WS8 – Documentation & Rollout

State: Done (2025-10-05)
Highlights:

README now covers the recorder error policy, JSON trailers, exit codes, and a short Python RecorderError catch example.
Added docs/onboarding/error-handling.md with migration steps, policy wiring tips, and assertion rules for contributors.
Started codetracer-python-recorder/CHANGELOG.md to brief downstream tools on consuming structured errors.
Next moves:
Share the onboarding doc with downstream maintainers and collect gaps before promoting ADR 0004 to Accepted.
Fold feedback into the change log before the next release tag.

When I use JJ workspaces I don't have a .git folder in the workspace. This makes flake.nix misbehave - I copy the whole directory contents in the /nix/store including the ./target directory which is 1.1GB. If this were a Git repo I would have copied only the tracked files. The restructuring fixes this issue.

Signed-off-by: Tzanko Matev <[email protected]>

We no longer emit DEBUG logs by default. This makes our tests less noisy. DEBUG logs can be turned on using RUST_LOG Also we tried to rephrase the error status document as an experiment to make Codex use a simpler language

…able or abort)

…strings Signed-off-by: Tzanko Matev <[email protected]>

… what docs need to be produced Signed-off-by: Tzanko Matev <[email protected]>

subfolder

github-actions · 2025-10-03T13:21:04Z

Coverage Summary

Rust (lines)
78.9% covered (3,201 / 4,058 | 857 missed)

File	Lines	Miss	Cover
`codetracer-python-recorder/crates/recorder-errors/src/lib.rs`	157	7	95.5%
`codetracer-python-recorder/src/code_object.rs`	117	16	86.3%
`codetracer-python-recorder/src/ffi.rs`	196	19	90.3%
`codetracer-python-recorder/src/lib.rs`	15	15	0.0%
`codetracer-python-recorder/src/logging.rs`	478	80	83.3%
`codetracer-python-recorder/src/monitoring/mod.rs`	69	3	95.7%
`codetracer-python-recorder/src/monitoring/tracer.rs`	794	221	72.2%
`codetracer-python-recorder/src/policy.rs`	240	48	80.0%
`codetracer-python-recorder/src/runtime/activation.rs`	117	9	92.3%
`codetracer-python-recorder/src/runtime/frame_inspector.rs`	145	63	56.6%
`codetracer-python-recorder/src/runtime/logging.rs`	15	4	73.3%
`codetracer-python-recorder/src/runtime/mod.rs`	1,158	191	83.5%
`codetracer-python-recorder/src/runtime/output_paths.rs`	98	16	83.7%
`codetracer-python-recorder/src/runtime/value_capture.rs`	102	66	35.3%
`codetracer-python-recorder/src/runtime/value_encoder.rs`	88	6	93.2%
`codetracer-python-recorder/src/session.rs`	71	71	0.0%
`codetracer-python-recorder/src/session/bootstrap.rs`	183	14	92.3%

Python (statements)
61.5% covered (112 / 182 | 70 missed)

File	Stmts	Miss	Cover
`codetracer-python-recorder/codetracer_python_recorder/__init__.py`	7	0	100.0%
`codetracer-python-recorder/codetracer_python_recorder/__main__.py`	59	59	0.0%
`codetracer-python-recorder/codetracer_python_recorder/api.py`	5	0	100.0%
`codetracer-python-recorder/codetracer_python_recorder/auto_start.py`	22	9	59.1%
`codetracer-python-recorder/codetracer_python_recorder/formats.py`	13	1	92.3%
`codetracer-python-recorder/codetracer_python_recorder/session.py`	76	1	98.7%

Generated automatically via generate_coverage_comment.py.

alehander92

this seems great, but it's so complex: i admit ~4k lines for error handling/logging seem a lot to me, i'd expect, maybe naively from the outside, that a whole recorder is ~4k lines; maybe if there were suitable ready libs for some things like policy/error facades it would decrease it. Some of it: maybe a lot of this are touched/changed lines though: not all is new code as well, so maybe it makes sense and it's just hard for me to quickly track it

on the other hand a lot of it makes sense, and there are many cases where custom logic/flexibility is needed, so it's probably very useful

tzanko-matev · 2025-10-03T22:05:55Z

A lot of it are just tests. Basically I told codex that our error handling was ad-hoc and asked it to fix it

alehander92 · 2025-10-03T22:30:21Z

tests are maybe ~500 lines, maybe ~500 lines documents as well: there are a lot of new abstractions/helpers and places where code is wrapped with them, i guess they are useful, i just haven't went deep into them

tzanko-matev added 21 commits October 3, 2025 09:08

errors:WS1

fbc80d2

errors: WS2

1465113

errors: WS3

0c9b392

errors: WS4

3ddf6b6

.gitignore: Remove .coverage file tracked by mistake

5ce6eb4

Signed-off-by: Tzanko Matev <[email protected]>

fix: Improve logging

a57b18b

We no longer emit DEBUG logs by default. This makes our tests less noisy. DEBUG logs can be turned on using RUST_LOG Also we tried to rephrase the error status document as an experiment to make Codex use a simpler language

WS5

37e476d

WS5-2

43c5e06

WS5-3

45f4b51

WS6

a49c8bf

fix: Make sure that panic! is handled according to policy (either dis…

d0dee6c

…able or abort)

codetracer-python-recorder/crates/recorder-errors/src/lib.rs: Add doc…

edc22b1

…strings Signed-off-by: Tzanko Matev <[email protected]>

test: Remove integration-test logic from production build

a9ab079

design-docs/error-handling-implementation-plan.md: More details about…

003935f

… what docs need to be produced Signed-off-by: Tzanko Matev <[email protected]>

WS7 - Test coverage & Tooling Enforcement

305c18e

WS8 - Docs

aa81052

fix(ci): Fix regression of CI after I had moved flake.nix to ./nix

b9c61f0

subfolder

Initial spec for capture-output

ae31932

ADR made to conform to current codebase

e7635e1

fix(ci): Fix regression in coverage script

16e2f63

alehander92 approved these changes Oct 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Enterprise-grade error handling and logging #45

Enterprise-grade error handling and logging #45

Uh oh!

tzanko-matev commented Oct 3, 2025

Uh oh!

github-actions bot commented Oct 3, 2025

Uh oh!

alehander92 left a comment •

edited

Loading

Uh oh!

tzanko-matev commented Oct 3, 2025

Uh oh!

alehander92 commented Oct 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Enterprise-grade error handling and logging #45

Are you sure you want to change the base?

Enterprise-grade error handling and logging #45

Uh oh!

Conversation

tzanko-matev commented Oct 3, 2025

WS1 – Foundations & Inventory

WS2 – recorder-errors Crate

WS3 – Retrofit Rust Modules

WS4 – FFI Wrapper & Python Exception Hierarchy

WS5 – Policy Switches & Runtime Configuration

WS6 – Logging, Metrics, and Diagnostics

WS7 – Test Coverage & Tooling Enforcement

WS8 – Documentation & Rollout

Uh oh!

github-actions bot commented Oct 3, 2025

Coverage Summary

Uh oh!

alehander92 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tzanko-matev commented Oct 3, 2025

Uh oh!

alehander92 commented Oct 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WS2 – `recorder-errors` Crate

alehander92 left a comment •

edited

Loading