Skip to content

Conversation

@endigma
Copy link
Member

@endigma endigma commented Nov 14, 2025

Summary by CodeRabbit

  • New Features
    • Introduced FlightRecorder, a new module that automatically captures detailed flight traces whenever request latency exceeds a configured threshold. This enables efficient performance analysis and debugging of slow-running requests with actionable trace data.
    • Fully customizable configuration options: set output directory location, adjust latency thresholds, and choose between single or multiple trace capture modes

Checklist

@coderabbitai
Copy link

coderabbitai bot commented Nov 14, 2025

Walkthrough

Introduces a new FlightRecorder module for the router using Go 1.25's flight recording feature. Includes documentation describing configuration and usage, a main.go entrypoint that imports the module, and a module.go implementation that records traces when request latency exceeds a configurable threshold.

Changes

Cohort / File(s) Summary
Documentation
router/cmd/flightrecorder/README.md
Adds comprehensive README documenting the FlightRecorder module, including minimal example main, configuration fields (outputPath, requestLatencyRecordThreshold, recordMultiple), field explanations, setup instructions, and build/run commands.
Entrypoint
router/cmd/flightrecorder/main.go
Adds Go executable entrypoint that imports the flightrecorder module for side effects and delegates execution to routercmd.Main().
Module Implementation
router/cmd/flightrecorder/module/module.go
Implements FlightRecorder module with configuration validation, request latency measurement, threshold-based trace recording, trace file persistence, and cleanup logic. Registers with core framework and implements RouterOnRequestHandler, Provisioner, and Cleaner interfaces.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • RouterOnRequest method includes a 150ms post-processing sleep — verify if this timing is intentional and necessary
  • RecordTrace error handling when RecordMultiple is false — ensure trace recorder state is managed correctly
  • Provision method calculations for MinAge and MaxBytes when initializing trace.FlightRecorder — validate memory/performance assumptions
  • File I/O and directory creation error handling — confirm all edge cases are properly managed

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add custom module implementing flight recorder' accurately and concisely summarizes the main change—adding a new flight recorder module to the router codebase.
✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Nov 14, 2025

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-f98b9354ff4202e4d61733e5fbb54f44d7e7a5a9-nonroot

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
router/cmd/flightrecorder/README.md (1)

5-17: Replace hard tabs with spaces in code example.

The code example uses hard tabs on lines 9, 10, 11, and 15, which should be replaced with spaces to align with standard Go formatting conventions.

Apply this diff:

 package main
 
 import (
-	routercmd "github.com/wundergraph/cosmo/router/cmd"
-	// Import your modules here
-	_ "github.com/wundergraph/cosmo/router/cmd/flightrecorder/module"
+    routercmd "github.com/wundergraph/cosmo/router/cmd"
+    // Import your modules here
+    _ "github.com/wundergraph/cosmo/router/cmd/flightrecorder/module"
 )
 
 func main() {
-	routercmd.Main()
+    routercmd.Main()
 }
router/cmd/flightrecorder/module/module.go (3)

29-29: Remove template comment.

This comment appears to be leftover from scaffolding and should be removed.

Apply this diff:

     requestLatencyRecordThresholdDuration time.Duration
 
-    // Add a new property here
     fl *trace.FlightRecorder

100-100: Fix error message format.

The error message uses a format string with %w but should rely on zap.Error for structured logging.

Apply this diff:

-        m.Logger.Error("failed to create trace file: %w", zap.Error(err))
+        m.Logger.Error("failed to create trace file", zap.Error(err))

92-113: Add file cleanup mechanism and sanitize operation names.

Two operational concerns:

  1. No file cleanup/rotation: Trace files accumulate indefinitely in the output directory, which could eventually exhaust disk space. Consider implementing a retention policy (e.g., delete files older than N days, or keep only the last N files).

  2. Unsanitized operation names in filenames: Operation names are used directly in filenames (line 94), but may contain characters that are invalid for filenames (/, \, :, *, ?, ", <, >, |). This could cause os.Create to fail.

Example sanitization approach:

import (
    "regexp"
    // ... other imports
)

var invalidFilenameChars = regexp.MustCompile(`[<>:"/\\|?*]`)

func sanitizeOperationName(name string) string {
    // Replace invalid characters with underscores
    sanitized := invalidFilenameChars.ReplaceAllString(name, "_")
    // Limit length to avoid filesystem limits
    if len(sanitized) > 100 {
        sanitized = sanitized[:100]
    }
    return sanitized
}

func (m *FlightRecorder) RecordTrace(operationName string) {
    sanitized := sanitizeOperationName(operationName)
    filename := fmt.Sprintf("trace-%s-%s.out", sanitized, time.Now().Format("2006-01-02-15-04-05"))
    // ... rest of the method
}

For file cleanup, consider adding a cleanup routine in the Provision method or as a background goroutine.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f94a099 and 12c2f7d.

📒 Files selected for processing (3)
  • router/cmd/flightrecorder/README.md (1 hunks)
  • router/cmd/flightrecorder/main.go (1 hunks)
  • router/cmd/flightrecorder/module/module.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
router/cmd/flightrecorder/main.go (1)
router/cmd/main.go (1)
  • Main (42-277)
router/cmd/flightrecorder/module/module.go (2)
router/core/modules.go (7)
  • RegisterModule (56-73)
  • ModuleContext (160-164)
  • Module (52-54)
  • ModuleInfo (44-50)
  • RouterOnRequestHandler (113-115)
  • Provisioner (149-152)
  • Cleaner (154-157)
router/core/context.go (1)
  • RequestContext (61-142)
🪛 markdownlint-cli2 (0.18.1)
router/cmd/flightrecorder/README.md

9-9: Hard tabs
Column: 1

(MD010, no-hard-tabs)


10-10: Hard tabs
Column: 1

(MD010, no-hard-tabs)


11-11: Hard tabs
Column: 1

(MD010, no-hard-tabs)


15-15: Hard tabs
Column: 1

(MD010, no-hard-tabs)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: build_push_image
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_test
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
router/cmd/flightrecorder/main.go (1)

1-10: LGTM!

The entrypoint correctly delegates to the router's main function and imports the module for side-effect registration. This is the standard pattern for module initialization.

Comment on lines +56 to +61
m.fl = trace.NewFlightRecorder(trace.FlightRecorderConfig{
MinAge: m.requestLatencyRecordThresholdDuration * 2,

// 10 MB/s of MinAge
MaxBytes: 10 * 1024 * 1024 * uint64(m.requestLatencyRecordThresholdDuration*2/time.Second),
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix MaxBytes calculation that evaluates to zero.

The MaxBytes calculation has a critical flaw. When requestLatencyRecordThresholdDuration is 100ms (the example threshold), the calculation requestLatencyRecordThresholdDuration*2/time.Second evaluates to 200ms/1s = 0.2, which is then cast to uint64(0.2) = 0. This results in MaxBytes = 0, which may cause the flight recorder to not buffer data correctly.

Apply this diff to fix the calculation:

     m.fl = trace.NewFlightRecorder(trace.FlightRecorderConfig{
         MinAge: m.requestLatencyRecordThresholdDuration * 2,
 
-        // 10 MB/s of MinAge
-        MaxBytes: 10 * 1024 * 1024 * uint64(m.requestLatencyRecordThresholdDuration*2/time.Second),
+        // 10 MB per second of MinAge
+        MaxBytes: uint64(10 * 1024 * 1024 * (m.requestLatencyRecordThresholdDuration * 2).Seconds()),
     })
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
m.fl = trace.NewFlightRecorder(trace.FlightRecorderConfig{
MinAge: m.requestLatencyRecordThresholdDuration * 2,
// 10 MB/s of MinAge
MaxBytes: 10 * 1024 * 1024 * uint64(m.requestLatencyRecordThresholdDuration*2/time.Second),
})
m.fl = trace.NewFlightRecorder(trace.FlightRecorderConfig{
MinAge: m.requestLatencyRecordThresholdDuration * 2,
// 10 MB per second of MinAge
MaxBytes: uint64(10 * 1024 * 1024 * (m.requestLatencyRecordThresholdDuration * 2).Seconds()),
})
🤖 Prompt for AI Agents
In router/cmd/flightrecorder/module/module.go around lines 56 to 61, the
MaxBytes expression multiplies by
uint64(m.requestLatencyRecordThresholdDuration*2/time.Second) which truncates
fractional seconds to zero for sub-second durations; replace this with a
calculation that preserves fractions (e.g. use Seconds() as float64 or convert
durations to nanoseconds before integer math). Specifically, compute the
multiplier using float64(m.requestLatencyRecordThresholdDuration.Seconds()*2)
and then cast the final product to uint64, or compute bytes using nanoseconds:
uint64(10*1024*1024) *
uint64((m.requestLatencyRecordThresholdDuration*2).Nanoseconds()) /
uint64(time.Second.Nanoseconds()), and assign that to MaxBytes so sub-second
durations do not produce zero.

Comment on lines +74 to +90
func (m *FlightRecorder) RouterOnRequest(ctx core.RequestContext, next http.Handler) {
start := time.Now()

next.ServeHTTP(ctx.ResponseWriter(), ctx.Request())

time.Sleep(150 * time.Millisecond)

requestDuration := time.Since(start)

if m.fl.Enabled() && requestDuration > m.requestLatencyRecordThresholdDuration {
operation := ctx.Operation()

m.Logger.Warn("Request took longer than threshold", zap.Duration("duration", requestDuration), zap.String("operation", operation.Name()))

m.RecordTrace(operation.Name())
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Remove hardcoded sleep that adds latency to every request.

Line 79 contains a hardcoded time.Sleep(150 * time.Millisecond) that will add 150ms of latency to every single request, regardless of whether recording occurs. This is a critical performance issue that will severely degrade user experience.

Additionally, the sleep causes incorrect duration measurement on Line 81, as requestDuration includes the sleep time, making the threshold comparison meaningless.

If the sleep was intended to allow trace data to be captured post-request, consider these alternatives:

  1. Remove the sleep entirely and rely on the flight recorder's MinAge configuration
  2. If post-processing time is needed, only sleep when recording will occur (inside the conditional)
  3. Use goroutines to record traces asynchronously without blocking the request

Apply this diff to remove the problematic sleep:

 func (m *FlightRecorder) RouterOnRequest(ctx core.RequestContext, next http.Handler) {
     start := time.Now()
 
     next.ServeHTTP(ctx.ResponseWriter(), ctx.Request())
 
-    time.Sleep(150 * time.Millisecond)
-
     requestDuration := time.Since(start)
 
     if m.fl.Enabled() && requestDuration > m.requestLatencyRecordThresholdDuration {
         operation := ctx.Operation()
 
         m.Logger.Warn("Request took longer than threshold", zap.Duration("duration", requestDuration), zap.String("operation", operation.Name()))
 
         m.RecordTrace(operation.Name())
     }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (m *FlightRecorder) RouterOnRequest(ctx core.RequestContext, next http.Handler) {
start := time.Now()
next.ServeHTTP(ctx.ResponseWriter(), ctx.Request())
time.Sleep(150 * time.Millisecond)
requestDuration := time.Since(start)
if m.fl.Enabled() && requestDuration > m.requestLatencyRecordThresholdDuration {
operation := ctx.Operation()
m.Logger.Warn("Request took longer than threshold", zap.Duration("duration", requestDuration), zap.String("operation", operation.Name()))
m.RecordTrace(operation.Name())
}
}
func (m *FlightRecorder) RouterOnRequest(ctx core.RequestContext, next http.Handler) {
start := time.Now()
next.ServeHTTP(ctx.ResponseWriter(), ctx.Request())
requestDuration := time.Since(start)
if m.fl.Enabled() && requestDuration > m.requestLatencyRecordThresholdDuration {
operation := ctx.Operation()
m.Logger.Warn("Request took longer than threshold", zap.Duration("duration", requestDuration), zap.String("operation", operation.Name()))
m.RecordTrace(operation.Name())
}
}
🤖 Prompt for AI Agents
In router/cmd/flightrecorder/module/module.go around lines 74 to 90: remove the
hardcoded time.Sleep(150 * time.Millisecond) so request timing is not
artificially inflated and the latency check is meaningful; measure
requestDuration immediately after next.ServeHTTP(...) and then, if
m.fl.Enabled() && requestDuration > m.requestLatencyRecordThresholdDuration,
perform recording (either call RecordTrace directly or spawn a goroutine to
record asynchronously) or, if a delay is truly required for post-processing,
only apply a sleep inside that conditional so it affects only recorded requests.

Comment on lines +92 to +95
func (m *FlightRecorder) RecordTrace(operationName string) {
// Generate timestamped filename
filename := fmt.Sprintf("trace-%s-%s.out", operationName, time.Now().Format("2006-01-02-15-04-05"))
filepath := filepath.Join(m.OutputPath, filename)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Rename variable to avoid shadowing the filepath package.

Line 95 declares a variable named filepath that shadows the imported filepath package. This is confusing and error-prone.

Apply this diff:

 func (m *FlightRecorder) RecordTrace(operationName string) {
     // Generate timestamped filename
     filename := fmt.Sprintf("trace-%s-%s.out", operationName, time.Now().Format("2006-01-02-15-04-05"))
-    filepath := filepath.Join(m.OutputPath, filename)
+    filePath := filepath.Join(m.OutputPath, filename)
 
     // Create the file
-    file, err := os.Create(filepath)
+    file, err := os.Create(filePath)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In router/cmd/flightrecorder/module/module.go around lines 92 to 95, a local
variable named "filepath" is shadowing the imported filepath package; rename the
local variable to something like "outPath" or "filePath" (e.g., filenamePath)
and update all subsequent uses in this function to the new name so the package
identifier remains available and confusion is avoided.

Comment on lines +110 to +112
if !m.RecordMultiple {
m.fl.Stop()
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Stopping the recorder prevents future recordings.

When RecordMultiple is false, the flight recorder is stopped after the first trace recording. This causes fl.Enabled() to return false in subsequent requests, preventing any further recordings even if requests exceed the threshold. This behavior likely contradicts the intended functionality.

Consider one of these approaches:

  1. Continue recording: Remove the stop logic and let the recorder run continuously, relying on the existing threshold check to control when traces are written
  2. Limit write operations: Keep the recorder running but add a flag/counter to limit the number of trace files written
  3. Clarify the behavior: If stopping after one recording is intentional, document this clearly and consider renaming RecordMultiple to something like RecordOnce

Example for approach #2:

+type FlightRecorder struct {
+    OutputPath                    string `mapstructure:"outputPath"`
+    RecordMultiple                bool   `mapstructure:"recordMultiple"`
+    RequestLatencyRecordThreshold int    `mapstructure:"requestLatencyRecordThreshold"`
+
+    requestLatencyRecordThresholdDuration time.Duration
+    tracesRecorded                         int
+
     fl *trace.FlightRecorder
 
     Logger *zap.Logger
 }

 func (m *FlightRecorder) RecordTrace(operationName string) {
     // ... existing file creation code ...
 
     _, err = m.fl.WriteTo(file)
     if err != nil {
         m.Logger.Error("Failed to record request", zap.Error(err))
+        return
     }
 
+    m.tracesRecorded++
+
     if !m.RecordMultiple {
-        m.fl.Stop()
+        m.Logger.Info("RecordMultiple is false, disabling further recordings", zap.Int("traces_recorded", m.tracesRecorded))
+        // Disable further writes but keep recorder running
+        // Could use a flag like m.recordingDisabled = true and check it in RouterOnRequest
     }
 }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In router/cmd/flightrecorder/module/module.go around lines 110-112, stopping the
recorder when RecordMultiple is false prevents any future recordings because
fl.Enabled() will be false; instead keep the recorder running and implement a
write-limiting counter/flag on the module: add a field (e.g., writeCount or
wroteOnce bool), increment/set it when a trace is written, and before writing
check that if RecordMultiple is false the counter is zero (or wroteOnce is
false); only write the trace when allowed and do not call m.fl.Stop(), leaving
the recorder enabled for the lifetime of the process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants