-
Notifications
You must be signed in to change notification settings - Fork 192
feat: add custom module implementing flight recorder #2334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughIntroduces a new FlightRecorder module for the router using Go 1.25's flight recording feature. Includes documentation describing configuration and usage, a main.go entrypoint that imports the module, and a module.go implementation that records traces when request latency exceeds a configurable threshold. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
Comment |
Router-nonroot image scan passed✅ No security vulnerabilities found in image: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (4)
router/cmd/flightrecorder/README.md (1)
5-17: Replace hard tabs with spaces in code example.The code example uses hard tabs on lines 9, 10, 11, and 15, which should be replaced with spaces to align with standard Go formatting conventions.
Apply this diff:
package main import ( - routercmd "github.com/wundergraph/cosmo/router/cmd" - // Import your modules here - _ "github.com/wundergraph/cosmo/router/cmd/flightrecorder/module" + routercmd "github.com/wundergraph/cosmo/router/cmd" + // Import your modules here + _ "github.com/wundergraph/cosmo/router/cmd/flightrecorder/module" ) func main() { - routercmd.Main() + routercmd.Main() }router/cmd/flightrecorder/module/module.go (3)
29-29: Remove template comment.This comment appears to be leftover from scaffolding and should be removed.
Apply this diff:
requestLatencyRecordThresholdDuration time.Duration - // Add a new property here fl *trace.FlightRecorder
100-100: Fix error message format.The error message uses a format string with
%wbut should rely onzap.Errorfor structured logging.Apply this diff:
- m.Logger.Error("failed to create trace file: %w", zap.Error(err)) + m.Logger.Error("failed to create trace file", zap.Error(err))
92-113: Add file cleanup mechanism and sanitize operation names.Two operational concerns:
No file cleanup/rotation: Trace files accumulate indefinitely in the output directory, which could eventually exhaust disk space. Consider implementing a retention policy (e.g., delete files older than N days, or keep only the last N files).
Unsanitized operation names in filenames: Operation names are used directly in filenames (line 94), but may contain characters that are invalid for filenames (
/,\,:,*,?,",<,>,|). This could causeos.Createto fail.Example sanitization approach:
import ( "regexp" // ... other imports ) var invalidFilenameChars = regexp.MustCompile(`[<>:"/\\|?*]`) func sanitizeOperationName(name string) string { // Replace invalid characters with underscores sanitized := invalidFilenameChars.ReplaceAllString(name, "_") // Limit length to avoid filesystem limits if len(sanitized) > 100 { sanitized = sanitized[:100] } return sanitized } func (m *FlightRecorder) RecordTrace(operationName string) { sanitized := sanitizeOperationName(operationName) filename := fmt.Sprintf("trace-%s-%s.out", sanitized, time.Now().Format("2006-01-02-15-04-05")) // ... rest of the method }For file cleanup, consider adding a cleanup routine in the Provision method or as a background goroutine.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
router/cmd/flightrecorder/README.md(1 hunks)router/cmd/flightrecorder/main.go(1 hunks)router/cmd/flightrecorder/module/module.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
router/cmd/flightrecorder/main.go (1)
router/cmd/main.go (1)
Main(42-277)
router/cmd/flightrecorder/module/module.go (2)
router/core/modules.go (7)
RegisterModule(56-73)ModuleContext(160-164)Module(52-54)ModuleInfo(44-50)RouterOnRequestHandler(113-115)Provisioner(149-152)Cleaner(154-157)router/core/context.go (1)
RequestContext(61-142)
🪛 markdownlint-cli2 (0.18.1)
router/cmd/flightrecorder/README.md
9-9: Hard tabs
Column: 1
(MD010, no-hard-tabs)
10-10: Hard tabs
Column: 1
(MD010, no-hard-tabs)
11-11: Hard tabs
Column: 1
(MD010, no-hard-tabs)
15-15: Hard tabs
Column: 1
(MD010, no-hard-tabs)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: build-router
- GitHub Check: build_push_image
- GitHub Check: build_push_image (nonroot)
- GitHub Check: image_scan
- GitHub Check: image_scan (nonroot)
- GitHub Check: build_test
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: integration_test (./events)
- GitHub Check: integration_test (./telemetry)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (go)
🔇 Additional comments (1)
router/cmd/flightrecorder/main.go (1)
1-10: LGTM!The entrypoint correctly delegates to the router's main function and imports the module for side-effect registration. This is the standard pattern for module initialization.
| m.fl = trace.NewFlightRecorder(trace.FlightRecorderConfig{ | ||
| MinAge: m.requestLatencyRecordThresholdDuration * 2, | ||
|
|
||
| // 10 MB/s of MinAge | ||
| MaxBytes: 10 * 1024 * 1024 * uint64(m.requestLatencyRecordThresholdDuration*2/time.Second), | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix MaxBytes calculation that evaluates to zero.
The MaxBytes calculation has a critical flaw. When requestLatencyRecordThresholdDuration is 100ms (the example threshold), the calculation requestLatencyRecordThresholdDuration*2/time.Second evaluates to 200ms/1s = 0.2, which is then cast to uint64(0.2) = 0. This results in MaxBytes = 0, which may cause the flight recorder to not buffer data correctly.
Apply this diff to fix the calculation:
m.fl = trace.NewFlightRecorder(trace.FlightRecorderConfig{
MinAge: m.requestLatencyRecordThresholdDuration * 2,
- // 10 MB/s of MinAge
- MaxBytes: 10 * 1024 * 1024 * uint64(m.requestLatencyRecordThresholdDuration*2/time.Second),
+ // 10 MB per second of MinAge
+ MaxBytes: uint64(10 * 1024 * 1024 * (m.requestLatencyRecordThresholdDuration * 2).Seconds()),
})📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| m.fl = trace.NewFlightRecorder(trace.FlightRecorderConfig{ | |
| MinAge: m.requestLatencyRecordThresholdDuration * 2, | |
| // 10 MB/s of MinAge | |
| MaxBytes: 10 * 1024 * 1024 * uint64(m.requestLatencyRecordThresholdDuration*2/time.Second), | |
| }) | |
| m.fl = trace.NewFlightRecorder(trace.FlightRecorderConfig{ | |
| MinAge: m.requestLatencyRecordThresholdDuration * 2, | |
| // 10 MB per second of MinAge | |
| MaxBytes: uint64(10 * 1024 * 1024 * (m.requestLatencyRecordThresholdDuration * 2).Seconds()), | |
| }) |
🤖 Prompt for AI Agents
In router/cmd/flightrecorder/module/module.go around lines 56 to 61, the
MaxBytes expression multiplies by
uint64(m.requestLatencyRecordThresholdDuration*2/time.Second) which truncates
fractional seconds to zero for sub-second durations; replace this with a
calculation that preserves fractions (e.g. use Seconds() as float64 or convert
durations to nanoseconds before integer math). Specifically, compute the
multiplier using float64(m.requestLatencyRecordThresholdDuration.Seconds()*2)
and then cast the final product to uint64, or compute bytes using nanoseconds:
uint64(10*1024*1024) *
uint64((m.requestLatencyRecordThresholdDuration*2).Nanoseconds()) /
uint64(time.Second.Nanoseconds()), and assign that to MaxBytes so sub-second
durations do not produce zero.
| func (m *FlightRecorder) RouterOnRequest(ctx core.RequestContext, next http.Handler) { | ||
| start := time.Now() | ||
|
|
||
| next.ServeHTTP(ctx.ResponseWriter(), ctx.Request()) | ||
|
|
||
| time.Sleep(150 * time.Millisecond) | ||
|
|
||
| requestDuration := time.Since(start) | ||
|
|
||
| if m.fl.Enabled() && requestDuration > m.requestLatencyRecordThresholdDuration { | ||
| operation := ctx.Operation() | ||
|
|
||
| m.Logger.Warn("Request took longer than threshold", zap.Duration("duration", requestDuration), zap.String("operation", operation.Name())) | ||
|
|
||
| m.RecordTrace(operation.Name()) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove hardcoded sleep that adds latency to every request.
Line 79 contains a hardcoded time.Sleep(150 * time.Millisecond) that will add 150ms of latency to every single request, regardless of whether recording occurs. This is a critical performance issue that will severely degrade user experience.
Additionally, the sleep causes incorrect duration measurement on Line 81, as requestDuration includes the sleep time, making the threshold comparison meaningless.
If the sleep was intended to allow trace data to be captured post-request, consider these alternatives:
- Remove the sleep entirely and rely on the flight recorder's MinAge configuration
- If post-processing time is needed, only sleep when recording will occur (inside the conditional)
- Use goroutines to record traces asynchronously without blocking the request
Apply this diff to remove the problematic sleep:
func (m *FlightRecorder) RouterOnRequest(ctx core.RequestContext, next http.Handler) {
start := time.Now()
next.ServeHTTP(ctx.ResponseWriter(), ctx.Request())
- time.Sleep(150 * time.Millisecond)
-
requestDuration := time.Since(start)
if m.fl.Enabled() && requestDuration > m.requestLatencyRecordThresholdDuration {
operation := ctx.Operation()
m.Logger.Warn("Request took longer than threshold", zap.Duration("duration", requestDuration), zap.String("operation", operation.Name()))
m.RecordTrace(operation.Name())
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (m *FlightRecorder) RouterOnRequest(ctx core.RequestContext, next http.Handler) { | |
| start := time.Now() | |
| next.ServeHTTP(ctx.ResponseWriter(), ctx.Request()) | |
| time.Sleep(150 * time.Millisecond) | |
| requestDuration := time.Since(start) | |
| if m.fl.Enabled() && requestDuration > m.requestLatencyRecordThresholdDuration { | |
| operation := ctx.Operation() | |
| m.Logger.Warn("Request took longer than threshold", zap.Duration("duration", requestDuration), zap.String("operation", operation.Name())) | |
| m.RecordTrace(operation.Name()) | |
| } | |
| } | |
| func (m *FlightRecorder) RouterOnRequest(ctx core.RequestContext, next http.Handler) { | |
| start := time.Now() | |
| next.ServeHTTP(ctx.ResponseWriter(), ctx.Request()) | |
| requestDuration := time.Since(start) | |
| if m.fl.Enabled() && requestDuration > m.requestLatencyRecordThresholdDuration { | |
| operation := ctx.Operation() | |
| m.Logger.Warn("Request took longer than threshold", zap.Duration("duration", requestDuration), zap.String("operation", operation.Name())) | |
| m.RecordTrace(operation.Name()) | |
| } | |
| } |
🤖 Prompt for AI Agents
In router/cmd/flightrecorder/module/module.go around lines 74 to 90: remove the
hardcoded time.Sleep(150 * time.Millisecond) so request timing is not
artificially inflated and the latency check is meaningful; measure
requestDuration immediately after next.ServeHTTP(...) and then, if
m.fl.Enabled() && requestDuration > m.requestLatencyRecordThresholdDuration,
perform recording (either call RecordTrace directly or spawn a goroutine to
record asynchronously) or, if a delay is truly required for post-processing,
only apply a sleep inside that conditional so it affects only recorded requests.
| func (m *FlightRecorder) RecordTrace(operationName string) { | ||
| // Generate timestamped filename | ||
| filename := fmt.Sprintf("trace-%s-%s.out", operationName, time.Now().Format("2006-01-02-15-04-05")) | ||
| filepath := filepath.Join(m.OutputPath, filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Rename variable to avoid shadowing the filepath package.
Line 95 declares a variable named filepath that shadows the imported filepath package. This is confusing and error-prone.
Apply this diff:
func (m *FlightRecorder) RecordTrace(operationName string) {
// Generate timestamped filename
filename := fmt.Sprintf("trace-%s-%s.out", operationName, time.Now().Format("2006-01-02-15-04-05"))
- filepath := filepath.Join(m.OutputPath, filename)
+ filePath := filepath.Join(m.OutputPath, filename)
// Create the file
- file, err := os.Create(filepath)
+ file, err := os.Create(filePath)Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In router/cmd/flightrecorder/module/module.go around lines 92 to 95, a local
variable named "filepath" is shadowing the imported filepath package; rename the
local variable to something like "outPath" or "filePath" (e.g., filenamePath)
and update all subsequent uses in this function to the new name so the package
identifier remains available and confusion is avoided.
| if !m.RecordMultiple { | ||
| m.fl.Stop() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stopping the recorder prevents future recordings.
When RecordMultiple is false, the flight recorder is stopped after the first trace recording. This causes fl.Enabled() to return false in subsequent requests, preventing any further recordings even if requests exceed the threshold. This behavior likely contradicts the intended functionality.
Consider one of these approaches:
- Continue recording: Remove the stop logic and let the recorder run continuously, relying on the existing threshold check to control when traces are written
- Limit write operations: Keep the recorder running but add a flag/counter to limit the number of trace files written
- Clarify the behavior: If stopping after one recording is intentional, document this clearly and consider renaming
RecordMultipleto something likeRecordOnce
Example for approach #2:
+type FlightRecorder struct {
+ OutputPath string `mapstructure:"outputPath"`
+ RecordMultiple bool `mapstructure:"recordMultiple"`
+ RequestLatencyRecordThreshold int `mapstructure:"requestLatencyRecordThreshold"`
+
+ requestLatencyRecordThresholdDuration time.Duration
+ tracesRecorded int
+
fl *trace.FlightRecorder
Logger *zap.Logger
}
func (m *FlightRecorder) RecordTrace(operationName string) {
// ... existing file creation code ...
_, err = m.fl.WriteTo(file)
if err != nil {
m.Logger.Error("Failed to record request", zap.Error(err))
+ return
}
+ m.tracesRecorded++
+
if !m.RecordMultiple {
- m.fl.Stop()
+ m.Logger.Info("RecordMultiple is false, disabling further recordings", zap.Int("traces_recorded", m.tracesRecorded))
+ // Disable further writes but keep recorder running
+ // Could use a flag like m.recordingDisabled = true and check it in RouterOnRequest
}
}Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In router/cmd/flightrecorder/module/module.go around lines 110-112, stopping the
recorder when RecordMultiple is false prevents any future recordings because
fl.Enabled() will be false; instead keep the recorder running and implement a
write-limiting counter/flag on the module: add a field (e.g., writeCount or
wroteOnce bool), increment/set it when a trace is written, and before writing
check that if RecordMultiple is false the counter is zero (or wroteOnce is
false); only write the trace when allowed and do not call m.fl.Stop(), leaving
the recorder enabled for the lifetime of the process.
Summary by CodeRabbit
Checklist