-
Notifications
You must be signed in to change notification settings - Fork 192
feat: prometheus sink for gqlmetrics, ensure all operation are tracked efficiently #2332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughReplaces probabilistic SampleRate-based schema-field sampling with a configurable exporter pipeline; introduces a generic Exporter[T] and Sink[T], GraphQL/Prometheus sinks and exporters, wires a Prometheus exporter into graph mux/router metrics, updates configs/tests to use batching and time-based flushing. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Possibly related PRs
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
Comment |
Router-nonroot image scan passed✅ No security vulnerabilities found in image: |
Router-nonroot image scan passed✅ No security vulnerabilities found in image: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
router/core/graph_server.go (1)
758-775: Shutdown ordering: stop exporter before closing metric store.Currently metricStore is shut down before prometheusMetricsExporter. If exporter is still flushing, writes can fail. Shut down the exporter first, then the stores.
Apply:
- if s.metricStore != nil { - if aErr := s.metricStore.Shutdown(ctx); aErr != nil { - err = errors.Join(err, aErr) - } - } - ... - if s.prometheusMetricsExporter != nil { - if aErr := s.prometheusMetricsExporter.Shutdown(ctx); aErr != nil { - err = errors.Join(err, aErr) - } - } + if s.prometheusMetricsExporter != nil { + if aErr := s.prometheusMetricsExporter.Shutdown(ctx); aErr != nil { + err = errors.Join(err, aErr) + } + } + if s.metricStore != nil { + if aErr := s.metricStore.Shutdown(ctx); aErr != nil { + err = errors.Join(err, aErr) + } + }
🧹 Nitpick comments (14)
router/internal/graphqlmetrics/graphql_metrics_sink.go (2)
29-36: Handle potential nil logger inNewGraphQLMetricsSink
cfg.Logger.With(...)will panic ifLoggeris ever nil (e.g. in tests or ad-hoc usage). Consider either guarding and falling back tozap.NewNop()or documenting/enforcing a non-nil logger requirement at call sites.
70-91: Consider tightening retry classification inIsRetryableErrorRight now any non-
connect.Error(including context cancellations or local/config errors) is treated as retryable. If such errors can reach this helper, consider explicitly short‑circuiting oncontext.Canceled/context.DeadlineExceededor other known permanent failures in the caller to avoid pointless retries.router/pkg/config/config.schema.json (1)
1249-1281: Exporter schema matches the new Prometheus usage exporter configThe
schema_usage.exporterblock (batch_size, queue_size, interval, export_timeout) and defaults are consistent and align with the new Go config type. If you want stricter validation, you could optionally add"additionalProperties": falseto theexporterobject to catch typos in config keys, but the current schema is functionally fine.router/internal/graphqlmetrics/exporter_test.go (1)
369-375: Avoid fixed sleeps in tests; poll until metrics observed.Replace time.Sleep with polling (e.g., require.Eventually) or call Shutdown before assertions to reduce flakiness.
Example:
- time.Sleep(200 * time.Millisecond) + require.Eventually(t, func() bool { + _, err := promRegistry.Gather() + return err == nil && len(c.publishedAggregations) > 0 + }, 2*time.Second, 50*time.Millisecond)Also applies to: 371-375
router-tests/testenv/testenv.go (1)
279-290: Clamp exporter settings to safe minimums in tests.If a caller supplies zeros, normalize to sane minimums (e.g., BatchSize>=1, QueueSize>=1, Interval>0) before wiring to router metric config to prevent silent stalls.
Also applies to: 1508-1552
router-tests/prometheus_improved_test.go (1)
82-96: Reduce flakiness: replace fixed sleeps with Eventually/polling.Use require.Eventually to wait for metrics flush instead of fixed sleeps. Keeps tests stable on slow CI.
Also applies to: 166-171, 229-233, 269-274, 315-321, 359-365, 413-419
router/core/graph_server.go (1)
923-949: Defaults/validation pass-through for exporter settings.When wiring settings from cfg.Exporter, ensure they are validated/clamped (BatchSize, QueueSize, Interval, ExportTimeout) before use to prevent zero/negative configs from stalling the exporter.
router/core/router.go (2)
841-851: Prometheus schema usage: consider logging exporter settings as wellThe enablement log for Prometheus schema field usage is clear. Since batching/queue behavior is now configurable, consider also logging the effective exporter settings (batch size, queue size, interval, export timeout) here to simplify debugging misconfigurations.
2324-2332: Exporter settings mapping is correct; be aware of validation constraintsThe mapping from
cfg.Metrics.Prometheus.SchemaFieldUsage.Exporterintormetric.PrometheusSchemaFieldUsageExporteris straightforward and correct. Note that the generic exporter’svalidate()requires all these numeric fields to be strictly positive, so a YAML/env override of0or negative values will now fail exporter creation. If that’s not desired, you may want to clamp or normalize these values earlier in config handling.router/core/router_metrics.go (1)
135-159: Prometheus export path is lightweight; consider extra guard and hash semanticsThe Prometheus-specific
SchemaUsageInfopayload deliberately includes only type field metrics, operation hash/name, and schema version, which is a good fit for local metric cardinality and memory use.Two minor points:
- To harden against future misuse, you could early-return if
m.prometheusMetricsExporteris nil, even though current callers gate onPrometheusUsageInfoTrack.- This path uses
operationContext.sha256Hashdirectly instead ofHashString(). Please confirm thatsha256Hashis always initialized and matches the intended “operation SHA” semantics for all operation types.router/internal/graphqlmetrics/exporter.go (4)
78-111: NewExporter wiring is correct; consider guarding nil settingsConstructor behavior is good: it enforces a non-nil sink, installs a default retryability handler, validates settings, and starts the batching goroutine with a component-scoped logger.
One robustness nit:
settingsis dereferenced without a nil check when creating the queue and during validation. Since this is an internal API, it’s probably fine, but you might want to either:
- Treat
nilas “useNewDefaultExporterSettings()”, or- Explicitly return an error like
fmt.Errorf("settings cannot be nil").This avoids surprising panics if a future caller accidentally passes
nil.
113-143: Validation enforces positive values; retries disabled still require fields
validate()enforces strictly positive values for batch/queue sizes, interval, export timeout, and all retry fields. That’s good for catching misconfigurations early.Note that it requires retry fields to be positive even when
RetryOptions.Enabledisfalse. If you later exposeEnabledas a user-facing toggle, you may want to relax validation so that only enabled retries need fully-populated options.
179-282: Batch export and retry logic are reasonable; consider respecting shutdown context for backoff
exportBatchwraps the shared export context with a per-call timeout and delegates to the sink, logging failures at debug.exportBatchWithRetryadds bounded exponential backoff with:
- Retryability decided by
isRetryableError.- Max duration, interval, and max attempts from settings.
- Clear logs on non-retryable errors and after exhausting retries.
One potential refinement: the retry loop currently sleeps with
time.Sleepand doesn’t inspect any shutdown signal or context, so a stubbornly failing sink can delay shutdown up to the cumulative backoff time. If you ever need faster shutdowns under failure, you could periodically checke.exportRequestContext.Done()or a dedicated shutdown flag inside the retry loop to abort early.
344-381: Shutdown sequence is mostly robust; document single-use expectationShutdown:
- Stops accepting traffic, signals the batching goroutine to drain and exit, and then polls
inflightBatchesuntil zero or context cancel.- Defers cancellation of export contexts and sink closure, with logs on errors and on completion.
Two small caveats:
Shutdownis not idempotent: calling it twice will panic on the second close ofacceptTrafficSema/shutdownSignal. It would help to either document “must only be called once” or add a guard (e.g.,sync.Once).- Synchronous
Record(..., true)calls are not tracked ininflightBatches, so they won’t delay shutdown waiting; they may still race with sink.Close if invoked concurrently. That’s likely acceptable for metrics but worth keeping in mind.Overall, the shutdown strategy is consistent with a best-effort metrics pipeline.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
router-tests/prometheus_improved_test.go(16 hunks)router-tests/testenv/testenv.go(3 hunks)router/core/graph_server.go(4 hunks)router/core/operation_metrics.go(3 hunks)router/core/router.go(3 hunks)router/core/router_config.go(1 hunks)router/core/router_metrics.go(4 hunks)router/internal/graphqlmetrics/exporter.go(5 hunks)router/internal/graphqlmetrics/exporter_test.go(7 hunks)router/internal/graphqlmetrics/graphql_exporter.go(1 hunks)router/internal/graphqlmetrics/graphql_metrics_sink.go(1 hunks)router/internal/graphqlmetrics/prometheus_exporter.go(1 hunks)router/internal/graphqlmetrics/prometheus_sink.go(1 hunks)router/internal/graphqlmetrics/sink.go(1 hunks)router/pkg/config/config.go(1 hunks)router/pkg/config/config.schema.json(1 hunks)router/pkg/metric/config.go(2 hunks)
🧰 Additional context used
🧠 Learnings (13)
📚 Learning: 2025-08-14T17:45:57.509Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.
Applied to files:
router/internal/graphqlmetrics/prometheus_exporter.gorouter/core/router_config.gorouter/core/router_metrics.gorouter/core/graph_server.gorouter/core/router.go
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.
Applied to files:
router-tests/testenv/testenv.gorouter-tests/prometheus_improved_test.gorouter/internal/graphqlmetrics/exporter_test.gorouter/core/operation_metrics.go
📚 Learning: 2025-09-29T10:28:07.122Z
Learnt from: dkorittki
Repo: wundergraph/cosmo PR: 2192
File: router/pkg/config/config.go:1028-1029
Timestamp: 2025-09-29T10:28:07.122Z
Learning: The deprecation strategy for IntrospectionEnabled field in router/pkg/config/config.go is to first mark it as deprecated, then migrate all call sites to use IntrospectionConfig.Enabled, and finally remove the deprecated field. The envDefault tag should be kept during the deprecation period for backward compatibility.
Applied to files:
router-tests/testenv/testenv.go
📚 Learning: 2025-09-17T20:55:39.456Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.
Applied to files:
router/core/router_config.gorouter/core/graph_server.gorouter/core/router.go
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.
Applied to files:
router/core/router_config.gorouter/pkg/config/config.schema.json
📚 Learning: 2025-08-14T17:46:00.930Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.930Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.
Applied to files:
router/core/router_config.gorouter/core/router_metrics.gorouter/core/graph_server.go
📚 Learning: 2025-08-20T22:13:25.222Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2157
File: router-tests/go.mod:16-16
Timestamp: 2025-08-20T22:13:25.222Z
Learning: github.com/mark3labs/mcp-go v0.38.0 has regressions and should not be used in the wundergraph/cosmo project. v0.36.0 is the stable version that should be used across router-tests and other modules.
Applied to files:
router-tests/prometheus_improved_test.gorouter/core/graph_server.gorouter/core/router.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.
Applied to files:
router/pkg/config/config.schema.json
📚 Learning: 2025-07-21T15:06:36.664Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2067
File: router/pkg/config/config.schema.json:1637-1644
Timestamp: 2025-07-21T15:06:36.664Z
Learning: In the Cosmo router project, when extending JSON schema validation for security-sensitive fields like JWKS secrets, backwards compatibility is maintained by implementing warnings in the Go code rather than hard validation constraints in the schema. This allows existing configurations to continue working while alerting users to potential security issues.
Applied to files:
router/pkg/config/config.schema.json
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.
Applied to files:
router/core/operation_metrics.gorouter/core/router_metrics.gorouter/core/graph_server.go
📚 Learning: 2025-08-26T17:19:28.178Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2167
File: router/internal/expr/retry_context.go:3-10
Timestamp: 2025-08-26T17:19:28.178Z
Learning: In the Cosmo router codebase, prefer using netErr.Timeout() checks over explicit context.DeadlineExceeded checks for timeout detection in retry logic, as Go's networking stack typically wraps context deadline exceeded errors in proper net.Error implementations.
Applied to files:
router/core/operation_metrics.gorouter/pkg/metric/config.go
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-11-12T18:38:46.619Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2327
File: graphqlmetrics/core/metrics_service.go:435-442
Timestamp: 2025-11-12T18:38:46.619Z
Learning: In graphqlmetrics/core/metrics_service.go, the opGuardCache TTL (30 days) is intentionally shorter than the gql_metrics_operations database retention period (90 days). This design reduces memory usage while relying on database constraints to prevent duplicate inserts after cache expiration.
Applied to files:
router/core/graph_server.go
🧬 Code graph analysis (15)
router/internal/graphqlmetrics/graphql_exporter.go (2)
router/internal/graphqlmetrics/exporter.go (3)
Exporter(16-31)ExporterSettings(50-61)NewExporter(82-111)router/internal/graphqlmetrics/graphql_metrics_sink.go (3)
NewGraphQLMetricsSink(30-36)GraphQLMetricsSinkConfig(23-27)IsRetryableError(72-91)
router/internal/graphqlmetrics/prometheus_exporter.go (3)
router/internal/graphqlmetrics/exporter.go (3)
Exporter(16-31)ExporterSettings(50-61)NewExporter(82-111)router/pkg/metric/metric_store.go (1)
Store(149-162)router/internal/graphqlmetrics/prometheus_sink.go (2)
NewPrometheusSink(41-47)PrometheusSinkConfig(25-29)
router-tests/testenv/testenv.go (2)
router/pkg/config/config.go (2)
PrometheusSchemaFieldUsageExporter(121-126)PrometheusSchemaFieldUsage(115-119)router/pkg/metric/config.go (2)
PrometheusSchemaFieldUsageExporter(48-53)PrometheusSchemaFieldUsage(42-46)
router/pkg/config/config.go (3)
router/internal/graphqlmetrics/exporter.go (1)
Exporter(16-31)router-tests/testenv/testenv.go (1)
PrometheusSchemaFieldUsageExporter(285-290)router/pkg/metric/config.go (1)
PrometheusSchemaFieldUsageExporter(48-53)
router/internal/graphqlmetrics/prometheus_sink.go (3)
router/pkg/metric/metric_store.go (1)
Store(149-162)router/pkg/otel/attributes.go (5)
WgOperationName(10-10)WgOperationType(11-11)WgOperationSha256(59-59)WgGraphQLFieldName(60-60)WgGraphQLParentType(61-61)router/pkg/graphqlschemausage/schemausage.go (1)
TypeFieldMetrics(28-28)
router/internal/graphqlmetrics/graphql_metrics_sink.go (1)
router/internal/graphqlmetrics/aggregate.go (1)
AggregateSchemaUsageInfoBatch(9-27)
router/core/router_config.go (2)
router/internal/graphqlmetrics/graphql_exporter.go (1)
GraphQLMetricsExporter(13-15)router/internal/graphqlmetrics/prometheus_exporter.go (1)
PrometheusMetricsExporter(13-15)
router/internal/graphqlmetrics/exporter.go (1)
router/internal/graphqlmetrics/sink.go (2)
Sink(9-18)SinkErrorHandler(23-23)
router-tests/prometheus_improved_test.go (4)
router-tests/testenv/testenv.go (3)
Run(105-122)PrometheusSchemaFieldUsageExporter(285-290)Environment(1763-1799)router/pkg/config/config.go (1)
PrometheusSchemaFieldUsageExporter(121-126)router/pkg/metric/config.go (1)
PrometheusSchemaFieldUsageExporter(48-53)router/pkg/otel/attributes.go (2)
WgOperationName(10-10)WgOperationType(11-11)
router/internal/graphqlmetrics/exporter_test.go (1)
router/internal/graphqlmetrics/graphql_exporter.go (1)
NewGraphQLMetricsExporter(19-39)
router/core/operation_metrics.go (1)
router/core/router_metrics.go (1)
RouterMetrics(14-21)
router/core/router_metrics.go (4)
router/internal/graphqlmetrics/graphql_exporter.go (1)
GraphQLMetricsExporter(13-15)router/internal/graphqlmetrics/prometheus_exporter.go (1)
PrometheusMetricsExporter(13-15)router/core/context.go (4)
OperationType(503-503)OperationTypeQuery(506-506)OperationTypeMutation(507-507)OperationTypeSubscription(508-508)router/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (16)
OperationType(26-26)OperationType(58-60)OperationType(62-64)OperationType(71-73)OperationType_QUERY(29-29)OperationType_MUTATION(30-30)OperationType_SUBSCRIPTION(31-31)SchemaUsageInfo(130-153)SchemaUsageInfo(168-168)SchemaUsageInfo(183-185)OperationInfo(362-373)OperationInfo(388-388)OperationInfo(403-405)SchemaInfo(428-435)SchemaInfo(450-450)SchemaInfo(465-467)
router/pkg/metric/config.go (3)
router/internal/graphqlmetrics/exporter.go (1)
Exporter(16-31)router-tests/testenv/testenv.go (1)
PrometheusSchemaFieldUsageExporter(285-290)router/pkg/config/config.go (1)
PrometheusSchemaFieldUsageExporter(121-126)
router/core/graph_server.go (4)
router/internal/graphqlmetrics/prometheus_exporter.go (2)
PrometheusMetricsExporter(13-15)NewPrometheusMetricsExporter(19-45)router/pkg/config/config.go (1)
Prometheus(99-113)router/internal/graphqlmetrics/exporter.go (3)
ExporterSettings(50-61)Exporter(16-31)RetryOptions(33-38)router/core/router_metrics.go (1)
NewRouterMetrics(43-52)
router/core/router.go (4)
router/internal/graphqlmetrics/graphql_exporter.go (1)
NewGraphQLMetricsExporter(19-39)router/pkg/config/config.go (3)
Prometheus(99-113)PrometheusSchemaFieldUsageExporter(121-126)Metrics(137-142)router/internal/graphqlmetrics/exporter.go (1)
Exporter(16-31)router/pkg/metric/config.go (1)
PrometheusSchemaFieldUsageExporter(48-53)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: build-router
- GitHub Check: build_push_image
- GitHub Check: build_test
- GitHub Check: integration_test (./telemetry)
- GitHub Check: build_push_image (nonroot)
- GitHub Check: integration_test (./events)
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: image_scan
- GitHub Check: Analyze (go)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: image_scan (nonroot)
- GitHub Check: build_test
🔇 Additional comments (19)
router/internal/graphqlmetrics/graphql_metrics_sink.go (1)
40-61: Export path and aggregation look correctBatch empty-check, aggregation via
AggregateSchemaUsageInfoBatch, auth header setup, and debug logging are all straightforward and concurrency-safe for this sink implementation. No functional issues from this diff.router/core/router_config.go (1)
37-38: New exporter fields fit the updated metrics wiringAdding
gqlMetricsExporterandprometheusMetricsExportertoConfigkeeps the router’s metrics wiring explicit and type‑safe; no issues with this change.router/internal/graphqlmetrics/sink.go (1)
7-23: Sink[T] abstraction is clear and compatible with the exporter designThe generic
Sink[T]interface andSinkErrorHandlerare minimal, well‑documented, and match the batch‑export semantics used by the generic exporter. No changes needed here.router/pkg/metric/config.go (1)
42-53: ****Verification confirms that all constructions of
PrometheusSchemaFieldUsagewith theExporterfield properly handle initialization:
- Production (router.go:2324): Reads all
Exporterfields from validated config source- Tests (testenv.go:1509-1530): Explicitly provides test-friendly defaults when enabled (BatchSize: 100, QueueSize: 1000, Interval: 100ms, ExportTimeout set)
No manual constructions with uninitialized
Exporterfields exist in the codebase. The code is correct as-is.router/internal/graphqlmetrics/prometheus_exporter.go (1)
31-41: LGTM on exporter wrapper.Constructor and delegation are clear; disabling retries for local Prom sink is appropriate.
Also applies to: 47-58
router/internal/graphqlmetrics/graphql_exporter.go (1)
31-39: LGTM on GraphQL exporter wrapper.API is minimal and mirrors the generic exporter cleanly.
Also applies to: 41-52
router-tests/prometheus_improved_test.go (1)
351-358: Review comment is incorrect; no changes required.The repository targets Go 1.25 (confirmed across all
go.modfiles), which fully supports thefor range nsyntax introduced in Go 1.22. The codebase consistently uses this syntax in multiple test files (prometheus_improved_test.go, authentication_test.go, plugintest/tracing_test.go). No compatibility concern exists; the code is idiomatic for the current toolchain.Likely an incorrect or invalid review comment.
router/internal/graphqlmetrics/prometheus_sink.go (1)
79-86: Remove this review comment; the Go version compatibility concern is invalid.The codebase targets Go 1.25, and
slices.Concatwas introduced in Go 1.22, so the "newer Go" compatibility argument doesn't apply. The code is compatible with the project's Go version.While the nil vs. empty slice suggestion (
[]attribute.KeyValue{}→nil) is a minor stylistic improvement, the original review overstates the concern about Go compatibility and allocation efficiency in this context.router/pkg/config/config.go (1)
115-126: No changes required; validation is already in place.The PrometheusSchemaFieldUsageExporter fields are validated both at the JSON schema level and at runtime:
- Runtime: The
Exporter[T].validate()method (router/internal/graphqlmetrics/exporter.go:113–142) enforces BatchSize > 0, QueueSize > 0, Interval > 0, and ExportTimeout > 0.- Schema: The config.schema.json (lines 1252–1278) defines
minimum: 1for batch_size/queue_size andduration: {minimum: "1s"}for interval/export_timeout.Both layers prevent misconfiguration, so no additional validation is needed.
router/core/router.go (1)
821-839: GraphQL metrics exporter wiring looks consistentUsing
NewGraphQLMetricsExporterwithNewDefaultExporterSettings()cleanly adapts to the new generic exporter API; error handling and assignment intor.gqlMetricsExporterremain correct.router/core/router_metrics.go (3)
14-21: RouterMetrics wiring for dual exporters is soundExtending
RouterMetricsandrouterMetricsto carry both the GraphQL and Prometheus exporters, plus an explicitexportEnabledflag for cloud usage, cleanly separates concerns.NewRouterMetricscorrectly threads these through, andStartOperation’sPrometheusUsageInfoTrack: m.prometheusMetricsExporter != nilgives a cheap, stable switch for local Prometheus tracking.Also applies to: 26-41, 44-51
57-67: StartOperation gating correctly enables Prometheus usage trackingPassing
TrackUsageInfoandPrometheusUsageInfoTrackintoOperationMetricsOptionsensures:
- Cloud GraphQL usage export is still controlled centrally via
exportEnabled.- Prometheus schema usage export is enabled solely by the presence of a Prometheus exporter.
This matches the “track all operations” goal without coupling the two sinks.
75-81: Exporter accessors are minimal and sufficient
GqlMetricsExporter()andPrometheusMetricsExporter()simply expose the underlying exporters, matching the new Prometheus path needed by the graph mux. No additional state is leaked.router/core/operation_metrics.go (2)
32-40: OperationMetrics option plumbing is consistentAdding
trackUsageInfoandprometheusUsageInfoTracktoOperationMetricsand threading them viaOperationMetricsOptionskeeps the decision about which sinks to hit local to the operation instance. The constructor correctly snapshots these booleans at operation start, avoiding any mid-request config races.Also applies to: 91-100, 105-117
77-88: Dual export in Finish aligns with tracking goalsThe new export block:
- Respects
SkipLoaderandreqContext.operation == nilas before.- Uses
trackUsageInfoto guard GraphQL (cloud) schema usage export.- Uses
prometheusUsageInfoTrackto independently guard the Prometheus schema usage path.This achieves “always track operations” for Prometheus while preserving existing cloud usage behavior.
router/internal/graphqlmetrics/exporter.go (4)
13-31: Generic exporter structure and defaults look solidThe generic
Exporter[T]plusExporterSettingsandNewDefaultExporterSettings()give a reusable, thread-safe batching primitive. The fields (queue,inflightBatches,exportRequestContext) and defaults (batch size, queue size, interval, timeouts, retry) are reasonable and match the use cases for GraphQL and Prometheus sinks.Also applies to: 50-76
145-177: Traffic gating and queue backpressure are correctly implemented
acceptTraffic()uses a closed channel to signal shutdown, andRecord:
- Sends synchronously via
exportBatchwhen requested.- Uses a non-blocking send to a bounded
queueotherwise, logging and dropping when full.This prevents producer goroutines from blocking indefinitely on shutdown or overload. The synchronous path intentionally ignores the error (just logs inside
exportBatch), which matches a “best-effort metrics” philosophy.
285-319: Batching loop and shutdown signal handling look correctThe
startloop:
- Maintains a preallocated buffer sized to
BatchSize.- Flushes on timer ticks and when the buffer fills.
- On shutdown, calls
drainQueue(buffer)and exits cleanly.This ensures no items are lost between tick and shutdown, and keeps the batching goroutine single-responsibility.
321-342: drainQueue correctly flushes remaining items during shutdown
drainQueueempties the queue into the buffer, sending full batches as they accumulate and finally sending any partial batch, with helpful debug logging. GivenacceptTrafficSemais closed beforeshutdownSignal, no new async items should be enqueued during draining, so this is race-free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
router/pkg/config/config_test.go (1)
764-787: LGTM! Precedence test correctly validates file config priority.The test correctly verifies that file-specified exporter values override environment variables, using clearly distinguishable values (1024 vs 9999, 5s vs 99s) to prove the behavior.
Optional: Consider testing the precedence behavior for all four exporter fields (BatchSize, QueueSize, Interval, ExportTimeout) for completeness, though testing two fields is sufficient to prove the precedence mechanism works.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
router/pkg/config/config_test.go(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.
Applied to files:
router/pkg/config/config_test.go
🧬 Code graph analysis (1)
router/pkg/config/config_test.go (1)
router/pkg/config/config.go (5)
LoadConfig(1120-1232)Config(1014-1089)Telemetry(157-163)Metrics(137-142)Prometheus(99-113)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: build-router
- GitHub Check: image_scan
- GitHub Check: image_scan (nonroot)
- GitHub Check: build_test
- GitHub Check: build_push_image
- GitHub Check: integration_test (./telemetry)
- GitHub Check: build_push_image (nonroot)
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: Analyze (go)
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (3)
router/pkg/config/config_test.go (3)
689-707: LGTM! Well-structured test for exporter defaults.The test correctly verifies that exporter configuration falls back to sensible defaults (8192/16384 for batch/queue sizes, 10s for interval/timeout) when only
enabled: trueis specified.
709-736: LGTM! Comprehensive test for custom file configuration.The test correctly verifies that all exporter fields (batch_size, queue_size, interval, export_timeout) can be customized via YAML configuration.
738-762: LGTM! Proper environment variable loading test.The test correctly verifies that exporter configuration can be loaded from environment variables using the
PROMETHEUS_SCHEMA_FIELD_USAGE_EXPORTER_*prefix pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
router/pkg/config/testdata/config_defaults.json(1 hunks)router/pkg/config/testdata/config_full.json(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.
Applied to files:
router/pkg/config/testdata/config_defaults.json
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: build-router
- GitHub Check: build_push_image
- GitHub Check: build_push_image (nonroot)
- GitHub Check: image_scan (nonroot)
- GitHub Check: build_test
- GitHub Check: image_scan
- GitHub Check: integration_test (./telemetry)
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: Analyze (go)
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (4)
router/pkg/config/testdata/config_full.json (2)
93-102: Schema change is properly tested via fixture-based tests.The config_full.json file is exercised by config parsing tests in
router/pkg/config/config_test.goat lines 320 and 1171. These tests use goldie for fixture-based JSON comparison: they load the YAML configuration and validate the parsed output matches theconfig_full.jsonfixture exactly. Since the fixture already contains the newExporterschema withBatchSize,QueueSize,Interval, andExportTimeoutfields, the goldie tests will automatically catch any schema mismatches. If tests are passing, the schema change is validated.
96-101: Configuration field names correctly match Go struct definition.Verification confirms the Exporter configuration in
config_full.json(lines 96-101) is properly structured:
- JSON field names (
BatchSize,QueueSize,Interval,ExportTimeout) match Go struct field names inrouter/pkg/config/config.go:121-126- Field values are correct (integers for int fields, nanoseconds for time.Duration fields)
- Both
config_full.jsonandconfig_defaults.jsonconsistently reflect the schema change- Tests and production code in
router/core/router.go:2327-2330correctly use these fieldsrouter/pkg/config/testdata/config_defaults.json (2)
63-72: Structural change from SampleRate to Exporter is sound.The replacement of the probabilistic
SampleRatesampling approach with a configurableExporterobject containing BatchSize, QueueSize, Interval, and ExportTimeout properly reflects the PR objective to track all operations deterministically using batching instead of sampling.
66-71: Testdata values are correct and consistent with config schema defaults.Verification confirms the Exporter values in the testdata file align with the actual configuration schema. The
router/pkg/config/config.gostruct definesPrometheusSchemaFieldUsageExporterwithenvDefaulttags that match exactly:
BatchSize: envDefault:"8192"(testdata: 8192) ✓QueueSize: envDefault:"16384"(testdata: 16384) ✓Interval: envDefault:"10s"(testdata: 10000000000 ns = 10s) ✓ExportTimeout: envDefault:"10s"(testdata: 10000000000 ns = 10s) ✓The testdata correctly reflects the source-of-truth defaults. Any discrepancy with the PR description likely reflects earlier proposal values that were finalized to these values in the implementation.
| "SchemaFieldUsage": { | ||
| "Enabled": true, | ||
| "IncludeOperationSha": true, | ||
| "SampleRate": 1 | ||
| "Exporter": { | ||
| "BatchSize": 8192, | ||
| "QueueSize": 16384, | ||
| "Interval": 10000000000, | ||
| "ExportTimeout": 10000000000 | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assess backward compatibility: old configs with SampleRate will now fail to parse.
Removing SampleRate and replacing it with Exporter is a breaking change. Existing user configs that specify SampleRate will fail validation/parsing. Consider documenting migration path or implementing a compatibility layer (e.g., mapping old SampleRate to new Exporter batching parameters).
If this is intentional and part of a documented breaking change, add a comment in the config or release notes. If backward compatibility is required, implement a migration/fallback mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we leave it in schema as a no-op with a deprecation warning or remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@endigma I'd recommend removing it entirely rather than keeping it as a deprecated no-op. Here's why:
-
Fundamental semantic change:
SampleRatewas about probabilistic sampling (whether to record an operation), whileExporteris about batch processing (how to export recorded operations). Since sampling is completely removed and all operations are now tracked, keepingSampleRatewould be misleading about what the system actually does. -
No meaningful mapping: There's no sensible way to translate
SampleRate: 1.0intoBatchSize,QueueSize,Interval, andExportTimeoutvalues. A deprecated no-op would silently ignore user intent without clear guidance. -
Clearer migration path: A clean break forces users to understand the new model, which is important since the behavior has fundamentally changed (probabilistic sampling → guaranteed tracking with batching).
Recommendation:
- Remove
SampleRatefrom the schema - Add validation logic that detects if
SampleRateis present in the config and returns a helpful error message like:"SampleRate has been removed. All operations are now tracked. Please use Exporter configuration instead. See: [link to docs/migration guide]" - Document the breaking change in release notes with migration examples
- Consider if this warrants a major version bump
Would you like me to search the codebase for any existing patterns for handling config migrations, or help draft the error message/migration documentation?
🧠 Learnings used
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
router/internal/graphqlmetrics/prometheus_sink.go (1)
81-86: Consider passingnilinstead of an empty slice for the unused parameter.Since attributes are being passed via
WithAttributeSetoption, thesliceAttrparameter is unused. Passingnilinstead of[]attribute.KeyValue{}would be more idiomatic and avoid an unnecessary empty slice allocation.Apply this diff:
s.metricStore.MeasureSchemaFieldUsage( ctx, int64(count), - []attribute.KeyValue{}, + nil, otelmetric.WithAttributeSet(attribute.NewSet(allAttrs...)), )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
router/internal/graphqlmetrics/prometheus_sink.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
router/internal/graphqlmetrics/prometheus_sink.go (3)
router/pkg/metric/metric_store.go (1)
Store(149-162)router/pkg/otel/attributes.go (5)
WgOperationName(10-10)WgOperationType(11-11)WgOperationSha256(59-59)WgGraphQLFieldName(60-60)WgGraphQLParentType(61-61)router/pkg/graphqlschemausage/schemausage.go (1)
TypeFieldMetrics(28-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: build-router
- GitHub Check: build_push_image (nonroot)
- GitHub Check: image_scan (nonroot)
- GitHub Check: image_scan
- GitHub Check: build_push_image
- GitHub Check: build_test
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (go)
🔇 Additional comments (1)
router/internal/graphqlmetrics/prometheus_sink.go (1)
127-130: No change needed — parent type and field-name extraction are correct.TypeFieldUsageInfo.TypeNames is populated from ParentTypeNames such that index 0 is the immediate parent type, and Path contains the path segments with the field name as the last element. Verified in router/pkg/graphqlschemausage/schemausage.go (ParentTypeNames -> TypeNames), the protobuf comment (graphqlmetrics/gen/proto/.../graphqlmetrics.pb.go), and tests (router/internal/graphqlmetrics/exporter_test.go, aggregate_test.go) showing examples like Path ["user","name"] with TypeNames ["User","String"].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
router/pkg/config/config.go (1)
116-119: Back-compat: removal of sample_rate needs migration story or explicit deprecation notice.Dropping SampleRate in favor of Exporter will break existing configs. Either:
- accept the breaking change and document a clear migration; or
- add a temporary compatibility path that maps sample_rate to exporter settings.
Also ensure any references to SampleRate are removed across code, tests, docs. Reusing the schema’s deprecation mechanism can help guide users.
#!/bin/bash # Find lingering references (code, tests, docs) rg -n -C2 -S -i '\bsample_rate\b|SampleRate|schema_usage\.sample_rate|PROMETHEUS_SCHEMA_FIELD_USAGE_SAMPLE_RATE'
🧹 Nitpick comments (2)
router/pkg/config/config.go (1)
121-126: Type naming duplication and defaults: consider renaming and confirm queue_size default.
- PrometheusSchemaFieldUsageExporter is also declared in router/pkg/metric/config.go. Keeping the same name in two packages is legal but confusing. Consider renaming this one to PrometheusSchemaFieldUsageExporterConfig or PrometheusSchemaFieldUsageSettings to reduce ambiguity.
- Default queue_size here is 12800, while the PR example shows 10240. Please reconcile or update examples to avoid drift.
router/pkg/config/config.schema.json (1)
1249-1281: Harden exporter schema: disallow unknown keys and (optionally) bound sizes.Add additionalProperties: false under schema_usage.exporter to catch typos (e.g., “batchsize”). Optionally add conservative maximums to bound memory use.
- "exporter": { - "type": "object", - "description": "Configuration for the schema usage exporter", + "exporter": { + "type": "object", + "description": "Configuration for the schema usage exporter", + "additionalProperties": false, "properties": { "batch_size": { "type": "integer", "default": 4096, "minimum": 1, + "maximum": 1048576, "description": "The maximum number of schema usage items to be applied in a single batch. The default value is 4096." }, "queue_size": { "type": "integer", "default": 12800, "minimum": 1, + "maximum": 10485760, "description": "The maximum number of schema usage items allowed in queue at a given time. The default value is 12800." },
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
router/pkg/config/config.go(1 hunks)router/pkg/config/config.schema.json(1 hunks)router/pkg/config/config_test.go(1 hunks)router/pkg/config/testdata/config_defaults.json(1 hunks)router/pkg/config/testdata/config_full.json(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- router/pkg/config/testdata/config_defaults.json
- router/pkg/config/config_test.go
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.
Applied to files:
router/pkg/config/testdata/config_full.json
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.
Applied to files:
router/pkg/config/testdata/config_full.jsonrouter/pkg/config/config.schema.json
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.
Applied to files:
router/pkg/config/testdata/config_full.json
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.
Applied to files:
router/pkg/config/config.schema.json
🧬 Code graph analysis (1)
router/pkg/config/config.go (3)
router/internal/graphqlmetrics/exporter.go (1)
Exporter(16-31)router/pkg/metric/config.go (1)
PrometheusSchemaFieldUsageExporter(48-53)router-tests/testenv/testenv.go (1)
PrometheusSchemaFieldUsageExporter(285-290)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: build-router
- GitHub Check: image_scan (nonroot)
- GitHub Check: integration_test (./events)
- GitHub Check: integration_test (./telemetry)
- GitHub Check: build_push_image (nonroot)
- GitHub Check: build_push_image
- GitHub Check: image_scan
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: build_test
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (go)
🔇 Additional comments (1)
router/pkg/config/testdata/config_full.json (1)
96-101: Fixture update aligns with new exporter config.Values and duration encodings look correct. LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
router/core/context.go(2 hunks)router/core/router_metrics.go(5 hunks)
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-09-17T20:55:39.456Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-14T17:45:57.509Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-14T17:46:00.930Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.930Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-11-12T18:38:46.619Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2327
File: graphqlmetrics/core/metrics_service.go:435-442
Timestamp: 2025-11-12T18:38:46.619Z
Learning: In graphqlmetrics/core/metrics_service.go, the opGuardCache TTL (30 days) is intentionally shorter than the gql_metrics_operations database retention period (90 days). This design reduces memory usage while relying on database constraints to prevent duplicate inserts after cache expiration.
Applied to files:
router/core/context.go
🧬 Code graph analysis (2)
router/core/router_metrics.go (4)
router/internal/graphqlmetrics/graphql_exporter.go (1)
GraphQLMetricsExporter(13-15)router/internal/graphqlmetrics/prometheus_exporter.go (1)
PrometheusMetricsExporter(13-15)router/core/context.go (4)
OperationType(503-503)OperationTypeQuery(506-506)OperationTypeMutation(507-507)OperationTypeSubscription(508-508)graphqlmetrics/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (16)
OperationType(26-26)OperationType(58-60)OperationType(62-64)OperationType(71-73)OperationType_QUERY(29-29)OperationType_MUTATION(30-30)OperationType_SUBSCRIPTION(31-31)SchemaUsageInfo(130-153)SchemaUsageInfo(168-168)SchemaUsageInfo(183-185)OperationInfo(362-373)OperationInfo(388-388)OperationInfo(403-405)SchemaInfo(428-435)SchemaInfo(450-450)SchemaInfo(465-467)
router/core/context.go (2)
router/pkg/graphqlschemausage/schemausage.go (2)
TypeFieldMetrics(28-28)TypeFieldUsageInfo(40-48)graphqlmetrics/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (3)
TypeFieldUsageInfo(476-493)TypeFieldUsageInfo(508-508)TypeFieldUsageInfo(523-525)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: build-router
- GitHub Check: build_push_image (nonroot)
- GitHub Check: build_push_image
- GitHub Check: image_scan (nonroot)
- GitHub Check: build_test
- GitHub Check: image_scan
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (go)
🔇 Additional comments (8)
router/core/context.go (2)
546-549: Context for new caching field.The addition of
typeFieldUsageInfoMetricsas a cached conversion result aligns with the PR's goal to efficiently track all operations without sampling. This avoids repeated allocations when multiple exporters need the same data.
601-608: The review comment is incorrect and should be disregarded.The premise of a race condition is based on a misunderstanding of the codebase. Here's why:
operationContext is per-request: In
graphql_prehandler.goline 229, a newoperationContextinstance is created fresh for each HTTP request:requestContext.operation = &operationContext{clientInfo: clientInfo}. This is a standard per-request context pattern.No concurrent access to the same instance: Since each request gets its own operationContext instance, different requests cannot race on the same instance's fields. The lazy initialization on lines 604-605 is only accessed by the single request handler processing that specific request.
Confusion between shared data and shared context: The comment in
router_metrics.go:100-102refers to the cached plan data (typeFieldUsageInfo) being reused across requests via the planner cache. This is separate from the operationContext wrapper itself, which is always fresh per request. The two concepts are distinct.Likely an incorrect or invalid review comment.
router/core/router_metrics.go (6)
17-19: LGTM: Clean interface extension for Prometheus support.The interface changes clearly separate GraphQL and Prometheus export paths while maintaining backward compatibility with the updated
GraphQLMetricsExportertype.
26-31: LGTM: Struct fields updated to support dual export paths.The addition of
prometheusMetricsExporteralongside the updatedgqlMetricsExportertype enables the router to export metrics to both GraphQL and Prometheus sinks.
43-52: LGTM: Constructor properly initializes new fields.All new fields are correctly initialized from the config.
57-69: LGTM: Prometheus tracking enabled conditionally.Line 64 correctly sets
PrometheusUsageInfoTrackbased on exporter presence, ensuring tracking overhead is only incurred when the Prometheus exporter is configured.
75-81: LGTM: Accessor methods updated.The accessor methods correctly return the new exporter types.
83-133: LGTM: Efficient use of cached metrics conversion.Line 109 now uses
GetTypeFieldUsageInfoMetrics()which avoids repeated allocations by caching the converted metrics data. This optimization aligns with the PR's goal to track all operations efficiently.Note: See the comment on
context.golines 601-608 regarding potential race conditions in the caching implementation.
| routerConfigVersion string | ||
| logger *zap.Logger | ||
| trackUsageInfo bool | ||
| prometheusUsageInfoTrack bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| prometheusUsageInfoTrack bool | |
| prometheusTrackUsageInfo bool |
| if m.trackUsageInfo { | ||
| m.routerMetrics.ExportSchemaUsageInfo(reqContext.operation, statusCode, reqContext.error != nil, exportSynchronous) | ||
| } | ||
|
|
||
| opAttrs := []attribute.KeyValue{ | ||
| rotel.WgOperationName.String(reqContext.operation.name), | ||
| rotel.WgOperationType.String(reqContext.operation.opType), | ||
| // Prometheus metrics export (to local Prometheus metrics) | ||
| if m.prometheusUsageInfoTrack { | ||
| m.routerMetrics.ExportSchemaUsageInfoPrometheus(reqContext.operation, statusCode, reqContext.error != nil, exportSynchronous) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a case where we want both? Should this just be one function that has an interface implementation?
| if !m.shouldSampleOperation() { | ||
| return | ||
| // Export schema usage info to configured exporters | ||
| if reqContext.operation != nil && !reqContext.operation.executionOptions.SkipLoader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if reqContext.operation != nil && !reqContext.operation.executionOptions.SkipLoader { | |
| if reqContext.operation != nil { |
This check is why we were unable to reproduce a customer issue locally without implementing their subgraph. Is it necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It ensures that we don't emit schema usage data for planning only operations, hence for testing.
| RouterMetrics RouterMetrics | ||
| Logger *zap.Logger | ||
| TrackUsageInfo bool | ||
| PrometheusUsageInfoTrack bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| PrometheusUsageInfoTrack bool | |
| PrometheusTrackUsageInfo bool |
| ExportSchemaUsageInfo(operationContext *operationContext, statusCode int, hasError bool, exportSynchronous bool) | ||
| GqlMetricsExporter() *graphqlmetrics.Exporter | ||
| ExportSchemaUsageInfoPrometheus(operationContext *operationContext, statusCode int, hasError bool, exportSynchronous bool) | ||
| GqlMetricsExporter() *graphqlmetrics.GraphQLMetricsExporter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| GqlMetricsExporter() *graphqlmetrics.GraphQLMetricsExporter | |
| GQLMetricsExporter() *graphqlmetrics.GraphQLMetricsExporter |
| type Sink[T any] interface { | ||
| // Export sends a batch of items to the sink destination. | ||
| // It returns an error if the export fails. | ||
| // The context can be used to cancel or timeout the export operation. | ||
| Export(ctx context.Context, batch []T) error | ||
|
|
||
| // Close performs any cleanup needed when shutting down the sink. | ||
| // It should block until all cleanup is complete or the context is cancelled. | ||
| Close(ctx context.Context) error | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should be put in pkg instead? It seems it might be useful as a generic batcher and as far as I can tell doesn't include anything graphqlmetrics specific?
|
|
||
| // Close performs any cleanup needed when shutting down the sink. | ||
| // It should block until all cleanup is complete or the context is cancelled. | ||
| Close(ctx context.Context) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither current sink actually has a close method, is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
router/pkg/config/config.schema.json (1)
1236-1282: Lock down schema_usage to reject legacy SampleRate and unknown keysCurrently additionalProperties is not set on schema_usage, so legacy sample_rate may be silently accepted and ignored. Tighten it and keep exporter properties as-is.
- "schema_usage": { - "type": "object", - "description": "Configure schema field usage metrics for Prometheus", - "properties": { + "schema_usage": { + "type": "object", + "description": "Configure schema field usage metrics for Prometheus", + "additionalProperties": false, + "properties": { "enabled": {Optionally, add a custom pre-validation check to emit a clearer error if sample_rate is present.
♻️ Duplicate comments (1)
router/core/router_metrics.go (1)
19-19: Consider usingGQLMetricsExporterfor proper Go initialism formatting.As noted in a previous review, Go's style guide recommends using all caps for initialisms like "GQL" rather than "Gql".
🧹 Nitpick comments (10)
router/internal/graphqlmetrics/exporter.go (2)
52-63: Clarify QueueSize commentThe queue holds items (T), not batches. Update the comment to avoid confusion.
382-416: Shutdown sequencing could prolong terminationRequests are canceled only in defer, and backoff sleeps are not cancelable. The diff above cancels earlier and makes sleeps cancelable to avoid long waits on shutdown.
router/pkg/config/config.schema.json (3)
1179-1185: Default mismatch: listen_addrSchema default is "localhost:8088" but code (config.go Line 102) uses "127.0.0.1:8088" and defaults files do too. Align schema default.
- "default": "localhost:8088" + "default": "127.0.0.1:8088"
462-466: Default + description mismatch: websocket.forward_upgrade_query_params.enabledSchema sets default false, description says default true, and code default is true. Align to true.
- "default": false, - "description": "Forward upgrade request query parameters in the extensions payload when starting a subscription on a Subgraph. The default value is true." + "default": true, + "description": "Forward upgrade request query parameters in the extensions payload when starting a subscription on a Subgraph. The default value is true."
633-637: Default mismatch: access_logs.add_stacktraceSchema default true; code default is false (config.go Line 883). Align to false.
- "default": true + "default": falserouter/core/router_metrics.go (2)
140-170: Document the hash field choice.The nil check at lines 141-143 properly addresses the defensive programming concern from the previous review. However, line 159 uses
operationContext.sha256Hash(original operation hash) while the GraphQL export path usesoperationContext.HashString()(normalized hash with variables).According to the previous review analysis, this difference is intentional: Prometheus uses the original hash for operation correlation. Please add a brief comment at line 159 explaining this choice to prevent future confusion.
Apply this diff to document the hash semantics:
item := &graphqlmetricsv1.SchemaUsageInfo{ TypeFieldMetrics: operationContext.GetTypeFieldUsageInfoMetrics(), OperationInfo: &graphqlmetricsv1.OperationInfo{ Type: opType, - Hash: operationContext.sha256Hash, + Hash: operationContext.sha256Hash, // Original operation hash for Prometheus correlation (vs normalized hash used in GraphQL export) // parsed operation names are re-used across requests // for that reason, we need to copy the name, or it might get corrupted Name: strings.Clone(operationContext.name), },
145-153: Consider extracting the repeated operation type mapping.The operation type switch (lines 145-153) is duplicated from the GraphQL export method (lines 93-101). Extracting this into a helper function would improve maintainability.
Example refactor:
func operationTypeToProto(opType OperationType) graphqlmetricsv1.OperationType { switch opType { case OperationTypeQuery: return graphqlmetricsv1.OperationType_QUERY case OperationTypeMutation: return graphqlmetricsv1.OperationType_MUTATION case OperationTypeSubscription: return graphqlmetricsv1.OperationType_SUBSCRIPTION default: return graphqlmetricsv1.OperationType_QUERY } }Then use
opType := operationTypeToProto(operationContext.opType)in both export methods.router/pkg/graphqlschemausage/schemausage.go (3)
64-104: Type field visitor preallocation and path handling look correctPreallocating
typeFieldUsageInfowith a small fixed capacity and switching topathCopy := make(len(path)+1)pluscopyavoids the extra allocation ofslices.Clone(append(...))while still guaranteeing that:
path == nilis handled safely (len(path)is 0,copyis no-op).- Each
pathCopyis independent and only shared intentionally between the direct field entry and itsIndirectInterfaceFieldvariant.- Recursion passes
pathCopydown and always re-allocates for children, so parent paths are never mutated.The constants (capacity 32 and per-field path allocations) are modest and appropriate for typical query shapes; if future workloads show much larger field counts, benchmarks in this package can guide tuning, but there’s no correctness or obvious perf risk here.
107-131: Argument usage slice preallocation is safe and aligns with usage patternInitializing
visitor.usagewithmake([]*graphqlmetrics.ArgumentUsageInfo, 0, 16)is a straightforward allocation win for common cases and doesn’t change behavior; all appends still produce the same sequence of entries as before, and the walker lifecycle remains unchanged.If production traces show higher argument counts per operation, you can revisit the
16heuristic, but as-is this is a low-risk, localized improvement.
159-244: Input usage preallocation and enum handling changes preserve semanticsThe new preallocations in input usage look good:
usage: make([]*graphqlmetrics.InputUsageInfo, 0, 16)inGetInputUsageInfomatches the typical small number of variables and reduces slice growth.- In
traverseVariable, settingusageInfo.Path = []string{parentTypeName, fieldName}whenparentTypeName != ""allocates an exact-length slice and is equivalent to the previous append-based construction, while avoiding extra capacity and potential future aliasing.- For enum arrays,
usageInfo.EnumValues = make([]string, len(arr))followed by index assignment is correct and avoids repeatedappendcalls.These changes don’t alter how
appendUniqueUsageandinfoEqualsbehave, so deduplication semantics remain the same while allocations are slightly reduced.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
router/core/router_metrics.go(5 hunks)router/internal/graphqlmetrics/exporter.go(5 hunks)router/internal/graphqlmetrics/exporter_bench_test.go(1 hunks)router/pkg/config/config.go(1 hunks)router/pkg/config/config.schema.json(1 hunks)router/pkg/config/testdata/config_defaults.json(1 hunks)router/pkg/config/testdata/config_full.json(1 hunks)router/pkg/graphqlschemausage/schemausage.go(6 hunks)router/pkg/graphqlschemausage/schemausage_bench_test.go(1 hunks)
🧰 Additional context used
🧠 Learnings (13)
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.
Applied to files:
router/pkg/config/testdata/config_defaults.jsonrouter/pkg/config/config.schema.jsonrouter/core/router_metrics.gorouter/pkg/config/testdata/config_full.jsonrouter/pkg/graphqlschemausage/schemausage.go
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.
Applied to files:
router/pkg/config/config.schema.jsonrouter/core/router_metrics.go
📚 Learning: 2025-09-17T20:55:39.456Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-14T17:45:57.509Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-09-02T12:52:27.677Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2181
File: router/core/operation_processor.go:0-0
Timestamp: 2025-09-02T12:52:27.677Z
Learning: Hash validation for persisted queries with query bodies is performed in router/core/graphql_prehandler.go in the handleOperation function at lines 571-578, not in the APQ processing logic in operation_processor.go. This validates that extensions.persistedQuery.sha256Hash matches the computed query hash before any APQ operations occur. There's also a test case that verifies this behavior.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.
Applied to files:
router/core/router_metrics.gorouter/pkg/config/testdata/config_full.json
📚 Learning: 2025-09-02T12:52:27.677Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2181
File: router/core/operation_processor.go:0-0
Timestamp: 2025-09-02T12:52:27.677Z
Learning: Hash validation for persisted queries with query bodies is performed in router/core/graphql_prehandler.go in the handleOperation function, not in the APQ processing logic in operation_processor.go. This validates that extensions.persistedQuery.sha256Hash matches the query body before any APQ operations occur.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-14T17:46:00.930Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.930Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-08-12T13:50:45.964Z
Learnt from: Noroth
Repo: wundergraph/cosmo PR: 2132
File: router-plugin/plugin.go:139-146
Timestamp: 2025-08-12T13:50:45.964Z
Learning: In the Cosmo router plugin system, the plugin framework creates its own logger independently. When creating a logger in NewRouterPlugin, it's only needed for the gRPC server setup (passed via setup.GrpcServerInitOpts.Logger) and doesn't need to be assigned back to serveConfig.Logger.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-11-12T18:38:46.619Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2327
File: graphqlmetrics/core/metrics_service.go:435-442
Timestamp: 2025-11-12T18:38:46.619Z
Learning: In graphqlmetrics/core/metrics_service.go, the opGuardCache TTL (30 days) is intentionally shorter than the gql_metrics_operations database retention period (90 days). This design reduces memory usage while relying on database constraints to prevent duplicate inserts after cache expiration.
Applied to files:
router/core/router_metrics.go
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.
Applied to files:
router/pkg/config/testdata/config_full.json
🧬 Code graph analysis (6)
router/internal/graphqlmetrics/exporter.go (1)
router/internal/graphqlmetrics/sink.go (2)
Sink(9-18)SinkErrorHandler(23-23)
router/pkg/config/config.go (2)
router/internal/graphqlmetrics/exporter.go (1)
Exporter(17-33)router/pkg/metric/config.go (1)
PrometheusSchemaFieldUsageExporter(48-53)
router/core/router_metrics.go (5)
router/core/operation_metrics.go (1)
OperationMetrics(31-40)router/internal/graphqlmetrics/graphql_exporter.go (1)
GraphQLMetricsExporter(13-15)router/internal/graphqlmetrics/prometheus_exporter.go (1)
PrometheusMetricsExporter(13-15)router/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (10)
OperationType(26-26)OperationType(58-60)OperationType(62-64)OperationType(71-73)OperationType_QUERY(29-29)OperationType_MUTATION(30-30)OperationType_SUBSCRIPTION(31-31)SchemaUsageInfo(130-153)SchemaUsageInfo(168-168)SchemaUsageInfo(183-185)router/core/context.go (4)
OperationType(503-503)OperationTypeQuery(506-506)OperationTypeMutation(507-507)OperationTypeSubscription(508-508)
router/internal/graphqlmetrics/exporter_bench_test.go (1)
router/internal/graphqlmetrics/exporter.go (3)
ExporterSettings(52-63)RetryOptions(35-40)NewExporter(84-120)
router/pkg/graphqlschemausage/schemausage_bench_test.go (2)
router/pkg/graphqlschemausage/schemausage_test.go (1)
FakeFactory(527-529)router/pkg/graphqlschemausage/schemausage.go (4)
GetTypeFieldUsageInfo(14-23)GetArgumentUsageInfo(107-124)GetInputUsageInfo(159-171)TypeFieldMetrics(26-26)
router/pkg/graphqlschemausage/schemausage.go (1)
graphqlmetrics/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (9)
TypeFieldUsageInfo(476-493)TypeFieldUsageInfo(508-508)TypeFieldUsageInfo(523-525)ArgumentUsageInfo(569-582)ArgumentUsageInfo(597-597)ArgumentUsageInfo(612-614)InputUsageInfo(644-659)InputUsageInfo(674-674)InputUsageInfo(689-691)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: build-router
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: integration_test (./events)
- GitHub Check: image_scan (nonroot)
- GitHub Check: image_scan
- GitHub Check: integration_test (./telemetry)
- GitHub Check: build_test
- GitHub Check: build_push_image
- GitHub Check: build_push_image (nonroot)
- GitHub Check: Analyze (go)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: build_test
🔇 Additional comments (10)
router/internal/graphqlmetrics/exporter_bench_test.go (1)
27-66: Benchmarks cover key hot paths; LGTMGood coverage of buffer pooling, high-throughput, batch cycling, growth, and parallel Record. Minor: time.Sleep in benchmarks can add variance, but acceptable here.
Also applies to: 68-111, 112-156, 157-197, 198-242
router/pkg/config/testdata/config_defaults.json (1)
66-71: Defaults alignment looks goodExporter defaults match code/env defaults (4096/12800/1s/10s). LGTM.
router/pkg/config/testdata/config_full.json (1)
96-101: Config shape migration verifiedExporter object replaces SampleRate as intended; values consistent with defaults/examples. LGTM.
router/pkg/config/config.go (1)
116-119: ****The two
PrometheusSchemaFieldUsageExporterdefinitions serve distinct purposes and should not be unified:
router/pkg/config/config.go:121is a configuration struct with YAML and environment variable tags for unmarshaling external configuration with defaultsrouter/pkg/metric/config.go:48is an internal runtime struct used by the metrics package to track exporter state, with no serialization concernsThis separation of concerns is intentional and appropriate. Consolidating them would couple the metric package to configuration unmarshaling logic and add unnecessary struct tags where they don't belong. The apparent duplication is justified.
Likely an incorrect or invalid review comment.
router/core/router_metrics.go (4)
3-5: Good use of standard library.The
stringspackage import supports the replacement of the customstrCopyimplementation withstrings.Clone(), which is a cleaner approach.
24-53: LGTM!The struct fields and constructor are updated consistently. The new
prometheusMetricsExporterfield is properly initialized inNewRouterMetrics.
76-82: LGTM!The getter methods correctly expose both exporters through the
RouterMetricsinterface.
84-138: LGTM!The method improvements are well-executed:
- Added defensive nil check for the exporter (lines 89-91)
- Replaced custom
strCopywith standardstrings.Clone(line 122)- Updated to use
GetTypeFieldUsageInfoMetrics()for cleaner data accessrouter/pkg/graphqlschemausage/schemausage_bench_test.go (2)
106-194: Benchmarks are well-structured and focus on the right hot pathsEach benchmark:
- Calls
setupBenchmarkonce, then usesb.ResetTimer()andb.ReportAllocs()to keep setup out of the measured section.- Exercises a single concern per benchmark (
GetTypeFieldUsageInfo,GetArgumentUsageInfo,GetInputUsageInfo,IntoGraphQLMetrics, and the combined end-to-end flow), with errors surfaced viab.Fatal.- Prevents compiler optimizations by assigning results to a blank variable.
The end-to-end benchmark intentionally reuses the precomputed plan, docs, and variables to focus on schema usage extraction rather than planning, which matches the PR’s perf focus. This all looks good and should give you stable allocation/CPU signals for regression tracking.
19-104: No issues found; original concern is unfounded.The pattern of calling
astjson.ParseBytes(op.Input.Variables)is safe and well-established. TheVariablesfield originates asjson.RawMessage(a[]bytetype withomitemptytag), which defaults to an empty slice rather than nil when unmarshaled. This empty slice is valid JSON input forastjson.ParseBytes. The identical pattern already exists and is tested inschemausage_test.go(line 206) with successful results, confirming the code path is sound.
| queue chan T | ||
| inflightBatches *atomic.Int64 | ||
| batchBufferPool *sync.Pool // Pool for batch slice buffers to reduce allocations | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: batch buffer returned to pool before export completes (data race/corruption)
Buffer is enqueued for export and immediately returned to the pool (lines 332-333, 341-342, 364-366, 371-372). The goroutine then exports a slice referencing memory that may be reused/modified, leading to corrupted batches and races. Also, the pool stores *[]T and rewraps slice headers, which is error‑prone.
Fix by:
- Pooling []T values (not pointers).
- Returning buffers to the pool only after export finishes (inside the export goroutine).
- Making retry backoff sleep cancelable and cancel export requests earlier during Shutdown.
Apply this diff:
@@
- "sync"
+ "sync"
@@
- batchBufferPool *sync.Pool // Pool for batch slice buffers to reduce allocations
+ batchBufferPool *sync.Pool // Pool of []T to reduce allocations
@@
- batchBufferPool: &sync.Pool{
- New: func() any {
- // Pre-allocate slice with batch size capacity
- buffer := make([]T, 0, settings.BatchSize)
- return &buffer
- },
- },
+ batchBufferPool: &sync.Pool{
+ New: func() any {
+ // Pre-allocate slice with batch size capacity
+ return make([]T, 0, settings.BatchSize)
+ },
+ },
@@
-func (e *Exporter[T]) getBatchBuffer() []T {
- bufferPtr := e.batchBufferPool.Get().(*[]T)
- buffer := *bufferPtr
- // Ensure the buffer is empty (should already be, but be defensive)
- return buffer[:0]
-}
+func (e *Exporter[T]) getBatchBuffer() []T {
+ bufAny := e.batchBufferPool.Get()
+ if bufAny == nil {
+ return make([]T, 0, e.settings.BatchSize)
+ }
+ buf := bufAny.([]T)
+ return buf[:0]
+}
@@
-func (e *Exporter[T]) putBatchBuffer(buffer []T) {
- // Reset the slice to zero length while keeping capacity
- buffer = buffer[:0]
- e.batchBufferPool.Put(&buffer)
-}
+func (e *Exporter[T]) putBatchBuffer(buf []T) {
+ e.batchBufferPool.Put(buf[:0])
+}
@@
func (e *Exporter[T]) prepareAndSendBatch(batch []T) {
e.logger.Debug("Preparing to send batch", zap.Int("batch_size", len(batch)))
e.inflightBatches.Inc()
- go func() {
+ go func(b []T) {
defer e.inflightBatches.Dec()
- e.exportBatchWithRetry(batch)
- }()
+ defer e.putBatchBuffer(b) // return buffer only after export completes
+ e.exportBatchWithRetry(b)
+ }(batch)
}
@@
- case <-ticker.C:
+ case <-ticker.C:
e.logger.Debug("Tick: flushing buffer", zap.Int("buffer_size", len(buffer)))
if len(buffer) > 0 {
- e.prepareAndSendBatch(buffer)
- // Return buffer to pool after sending
- e.putBatchBuffer(buffer)
+ e.prepareAndSendBatch(buffer) // buffer will be returned by goroutine
buffer = nil
}
@@
if len(buffer) == e.settings.BatchSize {
e.logger.Debug("Buffer full, sending batch", zap.Int("batch_size", len(buffer)))
- e.prepareAndSendBatch(buffer)
- // Return buffer to pool after sending
- e.putBatchBuffer(buffer)
+ e.prepareAndSendBatch(buffer) // buffer will be returned by goroutine
buffer = nil
}
@@
- if len(buffer) == e.settings.BatchSize {
- e.prepareAndSendBatch(buffer)
- // Return buffer to pool and get a new one
- e.putBatchBuffer(buffer)
- buffer = e.getBatchBuffer()
- }
+ if len(buffer) == e.settings.BatchSize {
+ e.prepareAndSendBatch(buffer) // buffer returned by goroutine
+ buffer = e.getBatchBuffer()
+ }
@@
- if len(buffer) > 0 {
- e.prepareAndSendBatch(buffer)
- // Return buffer to pool before returning
- e.putBatchBuffer(buffer)
- }
+ if len(buffer) > 0 {
+ e.prepareAndSendBatch(buffer) // buffer returned by goroutine
+ }And make backoff sleeps cancelable and cancel export requests earlier during shutdown:
@@
func (e *Exporter[T]) exportBatchWithRetry(batch []T) {
@@
- // Wait for the specified backoff period
- sleepDuration := b.Duration()
- e.logger.Debug("Retrying export after backoff",
+ // Wait for the specified backoff period (cancelable)
+ sleepDuration := b.Duration()
+ e.logger.Debug("Retrying export after backoff",
zap.Int("batch_size", len(batch)),
zap.Int("retry", retry),
zap.Duration("sleep", sleepDuration),
)
-
- time.Sleep(sleepDuration)
+ select {
+ case <-e.exportRequestContext.Done():
+ e.logger.Debug("Export cancelled during backoff")
+ return
+ case <-time.After(sleepDuration):
+ }
@@
func (e *Exporter[T]) Shutdown(ctx context.Context) error {
e.logger.Debug("Shutdown started")
@@
- defer func() {
- ticker.Stop()
- // Cancel all export requests
- e.cancelAllExportRequests()
+ // Cancel export requests early to unblock any long requests/backoff
+ e.cancelAllExportRequests()
+ defer func() {
+ ticker.Stop()
// Close the sink
if err := e.sink.Close(ctx); err != nil {
e.logger.Error("Error closing sink", zap.Error(err))
}
e.logger.Debug("Shutdown complete")
}()Minor nit: the comment for QueueSize says “number of batches” but the queue holds items. Consider updating it for clarity.
Also applies to: 96-114, 319-344, 352-377
🤖 Prompt for AI Agents
router/internal/graphqlmetrics/exporter.go around lines 24-27 and the other
affected ranges (96-114, 319-344, 352-377): the current implementation returns
*[]T buffers to the sync.Pool immediately after enqueueing, which allows the
buffer memory to be reused while an export goroutine is still referencing it,
and the pool stores pointers to slices which rewraps slice headers unsafely.
Change the pool to hold []T (not *[]T), stop returning buffers to the pool
before export completes, and move the buffer Put(...) into the export goroutine
after the export attempt (including after retries) so the goroutine owns the
slice lifetime; make any retry backoff sleeps cancelable by using a select on
ctx.Done() and a time.After or time.NewTimer channel so sleeps abort immediately
on Shutdown; ensure Shutdown cancels in-flight export requests earlier by
cancelling the export context (or closing a done channel) before waiting for
goroutines, and update the QueueSize comment to accurately say it’s the queue
capacity in items (not "number of batches"). Ensure all places that currently
Put a *[]T are updated to Put a []T and that enqueued values are copies of the
slice header or newly allocated slices if needed so the pool semantics are safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
router/internal/graphqlmetrics/exporter_test.go (1)
70-71: Minor: prefer require.NoError for errorsSwitch to require.NoError for clarity. Applies to all similar occurrences.
- require.Nil(t, err) + require.NoError(t, err)Also applies to: 149-150, 188-189, 222-223, 345-346, 421-422
♻️ Duplicate comments (3)
router/internal/graphqlmetrics/exporter.go (3)
56-57: Doc nit: QueueSize counts items, not batchesThe queue holds items (chan T), not batches. Update comment for accuracy.
- // QueueSize is the maximum number of batches allowed in queue at a given time. + // QueueSize is the maximum number of items allowed in the queue at a given time.
272-290: Backoff sleep not cancelable; goroutines may linger on shutdownMake backoff waits respect e.exportRequestContext so Shutdown can exit promptly.
- // Wait for the specified backoff period - sleepDuration := b.Duration() - e.logger.Debug("Retrying export after backoff", + // Wait for the specified backoff period (cancelable) + sleepDuration := b.Duration() + e.logger.Debug("Retrying export after backoff", zap.Int("batch_size", len(batch)), zap.Int("retry", retry), zap.Duration("sleep", sleepDuration), ) - - time.Sleep(sleepDuration) + select { + case <-e.exportRequestContext.Done(): + e.logger.Debug("Export cancelled during backoff") + return + case <-time.After(sleepDuration): + }
381-394: Shutdown can hang; cancel exports earlierCancel export requests before waiting, not in defer, so sleeping/blocked exports exit promptly.
func (e *Exporter[T]) Shutdown(ctx context.Context) error { e.logger.Debug("Shutdown started") ticker := time.NewTicker(time.Millisecond * 100) - defer func() { - ticker.Stop() - // Cancel all export requests - e.cancelAllExportRequests() - // Close the sink - if err := e.sink.Close(ctx); err != nil { - e.logger.Error("Error closing sink", zap.Error(err)) - } - e.logger.Debug("Shutdown complete") - }() + defer func() { + ticker.Stop() + // Close the sink + if err := e.sink.Close(ctx); err != nil { + e.logger.Error("Error closing sink", zap.Error(err)) + } + e.logger.Debug("Shutdown complete") + }() // First close the acceptTrafficSema to stop accepting new items close(e.acceptTrafficSema) // Then trigger the shutdown signal for the exporter goroutine to stop // It will drain the queue and send the remaining items close(e.shutdownSignal) + // Cancel all export requests to unblock backoff/sinks started before shutdown + e.cancelAllExportRequests()
🧹 Nitpick comments (5)
router/internal/graphqlmetrics/prometheus_sink.go (1)
104-106: Use int64 for counts to match metric API and avoid unnecessary castsStore counts as int64; remove per-iteration int conversions.
-func (s *PrometheusSink) aggregateBatch(batch []*graphqlmetrics.SchemaUsageInfo) map[aggregatedUsageKey]int { - aggregatedCounts := make(map[aggregatedUsageKey]int) +func (s *PrometheusSink) aggregateBatch(batch []*graphqlmetrics.SchemaUsageInfo) map[aggregatedUsageKey]int64 { + aggregatedCounts := make(map[aggregatedUsageKey]int64) @@ - // Increment count, using field.Count if available, otherwise 1 - if field.Count > 0 { - aggregatedCounts[key] += int(field.Count) - } else { - aggregatedCounts[key]++ - } + // Increment count, using field.Count if available, otherwise 1 + if field.Count > 0 { + aggregatedCounts[key] += int64(field.Count) + } else { + aggregatedCounts[key] += 1 + }And in Export:
- for key, count := range aggregatedCounts { + for key, count := range aggregatedCounts { @@ - int64(count), + count,Also applies to: 140-145, 81-87
router/demo.config.yaml (1)
16-25: LGTM: concise demo of Prometheus schema_usageSample config is coherent. Optional: add a comment noting include_operation_sha increases label cardinality.
router/internal/graphqlmetrics/exporter_test.go (1)
385-391: Reduce flakiness: prefer Eventually over fixed SleepReplace fixed sleep with Eventually to wait for flush deterministically.
- time.Sleep(200 * time.Millisecond) - - defer require.Nil(t, e.Shutdown(context.Background())) - - require.Equal(t, 1, len(c.publishedAggregations)) - require.Equal(t, 5, len(c.publishedAggregations[0])) + require.Eventually(t, func() bool { + c.mu.Lock() + defer c.mu.Unlock() + return len(c.publishedAggregations) == 1 && len(c.publishedAggregations[0]) == 5 + }, 2*time.Second, 50*time.Millisecond) + defer require.NoError(t, e.Shutdown(context.Background()))router/pkg/config/config.schema.json (1)
1249-1281: Tighten exporter schema and fix unit docDisallow unknown keys and include 'ms' in interval docs to match usage.
- "exporter": { - "type": "object", - "description": "Configuration for the schema usage exporter", - "properties": { + "exporter": { + "type": "object", + "description": "Configuration for the schema usage exporter", + "additionalProperties": false, + "properties": { @@ - "interval": { + "interval": { "type": "string", "default": "2s", "duration": { "minimum": "100ms" }, - "description": "The interval at which the schema usage queue is flushed. The period is specified as a string with a number and a unit, e.g. 10s, 1m. The supported units are 's', 'm', 'h'." + "description": "The interval at which the schema usage queue is flushed. Specify a duration, e.g. 100ms, 10s, 1m. Supported units: 'ms', 's', 'm', 'h'." },router/internal/graphqlmetrics/exporter.go (1)
107-114: *Prefer pooling []T (not []T) to simplify and avoid pointer indirectionSafer and simpler: pool raw slices; ownership already fixed by deferring put in goroutine.
- batchBufferPool: &sync.Pool{ - New: func() any { - // Pre-allocate slice with batch size capacity - buffer := make([]T, 0, settings.BatchSize) - return &buffer - }, - }, + batchBufferPool: &sync.Pool{ + New: func() any { + return make([]T, 0, settings.BatchSize) + }, + }, @@ -func (e *Exporter[T]) getBatchBuffer() []T { - bufferPtr := e.batchBufferPool.Get().(*[]T) - buffer := *bufferPtr - // Ensure the buffer is empty (should already be, but be defensive) - return buffer[:0] -} +func (e *Exporter[T]) getBatchBuffer() []T { + bufAny := e.batchBufferPool.Get() + if bufAny == nil { + return make([]T, 0, e.settings.BatchSize) + } + buf := bufAny.([]T) + return buf[:0] +} @@ -func (e *Exporter[T]) putBatchBuffer(buffer []T) { - // Reset the slice to zero length while keeping capacity - buffer = buffer[:0] - e.batchBufferPool.Put(&buffer) -} +func (e *Exporter[T]) putBatchBuffer(buf []T) { + e.batchBufferPool.Put(buf[:0]) +}Also applies to: 154-161, 163-169
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
router/demo.config.yaml(2 hunks)router/internal/graphqlmetrics/exporter.go(5 hunks)router/internal/graphqlmetrics/exporter_test.go(10 hunks)router/internal/graphqlmetrics/prometheus_sink.go(1 hunks)router/pkg/config/config.go(1 hunks)router/pkg/config/config.schema.json(1 hunks)router/pkg/config/config_test.go(1 hunks)router/pkg/config/testdata/config_defaults.json(1 hunks)router/pkg/config/testdata/config_full.json(1 hunks)router/pkg/graphqlschemausage/schemausage.go(7 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- router/pkg/config/config_test.go
- router/pkg/graphqlschemausage/schemausage.go
- router/pkg/config/testdata/config_full.json
- router/pkg/config/testdata/config_defaults.json
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.
Applied to files:
router/internal/graphqlmetrics/exporter_test.go
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.
Applied to files:
router/pkg/config/config.schema.json
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.
Applied to files:
router/pkg/config/config.schema.json
🧬 Code graph analysis (4)
router/internal/graphqlmetrics/exporter_test.go (2)
router/internal/graphqlmetrics/graphql_exporter.go (1)
NewGraphQLMetricsExporter(19-39)router/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (3)
RequestInfo(75-82)RequestInfo(97-97)RequestInfo(112-114)
router/pkg/config/config.go (2)
router/pkg/metric/config.go (1)
PrometheusSchemaFieldUsageExporter(48-53)router-tests/testenv/testenv.go (1)
PrometheusSchemaFieldUsageExporter(285-290)
router/internal/graphqlmetrics/prometheus_sink.go (3)
router/pkg/metric/metric_store.go (1)
Store(149-162)router/pkg/otel/attributes.go (5)
WgOperationName(10-10)WgOperationType(11-11)WgOperationSha256(59-59)WgGraphQLFieldName(60-60)WgGraphQLParentType(61-61)router/pkg/graphqlschemausage/schemausage.go (1)
TypeFieldMetrics(26-26)
router/internal/graphqlmetrics/exporter.go (1)
router/internal/graphqlmetrics/sink.go (2)
Sink(9-18)SinkErrorHandler(23-23)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: build-router
- GitHub Check: build_push_image
- GitHub Check: integration_test (./events)
- GitHub Check: image_scan (nonroot)
- GitHub Check: build_push_image (nonroot)
- GitHub Check: image_scan
- GitHub Check: build_test
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: integration_test (./telemetry)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (go)
🔇 Additional comments (2)
router/pkg/config/config.go (1)
116-119: LGTM: exporter config shape and defaultsExporter block and env tags look consistent with schema and intended wiring.
Also applies to: 121-126
router/internal/graphqlmetrics/prometheus_sink.go (1)
121-131: Avoid potential nil deref on field entriesTypeFieldMetrics is a slice of pointers; a nil element will panic on Path/TypeNames access.
- for _, field := range usageInfo.TypeFieldMetrics { + for _, field := range usageInfo.TypeFieldMetrics { + if field == nil { + continue + } // Skip fields without valid parent type or path if len(field.Path) == 0 || len(field.TypeNames) < 1 { continue }⛔ Skipped due to learnings
Learnt from: SkArchon Repo: wundergraph/cosmo PR: 2137 File: router/pkg/pubsub/redis/adapter.go:114-116 Timestamp: 2025-08-14T16:47:03.744Z Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.
| if e.settings.RetryOptions.MaxDuration <= 0 { | ||
| return errors.New("retry max duration must be positive") | ||
| return fmt.Errorf("retry max duration must be positive") | ||
| } | ||
|
|
||
| if e.settings.RetryOptions.Interval <= 0 { | ||
| return errors.New("retry interval must be positive") | ||
| return fmt.Errorf("retry interval must be positive") | ||
| } | ||
|
|
||
| if e.settings.RetryOptions.MaxRetry <= 0 { | ||
| return errors.New("retry max retry must be positive") | ||
| return fmt.Errorf("retry max retry must be positive") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validation rejects zero retry values even when retries are disabled
Only enforce retry field positivity if RetryOptions.Enabled is true.
- if e.settings.RetryOptions.MaxDuration <= 0 {
- return fmt.Errorf("retry max duration must be positive")
- }
- if e.settings.RetryOptions.Interval <= 0 {
- return fmt.Errorf("retry interval must be positive")
- }
- if e.settings.RetryOptions.MaxRetry <= 0 {
- return fmt.Errorf("retry max retry must be positive")
- }
+ if e.settings.RetryOptions.Enabled {
+ if e.settings.RetryOptions.MaxDuration <= 0 {
+ return fmt.Errorf("retry max duration must be positive")
+ }
+ if e.settings.RetryOptions.Interval <= 0 {
+ return fmt.Errorf("retry interval must be positive")
+ }
+ if e.settings.RetryOptions.MaxRetry <= 0 {
+ return fmt.Errorf("retry max retry must be positive")
+ }
+ }🤖 Prompt for AI Agents
In router/internal/graphqlmetrics/exporter.go around lines 139 to 149, the
current validation unconditionally rejects zero/negative retry values; change
the checks so they only run when e.settings.RetryOptions.Enabled is true (and
ensure RetryOptions is non-nil before accessing Enabled). Concretely, wrap the
three positivity checks for MaxDuration, Interval and MaxRetry in an if
e.settings.RetryOptions != nil && e.settings.RetryOptions.Enabled { ... } block
so disabled retries may have zero values, and return the same error messages
when validation runs.
| return &PrometheusSink{ | ||
| metricStore: cfg.MetricStore, | ||
| logger: cfg.Logger.With(zap.String("component", "prometheus_sink")), | ||
| includeOpSha: cfg.IncludeOpSha, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against nil logger in constructor
cfg.Logger can be nil; calling .With will panic. Default to zap.NewNop().
func NewPrometheusSink(cfg PrometheusSinkConfig) *PrometheusSink {
- return &PrometheusSink{
- metricStore: cfg.MetricStore,
- logger: cfg.Logger.With(zap.String("component", "prometheus_sink")),
+ lg := cfg.Logger
+ if lg == nil {
+ lg = zap.NewNop()
+ }
+ return &PrometheusSink{
+ metricStore: cfg.MetricStore,
+ logger: lg.With(zap.String("component", "prometheus_sink")),
includeOpSha: cfg.IncludeOpSha,
}
}Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In router/internal/graphqlmetrics/prometheus_sink.go around lines 41 to 45, the
constructor uses cfg.Logger.With(...) but cfg.Logger may be nil and will panic;
guard by defaulting to zap.NewNop() when cfg.Logger is nil, assign a local
logger variable like logger := cfg.Logger; if logger == nil { logger =
zap.NewNop() } and then call
logger.With(zap.String("component","prometheus_sink")) when constructing the
PrometheusSink.
This PR ensures that we never dismiss operation usage due to sampling. We eliminated sampling entirely while maintaining the same CPU util efficiency. Increase in total alloc memory is expected. With this change operations with low occurrence are always tracked.
I achieved this by making the gqlmetrics batch collector generic and use it for both implementations.
Fixes ENG-8486
CPU
Memory
Config
Summary by CodeRabbit
New Features
Tests
Checklist