feat: prometheus sink for gqlmetrics, ensure all operation are tracked efficiently #2332

StarpTech · 2025-11-13T22:42:43Z

This PR ensures that we never dismiss operation usage due to sampling. We eliminated sampling entirely while maintaining the same CPU util efficiency. Increase in total alloc memory is expected. With this change operations with low occurrence are always tracked.

I achieved this by making the gqlmetrics batch collector generic and use it for both implementations.

Fixes ENG-8486

CPU

Memory

Config

telemetry:
  metrics:
    otlp:
      enabled: false
    prometheus:
      schema_usage:
        enabled: true
        exporter:
          batch_size: 4096
          export_timeout: 10s
          interval: 10s
          queue_size: 10240

Summary by CodeRabbit

New Features
- Schema field usage now tracks all GraphQL requests (sampling removed) and adds configurable exporter settings for batching, queueing, flush interval, and export timeout.
- New pluggable metrics exporters and sinks for Prometheus and GraphQL metrics with synchronous/asynchronous recording and graceful shutdown.
- Added cached per-operation field metrics to reduce repeated work.
Tests
- Improved reliability with time-based flush waits, exporter-driven scenarios, dynamic assertions, aggregation-focused expectations, and new benchmarks for exporter throughput and buffering.

Checklist

I have discussed my proposed changes in an issue and have received approval to proceed.
I have followed the coding standards of the project.
Tests or benchmarks have been added or updated.
Documentation has been updated on https://github.com/wundergraph/cosmo-docs.
I have read the Contributors Guide.

coderabbitai · 2025-11-13T22:42:56Z

Walkthrough

Replaces probabilistic SampleRate-based schema-field sampling with a configurable exporter pipeline; introduces a generic Exporter[T] and Sink[T], GraphQL/Prometheus sinks and exporters, wires a Prometheus exporter into graph mux/router metrics, updates configs/tests to use batching and time-based flushing.

Changes

Cohort / File(s)	Change Summary
Config & JSON schema `router/pkg/config/config.go`, `router/pkg/config/config.schema.json`, `router/pkg/metric/config.go`, `router/pkg/config/testdata/config_defaults.json`, `router/pkg/config/testdata/config_full.json`, `router/pkg/config/config_test.go`	Removed `SampleRate` from `PrometheusSchemaFieldUsage`; added nested `Exporter` (`PrometheusSchemaFieldUsageExporter`) with `BatchSize`, `QueueSize`, `Interval`, `ExportTimeout`; updated JSON schema, defaults, testdata, and tests to validate exporter settings and precedence.
Router core — wiring & lifecycle `router/core/router.go`, `router/core/router_config.go`, `router/core/graph_server.go`	Replaced metrics exporter ctor usage with new GraphQL exporter constructor/signature; added internal `prometheusMetricsExporter` on `graphMux`, create/wire Prometheus exporter when PromSchemaFieldUsage enabled, pass exporter into RouterMetrics, and ensure Shutdown on graph mux shutdown.
Router core — metrics & operation tracking `router/core/router_metrics.go`, `router/core/operation_metrics.go`, `router/core/context.go`	Removed sampling machinery and `usageKey`; replaced sampling flags with `PrometheusUsageInfoTrack`; added `ExportSchemaUsageInfoPrometheus` and `PrometheusMetricsExporter()` accessor; added `operationContext.GetTypeFieldUsageInfoMetrics()` caching; simplified OperationMetrics finish path to use exporter-driven export.
Generic exporter & sink abstractions `router/internal/graphqlmetrics/exporter.go`, `router/internal/graphqlmetrics/sink.go`	Introduced generic `Exporter[T any]` with queueing, batching, retry/backoff, synchronous flush/drain and shutdown semantics; added `Sink[T any]` and `SinkErrorHandler` abstractions.
GraphQL metrics exporter & sink `router/internal/graphqlmetrics/graphql_exporter.go`, `router/internal/graphqlmetrics/graphql_metrics_sink.go`	Added `GraphQLMetricsExporter` wrapping the generic exporter and `GraphQLMetricsSink` that aggregates `SchemaUsageInfo` and publishes via Connect RPC; constructors, Export/Close, and retry classification implemented.
Prometheus metrics exporter & sink `router/internal/graphqlmetrics/prometheus_exporter.go`, `router/internal/graphqlmetrics/prometheus_sink.go`	Added `PrometheusMetricsExporter` wrapping generic exporter and `PrometheusSink` that aggregates schema-usage batches and emits OpenTelemetry/Prometheus metrics; aggregation and attribute composition implemented.
Exporter tests & benchmarks `router/internal/graphqlmetrics/exporter_test.go`, `router/internal/graphqlmetrics/exporter_bench_test.go`	Tests updated to call `NewGraphQLMetricsExporter`, adapted expectations for aggregation-based publishing; new exporter benchmarks added and tests adjusted to new RequestInfo shape and aggregation semantics.
Prometheus integration tests & testenv `router-tests/testenv/testenv.go`, `router-tests/prometheus_improved_test.go`	Replaced inline SampleRate with Exporter config and test-friendly defaults; tests use custom exporter settings, add time.Sleep waits for flushes, and verify metrics by aggregated presence checks instead of fixed index assertions.
Schema usage traversal & benchmarks `router/pkg/graphqlschemausage/schemausage.go`, `router/pkg/graphqlschemausage/schemausage_bench_test.go`	Allocation optimizations (preallocations, manual slice copies) in schema usage traversal; new schema-usage benchmarks added.
Module manifest `go.mod`	Dependency/import updates to accommodate new internal/graphqlmetrics types and sinks.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Focus review on:
- Generic Exporter[T] concurrency: queue sizing, drain/shutdown ordering, synchronous flush semantics and context/cancel handling.
- PrometheusSink aggregation correctness: grouping key, attribute composition, and accurate counts per operation/field.
- Router/graphMux wiring and lifecycle: creation order, nil-checks, and Shutdown sequencing for the new prometheus exporter.
- Tests using time.Sleep: potential flakiness and whether synchronous flush paths guarantee delivery during teardown.

Possibly related PRs

fix: improve prometheus schema usage collection and implement sampling #2323 — addresses Prometheus schema field-usage export and SampleRate handling; strongly related as it touches the same types and test areas (this PR replaces sampling with exporter batching).

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 61.29% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately describes the main change: introducing a Prometheus sink for gqlmetrics with comprehensive operation tracking. It directly reflects the core objective of removing sampling to track all operations efficiently.

✨ Finishing touches

📝 Generate docstrings

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-11-13T22:44:58Z

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-080c190f4ab7b9bece2cdad63a810c99627074ec-nonroot

github-actions · 2025-11-13T22:44:58Z

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-ced880bbbe6f9957e486f7deb9115c6b40f2a1bb-nonroot

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

router/core/graph_server.go (1)

758-775: Shutdown ordering: stop exporter before closing metric store.

Currently metricStore is shut down before prometheusMetricsExporter. If exporter is still flushing, writes can fail. Shut down the exporter first, then the stores.

Apply:

- if s.metricStore != nil {
-   if aErr := s.metricStore.Shutdown(ctx); aErr != nil {
-     err = errors.Join(err, aErr)
-   }
- }
- ...
- if s.prometheusMetricsExporter != nil {
-   if aErr := s.prometheusMetricsExporter.Shutdown(ctx); aErr != nil {
-     err = errors.Join(err, aErr)
-   }
- }
+ if s.prometheusMetricsExporter != nil {
+   if aErr := s.prometheusMetricsExporter.Shutdown(ctx); aErr != nil {
+     err = errors.Join(err, aErr)
+   }
+ }
+ if s.metricStore != nil {
+   if aErr := s.metricStore.Shutdown(ctx); aErr != nil {
+     err = errors.Join(err, aErr)
+   }
+ }

🧹 Nitpick comments (14)

router/internal/graphqlmetrics/graphql_metrics_sink.go (2)

29-36: Handle potential nil logger in NewGraphQLMetricsSink

cfg.Logger.With(...) will panic if Logger is ever nil (e.g. in tests or ad-hoc usage). Consider either guarding and falling back to zap.NewNop() or documenting/enforcing a non-nil logger requirement at call sites.

70-91: Consider tightening retry classification in IsRetryableError

Right now any non-connect.Error (including context cancellations or local/config errors) is treated as retryable. If such errors can reach this helper, consider explicitly short‑circuiting on context.Canceled / context.DeadlineExceeded or other known permanent failures in the caller to avoid pointless retries.

router/pkg/config/config.schema.json (1)

1249-1281: Exporter schema matches the new Prometheus usage exporter config

The schema_usage.exporter block (batch_size, queue_size, interval, export_timeout) and defaults are consistent and align with the new Go config type. If you want stricter validation, you could optionally add "additionalProperties": false to the exporter object to catch typos in config keys, but the current schema is functionally fine.
router/internal/graphqlmetrics/exporter_test.go (1)
369-375: Avoid fixed sleeps in tests; poll until metrics observed.

Replace time.Sleep with polling (e.g., require.Eventually) or call Shutdown before assertions to reduce flakiness.

Example:
- time.Sleep(200 * time.Millisecond)
+ require.Eventually(t, func() bool {
+   _, err := promRegistry.Gather()
+   return err == nil && len(c.publishedAggregations) > 0
+ }, 2*time.Second, 50*time.Millisecond)
Also applies to: 371-375
router-tests/testenv/testenv.go (1)

279-290: Clamp exporter settings to safe minimums in tests.

If a caller supplies zeros, normalize to sane minimums (e.g., BatchSize>=1, QueueSize>=1, Interval>0) before wiring to router metric config to prevent silent stalls.

Also applies to: 1508-1552

router-tests/prometheus_improved_test.go (1)

82-96: Reduce flakiness: replace fixed sleeps with Eventually/polling.

Use require.Eventually to wait for metrics flush instead of fixed sleeps. Keeps tests stable on slow CI.

Also applies to: 166-171, 229-233, 269-274, 315-321, 359-365, 413-419

router/core/graph_server.go (1)

923-949: Defaults/validation pass-through for exporter settings.

When wiring settings from cfg.Exporter, ensure they are validated/clamped (BatchSize, QueueSize, Interval, ExportTimeout) before use to prevent zero/negative configs from stalling the exporter.

router/core/router.go (2)

841-851: Prometheus schema usage: consider logging exporter settings as well

The enablement log for Prometheus schema field usage is clear. Since batching/queue behavior is now configurable, consider also logging the effective exporter settings (batch size, queue size, interval, export timeout) here to simplify debugging misconfigurations.

2324-2332: Exporter settings mapping is correct; be aware of validation constraints

The mapping from cfg.Metrics.Prometheus.SchemaFieldUsage.Exporter into rmetric.PrometheusSchemaFieldUsageExporter is straightforward and correct. Note that the generic exporter’s validate() requires all these numeric fields to be strictly positive, so a YAML/env override of 0 or negative values will now fail exporter creation. If that’s not desired, you may want to clamp or normalize these values earlier in config handling.

router/core/router_metrics.go (1)

135-159: Prometheus export path is lightweight; consider extra guard and hash semantics

The Prometheus-specific SchemaUsageInfo payload deliberately includes only type field metrics, operation hash/name, and schema version, which is a good fit for local metric cardinality and memory use.

Two minor points:

To harden against future misuse, you could early-return if m.prometheusMetricsExporter is nil, even though current callers gate on PrometheusUsageInfoTrack.

This path uses operationContext.sha256Hash directly instead of HashString(). Please confirm that sha256Hash is always initialized and matches the intended “operation SHA” semantics for all operation types.

router/internal/graphqlmetrics/exporter.go (4)

78-111: NewExporter wiring is correct; consider guarding nil settings

Constructor behavior is good: it enforces a non-nil sink, installs a default retryability handler, validates settings, and starts the batching goroutine with a component-scoped logger.

One robustness nit: settings is dereferenced without a nil check when creating the queue and during validation. Since this is an internal API, it’s probably fine, but you might want to either:

Treat nil as “use NewDefaultExporterSettings()”, or

Explicitly return an error like fmt.Errorf("settings cannot be nil").

This avoids surprising panics if a future caller accidentally passes nil.

113-143: Validation enforces positive values; retries disabled still require fields

validate() enforces strictly positive values for batch/queue sizes, interval, export timeout, and all retry fields. That’s good for catching misconfigurations early.

Note that it requires retry fields to be positive even when RetryOptions.Enabled is false. If you later expose Enabled as a user-facing toggle, you may want to relax validation so that only enabled retries need fully-populated options.

179-282: Batch export and retry logic are reasonable; consider respecting shutdown context for backoff

exportBatch wraps the shared export context with a per-call timeout and delegates to the sink, logging failures at debug. exportBatchWithRetry adds bounded exponential backoff with:

Retryability decided by isRetryableError.

Max duration, interval, and max attempts from settings.

Clear logs on non-retryable errors and after exhausting retries.

One potential refinement: the retry loop currently sleeps with time.Sleep and doesn’t inspect any shutdown signal or context, so a stubbornly failing sink can delay shutdown up to the cumulative backoff time. If you ever need faster shutdowns under failure, you could periodically check e.exportRequestContext.Done() or a dedicated shutdown flag inside the retry loop to abort early.

344-381: Shutdown sequence is mostly robust; document single-use expectation

Shutdown:

Stops accepting traffic, signals the batching goroutine to drain and exit, and then polls inflightBatches until zero or context cancel.

Defers cancellation of export contexts and sink closure, with logs on errors and on completion.

Two small caveats:

Shutdown is not idempotent: calling it twice will panic on the second close of acceptTrafficSema/shutdownSignal. It would help to either document “must only be called once” or add a guard (e.g., sync.Once).

Synchronous Record(..., true) calls are not tracked in inflightBatches, so they won’t delay shutdown waiting; they may still race with sink.Close if invoked concurrently. That’s likely acceptable for metrics but worth keeping in mind.

Overall, the shutdown strategy is consistent with a best-effort metrics pipeline.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ac085ab and bcddb08.

📒 Files selected for processing (17)

router-tests/prometheus_improved_test.go (16 hunks)
router-tests/testenv/testenv.go (3 hunks)
router/core/graph_server.go (4 hunks)
router/core/operation_metrics.go (3 hunks)
router/core/router.go (3 hunks)
router/core/router_config.go (1 hunks)
router/core/router_metrics.go (4 hunks)
router/internal/graphqlmetrics/exporter.go (5 hunks)
router/internal/graphqlmetrics/exporter_test.go (7 hunks)
router/internal/graphqlmetrics/graphql_exporter.go (1 hunks)
router/internal/graphqlmetrics/graphql_metrics_sink.go (1 hunks)
router/internal/graphqlmetrics/prometheus_exporter.go (1 hunks)
router/internal/graphqlmetrics/prometheus_sink.go (1 hunks)
router/internal/graphqlmetrics/sink.go (1 hunks)
router/pkg/config/config.go (1 hunks)
router/pkg/config/config.schema.json (1 hunks)
router/pkg/metric/config.go (2 hunks)

🧰 Additional context used

🧠 Learnings (13)

📚 Learning: 2025-08-14T17:45:57.509Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.

Applied to files:

router/internal/graphqlmetrics/prometheus_exporter.go
router/core/router_config.go
router/core/router_metrics.go
router/core/graph_server.go
router/core/router.go

📚 Learning: 2025-10-01T20:39:16.113Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

router-tests/testenv/testenv.go
router-tests/prometheus_improved_test.go
router/internal/graphqlmetrics/exporter_test.go
router/core/operation_metrics.go

📚 Learning: 2025-09-29T10:28:07.122Z

Learnt from: dkorittki
Repo: wundergraph/cosmo PR: 2192
File: router/pkg/config/config.go:1028-1029
Timestamp: 2025-09-29T10:28:07.122Z
Learning: The deprecation strategy for IntrospectionEnabled field in router/pkg/config/config.go is to first mark it as deprecated, then migrate all call sites to use IntrospectionConfig.Enabled, and finally remove the deprecated field. The envDefault tag should be kept during the deprecation period for backward compatibility.

Applied to files:

router-tests/testenv/testenv.go

📚 Learning: 2025-09-17T20:55:39.456Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.

Applied to files:

router/core/router_config.go
router/core/graph_server.go
router/core/router.go

📚 Learning: 2025-07-03T10:33:25.778Z

Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.

Applied to files:

router/core/router_config.go
router/pkg/config/config.schema.json

📚 Learning: 2025-08-14T17:46:00.930Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.930Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.

Applied to files:

router/core/router_config.go
router/core/router_metrics.go
router/core/graph_server.go

📚 Learning: 2025-08-20T22:13:25.222Z

Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2157
File: router-tests/go.mod:16-16
Timestamp: 2025-08-20T22:13:25.222Z
Learning: github.com/mark3labs/mcp-go v0.38.0 has regressions and should not be used in the wundergraph/cosmo project. v0.36.0 is the stable version that should be used across router-tests and other modules.

Applied to files:

router-tests/prometheus_improved_test.go
router/core/graph_server.go
router/core/router.go

📚 Learning: 2025-08-20T10:08:17.857Z

Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

router/pkg/config/config.schema.json

📚 Learning: 2025-07-21T15:06:36.664Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2067
File: router/pkg/config/config.schema.json:1637-1644
Timestamp: 2025-07-21T15:06:36.664Z
Learning: In the Cosmo router project, when extending JSON schema validation for security-sensitive fields like JWKS secrets, backwards compatibility is maintained by implementing warnings in the Go code rather than hard validation constraints in the schema. This allows existing configurations to continue working while alerting users to potential security issues.

Applied to files:

router/pkg/config/config.schema.json

📚 Learning: 2025-08-14T16:47:03.744Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Applied to files:

router/core/operation_metrics.go
router/core/router_metrics.go
router/core/graph_server.go

📚 Learning: 2025-08-26T17:19:28.178Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2167
File: router/internal/expr/retry_context.go:3-10
Timestamp: 2025-08-26T17:19:28.178Z
Learning: In the Cosmo router codebase, prefer using netErr.Timeout() checks over explicit context.DeadlineExceeded checks for timeout detection in retry logic, as Go's networking stack typically wraps context deadline exceeded errors in proper net.Error implementations.

Applied to files:

router/core/operation_metrics.go
router/pkg/metric/config.go

📚 Learning: 2025-08-14T17:22:41.662Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.

Applied to files:

router/core/router_metrics.go

📚 Learning: 2025-11-12T18:38:46.619Z

Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2327
File: graphqlmetrics/core/metrics_service.go:435-442
Timestamp: 2025-11-12T18:38:46.619Z
Learning: In graphqlmetrics/core/metrics_service.go, the opGuardCache TTL (30 days) is intentionally shorter than the gql_metrics_operations database retention period (90 days). This design reduces memory usage while relying on database constraints to prevent duplicate inserts after cache expiration.

Applied to files:

router/core/graph_server.go

🧬 Code graph analysis (15)

router/internal/graphqlmetrics/graphql_exporter.go (2)

router/internal/graphqlmetrics/exporter.go (3)

Exporter (16-31)

ExporterSettings (50-61)

NewExporter (82-111)

router/internal/graphqlmetrics/graphql_metrics_sink.go (3)

NewGraphQLMetricsSink (30-36)

GraphQLMetricsSinkConfig (23-27)

IsRetryableError (72-91)

router/internal/graphqlmetrics/prometheus_exporter.go (3)

router/internal/graphqlmetrics/exporter.go (3)

Exporter (16-31)

ExporterSettings (50-61)

NewExporter (82-111)

router/pkg/metric/metric_store.go (1)

Store (149-162)

router/internal/graphqlmetrics/prometheus_sink.go (2)

NewPrometheusSink (41-47)

PrometheusSinkConfig (25-29)

router-tests/testenv/testenv.go (2)

router/pkg/config/config.go (2)

PrometheusSchemaFieldUsageExporter (121-126)

PrometheusSchemaFieldUsage (115-119)

router/pkg/metric/config.go (2)

PrometheusSchemaFieldUsageExporter (48-53)

PrometheusSchemaFieldUsage (42-46)

router/pkg/config/config.go (3)

router/internal/graphqlmetrics/exporter.go (1)

Exporter (16-31)

router-tests/testenv/testenv.go (1)

PrometheusSchemaFieldUsageExporter (285-290)

router/pkg/metric/config.go (1)

PrometheusSchemaFieldUsageExporter (48-53)

router/internal/graphqlmetrics/prometheus_sink.go (3)

router/pkg/metric/metric_store.go (1)

Store (149-162)

router/pkg/otel/attributes.go (5)

WgOperationName (10-10)

WgOperationType (11-11)

WgOperationSha256 (59-59)

WgGraphQLFieldName (60-60)

WgGraphQLParentType (61-61)

router/pkg/graphqlschemausage/schemausage.go (1)

TypeFieldMetrics (28-28)

router/internal/graphqlmetrics/graphql_metrics_sink.go (1)

router/internal/graphqlmetrics/aggregate.go (1)

AggregateSchemaUsageInfoBatch (9-27)

router/core/router_config.go (2)

router/internal/graphqlmetrics/graphql_exporter.go (1)

GraphQLMetricsExporter (13-15)

router/internal/graphqlmetrics/prometheus_exporter.go (1)

PrometheusMetricsExporter (13-15)

router/internal/graphqlmetrics/exporter.go (1)

router/internal/graphqlmetrics/sink.go (2)

Sink (9-18)

SinkErrorHandler (23-23)

router-tests/prometheus_improved_test.go (4)

router-tests/testenv/testenv.go (3)

Run (105-122)

PrometheusSchemaFieldUsageExporter (285-290)

Environment (1763-1799)

router/pkg/config/config.go (1)

PrometheusSchemaFieldUsageExporter (121-126)

router/pkg/metric/config.go (1)

PrometheusSchemaFieldUsageExporter (48-53)

router/pkg/otel/attributes.go (2)

WgOperationName (10-10)

WgOperationType (11-11)

router/internal/graphqlmetrics/exporter_test.go (1)

router/internal/graphqlmetrics/graphql_exporter.go (1)

NewGraphQLMetricsExporter (19-39)

router/core/operation_metrics.go (1)

router/core/router_metrics.go (1)

RouterMetrics (14-21)

router/core/router_metrics.go (4)

router/internal/graphqlmetrics/graphql_exporter.go (1)

GraphQLMetricsExporter (13-15)

router/internal/graphqlmetrics/prometheus_exporter.go (1)

PrometheusMetricsExporter (13-15)

router/core/context.go (4)

OperationType (503-503)

OperationTypeQuery (506-506)

OperationTypeMutation (507-507)

OperationTypeSubscription (508-508)

router/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (16)

OperationType (26-26)

OperationType (58-60)

OperationType (62-64)

OperationType (71-73)

OperationType_QUERY (29-29)

OperationType_MUTATION (30-30)

OperationType_SUBSCRIPTION (31-31)

SchemaUsageInfo (130-153)

SchemaUsageInfo (168-168)

SchemaUsageInfo (183-185)

OperationInfo (362-373)

OperationInfo (388-388)

OperationInfo (403-405)

SchemaInfo (428-435)

SchemaInfo (450-450)

SchemaInfo (465-467)

router/pkg/metric/config.go (3)

router/internal/graphqlmetrics/exporter.go (1)

Exporter (16-31)

router-tests/testenv/testenv.go (1)

PrometheusSchemaFieldUsageExporter (285-290)

router/pkg/config/config.go (1)

PrometheusSchemaFieldUsageExporter (121-126)

router/core/graph_server.go (4)

router/internal/graphqlmetrics/prometheus_exporter.go (2)

PrometheusMetricsExporter (13-15)

NewPrometheusMetricsExporter (19-45)

router/pkg/config/config.go (1)

Prometheus (99-113)

router/internal/graphqlmetrics/exporter.go (3)

ExporterSettings (50-61)

Exporter (16-31)

RetryOptions (33-38)

router/core/router_metrics.go (1)

NewRouterMetrics (43-52)

router/core/router.go (4)

router/internal/graphqlmetrics/graphql_exporter.go (1)

NewGraphQLMetricsExporter (19-39)

router/pkg/config/config.go (3)

Prometheus (99-113)

PrometheusSchemaFieldUsageExporter (121-126)

Metrics (137-142)

router/internal/graphqlmetrics/exporter.go (1)

Exporter (16-31)

router/pkg/metric/config.go (1)

PrometheusSchemaFieldUsageExporter (48-53)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: build-router
GitHub Check: build_push_image
GitHub Check: build_test
GitHub Check: integration_test (./telemetry)
GitHub Check: build_push_image (nonroot)
GitHub Check: integration_test (./events)
GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
GitHub Check: image_scan
GitHub Check: Analyze (go)
GitHub Check: Analyze (javascript-typescript)
GitHub Check: image_scan (nonroot)
GitHub Check: build_test

🔇 Additional comments (19)

router/internal/graphqlmetrics/graphql_metrics_sink.go (1)

40-61: Export path and aggregation look correct

Batch empty-check, aggregation via AggregateSchemaUsageInfoBatch, auth header setup, and debug logging are all straightforward and concurrency-safe for this sink implementation. No functional issues from this diff.

router/core/router_config.go (1)

37-38: New exporter fields fit the updated metrics wiring

Adding gqlMetricsExporter and prometheusMetricsExporter to Config keeps the router’s metrics wiring explicit and type‑safe; no issues with this change.

router/internal/graphqlmetrics/sink.go (1)

7-23: Sink[T] abstraction is clear and compatible with the exporter design

The generic Sink[T] interface and SinkErrorHandler are minimal, well‑documented, and match the batch‑export semantics used by the generic exporter. No changes needed here.

router/pkg/metric/config.go (1)

42-53: ****

Verification confirms that all constructions of PrometheusSchemaFieldUsage with the Exporter field properly handle initialization:

Production (router.go:2324): Reads all Exporter fields from validated config source

Tests (testenv.go:1509-1530): Explicitly provides test-friendly defaults when enabled (BatchSize: 100, QueueSize: 1000, Interval: 100ms, ExportTimeout set)

No manual constructions with uninitialized Exporter fields exist in the codebase. The code is correct as-is.

router/internal/graphqlmetrics/prometheus_exporter.go (1)

31-41: LGTM on exporter wrapper.

Constructor and delegation are clear; disabling retries for local Prom sink is appropriate.

Also applies to: 47-58

router/internal/graphqlmetrics/graphql_exporter.go (1)

31-39: LGTM on GraphQL exporter wrapper.

API is minimal and mirrors the generic exporter cleanly.

Also applies to: 41-52

router-tests/prometheus_improved_test.go (1)

351-358: Review comment is incorrect; no changes required.

The repository targets Go 1.25 (confirmed across all go.mod files), which fully supports the for range n syntax introduced in Go 1.22. The codebase consistently uses this syntax in multiple test files (prometheus_improved_test.go, authentication_test.go, plugintest/tracing_test.go). No compatibility concern exists; the code is idiomatic for the current toolchain.

Likely an incorrect or invalid review comment.

router/internal/graphqlmetrics/prometheus_sink.go (1)

79-86: Remove this review comment; the Go version compatibility concern is invalid.

The codebase targets Go 1.25, and slices.Concat was introduced in Go 1.22, so the "newer Go" compatibility argument doesn't apply. The code is compatible with the project's Go version.

While the nil vs. empty slice suggestion ([]attribute.KeyValue{} → nil) is a minor stylistic improvement, the original review overstates the concern about Go compatibility and allocation efficiency in this context.

router/pkg/config/config.go (1)

115-126: No changes required; validation is already in place.

The PrometheusSchemaFieldUsageExporter fields are validated both at the JSON schema level and at runtime:

Runtime: The Exporter[T].validate() method (router/internal/graphqlmetrics/exporter.go:113–142) enforces BatchSize > 0, QueueSize > 0, Interval > 0, and ExportTimeout > 0.

Schema: The config.schema.json (lines 1252–1278) defines minimum: 1 for batch_size/queue_size and duration: {minimum: "1s"} for interval/export_timeout.

Both layers prevent misconfiguration, so no additional validation is needed.

router/core/router.go (1)

821-839: GraphQL metrics exporter wiring looks consistent

Using NewGraphQLMetricsExporter with NewDefaultExporterSettings() cleanly adapts to the new generic exporter API; error handling and assignment into r.gqlMetricsExporter remain correct.

router/core/router_metrics.go (3)

14-21: RouterMetrics wiring for dual exporters is sound

Extending RouterMetrics and routerMetrics to carry both the GraphQL and Prometheus exporters, plus an explicit exportEnabled flag for cloud usage, cleanly separates concerns. NewRouterMetrics correctly threads these through, and StartOperation’s PrometheusUsageInfoTrack: m.prometheusMetricsExporter != nil gives a cheap, stable switch for local Prometheus tracking.

Also applies to: 26-41, 44-51

57-67: StartOperation gating correctly enables Prometheus usage tracking

Passing TrackUsageInfo and PrometheusUsageInfoTrack into OperationMetricsOptions ensures:

Cloud GraphQL usage export is still controlled centrally via exportEnabled.

Prometheus schema usage export is enabled solely by the presence of a Prometheus exporter.
This matches the “track all operations” goal without coupling the two sinks.

75-81: Exporter accessors are minimal and sufficient

GqlMetricsExporter() and PrometheusMetricsExporter() simply expose the underlying exporters, matching the new Prometheus path needed by the graph mux. No additional state is leaked.

router/core/operation_metrics.go (2)

32-40: OperationMetrics option plumbing is consistent

Adding trackUsageInfo and prometheusUsageInfoTrack to OperationMetrics and threading them via OperationMetricsOptions keeps the decision about which sinks to hit local to the operation instance. The constructor correctly snapshots these booleans at operation start, avoiding any mid-request config races.

Also applies to: 91-100, 105-117

77-88: Dual export in Finish aligns with tracking goals

The new export block:

Respects SkipLoader and reqContext.operation == nil as before.

Uses trackUsageInfo to guard GraphQL (cloud) schema usage export.

Uses prometheusUsageInfoTrack to independently guard the Prometheus schema usage path.

This achieves “always track operations” for Prometheus while preserving existing cloud usage behavior.

router/internal/graphqlmetrics/exporter.go (4)

13-31: Generic exporter structure and defaults look solid

The generic Exporter[T] plus ExporterSettings and NewDefaultExporterSettings() give a reusable, thread-safe batching primitive. The fields (queue, inflightBatches, exportRequestContext) and defaults (batch size, queue size, interval, timeouts, retry) are reasonable and match the use cases for GraphQL and Prometheus sinks.

Also applies to: 50-76

145-177: Traffic gating and queue backpressure are correctly implemented

acceptTraffic() uses a closed channel to signal shutdown, and Record:

Sends synchronously via exportBatch when requested.

Uses a non-blocking send to a bounded queue otherwise, logging and dropping when full.

This prevents producer goroutines from blocking indefinitely on shutdown or overload. The synchronous path intentionally ignores the error (just logs inside exportBatch), which matches a “best-effort metrics” philosophy.

285-319: Batching loop and shutdown signal handling look correct

The start loop:

Maintains a preallocated buffer sized to BatchSize.

Flushes on timer ticks and when the buffer fills.

On shutdown, calls drainQueue(buffer) and exits cleanly.

This ensures no items are lost between tick and shutdown, and keeps the batching goroutine single-responsibility.

321-342: drainQueue correctly flushes remaining items during shutdown

drainQueue empties the queue into the buffer, sending full batches as they accumulate and finally sending any partial batch, with helpful debug logging. Given acceptTrafficSema is closed before shutdownSignal, no new async items should be enqueued during draining, so this is race-free.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

router/pkg/config/config_test.go (1)

764-787: LGTM! Precedence test correctly validates file config priority.

The test correctly verifies that file-specified exporter values override environment variables, using clearly distinguishable values (1024 vs 9999, 5s vs 99s) to prove the behavior.

Optional: Consider testing the precedence behavior for all four exporter fields (BatchSize, QueueSize, Interval, ExportTimeout) for completeness, though testing two fields is sufficient to prove the precedence mechanism works.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bcddb08 and 6f21eb9.

📒 Files selected for processing (1)

router/pkg/config/config_test.go (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-10-01T20:39:16.113Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

router/pkg/config/config_test.go

🧬 Code graph analysis (1)

router/pkg/config/config_test.go (1)

router/pkg/config/config.go (5)

LoadConfig (1120-1232)

Config (1014-1089)

Telemetry (157-163)

Metrics (137-142)

Prometheus (99-113)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: build-router
GitHub Check: image_scan
GitHub Check: image_scan (nonroot)
GitHub Check: build_test
GitHub Check: build_push_image
GitHub Check: integration_test (./telemetry)
GitHub Check: build_push_image (nonroot)
GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
GitHub Check: Analyze (go)
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (3)

router/pkg/config/config_test.go (3)

689-707: LGTM! Well-structured test for exporter defaults.

The test correctly verifies that exporter configuration falls back to sensible defaults (8192/16384 for batch/queue sizes, 10s for interval/timeout) when only enabled: true is specified.

709-736: LGTM! Comprehensive test for custom file configuration.

The test correctly verifies that all exporter fields (batch_size, queue_size, interval, export_timeout) can be customized via YAML configuration.

738-762: LGTM! Proper environment variable loading test.

The test correctly verifies that exporter configuration can be loaded from environment variables using the PROMETHEUS_SCHEMA_FIELD_USAGE_EXPORTER_* prefix pattern.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d0e9d36 and 3fd209b.

📒 Files selected for processing (2)

router/pkg/config/testdata/config_defaults.json (1 hunks)
router/pkg/config/testdata/config_full.json (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-08-20T10:08:17.857Z

Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

router/pkg/config/testdata/config_defaults.json

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: build-router
GitHub Check: build_push_image
GitHub Check: build_push_image (nonroot)
GitHub Check: image_scan (nonroot)
GitHub Check: build_test
GitHub Check: image_scan
GitHub Check: integration_test (./telemetry)
GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
GitHub Check: Analyze (go)
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (4)

router/pkg/config/testdata/config_full.json (2)

93-102: Schema change is properly tested via fixture-based tests.

The config_full.json file is exercised by config parsing tests in router/pkg/config/config_test.go at lines 320 and 1171. These tests use goldie for fixture-based JSON comparison: they load the YAML configuration and validate the parsed output matches the config_full.json fixture exactly. Since the fixture already contains the new Exporter schema with BatchSize, QueueSize, Interval, and ExportTimeout fields, the goldie tests will automatically catch any schema mismatches. If tests are passing, the schema change is validated.

96-101: Configuration field names correctly match Go struct definition.

Verification confirms the Exporter configuration in config_full.json (lines 96-101) is properly structured:

JSON field names (BatchSize, QueueSize, Interval, ExportTimeout) match Go struct field names in router/pkg/config/config.go:121-126

Field values are correct (integers for int fields, nanoseconds for time.Duration fields)

Both config_full.json and config_defaults.json consistently reflect the schema change

Tests and production code in router/core/router.go:2327-2330 correctly use these fields

router/pkg/config/testdata/config_defaults.json (2)

63-72: Structural change from SampleRate to Exporter is sound.

The replacement of the probabilistic SampleRate sampling approach with a configurable Exporter object containing BatchSize, QueueSize, Interval, and ExportTimeout properly reflects the PR objective to track all operations deterministically using batching instead of sampling.

66-71: Testdata values are correct and consistent with config schema defaults.

Verification confirms the Exporter values in the testdata file align with the actual configuration schema. The router/pkg/config/config.go struct defines PrometheusSchemaFieldUsageExporter with envDefault tags that match exactly:

BatchSize: envDefault:"8192" (testdata: 8192) ✓

QueueSize: envDefault:"16384" (testdata: 16384) ✓

Interval: envDefault:"10s" (testdata: 10000000000 ns = 10s) ✓

ExportTimeout: envDefault:"10s" (testdata: 10000000000 ns = 10s) ✓

The testdata correctly reflects the source-of-truth defaults. Any discrepancy with the PR description likely reflects earlier proposal values that were finalized to these values in the implementation.

coderabbitai · 2025-11-14T00:00:04Z

router/pkg/config/testdata/config_full.json

        "SchemaFieldUsage": {
          "Enabled": true,
          "IncludeOperationSha": true,
-          "SampleRate": 1
+          "Exporter": {
+            "BatchSize": 8192,
+            "QueueSize": 16384,
+            "Interval": 10000000000,
+            "ExportTimeout": 10000000000
+          }
        }


⚠️ Potential issue | 🟠 Major

Assess backward compatibility: old configs with SampleRate will now fail to parse.

Removing SampleRate and replacing it with Exporter is a breaking change. Existing user configs that specify SampleRate will fail validation/parsing. Consider documenting migration path or implementing a compatibility layer (e.g., mapping old SampleRate to new Exporter batching parameters).

If this is intentional and part of a documented breaking change, add a comment in the config or release notes. If backward compatibility is required, implement a migration/fallback mechanism.

Should we leave it in schema as a no-op with a deprecation warning or remove it?

@endigma I'd recommend removing it entirely rather than keeping it as a deprecated no-op. Here's why:

Fundamental semantic change: SampleRate was about probabilistic sampling (whether to record an operation), while Exporter is about batch processing (how to export recorded operations). Since sampling is completely removed and all operations are now tracked, keeping SampleRate would be misleading about what the system actually does.

No meaningful mapping: There's no sensible way to translate SampleRate: 1.0 into BatchSize, QueueSize, Interval, and ExportTimeout values. A deprecated no-op would silently ignore user intent without clear guidance.

Clearer migration path: A clean break forces users to understand the new model, which is important since the behavior has fundamentally changed (probabilistic sampling → guaranteed tracking with batching).

Recommendation:

Remove SampleRate from the schema

Add validation logic that detects if SampleRate is present in the config and returns a helpful error message like:
"SampleRate has been removed. All operations are now tracked. Please use Exporter configuration instead. See: [link to docs/migration guide]"

Document the breaking change in release notes with migration examples

Consider if this warrants a major version bump

Would you like me to search the codebase for any existing patterns for handling config migrations, or help draft the error message/migration documentation?

🧠 Learnings used

Learnt from: SkArchon Repo: wundergraph/cosmo PR: 2252 File: router-tests/telemetry/telemetry_test.go:9684-9693 Timestamp: 2025-10-01T20:39:16.113Z Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

router/internal/graphqlmetrics/prometheus_sink.go (1)
81-86: Consider passing nil instead of an empty slice for the unused parameter.

Since attributes are being passed via WithAttributeSet option, the sliceAttr parameter is unused. Passing nil instead of []attribute.KeyValue{} would be more idiomatic and avoid an unnecessary empty slice allocation.

Apply this diff:
 		s.metricStore.MeasureSchemaFieldUsage(
 			ctx,
 			int64(count),
-			[]attribute.KeyValue{},
+			nil,
 			otelmetric.WithAttributeSet(attribute.NewSet(allAttrs...)),
 		)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3fd209b and 0200c3c.

📒 Files selected for processing (1)

router/internal/graphqlmetrics/prometheus_sink.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

router/internal/graphqlmetrics/prometheus_sink.go (3)

router/pkg/metric/metric_store.go (1)

Store (149-162)

router/pkg/otel/attributes.go (5)

WgOperationName (10-10)

WgOperationType (11-11)

WgOperationSha256 (59-59)

WgGraphQLFieldName (60-60)

WgGraphQLParentType (61-61)

router/pkg/graphqlschemausage/schemausage.go (1)

TypeFieldMetrics (28-28)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: build-router
GitHub Check: build_push_image (nonroot)
GitHub Check: image_scan (nonroot)
GitHub Check: image_scan
GitHub Check: build_push_image
GitHub Check: build_test
GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
GitHub Check: Analyze (javascript-typescript)
GitHub Check: Analyze (go)

🔇 Additional comments (1)

router/internal/graphqlmetrics/prometheus_sink.go (1)

127-130: No change needed — parent type and field-name extraction are correct.

TypeFieldUsageInfo.TypeNames is populated from ParentTypeNames such that index 0 is the immediate parent type, and Path contains the path segments with the field name as the last element. Verified in router/pkg/graphqlschemausage/schemausage.go (ParentTypeNames -> TypeNames), the protobuf comment (graphqlmetrics/gen/proto/.../graphqlmetrics.pb.go), and tests (router/internal/graphqlmetrics/exporter_test.go, aggregate_test.go) showing examples like Path ["user","name"] with TypeNames ["User","String"].

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

router/pkg/config/config.go (1)
116-119: Back-compat: removal of sample_rate needs migration story or explicit deprecation notice.

Dropping SampleRate in favor of Exporter will break existing configs. Either:

accept the breaking change and document a clear migration; or

add a temporary compatibility path that maps sample_rate to exporter settings.

Also ensure any references to SampleRate are removed across code, tests, docs. Reusing the schema’s deprecation mechanism can help guide users.
#!/bin/bash
# Find lingering references (code, tests, docs)
rg -n -C2 -S -i '\bsample_rate\b|SampleRate|schema_usage\.sample_rate|PROMETHEUS_SCHEMA_FIELD_USAGE_SAMPLE_RATE'

🧹 Nitpick comments (2)

router/pkg/config/config.go (1)

121-126: Type naming duplication and defaults: consider renaming and confirm queue_size default.

PrometheusSchemaFieldUsageExporter is also declared in router/pkg/metric/config.go. Keeping the same name in two packages is legal but confusing. Consider renaming this one to PrometheusSchemaFieldUsageExporterConfig or PrometheusSchemaFieldUsageSettings to reduce ambiguity.

Default queue_size here is 12800, while the PR example shows 10240. Please reconcile or update examples to avoid drift.

router/pkg/config/config.schema.json (1)

1249-1281: Harden exporter schema: disallow unknown keys and (optionally) bound sizes.

Add additionalProperties: false under schema_usage.exporter to catch typos (e.g., “batchsize”). Optionally add conservative maximums to bound memory use.

-                    "exporter": {
-                      "type": "object",
-                      "description": "Configuration for the schema usage exporter",
+                    "exporter": {
+                      "type": "object",
+                      "description": "Configuration for the schema usage exporter",
+                      "additionalProperties": false,
                       "properties": {
                         "batch_size": {
                           "type": "integer",
                           "default": 4096,
                           "minimum": 1,
+                          "maximum": 1048576,
                           "description": "The maximum number of schema usage items to be applied in a single batch. The default value is 4096."
                         },
                         "queue_size": {
                           "type": "integer",
                           "default": 12800,
                           "minimum": 1,
+                          "maximum": 10485760,
                           "description": "The maximum number of schema usage items allowed in queue at a given time. The default value is 12800."
                         },

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0200c3c and 45c510b.

📒 Files selected for processing (5)

router/pkg/config/config.go (1 hunks)
router/pkg/config/config.schema.json (1 hunks)
router/pkg/config/config_test.go (1 hunks)
router/pkg/config/testdata/config_defaults.json (1 hunks)
router/pkg/config/testdata/config_full.json (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

router/pkg/config/testdata/config_defaults.json
router/pkg/config/config_test.go

🧰 Additional context used

🧠 Learnings (4)

📚 Learning: 2025-10-01T20:39:16.113Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

router/pkg/config/testdata/config_full.json

📚 Learning: 2025-08-20T10:08:17.857Z

Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

router/pkg/config/testdata/config_full.json
router/pkg/config/config.schema.json

📚 Learning: 2025-08-20T10:08:17.857Z

Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.

Applied to files:

router/pkg/config/testdata/config_full.json

📚 Learning: 2025-07-03T10:33:25.778Z

Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.

Applied to files:

router/pkg/config/config.schema.json

🧬 Code graph analysis (1)

router/pkg/config/config.go (3)

router/internal/graphqlmetrics/exporter.go (1)

Exporter (16-31)

router/pkg/metric/config.go (1)

PrometheusSchemaFieldUsageExporter (48-53)

router-tests/testenv/testenv.go (1)

PrometheusSchemaFieldUsageExporter (285-290)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: build-router
GitHub Check: image_scan (nonroot)
GitHub Check: integration_test (./events)
GitHub Check: integration_test (./telemetry)
GitHub Check: build_push_image (nonroot)
GitHub Check: build_push_image
GitHub Check: image_scan
GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
GitHub Check: build_test
GitHub Check: Analyze (javascript-typescript)
GitHub Check: Analyze (go)

🔇 Additional comments (1)

router/pkg/config/testdata/config_full.json (1)

96-101: Fixture update aligns with new exporter config.

Values and duration encodings look correct. LGTM.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45c510b and 4b706fc.

📒 Files selected for processing (2)

router/core/context.go (2 hunks)
router/core/router_metrics.go (5 hunks)

🧰 Additional context used

🧠 Learnings (6)

📚 Learning: 2025-09-17T20:55:39.456Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.

Applied to files:

router/core/router_metrics.go

📚 Learning: 2025-08-14T17:45:57.509Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.

Applied to files:

router/core/router_metrics.go

📚 Learning: 2025-08-14T16:47:03.744Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Applied to files:

router/core/router_metrics.go

📚 Learning: 2025-08-14T17:22:41.662Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.

Applied to files:

router/core/router_metrics.go

📚 Learning: 2025-08-14T17:46:00.930Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.930Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.

Applied to files:

router/core/router_metrics.go

📚 Learning: 2025-11-12T18:38:46.619Z

Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2327
File: graphqlmetrics/core/metrics_service.go:435-442
Timestamp: 2025-11-12T18:38:46.619Z
Learning: In graphqlmetrics/core/metrics_service.go, the opGuardCache TTL (30 days) is intentionally shorter than the gql_metrics_operations database retention period (90 days). This design reduces memory usage while relying on database constraints to prevent duplicate inserts after cache expiration.

Applied to files:

router/core/context.go

🧬 Code graph analysis (2)

router/core/router_metrics.go (4)

router/internal/graphqlmetrics/graphql_exporter.go (1)

GraphQLMetricsExporter (13-15)

router/internal/graphqlmetrics/prometheus_exporter.go (1)

PrometheusMetricsExporter (13-15)

router/core/context.go (4)

OperationType (503-503)

OperationTypeQuery (506-506)

OperationTypeMutation (507-507)

OperationTypeSubscription (508-508)

graphqlmetrics/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (16)

OperationType (26-26)

OperationType (58-60)

OperationType (62-64)

OperationType (71-73)

OperationType_QUERY (29-29)

OperationType_MUTATION (30-30)

OperationType_SUBSCRIPTION (31-31)

SchemaUsageInfo (130-153)

SchemaUsageInfo (168-168)

SchemaUsageInfo (183-185)

OperationInfo (362-373)

OperationInfo (388-388)

OperationInfo (403-405)

SchemaInfo (428-435)

SchemaInfo (450-450)

SchemaInfo (465-467)

router/core/context.go (2)

router/pkg/graphqlschemausage/schemausage.go (2)

TypeFieldMetrics (28-28)

TypeFieldUsageInfo (40-48)

graphqlmetrics/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (3)

TypeFieldUsageInfo (476-493)

TypeFieldUsageInfo (508-508)

TypeFieldUsageInfo (523-525)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: build-router
GitHub Check: build_push_image (nonroot)
GitHub Check: build_push_image
GitHub Check: image_scan (nonroot)
GitHub Check: build_test
GitHub Check: image_scan
GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
GitHub Check: Analyze (javascript-typescript)
GitHub Check: Analyze (go)

🔇 Additional comments (8)

router/core/context.go (2)

546-549: Context for new caching field.

The addition of typeFieldUsageInfoMetrics as a cached conversion result aligns with the PR's goal to efficiently track all operations without sampling. This avoids repeated allocations when multiple exporters need the same data.

601-608: The review comment is incorrect and should be disregarded.

The premise of a race condition is based on a misunderstanding of the codebase. Here's why:

operationContext is per-request: In graphql_prehandler.go line 229, a new operationContext instance is created fresh for each HTTP request: requestContext.operation = &operationContext{clientInfo: clientInfo}. This is a standard per-request context pattern.

No concurrent access to the same instance: Since each request gets its own operationContext instance, different requests cannot race on the same instance's fields. The lazy initialization on lines 604-605 is only accessed by the single request handler processing that specific request.

Confusion between shared data and shared context: The comment in router_metrics.go:100-102 refers to the cached plan data (typeFieldUsageInfo) being reused across requests via the planner cache. This is separate from the operationContext wrapper itself, which is always fresh per request. The two concepts are distinct.

Likely an incorrect or invalid review comment.

router/core/router_metrics.go (6)

17-19: LGTM: Clean interface extension for Prometheus support.

The interface changes clearly separate GraphQL and Prometheus export paths while maintaining backward compatibility with the updated GraphQLMetricsExporter type.

26-31: LGTM: Struct fields updated to support dual export paths.

The addition of prometheusMetricsExporter alongside the updated gqlMetricsExporter type enables the router to export metrics to both GraphQL and Prometheus sinks.

43-52: LGTM: Constructor properly initializes new fields.

All new fields are correctly initialized from the config.

57-69: LGTM: Prometheus tracking enabled conditionally.

Line 64 correctly sets PrometheusUsageInfoTrack based on exporter presence, ensuring tracking overhead is only incurred when the Prometheus exporter is configured.

75-81: LGTM: Accessor methods updated.

The accessor methods correctly return the new exporter types.

83-133: LGTM: Efficient use of cached metrics conversion.

Line 109 now uses GetTypeFieldUsageInfoMetrics() which avoids repeated allocations by caching the converted metrics data. This optimization aligns with the PR's goal to track all operations efficiently.

Note: See the comment on context.go lines 601-608 regarding potential race conditions in the caching implementation.

router/core/router_metrics.go

endigma · 2025-11-14T15:46:04Z

router/core/operation_metrics.go

+	routerConfigVersion      string
+	logger                   *zap.Logger
+	trackUsageInfo           bool
+	prometheusUsageInfoTrack bool


Suggested change

prometheusUsageInfoTrack bool

prometheusTrackUsageInfo bool

endigma · 2025-11-14T15:48:36Z

router/core/operation_metrics.go

+		if m.trackUsageInfo {
+			m.routerMetrics.ExportSchemaUsageInfo(reqContext.operation, statusCode, reqContext.error != nil, exportSynchronous)
 		}

-		opAttrs := []attribute.KeyValue{
-			rotel.WgOperationName.String(reqContext.operation.name),
-			rotel.WgOperationType.String(reqContext.operation.opType),
+		// Prometheus metrics export (to local Prometheus metrics)
+		if m.prometheusUsageInfoTrack {
+			m.routerMetrics.ExportSchemaUsageInfoPrometheus(reqContext.operation, statusCode, reqContext.error != nil, exportSynchronous)
 		}


Is there a case where we want both? Should this just be one function that has an interface implementation?

endigma · 2025-11-14T15:49:08Z

router/core/operation_metrics.go

-		if !m.shouldSampleOperation() {
-			return
+	// Export schema usage info to configured exporters
+	if reqContext.operation != nil && !reqContext.operation.executionOptions.SkipLoader {


Suggested change

if reqContext.operation != nil && !reqContext.operation.executionOptions.SkipLoader {

if reqContext.operation != nil {

This check is why we were unable to reproduce a customer issue locally without implementing their subgraph. Is it necessary?

It ensures that we don't emit schema usage data for planning only operations, hence for testing.

endigma · 2025-11-14T15:49:48Z

router/core/operation_metrics.go

+	RouterMetrics            RouterMetrics
+	Logger                   *zap.Logger
+	TrackUsageInfo           bool
+	PrometheusUsageInfoTrack bool


Suggested change

PrometheusUsageInfoTrack bool

PrometheusTrackUsageInfo bool

endigma · 2025-11-14T15:50:37Z

router/core/router_metrics.go

 	ExportSchemaUsageInfo(operationContext *operationContext, statusCode int, hasError bool, exportSynchronous bool)
-	GqlMetricsExporter() *graphqlmetrics.Exporter
+	ExportSchemaUsageInfoPrometheus(operationContext *operationContext, statusCode int, hasError bool, exportSynchronous bool)
+	GqlMetricsExporter() *graphqlmetrics.GraphQLMetricsExporter


Suggested change

GqlMetricsExporter() *graphqlmetrics.GraphQLMetricsExporter

GQLMetricsExporter() *graphqlmetrics.GraphQLMetricsExporter

https://go.dev/wiki/CodeReviewComments#initialisms

endigma · 2025-11-14T15:59:36Z

router/internal/graphqlmetrics/sink.go

+type Sink[T any] interface {
+	// Export sends a batch of items to the sink destination.
+	// It returns an error if the export fails.
+	// The context can be used to cancel or timeout the export operation.
+	Export(ctx context.Context, batch []T) error
+
+	// Close performs any cleanup needed when shutting down the sink.
+	// It should block until all cleanup is complete or the context is cancelled.
+	Close(ctx context.Context) error
+}


I wonder if this should be put in pkg instead? It seems it might be useful as a generic batcher and as far as I can tell doesn't include anything graphqlmetrics specific?

endigma · 2025-11-14T16:03:26Z

router/internal/graphqlmetrics/sink.go

+
+	// Close performs any cleanup needed when shutting down the sink.
+	// It should block until all cleanup is complete or the context is cancelled.
+	Close(ctx context.Context) error


Neither current sink actually has a close method, is this necessary?

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

router/pkg/config/config.schema.json (1)

1236-1282: Lock down schema_usage to reject legacy SampleRate and unknown keys

Currently additionalProperties is not set on schema_usage, so legacy sample_rate may be silently accepted and ignored. Tighten it and keep exporter properties as-is.

-                "schema_usage": {
-                  "type": "object",
-                  "description": "Configure schema field usage metrics for Prometheus",
-                  "properties": {
+                "schema_usage": {
+                  "type": "object",
+                  "description": "Configure schema field usage metrics for Prometheus",
+                  "additionalProperties": false,
+                  "properties": {
                     "enabled": {

Optionally, add a custom pre-validation check to emit a clearer error if sample_rate is present.

♻️ Duplicate comments (1)

router/core/router_metrics.go (1)

19-19: Consider using GQLMetricsExporter for proper Go initialism formatting.

As noted in a previous review, Go's style guide recommends using all caps for initialisms like "GQL" rather than "Gql".

🧹 Nitpick comments (10)

router/internal/graphqlmetrics/exporter.go (2)

52-63: Clarify QueueSize comment

The queue holds items (T), not batches. Update the comment to avoid confusion.

382-416: Shutdown sequencing could prolong termination

Requests are canceled only in defer, and backoff sleeps are not cancelable. The diff above cancels earlier and makes sleeps cancelable to avoid long waits on shutdown.
router/pkg/config/config.schema.json (3)
1179-1185: Default mismatch: listen_addr

Schema default is "localhost:8088" but code (config.go Line 102) uses "127.0.0.1:8088" and defaults files do too. Align schema default.
-                  "default": "localhost:8088"
+                  "default": "127.0.0.1:8088"
462-466: Default + description mismatch: websocket.forward_upgrade_query_params.enabled

Schema sets default false, description says default true, and code default is true. Align to true.
-              "default": false,
-              "description": "Forward upgrade request query parameters in the extensions payload when starting a subscription on a Subgraph. The default value is true."
+              "default": true,
+              "description": "Forward upgrade request query parameters in the extensions payload when starting a subscription on a Subgraph. The default value is true."
633-637: Default mismatch: access_logs.add_stacktrace

Schema default true; code default is false (config.go Line 883). Align to false.
-          "default": true
+          "default": false
router/core/router_metrics.go (2)
140-170: Document the hash field choice.

The nil check at lines 141-143 properly addresses the defensive programming concern from the previous review. However, line 159 uses operationContext.sha256Hash (original operation hash) while the GraphQL export path uses operationContext.HashString() (normalized hash with variables).

According to the previous review analysis, this difference is intentional: Prometheus uses the original hash for operation correlation. Please add a brief comment at line 159 explaining this choice to prevent future confusion.

Apply this diff to document the hash semantics:
 	item := &graphqlmetricsv1.SchemaUsageInfo{
 		TypeFieldMetrics: operationContext.GetTypeFieldUsageInfoMetrics(),
 		OperationInfo: &graphqlmetricsv1.OperationInfo{
 			Type: opType,
-			Hash: operationContext.sha256Hash,
+			Hash: operationContext.sha256Hash, // Original operation hash for Prometheus correlation (vs normalized hash used in GraphQL export)
 			// parsed operation names are re-used across requests
 			// for that reason, we need to copy the name, or it might get corrupted
 			Name: strings.Clone(operationContext.name),
 		},
145-153: Consider extracting the repeated operation type mapping.

The operation type switch (lines 145-153) is duplicated from the GraphQL export method (lines 93-101). Extracting this into a helper function would improve maintainability.

Example refactor:
func operationTypeToProto(opType OperationType) graphqlmetricsv1.OperationType {
	switch opType {
	case OperationTypeQuery:
		return graphqlmetricsv1.OperationType_QUERY
	case OperationTypeMutation:
		return graphqlmetricsv1.OperationType_MUTATION
	case OperationTypeSubscription:
		return graphqlmetricsv1.OperationType_SUBSCRIPTION
	default:
		return graphqlmetricsv1.OperationType_QUERY
	}
}
Then use opType := operationTypeToProto(operationContext.opType) in both export methods.
router/pkg/graphqlschemausage/schemausage.go (3)

64-104: Type field visitor preallocation and path handling look correct

Preallocating typeFieldUsageInfo with a small fixed capacity and switching to pathCopy := make(len(path)+1) plus copy avoids the extra allocation of slices.Clone(append(...)) while still guaranteeing that:

path == nil is handled safely (len(path) is 0, copy is no-op).

Each pathCopy is independent and only shared intentionally between the direct field entry and its IndirectInterfaceField variant.

Recursion passes pathCopy down and always re-allocates for children, so parent paths are never mutated.

The constants (capacity 32 and per-field path allocations) are modest and appropriate for typical query shapes; if future workloads show much larger field counts, benchmarks in this package can guide tuning, but there’s no correctness or obvious perf risk here.

107-131: Argument usage slice preallocation is safe and aligns with usage pattern

Initializing visitor.usage with make([]*graphqlmetrics.ArgumentUsageInfo, 0, 16) is a straightforward allocation win for common cases and doesn’t change behavior; all appends still produce the same sequence of entries as before, and the walker lifecycle remains unchanged.

If production traces show higher argument counts per operation, you can revisit the 16 heuristic, but as-is this is a low-risk, localized improvement.

159-244: Input usage preallocation and enum handling changes preserve semantics

The new preallocations in input usage look good:

usage: make([]*graphqlmetrics.InputUsageInfo, 0, 16) in GetInputUsageInfo matches the typical small number of variables and reduces slice growth.

In traverseVariable, setting usageInfo.Path = []string{parentTypeName, fieldName} when parentTypeName != "" allocates an exact-length slice and is equivalent to the previous append-based construction, while avoiding extra capacity and potential future aliasing.

For enum arrays, usageInfo.EnumValues = make([]string, len(arr)) followed by index assignment is correct and avoids repeated append calls.

These changes don’t alter how appendUniqueUsage and infoEquals behave, so deduplication semantics remain the same while allocations are slightly reduced.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 86480f7 and bb6d441.

📒 Files selected for processing (9)

router/core/router_metrics.go (5 hunks)
router/internal/graphqlmetrics/exporter.go (5 hunks)
router/internal/graphqlmetrics/exporter_bench_test.go (1 hunks)
router/pkg/config/config.go (1 hunks)
router/pkg/config/config.schema.json (1 hunks)
router/pkg/config/testdata/config_defaults.json (1 hunks)
router/pkg/config/testdata/config_full.json (1 hunks)
router/pkg/graphqlschemausage/schemausage.go (6 hunks)
router/pkg/graphqlschemausage/schemausage_bench_test.go (1 hunks)

🧰 Additional context used

🧠 Learnings (13)

📚 Learning: 2025-08-20T10:08:17.857Z

Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

router/pkg/config/testdata/config_defaults.json
router/pkg/config/config.schema.json
router/core/router_metrics.go
router/pkg/config/testdata/config_full.json
router/pkg/graphqlschemausage/schemausage.go

📚 Learning: 2025-07-03T10:33:25.778Z

Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.

Applied to files:

router/pkg/config/config.schema.json
router/core/router_metrics.go

📚 Learning: 2025-09-17T20:55:39.456Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.

Applied to files:

router/core/router_metrics.go

📚 Learning: 2025-08-14T17:45:57.509Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.

Applied to files:

router/core/router_metrics.go

📚 Learning: 2025-09-02T12:52:27.677Z

Learnt from: endigma
Repo: wundergraph/cosmo PR: 2181
File: router/core/operation_processor.go:0-0
Timestamp: 2025-09-02T12:52:27.677Z
Learning: Hash validation for persisted queries with query bodies is performed in router/core/graphql_prehandler.go in the handleOperation function at lines 571-578, not in the APQ processing logic in operation_processor.go. This validates that extensions.persistedQuery.sha256Hash matches the computed query hash before any APQ operations occur. There's also a test case that verifies this behavior.

Applied to files:

router/core/router_metrics.go

📚 Learning: 2025-08-20T10:08:17.857Z

Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.

Applied to files:

router/core/router_metrics.go
router/pkg/config/testdata/config_full.json

📚 Learning: 2025-09-02T12:52:27.677Z

Learnt from: endigma
Repo: wundergraph/cosmo PR: 2181
File: router/core/operation_processor.go:0-0
Timestamp: 2025-09-02T12:52:27.677Z
Learning: Hash validation for persisted queries with query bodies is performed in router/core/graphql_prehandler.go in the handleOperation function, not in the APQ processing logic in operation_processor.go. This validates that extensions.persistedQuery.sha256Hash matches the query body before any APQ operations occur.

Applied to files:

router/core/router_metrics.go

📚 Learning: 2025-08-14T16:47:03.744Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Applied to files:

router/core/router_metrics.go

📚 Learning: 2025-08-14T17:22:41.662Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.

Applied to files:

router/core/router_metrics.go

📚 Learning: 2025-08-14T17:46:00.930Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.930Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.

Applied to files:

router/core/router_metrics.go

📚 Learning: 2025-08-12T13:50:45.964Z

Learnt from: Noroth
Repo: wundergraph/cosmo PR: 2132
File: router-plugin/plugin.go:139-146
Timestamp: 2025-08-12T13:50:45.964Z
Learning: In the Cosmo router plugin system, the plugin framework creates its own logger independently. When creating a logger in NewRouterPlugin, it's only needed for the gRPC server setup (passed via setup.GrpcServerInitOpts.Logger) and doesn't need to be assigned back to serveConfig.Logger.

Applied to files:

router/core/router_metrics.go

📚 Learning: 2025-11-12T18:38:46.619Z

Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2327
File: graphqlmetrics/core/metrics_service.go:435-442
Timestamp: 2025-11-12T18:38:46.619Z
Learning: In graphqlmetrics/core/metrics_service.go, the opGuardCache TTL (30 days) is intentionally shorter than the gql_metrics_operations database retention period (90 days). This design reduces memory usage while relying on database constraints to prevent duplicate inserts after cache expiration.

Applied to files:

router/core/router_metrics.go

📚 Learning: 2025-10-01T20:39:16.113Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

router/pkg/config/testdata/config_full.json

🧬 Code graph analysis (6)

router/internal/graphqlmetrics/exporter.go (1)

router/internal/graphqlmetrics/sink.go (2)

Sink (9-18)

SinkErrorHandler (23-23)

router/pkg/config/config.go (2)

router/internal/graphqlmetrics/exporter.go (1)

Exporter (17-33)

router/pkg/metric/config.go (1)

PrometheusSchemaFieldUsageExporter (48-53)

router/core/router_metrics.go (5)

router/core/operation_metrics.go (1)

OperationMetrics (31-40)

router/internal/graphqlmetrics/graphql_exporter.go (1)

GraphQLMetricsExporter (13-15)

router/internal/graphqlmetrics/prometheus_exporter.go (1)

PrometheusMetricsExporter (13-15)

router/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (10)

OperationType (26-26)

OperationType (58-60)

OperationType (62-64)

OperationType (71-73)

OperationType_QUERY (29-29)

OperationType_MUTATION (30-30)

OperationType_SUBSCRIPTION (31-31)

SchemaUsageInfo (130-153)

SchemaUsageInfo (168-168)

SchemaUsageInfo (183-185)

router/core/context.go (4)

OperationType (503-503)

OperationTypeQuery (506-506)

OperationTypeMutation (507-507)

OperationTypeSubscription (508-508)

router/internal/graphqlmetrics/exporter_bench_test.go (1)

router/internal/graphqlmetrics/exporter.go (3)

ExporterSettings (52-63)

RetryOptions (35-40)

NewExporter (84-120)

router/pkg/graphqlschemausage/schemausage_bench_test.go (2)

router/pkg/graphqlschemausage/schemausage_test.go (1)

FakeFactory (527-529)

router/pkg/graphqlschemausage/schemausage.go (4)

GetTypeFieldUsageInfo (14-23)

GetArgumentUsageInfo (107-124)

GetInputUsageInfo (159-171)

TypeFieldMetrics (26-26)

router/pkg/graphqlschemausage/schemausage.go (1)

graphqlmetrics/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (9)

TypeFieldUsageInfo (476-493)

TypeFieldUsageInfo (508-508)

TypeFieldUsageInfo (523-525)

ArgumentUsageInfo (569-582)

ArgumentUsageInfo (597-597)

ArgumentUsageInfo (612-614)

InputUsageInfo (644-659)

InputUsageInfo (674-674)

InputUsageInfo (689-691)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: build-router
GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
GitHub Check: integration_test (./events)
GitHub Check: image_scan (nonroot)
GitHub Check: image_scan
GitHub Check: integration_test (./telemetry)
GitHub Check: build_test
GitHub Check: build_push_image
GitHub Check: build_push_image (nonroot)
GitHub Check: Analyze (go)
GitHub Check: Analyze (javascript-typescript)
GitHub Check: build_test

🔇 Additional comments (10)

router/internal/graphqlmetrics/exporter_bench_test.go (1)

27-66: Benchmarks cover key hot paths; LGTM

Good coverage of buffer pooling, high-throughput, batch cycling, growth, and parallel Record. Minor: time.Sleep in benchmarks can add variance, but acceptable here.

Also applies to: 68-111, 112-156, 157-197, 198-242

router/pkg/config/testdata/config_defaults.json (1)

66-71: Defaults alignment looks good

Exporter defaults match code/env defaults (4096/12800/1s/10s). LGTM.

router/pkg/config/testdata/config_full.json (1)

96-101: Config shape migration verified

Exporter object replaces SampleRate as intended; values consistent with defaults/examples. LGTM.

router/pkg/config/config.go (1)

116-119: ****

The two PrometheusSchemaFieldUsageExporter definitions serve distinct purposes and should not be unified:

router/pkg/config/config.go:121 is a configuration struct with YAML and environment variable tags for unmarshaling external configuration with defaults

router/pkg/metric/config.go:48 is an internal runtime struct used by the metrics package to track exporter state, with no serialization concerns

This separation of concerns is intentional and appropriate. Consolidating them would couple the metric package to configuration unmarshaling logic and add unnecessary struct tags where they don't belong. The apparent duplication is justified.

Likely an incorrect or invalid review comment.

router/core/router_metrics.go (4)

3-5: Good use of standard library.

The strings package import supports the replacement of the custom strCopy implementation with strings.Clone(), which is a cleaner approach.

24-53: LGTM!

The struct fields and constructor are updated consistently. The new prometheusMetricsExporter field is properly initialized in NewRouterMetrics.

76-82: LGTM!

The getter methods correctly expose both exporters through the RouterMetrics interface.

84-138: LGTM!

The method improvements are well-executed:

Added defensive nil check for the exporter (lines 89-91)

Replaced custom strCopy with standard strings.Clone (line 122)

Updated to use GetTypeFieldUsageInfoMetrics() for cleaner data access

router/pkg/graphqlschemausage/schemausage_bench_test.go (2)

106-194: Benchmarks are well-structured and focus on the right hot paths

Each benchmark:

Calls setupBenchmark once, then uses b.ResetTimer() and b.ReportAllocs() to keep setup out of the measured section.

Exercises a single concern per benchmark (GetTypeFieldUsageInfo, GetArgumentUsageInfo, GetInputUsageInfo, IntoGraphQLMetrics, and the combined end-to-end flow), with errors surfaced via b.Fatal.

Prevents compiler optimizations by assigning results to a blank variable.

The end-to-end benchmark intentionally reuses the precomputed plan, docs, and variables to focus on schema usage extraction rather than planning, which matches the PR’s perf focus. This all looks good and should give you stable allocation/CPU signals for regression tracking.

19-104: No issues found; original concern is unfounded.

The pattern of calling astjson.ParseBytes(op.Input.Variables) is safe and well-established. The Variables field originates as json.RawMessage (a []byte type with omitempty tag), which defaults to an empty slice rather than nil when unmarshaled. This empty slice is valid JSON input for astjson.ParseBytes. The identical pattern already exists and is tested in schemausage_test.go (line 206) with successful results, confirming the code path is sound.

coderabbitai · 2025-11-14T17:54:36Z

router/internal/graphqlmetrics/exporter.go

+	queue             chan T
+	inflightBatches   *atomic.Int64
+	batchBufferPool   *sync.Pool // Pool for batch slice buffers to reduce allocations



⚠️ Potential issue | 🔴 Critical

Critical: batch buffer returned to pool before export completes (data race/corruption)

Buffer is enqueued for export and immediately returned to the pool (lines 332-333, 341-342, 364-366, 371-372). The goroutine then exports a slice referencing memory that may be reused/modified, leading to corrupted batches and races. Also, the pool stores *[]T and rewraps slice headers, which is error‑prone.

Fix by:

Pooling []T values (not pointers).

Returning buffers to the pool only after export finishes (inside the export goroutine).

Making retry backoff sleep cancelable and cancel export requests earlier during Shutdown.

Apply this diff:

@@ - "sync" + "sync" @@ - batchBufferPool *sync.Pool // Pool for batch slice buffers to reduce allocations + batchBufferPool *sync.Pool // Pool of []T to reduce allocations @@ - batchBufferPool: &sync.Pool{ - New: func() any { - // Pre-allocate slice with batch size capacity - buffer := make([]T, 0, settings.BatchSize) - return &buffer - }, - }, + batchBufferPool: &sync.Pool{ + New: func() any { + // Pre-allocate slice with batch size capacity + return make([]T, 0, settings.BatchSize) + }, + }, @@ -func (e *Exporter[T]) getBatchBuffer() []T { - bufferPtr := e.batchBufferPool.Get().(*[]T) - buffer := *bufferPtr - // Ensure the buffer is empty (should already be, but be defensive) - return buffer[:0] -} +func (e *Exporter[T]) getBatchBuffer() []T { + bufAny := e.batchBufferPool.Get() + if bufAny == nil { + return make([]T, 0, e.settings.BatchSize) + } + buf := bufAny.([]T) + return buf[:0] +} @@ -func (e *Exporter[T]) putBatchBuffer(buffer []T) { - // Reset the slice to zero length while keeping capacity - buffer = buffer[:0] - e.batchBufferPool.Put(&buffer) -} +func (e *Exporter[T]) putBatchBuffer(buf []T) { + e.batchBufferPool.Put(buf[:0]) +} @@ func (e *Exporter[T]) prepareAndSendBatch(batch []T) { e.logger.Debug("Preparing to send batch", zap.Int("batch_size", len(batch))) e.inflightBatches.Inc() - go func() { + go func(b []T) { defer e.inflightBatches.Dec() - e.exportBatchWithRetry(batch) - }() + defer e.putBatchBuffer(b) // return buffer only after export completes + e.exportBatchWithRetry(b) + }(batch) } @@ - case <-ticker.C: + case <-ticker.C: e.logger.Debug("Tick: flushing buffer", zap.Int("buffer_size", len(buffer))) if len(buffer) > 0 { - e.prepareAndSendBatch(buffer) - // Return buffer to pool after sending - e.putBatchBuffer(buffer) + e.prepareAndSendBatch(buffer) // buffer will be returned by goroutine buffer = nil } @@ if len(buffer) == e.settings.BatchSize { e.logger.Debug("Buffer full, sending batch", zap.Int("batch_size", len(buffer))) - e.prepareAndSendBatch(buffer) - // Return buffer to pool after sending - e.putBatchBuffer(buffer) + e.prepareAndSendBatch(buffer) // buffer will be returned by goroutine buffer = nil } @@ - if len(buffer) == e.settings.BatchSize { - e.prepareAndSendBatch(buffer) - // Return buffer to pool and get a new one - e.putBatchBuffer(buffer) - buffer = e.getBatchBuffer() - } + if len(buffer) == e.settings.BatchSize { + e.prepareAndSendBatch(buffer) // buffer returned by goroutine + buffer = e.getBatchBuffer() + } @@ - if len(buffer) > 0 { - e.prepareAndSendBatch(buffer) - // Return buffer to pool before returning - e.putBatchBuffer(buffer) - } + if len(buffer) > 0 { + e.prepareAndSendBatch(buffer) // buffer returned by goroutine + }

And make backoff sleeps cancelable and cancel export requests earlier during shutdown:

@@ func (e *Exporter[T]) exportBatchWithRetry(batch []T) { @@ - // Wait for the specified backoff period - sleepDuration := b.Duration() - e.logger.Debug("Retrying export after backoff", + // Wait for the specified backoff period (cancelable) + sleepDuration := b.Duration() + e.logger.Debug("Retrying export after backoff", zap.Int("batch_size", len(batch)), zap.Int("retry", retry), zap.Duration("sleep", sleepDuration), ) - - time.Sleep(sleepDuration) + select { + case <-e.exportRequestContext.Done(): + e.logger.Debug("Export cancelled during backoff") + return + case <-time.After(sleepDuration): + } @@ func (e *Exporter[T]) Shutdown(ctx context.Context) error { e.logger.Debug("Shutdown started") @@ - defer func() { - ticker.Stop() - // Cancel all export requests - e.cancelAllExportRequests() + // Cancel export requests early to unblock any long requests/backoff + e.cancelAllExportRequests() + defer func() { + ticker.Stop() // Close the sink if err := e.sink.Close(ctx); err != nil { e.logger.Error("Error closing sink", zap.Error(err)) } e.logger.Debug("Shutdown complete") }()

Minor nit: the comment for QueueSize says “number of batches” but the queue holds items. Consider updating it for clarity.

Also applies to: 96-114, 319-344, 352-377

🤖 Prompt for AI Agents

router/internal/graphqlmetrics/exporter.go around lines 24-27 and the other affected ranges (96-114, 319-344, 352-377): the current implementation returns *[]T buffers to the sync.Pool immediately after enqueueing, which allows the buffer memory to be reused while an export goroutine is still referencing it, and the pool stores pointers to slices which rewraps slice headers unsafely. Change the pool to hold []T (not *[]T), stop returning buffers to the pool before export completes, and move the buffer Put(...) into the export goroutine after the export attempt (including after retries) so the goroutine owns the slice lifetime; make any retry backoff sleeps cancelable by using a select on ctx.Done() and a time.After or time.NewTimer channel so sleeps abort immediately on Shutdown; ensure Shutdown cancels in-flight export requests earlier by cancelling the export context (or closing a done channel) before waiting for goroutines, and update the QueueSize comment to accurately say it’s the queue capacity in items (not "number of batches"). Ensure all places that currently Put a *[]T are updated to Put a []T and that enqueued values are copies of the slice header or newly allocated slices if needed so the pool semantics are safe.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

router/internal/graphqlmetrics/exporter_test.go (1)
70-71: Minor: prefer require.NoError for errors

Switch to require.NoError for clarity. Applies to all similar occurrences.
- require.Nil(t, err)
+ require.NoError(t, err)
Also applies to: 149-150, 188-189, 222-223, 345-346, 421-422

♻️ Duplicate comments (3)

router/internal/graphqlmetrics/exporter.go (3)

56-57: Doc nit: QueueSize counts items, not batches

The queue holds items (chan T), not batches. Update comment for accuracy.

- // QueueSize is the maximum number of batches allowed in queue at a given time.
+ // QueueSize is the maximum number of items allowed in the queue at a given time.

272-290: Backoff sleep not cancelable; goroutines may linger on shutdown

Make backoff waits respect e.exportRequestContext so Shutdown can exit promptly.

-    // Wait for the specified backoff period
-    sleepDuration := b.Duration()
-    e.logger.Debug("Retrying export after backoff",
+    // Wait for the specified backoff period (cancelable)
+    sleepDuration := b.Duration()
+    e.logger.Debug("Retrying export after backoff",
       zap.Int("batch_size", len(batch)),
       zap.Int("retry", retry),
       zap.Duration("sleep", sleepDuration),
     )
-
-    time.Sleep(sleepDuration)
+    select {
+    case <-e.exportRequestContext.Done():
+      e.logger.Debug("Export cancelled during backoff")
+      return
+    case <-time.After(sleepDuration):
+    }

381-394: Shutdown can hang; cancel exports earlier

Cancel export requests before waiting, not in defer, so sleeping/blocked exports exit promptly.

 func (e *Exporter[T]) Shutdown(ctx context.Context) error {
   e.logger.Debug("Shutdown started")
 
   ticker := time.NewTicker(time.Millisecond * 100)
-  defer func() {
-    ticker.Stop()
-    // Cancel all export requests
-    e.cancelAllExportRequests()
-    // Close the sink
-    if err := e.sink.Close(ctx); err != nil {
-      e.logger.Error("Error closing sink", zap.Error(err))
-    }
-    e.logger.Debug("Shutdown complete")
-  }()
+  defer func() {
+    ticker.Stop()
+    // Close the sink
+    if err := e.sink.Close(ctx); err != nil {
+      e.logger.Error("Error closing sink", zap.Error(err))
+    }
+    e.logger.Debug("Shutdown complete")
+  }()
 
   // First close the acceptTrafficSema to stop accepting new items
   close(e.acceptTrafficSema)
   // Then trigger the shutdown signal for the exporter goroutine to stop
   // It will drain the queue and send the remaining items
   close(e.shutdownSignal)
+  // Cancel all export requests to unblock backoff/sinks started before shutdown
+  e.cancelAllExportRequests()

🧹 Nitpick comments (5)

router/internal/graphqlmetrics/prometheus_sink.go (1)

104-106: Use int64 for counts to match metric API and avoid unnecessary casts

Store counts as int64; remove per-iteration int conversions.

-func (s *PrometheusSink) aggregateBatch(batch []*graphqlmetrics.SchemaUsageInfo) map[aggregatedUsageKey]int {
-	aggregatedCounts := make(map[aggregatedUsageKey]int)
+func (s *PrometheusSink) aggregateBatch(batch []*graphqlmetrics.SchemaUsageInfo) map[aggregatedUsageKey]int64 {
+	aggregatedCounts := make(map[aggregatedUsageKey]int64)
@@
-			// Increment count, using field.Count if available, otherwise 1
-			if field.Count > 0 {
-				aggregatedCounts[key] += int(field.Count)
-			} else {
-				aggregatedCounts[key]++
-			}
+			// Increment count, using field.Count if available, otherwise 1
+			if field.Count > 0 {
+				aggregatedCounts[key] += int64(field.Count)
+			} else {
+				aggregatedCounts[key] += 1
+			}

And in Export:

-	for key, count := range aggregatedCounts {
+	for key, count := range aggregatedCounts {
@@
-			int64(count),
+			count,

Also applies to: 140-145, 81-87

router/demo.config.yaml (1)

16-25: LGTM: concise demo of Prometheus schema_usage

Sample config is coherent. Optional: add a comment noting include_operation_sha increases label cardinality.

router/internal/graphqlmetrics/exporter_test.go (1)

385-391: Reduce flakiness: prefer Eventually over fixed Sleep

Replace fixed sleep with Eventually to wait for flush deterministically.

- time.Sleep(200 * time.Millisecond)
- 
- defer require.Nil(t, e.Shutdown(context.Background()))
- 
- require.Equal(t, 1, len(c.publishedAggregations))
- require.Equal(t, 5, len(c.publishedAggregations[0]))
+ require.Eventually(t, func() bool {
+   c.mu.Lock()
+   defer c.mu.Unlock()
+   return len(c.publishedAggregations) == 1 && len(c.publishedAggregations[0]) == 5
+ }, 2*time.Second, 50*time.Millisecond)
+ defer require.NoError(t, e.Shutdown(context.Background()))

router/pkg/config/config.schema.json (1)

1249-1281: Tighten exporter schema and fix unit doc

Disallow unknown keys and include 'ms' in interval docs to match usage.

-                    "exporter": {
-                      "type": "object",
-                      "description": "Configuration for the schema usage exporter",
-                      "properties": {
+                    "exporter": {
+                      "type": "object",
+                      "description": "Configuration for the schema usage exporter",
+                      "additionalProperties": false,
+                      "properties": {
@@
-                        "interval": {
+                        "interval": {
                           "type": "string",
                           "default": "2s",
                           "duration": {
                             "minimum": "100ms"
                           },
-                          "description": "The interval at which the schema usage queue is flushed. The period is specified as a string with a number and a unit, e.g. 10s, 1m. The supported units are 's', 'm', 'h'."
+                          "description": "The interval at which the schema usage queue is flushed. Specify a duration, e.g. 100ms, 10s, 1m. Supported units: 'ms', 's', 'm', 'h'."
                         },

router/internal/graphqlmetrics/exporter.go (1)

107-114: *Prefer pooling []T (not []T) to simplify and avoid pointer indirection

Safer and simpler: pool raw slices; ownership already fixed by deferring put in goroutine.

-    batchBufferPool: &sync.Pool{
-      New: func() any {
-        // Pre-allocate slice with batch size capacity
-        buffer := make([]T, 0, settings.BatchSize)
-        return &buffer
-      },
-    },
+    batchBufferPool: &sync.Pool{
+      New: func() any {
+        return make([]T, 0, settings.BatchSize)
+      },
+    },
@@
-func (e *Exporter[T]) getBatchBuffer() []T {
-  bufferPtr := e.batchBufferPool.Get().(*[]T)
-  buffer := *bufferPtr
-  // Ensure the buffer is empty (should already be, but be defensive)
-  return buffer[:0]
-}
+func (e *Exporter[T]) getBatchBuffer() []T {
+  bufAny := e.batchBufferPool.Get()
+  if bufAny == nil {
+    return make([]T, 0, e.settings.BatchSize)
+  }
+  buf := bufAny.([]T)
+  return buf[:0]
+}
@@
-func (e *Exporter[T]) putBatchBuffer(buffer []T) {
-  // Reset the slice to zero length while keeping capacity
-  buffer = buffer[:0]
-  e.batchBufferPool.Put(&buffer)
-}
+func (e *Exporter[T]) putBatchBuffer(buf []T) {
+  e.batchBufferPool.Put(buf[:0])
+}

Also applies to: 154-161, 163-169

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb6d441 and 841d22a.

📒 Files selected for processing (10)

router/demo.config.yaml (2 hunks)
router/internal/graphqlmetrics/exporter.go (5 hunks)
router/internal/graphqlmetrics/exporter_test.go (10 hunks)
router/internal/graphqlmetrics/prometheus_sink.go (1 hunks)
router/pkg/config/config.go (1 hunks)
router/pkg/config/config.schema.json (1 hunks)
router/pkg/config/config_test.go (1 hunks)
router/pkg/config/testdata/config_defaults.json (1 hunks)
router/pkg/config/testdata/config_full.json (1 hunks)
router/pkg/graphqlschemausage/schemausage.go (7 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

router/pkg/config/config_test.go
router/pkg/graphqlschemausage/schemausage.go
router/pkg/config/testdata/config_full.json
router/pkg/config/testdata/config_defaults.json

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2025-10-01T20:39:16.113Z

Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

router/internal/graphqlmetrics/exporter_test.go

📚 Learning: 2025-07-03T10:33:25.778Z

Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.

Applied to files:

router/pkg/config/config.schema.json

📚 Learning: 2025-08-20T10:08:17.857Z

Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

router/pkg/config/config.schema.json

🧬 Code graph analysis (4)

router/internal/graphqlmetrics/exporter_test.go (2)

router/internal/graphqlmetrics/graphql_exporter.go (1)

NewGraphQLMetricsExporter (19-39)

router/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (3)

RequestInfo (75-82)

RequestInfo (97-97)

RequestInfo (112-114)

router/pkg/config/config.go (2)

router/pkg/metric/config.go (1)

PrometheusSchemaFieldUsageExporter (48-53)

router-tests/testenv/testenv.go (1)

PrometheusSchemaFieldUsageExporter (285-290)

router/internal/graphqlmetrics/prometheus_sink.go (3)

router/pkg/metric/metric_store.go (1)

Store (149-162)

router/pkg/otel/attributes.go (5)

WgOperationName (10-10)

WgOperationType (11-11)

WgOperationSha256 (59-59)

WgGraphQLFieldName (60-60)

WgGraphQLParentType (61-61)

router/pkg/graphqlschemausage/schemausage.go (1)

TypeFieldMetrics (26-26)

router/internal/graphqlmetrics/exporter.go (1)

router/internal/graphqlmetrics/sink.go (2)

Sink (9-18)

SinkErrorHandler (23-23)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: build-router
GitHub Check: build_push_image
GitHub Check: integration_test (./events)
GitHub Check: image_scan (nonroot)
GitHub Check: build_push_image (nonroot)
GitHub Check: image_scan
GitHub Check: build_test
GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
GitHub Check: integration_test (./telemetry)
GitHub Check: Analyze (javascript-typescript)
GitHub Check: Analyze (go)

🔇 Additional comments (2)

router/pkg/config/config.go (1)

116-119: LGTM: exporter config shape and defaults

Exporter block and env tags look consistent with schema and intended wiring.

Also applies to: 121-126
router/internal/graphqlmetrics/prometheus_sink.go (1)
121-131: Avoid potential nil deref on field entries

TypeFieldMetrics is a slice of pointers; a nil element will panic on Path/TypeNames access.
-		for _, field := range usageInfo.TypeFieldMetrics {
+		for _, field := range usageInfo.TypeFieldMetrics {
+			if field == nil {
+				continue
+			}
 			// Skip fields without valid parent type or path
 			if len(field.Path) == 0 || len(field.TypeNames) < 1 {
 				continue
 			}
⛔ Skipped due to learnings
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

coderabbitai · 2025-11-14T19:39:27Z

router/internal/graphqlmetrics/exporter.go

 	if e.settings.RetryOptions.MaxDuration <= 0 {
-		return errors.New("retry max duration must be positive")
+		return fmt.Errorf("retry max duration must be positive")
 	}

 	if e.settings.RetryOptions.Interval <= 0 {
-		return errors.New("retry interval must be positive")
+		return fmt.Errorf("retry interval must be positive")
 	}

 	if e.settings.RetryOptions.MaxRetry <= 0 {
-		return errors.New("retry max retry must be positive")
+		return fmt.Errorf("retry max retry must be positive")
 	}


⚠️ Potential issue | 🟠 Major

Validation rejects zero retry values even when retries are disabled

Only enforce retry field positivity if RetryOptions.Enabled is true.

- if e.settings.RetryOptions.MaxDuration <= 0 { - return fmt.Errorf("retry max duration must be positive") - } - if e.settings.RetryOptions.Interval <= 0 { - return fmt.Errorf("retry interval must be positive") - } - if e.settings.RetryOptions.MaxRetry <= 0 { - return fmt.Errorf("retry max retry must be positive") - } + if e.settings.RetryOptions.Enabled { + if e.settings.RetryOptions.MaxDuration <= 0 { + return fmt.Errorf("retry max duration must be positive") + } + if e.settings.RetryOptions.Interval <= 0 { + return fmt.Errorf("retry interval must be positive") + } + if e.settings.RetryOptions.MaxRetry <= 0 { + return fmt.Errorf("retry max retry must be positive") + } + }

🤖 Prompt for AI Agents

In router/internal/graphqlmetrics/exporter.go around lines 139 to 149, the current validation unconditionally rejects zero/negative retry values; change the checks so they only run when e.settings.RetryOptions.Enabled is true (and ensure RetryOptions is non-nil before accessing Enabled). Concretely, wrap the three positivity checks for MaxDuration, Interval and MaxRetry in an if e.settings.RetryOptions != nil && e.settings.RetryOptions.Enabled { ... } block so disabled retries may have zero values, and return the same error messages when validation runs.

coderabbitai · 2025-11-14T19:39:27Z

router/internal/graphqlmetrics/prometheus_sink.go

+	return &PrometheusSink{
+		metricStore:  cfg.MetricStore,
+		logger:       cfg.Logger.With(zap.String("component", "prometheus_sink")),
+		includeOpSha: cfg.IncludeOpSha,
+	}


⚠️ Potential issue | 🟠 Major

Guard against nil logger in constructor

cfg.Logger can be nil; calling .With will panic. Default to zap.NewNop().

func NewPrometheusSink(cfg PrometheusSinkConfig) *PrometheusSink { - return &PrometheusSink{ - metricStore: cfg.MetricStore, - logger: cfg.Logger.With(zap.String("component", "prometheus_sink")), + lg := cfg.Logger + if lg == nil { + lg = zap.NewNop() + } + return &PrometheusSink{ + metricStore: cfg.MetricStore, + logger: lg.With(zap.String("component", "prometheus_sink")), includeOpSha: cfg.IncludeOpSha, } }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In router/internal/graphqlmetrics/prometheus_sink.go around lines 41 to 45, the constructor uses cfg.Logger.With(...) but cfg.Logger may be nil and will panic; guard by defaulting to zap.NewNop() when cfg.Logger is nil, assign a local logger variable like logger := cfg.Logger; if logger == nil { logger = zap.NewNop() } and then call logger.With(zap.String("component","prometheus_sink")) when constructing the PrometheusSink.

feat: prometheus sink for gqlmetrics

bcddb08

StarpTech requested review from Noroth, devsergiy, endigma and jensneuse as code owners November 13, 2025 22:42

github-actions bot added the router label Nov 13, 2025

coderabbitai bot reviewed Nov 13, 2025

View reviewed changes

chore: config tests

6f21eb9

coderabbitai bot reviewed Nov 13, 2025

View reviewed changes

StarpTech added 2 commits November 14, 2025 00:15

chore: fix lint

d0e9d36

chore: fix lint

3fd209b

coderabbitai bot reviewed Nov 14, 2025

View reviewed changes

chore: reduce alloc

0200c3c

coderabbitai bot reviewed Nov 14, 2025

View reviewed changes

chore: change defaults

45c510b

coderabbitai bot reviewed Nov 14, 2025

View reviewed changes

chore: cache type field usage information

4b706fc

coderabbitai bot reviewed Nov 14, 2025

View reviewed changes

router/core/router_metrics.go Show resolved Hide resolved

chore: allow smaller intervals

86480f7

endigma requested changes Nov 14, 2025

View reviewed changes

chore: improve

bb6d441

coderabbitai bot reviewed Nov 14, 2025

View reviewed changes

chore: improve

841d22a

coderabbitai bot reviewed Nov 14, 2025

View reviewed changes

	if reqContext.operation != nil && !reqContext.operation.executionOptions.SkipLoader {
	if reqContext.operation != nil {

	GqlMetricsExporter() *graphqlmetrics.GraphQLMetricsExporter
	GQLMetricsExporter() *graphqlmetrics.GraphQLMetricsExporter

feat: prometheus sink for gqlmetrics, ensure all operation are tracked efficiently #2332

Are you sure you want to change the base?

feat: prometheus sink for gqlmetrics, ensure all operation are tracked efficiently #2332

Uh oh!

Conversation

StarpTech commented Nov 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CPU

Memory

Config

Summary by CodeRabbit

Checklist

Uh oh!

coderabbitai bot commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Pre-merge checks and finishing touches

Uh oh!

github-actions bot commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Router-nonroot image scan passed

Uh oh!

github-actions bot commented Nov 13, 2025

Router-nonroot image scan passed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

endigma Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

endigma Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

endigma Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

endigma Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StarpTech Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

endigma Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

endigma Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

endigma Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

StarpTech commented Nov 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 13, 2025 •

edited

Loading

github-actions bot commented Nov 13, 2025 •

edited

Loading

coderabbitai bot Nov 14, 2025 •

edited

Loading

endigma Nov 14, 2025 •

edited

Loading

StarpTech Nov 14, 2025 •

edited

Loading