Skip to content

Conversation

@StarpTech
Copy link
Contributor

@StarpTech StarpTech commented Nov 13, 2025

This PR ensures that we never dismiss operation usage due to sampling. We eliminated sampling entirely while maintaining the same CPU util efficiency. Increase in total alloc memory is expected. With this change operations with low occurrence are always tracked.

I achieved this by making the gqlmetrics batch collector generic and use it for both implementations.

Fixes ENG-8486

CPU

CleanShot 2025-11-13 at 23 36 01@2x

Memory

CleanShot 2025-11-13 at 23 38 28@2x

Config

telemetry:
  metrics:
    otlp:
      enabled: false
    prometheus:
      schema_usage:
        enabled: true
        exporter:
          batch_size: 4096
          export_timeout: 10s
          interval: 10s
          queue_size: 10240

Summary by CodeRabbit

  • New Features

    • Schema field usage now tracks all GraphQL requests (sampling removed) and adds configurable exporter settings for batching, queueing, flush interval, and export timeout.
    • New pluggable metrics exporters and sinks for Prometheus and GraphQL metrics with synchronous/asynchronous recording and graceful shutdown.
    • Added cached per-operation field metrics to reduce repeated work.
  • Tests

    • Improved reliability with time-based flush waits, exporter-driven scenarios, dynamic assertions, aggregation-focused expectations, and new benchmarks for exporter throughput and buffering.

Checklist

@coderabbitai
Copy link

coderabbitai bot commented Nov 13, 2025

Walkthrough

Replaces probabilistic SampleRate-based schema-field sampling with a configurable exporter pipeline; introduces a generic Exporter[T] and Sink[T], GraphQL/Prometheus sinks and exporters, wires a Prometheus exporter into graph mux/router metrics, updates configs/tests to use batching and time-based flushing.

Changes

Cohort / File(s) Change Summary
Config & JSON schema
router/pkg/config/config.go, router/pkg/config/config.schema.json, router/pkg/metric/config.go, router/pkg/config/testdata/config_defaults.json, router/pkg/config/testdata/config_full.json, router/pkg/config/config_test.go
Removed SampleRate from PrometheusSchemaFieldUsage; added nested Exporter (PrometheusSchemaFieldUsageExporter) with BatchSize, QueueSize, Interval, ExportTimeout; updated JSON schema, defaults, testdata, and tests to validate exporter settings and precedence.
Router core — wiring & lifecycle
router/core/router.go, router/core/router_config.go, router/core/graph_server.go
Replaced metrics exporter ctor usage with new GraphQL exporter constructor/signature; added internal prometheusMetricsExporter on graphMux, create/wire Prometheus exporter when PromSchemaFieldUsage enabled, pass exporter into RouterMetrics, and ensure Shutdown on graph mux shutdown.
Router core — metrics & operation tracking
router/core/router_metrics.go, router/core/operation_metrics.go, router/core/context.go
Removed sampling machinery and usageKey; replaced sampling flags with PrometheusUsageInfoTrack; added ExportSchemaUsageInfoPrometheus and PrometheusMetricsExporter() accessor; added operationContext.GetTypeFieldUsageInfoMetrics() caching; simplified OperationMetrics finish path to use exporter-driven export.
Generic exporter & sink abstractions
router/internal/graphqlmetrics/exporter.go, router/internal/graphqlmetrics/sink.go
Introduced generic Exporter[T any] with queueing, batching, retry/backoff, synchronous flush/drain and shutdown semantics; added Sink[T any] and SinkErrorHandler abstractions.
GraphQL metrics exporter & sink
router/internal/graphqlmetrics/graphql_exporter.go, router/internal/graphqlmetrics/graphql_metrics_sink.go
Added GraphQLMetricsExporter wrapping the generic exporter and GraphQLMetricsSink that aggregates SchemaUsageInfo and publishes via Connect RPC; constructors, Export/Close, and retry classification implemented.
Prometheus metrics exporter & sink
router/internal/graphqlmetrics/prometheus_exporter.go, router/internal/graphqlmetrics/prometheus_sink.go
Added PrometheusMetricsExporter wrapping generic exporter and PrometheusSink that aggregates schema-usage batches and emits OpenTelemetry/Prometheus metrics; aggregation and attribute composition implemented.
Exporter tests & benchmarks
router/internal/graphqlmetrics/exporter_test.go, router/internal/graphqlmetrics/exporter_bench_test.go
Tests updated to call NewGraphQLMetricsExporter, adapted expectations for aggregation-based publishing; new exporter benchmarks added and tests adjusted to new RequestInfo shape and aggregation semantics.
Prometheus integration tests & testenv
router-tests/testenv/testenv.go, router-tests/prometheus_improved_test.go
Replaced inline SampleRate with Exporter config and test-friendly defaults; tests use custom exporter settings, add time.Sleep waits for flushes, and verify metrics by aggregated presence checks instead of fixed index assertions.
Schema usage traversal & benchmarks
router/pkg/graphqlschemausage/schemausage.go, router/pkg/graphqlschemausage/schemausage_bench_test.go
Allocation optimizations (preallocations, manual slice copies) in schema usage traversal; new schema-usage benchmarks added.
Module manifest
go.mod
Dependency/import updates to accommodate new internal/graphqlmetrics types and sinks.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • Focus review on:
    • Generic Exporter[T] concurrency: queue sizing, drain/shutdown ordering, synchronous flush semantics and context/cancel handling.
    • PrometheusSink aggregation correctness: grouping key, attribute composition, and accurate counts per operation/field.
    • Router/graphMux wiring and lifecycle: creation order, nil-checks, and Shutdown sequencing for the new prometheus exporter.
    • Tests using time.Sleep: potential flakiness and whether synchronous flush paths guarantee delivery during teardown.

Possibly related PRs

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.29% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: introducing a Prometheus sink for gqlmetrics with comprehensive operation tracking. It directly reflects the core objective of removing sampling to track all operations efficiently.
✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Nov 13, 2025

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-080c190f4ab7b9bece2cdad63a810c99627074ec-nonroot

@github-actions
Copy link

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-ced880bbbe6f9957e486f7deb9115c6b40f2a1bb-nonroot

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
router/core/graph_server.go (1)

758-775: Shutdown ordering: stop exporter before closing metric store.

Currently metricStore is shut down before prometheusMetricsExporter. If exporter is still flushing, writes can fail. Shut down the exporter first, then the stores.

Apply:

- if s.metricStore != nil {
-   if aErr := s.metricStore.Shutdown(ctx); aErr != nil {
-     err = errors.Join(err, aErr)
-   }
- }
- ...
- if s.prometheusMetricsExporter != nil {
-   if aErr := s.prometheusMetricsExporter.Shutdown(ctx); aErr != nil {
-     err = errors.Join(err, aErr)
-   }
- }
+ if s.prometheusMetricsExporter != nil {
+   if aErr := s.prometheusMetricsExporter.Shutdown(ctx); aErr != nil {
+     err = errors.Join(err, aErr)
+   }
+ }
+ if s.metricStore != nil {
+   if aErr := s.metricStore.Shutdown(ctx); aErr != nil {
+     err = errors.Join(err, aErr)
+   }
+ }
🧹 Nitpick comments (14)
router/internal/graphqlmetrics/graphql_metrics_sink.go (2)

29-36: Handle potential nil logger in NewGraphQLMetricsSink

cfg.Logger.With(...) will panic if Logger is ever nil (e.g. in tests or ad-hoc usage). Consider either guarding and falling back to zap.NewNop() or documenting/enforcing a non-nil logger requirement at call sites.


70-91: Consider tightening retry classification in IsRetryableError

Right now any non-connect.Error (including context cancellations or local/config errors) is treated as retryable. If such errors can reach this helper, consider explicitly short‑circuiting on context.Canceled / context.DeadlineExceeded or other known permanent failures in the caller to avoid pointless retries.

router/pkg/config/config.schema.json (1)

1249-1281: Exporter schema matches the new Prometheus usage exporter config

The schema_usage.exporter block (batch_size, queue_size, interval, export_timeout) and defaults are consistent and align with the new Go config type. If you want stricter validation, you could optionally add "additionalProperties": false to the exporter object to catch typos in config keys, but the current schema is functionally fine.

router/internal/graphqlmetrics/exporter_test.go (1)

369-375: Avoid fixed sleeps in tests; poll until metrics observed.

Replace time.Sleep with polling (e.g., require.Eventually) or call Shutdown before assertions to reduce flakiness.

Example:

- time.Sleep(200 * time.Millisecond)
+ require.Eventually(t, func() bool {
+   _, err := promRegistry.Gather()
+   return err == nil && len(c.publishedAggregations) > 0
+ }, 2*time.Second, 50*time.Millisecond)

Also applies to: 371-375

router-tests/testenv/testenv.go (1)

279-290: Clamp exporter settings to safe minimums in tests.

If a caller supplies zeros, normalize to sane minimums (e.g., BatchSize>=1, QueueSize>=1, Interval>0) before wiring to router metric config to prevent silent stalls.

Also applies to: 1508-1552

router-tests/prometheus_improved_test.go (1)

82-96: Reduce flakiness: replace fixed sleeps with Eventually/polling.

Use require.Eventually to wait for metrics flush instead of fixed sleeps. Keeps tests stable on slow CI.

Also applies to: 166-171, 229-233, 269-274, 315-321, 359-365, 413-419

router/core/graph_server.go (1)

923-949: Defaults/validation pass-through for exporter settings.

When wiring settings from cfg.Exporter, ensure they are validated/clamped (BatchSize, QueueSize, Interval, ExportTimeout) before use to prevent zero/negative configs from stalling the exporter.

router/core/router.go (2)

841-851: Prometheus schema usage: consider logging exporter settings as well

The enablement log for Prometheus schema field usage is clear. Since batching/queue behavior is now configurable, consider also logging the effective exporter settings (batch size, queue size, interval, export timeout) here to simplify debugging misconfigurations.


2324-2332: Exporter settings mapping is correct; be aware of validation constraints

The mapping from cfg.Metrics.Prometheus.SchemaFieldUsage.Exporter into rmetric.PrometheusSchemaFieldUsageExporter is straightforward and correct. Note that the generic exporter’s validate() requires all these numeric fields to be strictly positive, so a YAML/env override of 0 or negative values will now fail exporter creation. If that’s not desired, you may want to clamp or normalize these values earlier in config handling.

router/core/router_metrics.go (1)

135-159: Prometheus export path is lightweight; consider extra guard and hash semantics

The Prometheus-specific SchemaUsageInfo payload deliberately includes only type field metrics, operation hash/name, and schema version, which is a good fit for local metric cardinality and memory use.

Two minor points:

  • To harden against future misuse, you could early-return if m.prometheusMetricsExporter is nil, even though current callers gate on PrometheusUsageInfoTrack.
  • This path uses operationContext.sha256Hash directly instead of HashString(). Please confirm that sha256Hash is always initialized and matches the intended “operation SHA” semantics for all operation types.
router/internal/graphqlmetrics/exporter.go (4)

78-111: NewExporter wiring is correct; consider guarding nil settings

Constructor behavior is good: it enforces a non-nil sink, installs a default retryability handler, validates settings, and starts the batching goroutine with a component-scoped logger.

One robustness nit: settings is dereferenced without a nil check when creating the queue and during validation. Since this is an internal API, it’s probably fine, but you might want to either:

  • Treat nil as “use NewDefaultExporterSettings()”, or
  • Explicitly return an error like fmt.Errorf("settings cannot be nil").

This avoids surprising panics if a future caller accidentally passes nil.


113-143: Validation enforces positive values; retries disabled still require fields

validate() enforces strictly positive values for batch/queue sizes, interval, export timeout, and all retry fields. That’s good for catching misconfigurations early.

Note that it requires retry fields to be positive even when RetryOptions.Enabled is false. If you later expose Enabled as a user-facing toggle, you may want to relax validation so that only enabled retries need fully-populated options.


179-282: Batch export and retry logic are reasonable; consider respecting shutdown context for backoff

exportBatch wraps the shared export context with a per-call timeout and delegates to the sink, logging failures at debug. exportBatchWithRetry adds bounded exponential backoff with:

  • Retryability decided by isRetryableError.
  • Max duration, interval, and max attempts from settings.
  • Clear logs on non-retryable errors and after exhausting retries.

One potential refinement: the retry loop currently sleeps with time.Sleep and doesn’t inspect any shutdown signal or context, so a stubbornly failing sink can delay shutdown up to the cumulative backoff time. If you ever need faster shutdowns under failure, you could periodically check e.exportRequestContext.Done() or a dedicated shutdown flag inside the retry loop to abort early.


344-381: Shutdown sequence is mostly robust; document single-use expectation

Shutdown:

  • Stops accepting traffic, signals the batching goroutine to drain and exit, and then polls inflightBatches until zero or context cancel.
  • Defers cancellation of export contexts and sink closure, with logs on errors and on completion.

Two small caveats:

  • Shutdown is not idempotent: calling it twice will panic on the second close of acceptTrafficSema/shutdownSignal. It would help to either document “must only be called once” or add a guard (e.g., sync.Once).
  • Synchronous Record(..., true) calls are not tracked in inflightBatches, so they won’t delay shutdown waiting; they may still race with sink.Close if invoked concurrently. That’s likely acceptable for metrics but worth keeping in mind.

Overall, the shutdown strategy is consistent with a best-effort metrics pipeline.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ac085ab and bcddb08.

📒 Files selected for processing (17)
  • router-tests/prometheus_improved_test.go (16 hunks)
  • router-tests/testenv/testenv.go (3 hunks)
  • router/core/graph_server.go (4 hunks)
  • router/core/operation_metrics.go (3 hunks)
  • router/core/router.go (3 hunks)
  • router/core/router_config.go (1 hunks)
  • router/core/router_metrics.go (4 hunks)
  • router/internal/graphqlmetrics/exporter.go (5 hunks)
  • router/internal/graphqlmetrics/exporter_test.go (7 hunks)
  • router/internal/graphqlmetrics/graphql_exporter.go (1 hunks)
  • router/internal/graphqlmetrics/graphql_metrics_sink.go (1 hunks)
  • router/internal/graphqlmetrics/prometheus_exporter.go (1 hunks)
  • router/internal/graphqlmetrics/prometheus_sink.go (1 hunks)
  • router/internal/graphqlmetrics/sink.go (1 hunks)
  • router/pkg/config/config.go (1 hunks)
  • router/pkg/config/config.schema.json (1 hunks)
  • router/pkg/metric/config.go (2 hunks)
🧰 Additional context used
🧠 Learnings (13)
📚 Learning: 2025-08-14T17:45:57.509Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.

Applied to files:

  • router/internal/graphqlmetrics/prometheus_exporter.go
  • router/core/router_config.go
  • router/core/router_metrics.go
  • router/core/graph_server.go
  • router/core/router.go
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

  • router-tests/testenv/testenv.go
  • router-tests/prometheus_improved_test.go
  • router/internal/graphqlmetrics/exporter_test.go
  • router/core/operation_metrics.go
📚 Learning: 2025-09-29T10:28:07.122Z
Learnt from: dkorittki
Repo: wundergraph/cosmo PR: 2192
File: router/pkg/config/config.go:1028-1029
Timestamp: 2025-09-29T10:28:07.122Z
Learning: The deprecation strategy for IntrospectionEnabled field in router/pkg/config/config.go is to first mark it as deprecated, then migrate all call sites to use IntrospectionConfig.Enabled, and finally remove the deprecated field. The envDefault tag should be kept during the deprecation period for backward compatibility.

Applied to files:

  • router-tests/testenv/testenv.go
📚 Learning: 2025-09-17T20:55:39.456Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.

Applied to files:

  • router/core/router_config.go
  • router/core/graph_server.go
  • router/core/router.go
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.

Applied to files:

  • router/core/router_config.go
  • router/pkg/config/config.schema.json
📚 Learning: 2025-08-14T17:46:00.930Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.930Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.

Applied to files:

  • router/core/router_config.go
  • router/core/router_metrics.go
  • router/core/graph_server.go
📚 Learning: 2025-08-20T22:13:25.222Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2157
File: router-tests/go.mod:16-16
Timestamp: 2025-08-20T22:13:25.222Z
Learning: github.com/mark3labs/mcp-go v0.38.0 has regressions and should not be used in the wundergraph/cosmo project. v0.36.0 is the stable version that should be used across router-tests and other modules.

Applied to files:

  • router-tests/prometheus_improved_test.go
  • router/core/graph_server.go
  • router/core/router.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router/pkg/config/config.schema.json
📚 Learning: 2025-07-21T15:06:36.664Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2067
File: router/pkg/config/config.schema.json:1637-1644
Timestamp: 2025-07-21T15:06:36.664Z
Learning: In the Cosmo router project, when extending JSON schema validation for security-sensitive fields like JWKS secrets, backwards compatibility is maintained by implementing warnings in the Go code rather than hard validation constraints in the schema. This allows existing configurations to continue working while alerting users to potential security issues.

Applied to files:

  • router/pkg/config/config.schema.json
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Applied to files:

  • router/core/operation_metrics.go
  • router/core/router_metrics.go
  • router/core/graph_server.go
📚 Learning: 2025-08-26T17:19:28.178Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2167
File: router/internal/expr/retry_context.go:3-10
Timestamp: 2025-08-26T17:19:28.178Z
Learning: In the Cosmo router codebase, prefer using netErr.Timeout() checks over explicit context.DeadlineExceeded checks for timeout detection in retry logic, as Go's networking stack typically wraps context deadline exceeded errors in proper net.Error implementations.

Applied to files:

  • router/core/operation_metrics.go
  • router/pkg/metric/config.go
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-11-12T18:38:46.619Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2327
File: graphqlmetrics/core/metrics_service.go:435-442
Timestamp: 2025-11-12T18:38:46.619Z
Learning: In graphqlmetrics/core/metrics_service.go, the opGuardCache TTL (30 days) is intentionally shorter than the gql_metrics_operations database retention period (90 days). This design reduces memory usage while relying on database constraints to prevent duplicate inserts after cache expiration.

Applied to files:

  • router/core/graph_server.go
🧬 Code graph analysis (15)
router/internal/graphqlmetrics/graphql_exporter.go (2)
router/internal/graphqlmetrics/exporter.go (3)
  • Exporter (16-31)
  • ExporterSettings (50-61)
  • NewExporter (82-111)
router/internal/graphqlmetrics/graphql_metrics_sink.go (3)
  • NewGraphQLMetricsSink (30-36)
  • GraphQLMetricsSinkConfig (23-27)
  • IsRetryableError (72-91)
router/internal/graphqlmetrics/prometheus_exporter.go (3)
router/internal/graphqlmetrics/exporter.go (3)
  • Exporter (16-31)
  • ExporterSettings (50-61)
  • NewExporter (82-111)
router/pkg/metric/metric_store.go (1)
  • Store (149-162)
router/internal/graphqlmetrics/prometheus_sink.go (2)
  • NewPrometheusSink (41-47)
  • PrometheusSinkConfig (25-29)
router-tests/testenv/testenv.go (2)
router/pkg/config/config.go (2)
  • PrometheusSchemaFieldUsageExporter (121-126)
  • PrometheusSchemaFieldUsage (115-119)
router/pkg/metric/config.go (2)
  • PrometheusSchemaFieldUsageExporter (48-53)
  • PrometheusSchemaFieldUsage (42-46)
router/pkg/config/config.go (3)
router/internal/graphqlmetrics/exporter.go (1)
  • Exporter (16-31)
router-tests/testenv/testenv.go (1)
  • PrometheusSchemaFieldUsageExporter (285-290)
router/pkg/metric/config.go (1)
  • PrometheusSchemaFieldUsageExporter (48-53)
router/internal/graphqlmetrics/prometheus_sink.go (3)
router/pkg/metric/metric_store.go (1)
  • Store (149-162)
router/pkg/otel/attributes.go (5)
  • WgOperationName (10-10)
  • WgOperationType (11-11)
  • WgOperationSha256 (59-59)
  • WgGraphQLFieldName (60-60)
  • WgGraphQLParentType (61-61)
router/pkg/graphqlschemausage/schemausage.go (1)
  • TypeFieldMetrics (28-28)
router/internal/graphqlmetrics/graphql_metrics_sink.go (1)
router/internal/graphqlmetrics/aggregate.go (1)
  • AggregateSchemaUsageInfoBatch (9-27)
router/core/router_config.go (2)
router/internal/graphqlmetrics/graphql_exporter.go (1)
  • GraphQLMetricsExporter (13-15)
router/internal/graphqlmetrics/prometheus_exporter.go (1)
  • PrometheusMetricsExporter (13-15)
router/internal/graphqlmetrics/exporter.go (1)
router/internal/graphqlmetrics/sink.go (2)
  • Sink (9-18)
  • SinkErrorHandler (23-23)
router-tests/prometheus_improved_test.go (4)
router-tests/testenv/testenv.go (3)
  • Run (105-122)
  • PrometheusSchemaFieldUsageExporter (285-290)
  • Environment (1763-1799)
router/pkg/config/config.go (1)
  • PrometheusSchemaFieldUsageExporter (121-126)
router/pkg/metric/config.go (1)
  • PrometheusSchemaFieldUsageExporter (48-53)
router/pkg/otel/attributes.go (2)
  • WgOperationName (10-10)
  • WgOperationType (11-11)
router/internal/graphqlmetrics/exporter_test.go (1)
router/internal/graphqlmetrics/graphql_exporter.go (1)
  • NewGraphQLMetricsExporter (19-39)
router/core/operation_metrics.go (1)
router/core/router_metrics.go (1)
  • RouterMetrics (14-21)
router/core/router_metrics.go (4)
router/internal/graphqlmetrics/graphql_exporter.go (1)
  • GraphQLMetricsExporter (13-15)
router/internal/graphqlmetrics/prometheus_exporter.go (1)
  • PrometheusMetricsExporter (13-15)
router/core/context.go (4)
  • OperationType (503-503)
  • OperationTypeQuery (506-506)
  • OperationTypeMutation (507-507)
  • OperationTypeSubscription (508-508)
router/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (16)
  • OperationType (26-26)
  • OperationType (58-60)
  • OperationType (62-64)
  • OperationType (71-73)
  • OperationType_QUERY (29-29)
  • OperationType_MUTATION (30-30)
  • OperationType_SUBSCRIPTION (31-31)
  • SchemaUsageInfo (130-153)
  • SchemaUsageInfo (168-168)
  • SchemaUsageInfo (183-185)
  • OperationInfo (362-373)
  • OperationInfo (388-388)
  • OperationInfo (403-405)
  • SchemaInfo (428-435)
  • SchemaInfo (450-450)
  • SchemaInfo (465-467)
router/pkg/metric/config.go (3)
router/internal/graphqlmetrics/exporter.go (1)
  • Exporter (16-31)
router-tests/testenv/testenv.go (1)
  • PrometheusSchemaFieldUsageExporter (285-290)
router/pkg/config/config.go (1)
  • PrometheusSchemaFieldUsageExporter (121-126)
router/core/graph_server.go (4)
router/internal/graphqlmetrics/prometheus_exporter.go (2)
  • PrometheusMetricsExporter (13-15)
  • NewPrometheusMetricsExporter (19-45)
router/pkg/config/config.go (1)
  • Prometheus (99-113)
router/internal/graphqlmetrics/exporter.go (3)
  • ExporterSettings (50-61)
  • Exporter (16-31)
  • RetryOptions (33-38)
router/core/router_metrics.go (1)
  • NewRouterMetrics (43-52)
router/core/router.go (4)
router/internal/graphqlmetrics/graphql_exporter.go (1)
  • NewGraphQLMetricsExporter (19-39)
router/pkg/config/config.go (3)
  • Prometheus (99-113)
  • PrometheusSchemaFieldUsageExporter (121-126)
  • Metrics (137-142)
router/internal/graphqlmetrics/exporter.go (1)
  • Exporter (16-31)
router/pkg/metric/config.go (1)
  • PrometheusSchemaFieldUsageExporter (48-53)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: build-router
  • GitHub Check: build_push_image
  • GitHub Check: build_test
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: image_scan
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_test
🔇 Additional comments (19)
router/internal/graphqlmetrics/graphql_metrics_sink.go (1)

40-61: Export path and aggregation look correct

Batch empty-check, aggregation via AggregateSchemaUsageInfoBatch, auth header setup, and debug logging are all straightforward and concurrency-safe for this sink implementation. No functional issues from this diff.

router/core/router_config.go (1)

37-38: New exporter fields fit the updated metrics wiring

Adding gqlMetricsExporter and prometheusMetricsExporter to Config keeps the router’s metrics wiring explicit and type‑safe; no issues with this change.

router/internal/graphqlmetrics/sink.go (1)

7-23: Sink[T] abstraction is clear and compatible with the exporter design

The generic Sink[T] interface and SinkErrorHandler are minimal, well‑documented, and match the batch‑export semantics used by the generic exporter. No changes needed here.

router/pkg/metric/config.go (1)

42-53: ****

Verification confirms that all constructions of PrometheusSchemaFieldUsage with the Exporter field properly handle initialization:

  1. Production (router.go:2324): Reads all Exporter fields from validated config source
  2. Tests (testenv.go:1509-1530): Explicitly provides test-friendly defaults when enabled (BatchSize: 100, QueueSize: 1000, Interval: 100ms, ExportTimeout set)

No manual constructions with uninitialized Exporter fields exist in the codebase. The code is correct as-is.

router/internal/graphqlmetrics/prometheus_exporter.go (1)

31-41: LGTM on exporter wrapper.

Constructor and delegation are clear; disabling retries for local Prom sink is appropriate.

Also applies to: 47-58

router/internal/graphqlmetrics/graphql_exporter.go (1)

31-39: LGTM on GraphQL exporter wrapper.

API is minimal and mirrors the generic exporter cleanly.

Also applies to: 41-52

router-tests/prometheus_improved_test.go (1)

351-358: Review comment is incorrect; no changes required.

The repository targets Go 1.25 (confirmed across all go.mod files), which fully supports the for range n syntax introduced in Go 1.22. The codebase consistently uses this syntax in multiple test files (prometheus_improved_test.go, authentication_test.go, plugintest/tracing_test.go). No compatibility concern exists; the code is idiomatic for the current toolchain.

Likely an incorrect or invalid review comment.

router/internal/graphqlmetrics/prometheus_sink.go (1)

79-86: Remove this review comment; the Go version compatibility concern is invalid.

The codebase targets Go 1.25, and slices.Concat was introduced in Go 1.22, so the "newer Go" compatibility argument doesn't apply. The code is compatible with the project's Go version.

While the nil vs. empty slice suggestion ([]attribute.KeyValue{}nil) is a minor stylistic improvement, the original review overstates the concern about Go compatibility and allocation efficiency in this context.

router/pkg/config/config.go (1)

115-126: No changes required; validation is already in place.

The PrometheusSchemaFieldUsageExporter fields are validated both at the JSON schema level and at runtime:

  • Runtime: The Exporter[T].validate() method (router/internal/graphqlmetrics/exporter.go:113–142) enforces BatchSize > 0, QueueSize > 0, Interval > 0, and ExportTimeout > 0.
  • Schema: The config.schema.json (lines 1252–1278) defines minimum: 1 for batch_size/queue_size and duration: {minimum: "1s"} for interval/export_timeout.

Both layers prevent misconfiguration, so no additional validation is needed.

router/core/router.go (1)

821-839: GraphQL metrics exporter wiring looks consistent

Using NewGraphQLMetricsExporter with NewDefaultExporterSettings() cleanly adapts to the new generic exporter API; error handling and assignment into r.gqlMetricsExporter remain correct.

router/core/router_metrics.go (3)

14-21: RouterMetrics wiring for dual exporters is sound

Extending RouterMetrics and routerMetrics to carry both the GraphQL and Prometheus exporters, plus an explicit exportEnabled flag for cloud usage, cleanly separates concerns. NewRouterMetrics correctly threads these through, and StartOperation’s PrometheusUsageInfoTrack: m.prometheusMetricsExporter != nil gives a cheap, stable switch for local Prometheus tracking.

Also applies to: 26-41, 44-51


57-67: StartOperation gating correctly enables Prometheus usage tracking

Passing TrackUsageInfo and PrometheusUsageInfoTrack into OperationMetricsOptions ensures:

  • Cloud GraphQL usage export is still controlled centrally via exportEnabled.
  • Prometheus schema usage export is enabled solely by the presence of a Prometheus exporter.
    This matches the “track all operations” goal without coupling the two sinks.

75-81: Exporter accessors are minimal and sufficient

GqlMetricsExporter() and PrometheusMetricsExporter() simply expose the underlying exporters, matching the new Prometheus path needed by the graph mux. No additional state is leaked.

router/core/operation_metrics.go (2)

32-40: OperationMetrics option plumbing is consistent

Adding trackUsageInfo and prometheusUsageInfoTrack to OperationMetrics and threading them via OperationMetricsOptions keeps the decision about which sinks to hit local to the operation instance. The constructor correctly snapshots these booleans at operation start, avoiding any mid-request config races.

Also applies to: 91-100, 105-117


77-88: Dual export in Finish aligns with tracking goals

The new export block:

  • Respects SkipLoader and reqContext.operation == nil as before.
  • Uses trackUsageInfo to guard GraphQL (cloud) schema usage export.
  • Uses prometheusUsageInfoTrack to independently guard the Prometheus schema usage path.

This achieves “always track operations” for Prometheus while preserving existing cloud usage behavior.

router/internal/graphqlmetrics/exporter.go (4)

13-31: Generic exporter structure and defaults look solid

The generic Exporter[T] plus ExporterSettings and NewDefaultExporterSettings() give a reusable, thread-safe batching primitive. The fields (queue, inflightBatches, exportRequestContext) and defaults (batch size, queue size, interval, timeouts, retry) are reasonable and match the use cases for GraphQL and Prometheus sinks.

Also applies to: 50-76


145-177: Traffic gating and queue backpressure are correctly implemented

acceptTraffic() uses a closed channel to signal shutdown, and Record:

  • Sends synchronously via exportBatch when requested.
  • Uses a non-blocking send to a bounded queue otherwise, logging and dropping when full.

This prevents producer goroutines from blocking indefinitely on shutdown or overload. The synchronous path intentionally ignores the error (just logs inside exportBatch), which matches a “best-effort metrics” philosophy.


285-319: Batching loop and shutdown signal handling look correct

The start loop:

  • Maintains a preallocated buffer sized to BatchSize.
  • Flushes on timer ticks and when the buffer fills.
  • On shutdown, calls drainQueue(buffer) and exits cleanly.

This ensures no items are lost between tick and shutdown, and keeps the batching goroutine single-responsibility.


321-342: drainQueue correctly flushes remaining items during shutdown

drainQueue empties the queue into the buffer, sending full batches as they accumulate and finally sending any partial batch, with helpful debug logging. Given acceptTrafficSema is closed before shutdownSignal, no new async items should be enqueued during draining, so this is race-free.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
router/pkg/config/config_test.go (1)

764-787: LGTM! Precedence test correctly validates file config priority.

The test correctly verifies that file-specified exporter values override environment variables, using clearly distinguishable values (1024 vs 9999, 5s vs 99s) to prove the behavior.

Optional: Consider testing the precedence behavior for all four exporter fields (BatchSize, QueueSize, Interval, ExportTimeout) for completeness, though testing two fields is sufficient to prove the precedence mechanism works.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bcddb08 and 6f21eb9.

📒 Files selected for processing (1)
  • router/pkg/config/config_test.go (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

  • router/pkg/config/config_test.go
🧬 Code graph analysis (1)
router/pkg/config/config_test.go (1)
router/pkg/config/config.go (5)
  • LoadConfig (1120-1232)
  • Config (1014-1089)
  • Telemetry (157-163)
  • Metrics (137-142)
  • Prometheus (99-113)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: build-router
  • GitHub Check: image_scan
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_test
  • GitHub Check: build_push_image
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (3)
router/pkg/config/config_test.go (3)

689-707: LGTM! Well-structured test for exporter defaults.

The test correctly verifies that exporter configuration falls back to sensible defaults (8192/16384 for batch/queue sizes, 10s for interval/timeout) when only enabled: true is specified.


709-736: LGTM! Comprehensive test for custom file configuration.

The test correctly verifies that all exporter fields (batch_size, queue_size, interval, export_timeout) can be customized via YAML configuration.


738-762: LGTM! Proper environment variable loading test.

The test correctly verifies that exporter configuration can be loaded from environment variables using the PROMETHEUS_SCHEMA_FIELD_USAGE_EXPORTER_* prefix pattern.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d0e9d36 and 3fd209b.

📒 Files selected for processing (2)
  • router/pkg/config/testdata/config_defaults.json (1 hunks)
  • router/pkg/config/testdata/config_full.json (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router/pkg/config/testdata/config_defaults.json
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: build-router
  • GitHub Check: build_push_image
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_test
  • GitHub Check: image_scan
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (4)
router/pkg/config/testdata/config_full.json (2)

93-102: Schema change is properly tested via fixture-based tests.

The config_full.json file is exercised by config parsing tests in router/pkg/config/config_test.go at lines 320 and 1171. These tests use goldie for fixture-based JSON comparison: they load the YAML configuration and validate the parsed output matches the config_full.json fixture exactly. Since the fixture already contains the new Exporter schema with BatchSize, QueueSize, Interval, and ExportTimeout fields, the goldie tests will automatically catch any schema mismatches. If tests are passing, the schema change is validated.


96-101: Configuration field names correctly match Go struct definition.

Verification confirms the Exporter configuration in config_full.json (lines 96-101) is properly structured:

  • JSON field names (BatchSize, QueueSize, Interval, ExportTimeout) match Go struct field names in router/pkg/config/config.go:121-126
  • Field values are correct (integers for int fields, nanoseconds for time.Duration fields)
  • Both config_full.json and config_defaults.json consistently reflect the schema change
  • Tests and production code in router/core/router.go:2327-2330 correctly use these fields
router/pkg/config/testdata/config_defaults.json (2)

63-72: Structural change from SampleRate to Exporter is sound.

The replacement of the probabilistic SampleRate sampling approach with a configurable Exporter object containing BatchSize, QueueSize, Interval, and ExportTimeout properly reflects the PR objective to track all operations deterministically using batching instead of sampling.


66-71: Testdata values are correct and consistent with config schema defaults.

Verification confirms the Exporter values in the testdata file align with the actual configuration schema. The router/pkg/config/config.go struct defines PrometheusSchemaFieldUsageExporter with envDefault tags that match exactly:

  • BatchSize: envDefault:"8192" (testdata: 8192) ✓
  • QueueSize: envDefault:"16384" (testdata: 16384) ✓
  • Interval: envDefault:"10s" (testdata: 10000000000 ns = 10s) ✓
  • ExportTimeout: envDefault:"10s" (testdata: 10000000000 ns = 10s) ✓

The testdata correctly reflects the source-of-truth defaults. Any discrepancy with the PR description likely reflects earlier proposal values that were finalized to these values in the implementation.

Comment on lines 93 to 102
"SchemaFieldUsage": {
"Enabled": true,
"IncludeOperationSha": true,
"SampleRate": 1
"Exporter": {
"BatchSize": 8192,
"QueueSize": 16384,
"Interval": 10000000000,
"ExportTimeout": 10000000000
}
}
Copy link

@coderabbitai coderabbitai bot Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Assess backward compatibility: old configs with SampleRate will now fail to parse.

Removing SampleRate and replacing it with Exporter is a breaking change. Existing user configs that specify SampleRate will fail validation/parsing. Consider documenting migration path or implementing a compatibility layer (e.g., mapping old SampleRate to new Exporter batching parameters).

If this is intentional and part of a documented breaking change, add a comment in the config or release notes. If backward compatibility is required, implement a migration/fallback mechanism.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we leave it in schema as a no-op with a deprecation warning or remove it?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@endigma I'd recommend removing it entirely rather than keeping it as a deprecated no-op. Here's why:

  1. Fundamental semantic change: SampleRate was about probabilistic sampling (whether to record an operation), while Exporter is about batch processing (how to export recorded operations). Since sampling is completely removed and all operations are now tracked, keeping SampleRate would be misleading about what the system actually does.

  2. No meaningful mapping: There's no sensible way to translate SampleRate: 1.0 into BatchSize, QueueSize, Interval, and ExportTimeout values. A deprecated no-op would silently ignore user intent without clear guidance.

  3. Clearer migration path: A clean break forces users to understand the new model, which is important since the behavior has fundamentally changed (probabilistic sampling → guaranteed tracking with batching).

Recommendation:

  • Remove SampleRate from the schema
  • Add validation logic that detects if SampleRate is present in the config and returns a helpful error message like:
    "SampleRate has been removed. All operations are now tracked. 
     Please use Exporter configuration instead. See: [link to docs/migration guide]"
    
  • Document the breaking change in release notes with migration examples
  • Consider if this warrants a major version bump

Would you like me to search the codebase for any existing patterns for handling config migrations, or help draft the error message/migration documentation?


🧠 Learnings used
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
router/internal/graphqlmetrics/prometheus_sink.go (1)

81-86: Consider passing nil instead of an empty slice for the unused parameter.

Since attributes are being passed via WithAttributeSet option, the sliceAttr parameter is unused. Passing nil instead of []attribute.KeyValue{} would be more idiomatic and avoid an unnecessary empty slice allocation.

Apply this diff:

 		s.metricStore.MeasureSchemaFieldUsage(
 			ctx,
 			int64(count),
-			[]attribute.KeyValue{},
+			nil,
 			otelmetric.WithAttributeSet(attribute.NewSet(allAttrs...)),
 		)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3fd209b and 0200c3c.

📒 Files selected for processing (1)
  • router/internal/graphqlmetrics/prometheus_sink.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
router/internal/graphqlmetrics/prometheus_sink.go (3)
router/pkg/metric/metric_store.go (1)
  • Store (149-162)
router/pkg/otel/attributes.go (5)
  • WgOperationName (10-10)
  • WgOperationType (11-11)
  • WgOperationSha256 (59-59)
  • WgGraphQLFieldName (60-60)
  • WgGraphQLParentType (61-61)
router/pkg/graphqlschemausage/schemausage.go (1)
  • TypeFieldMetrics (28-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: build-router
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: build_push_image
  • GitHub Check: build_test
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
router/internal/graphqlmetrics/prometheus_sink.go (1)

127-130: No change needed — parent type and field-name extraction are correct.

TypeFieldUsageInfo.TypeNames is populated from ParentTypeNames such that index 0 is the immediate parent type, and Path contains the path segments with the field name as the last element. Verified in router/pkg/graphqlschemausage/schemausage.go (ParentTypeNames -> TypeNames), the protobuf comment (graphqlmetrics/gen/proto/.../graphqlmetrics.pb.go), and tests (router/internal/graphqlmetrics/exporter_test.go, aggregate_test.go) showing examples like Path ["user","name"] with TypeNames ["User","String"].

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
router/pkg/config/config.go (1)

116-119: Back-compat: removal of sample_rate needs migration story or explicit deprecation notice.

Dropping SampleRate in favor of Exporter will break existing configs. Either:

  • accept the breaking change and document a clear migration; or
  • add a temporary compatibility path that maps sample_rate to exporter settings.

Also ensure any references to SampleRate are removed across code, tests, docs. Reusing the schema’s deprecation mechanism can help guide users.

#!/bin/bash
# Find lingering references (code, tests, docs)
rg -n -C2 -S -i '\bsample_rate\b|SampleRate|schema_usage\.sample_rate|PROMETHEUS_SCHEMA_FIELD_USAGE_SAMPLE_RATE'
🧹 Nitpick comments (2)
router/pkg/config/config.go (1)

121-126: Type naming duplication and defaults: consider renaming and confirm queue_size default.

  • PrometheusSchemaFieldUsageExporter is also declared in router/pkg/metric/config.go. Keeping the same name in two packages is legal but confusing. Consider renaming this one to PrometheusSchemaFieldUsageExporterConfig or PrometheusSchemaFieldUsageSettings to reduce ambiguity.
  • Default queue_size here is 12800, while the PR example shows 10240. Please reconcile or update examples to avoid drift.
router/pkg/config/config.schema.json (1)

1249-1281: Harden exporter schema: disallow unknown keys and (optionally) bound sizes.

Add additionalProperties: false under schema_usage.exporter to catch typos (e.g., “batchsize”). Optionally add conservative maximums to bound memory use.

-                    "exporter": {
-                      "type": "object",
-                      "description": "Configuration for the schema usage exporter",
+                    "exporter": {
+                      "type": "object",
+                      "description": "Configuration for the schema usage exporter",
+                      "additionalProperties": false,
                       "properties": {
                         "batch_size": {
                           "type": "integer",
                           "default": 4096,
                           "minimum": 1,
+                          "maximum": 1048576,
                           "description": "The maximum number of schema usage items to be applied in a single batch. The default value is 4096."
                         },
                         "queue_size": {
                           "type": "integer",
                           "default": 12800,
                           "minimum": 1,
+                          "maximum": 10485760,
                           "description": "The maximum number of schema usage items allowed in queue at a given time. The default value is 12800."
                         },
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0200c3c and 45c510b.

📒 Files selected for processing (5)
  • router/pkg/config/config.go (1 hunks)
  • router/pkg/config/config.schema.json (1 hunks)
  • router/pkg/config/config_test.go (1 hunks)
  • router/pkg/config/testdata/config_defaults.json (1 hunks)
  • router/pkg/config/testdata/config_full.json (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • router/pkg/config/testdata/config_defaults.json
  • router/pkg/config/config_test.go
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

  • router/pkg/config/testdata/config_full.json
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router/pkg/config/testdata/config_full.json
  • router/pkg/config/config.schema.json
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.

Applied to files:

  • router/pkg/config/testdata/config_full.json
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.

Applied to files:

  • router/pkg/config/config.schema.json
🧬 Code graph analysis (1)
router/pkg/config/config.go (3)
router/internal/graphqlmetrics/exporter.go (1)
  • Exporter (16-31)
router/pkg/metric/config.go (1)
  • PrometheusSchemaFieldUsageExporter (48-53)
router-tests/testenv/testenv.go (1)
  • PrometheusSchemaFieldUsageExporter (285-290)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: build_push_image
  • GitHub Check: image_scan
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: build_test
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
router/pkg/config/testdata/config_full.json (1)

96-101: Fixture update aligns with new exporter config.

Values and duration encodings look correct. LGTM.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45c510b and 4b706fc.

📒 Files selected for processing (2)
  • router/core/context.go (2 hunks)
  • router/core/router_metrics.go (5 hunks)
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-09-17T20:55:39.456Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-14T17:45:57.509Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-14T17:46:00.930Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.930Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-11-12T18:38:46.619Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2327
File: graphqlmetrics/core/metrics_service.go:435-442
Timestamp: 2025-11-12T18:38:46.619Z
Learning: In graphqlmetrics/core/metrics_service.go, the opGuardCache TTL (30 days) is intentionally shorter than the gql_metrics_operations database retention period (90 days). This design reduces memory usage while relying on database constraints to prevent duplicate inserts after cache expiration.

Applied to files:

  • router/core/context.go
🧬 Code graph analysis (2)
router/core/router_metrics.go (4)
router/internal/graphqlmetrics/graphql_exporter.go (1)
  • GraphQLMetricsExporter (13-15)
router/internal/graphqlmetrics/prometheus_exporter.go (1)
  • PrometheusMetricsExporter (13-15)
router/core/context.go (4)
  • OperationType (503-503)
  • OperationTypeQuery (506-506)
  • OperationTypeMutation (507-507)
  • OperationTypeSubscription (508-508)
graphqlmetrics/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (16)
  • OperationType (26-26)
  • OperationType (58-60)
  • OperationType (62-64)
  • OperationType (71-73)
  • OperationType_QUERY (29-29)
  • OperationType_MUTATION (30-30)
  • OperationType_SUBSCRIPTION (31-31)
  • SchemaUsageInfo (130-153)
  • SchemaUsageInfo (168-168)
  • SchemaUsageInfo (183-185)
  • OperationInfo (362-373)
  • OperationInfo (388-388)
  • OperationInfo (403-405)
  • SchemaInfo (428-435)
  • SchemaInfo (450-450)
  • SchemaInfo (465-467)
router/core/context.go (2)
router/pkg/graphqlschemausage/schemausage.go (2)
  • TypeFieldMetrics (28-28)
  • TypeFieldUsageInfo (40-48)
graphqlmetrics/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (3)
  • TypeFieldUsageInfo (476-493)
  • TypeFieldUsageInfo (508-508)
  • TypeFieldUsageInfo (523-525)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: build-router
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: build_push_image
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_test
  • GitHub Check: image_scan
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (8)
router/core/context.go (2)

546-549: Context for new caching field.

The addition of typeFieldUsageInfoMetrics as a cached conversion result aligns with the PR's goal to efficiently track all operations without sampling. This avoids repeated allocations when multiple exporters need the same data.


601-608: The review comment is incorrect and should be disregarded.

The premise of a race condition is based on a misunderstanding of the codebase. Here's why:

  1. operationContext is per-request: In graphql_prehandler.go line 229, a new operationContext instance is created fresh for each HTTP request: requestContext.operation = &operationContext{clientInfo: clientInfo}. This is a standard per-request context pattern.

  2. No concurrent access to the same instance: Since each request gets its own operationContext instance, different requests cannot race on the same instance's fields. The lazy initialization on lines 604-605 is only accessed by the single request handler processing that specific request.

  3. Confusion between shared data and shared context: The comment in router_metrics.go:100-102 refers to the cached plan data (typeFieldUsageInfo) being reused across requests via the planner cache. This is separate from the operationContext wrapper itself, which is always fresh per request. The two concepts are distinct.

Likely an incorrect or invalid review comment.

router/core/router_metrics.go (6)

17-19: LGTM: Clean interface extension for Prometheus support.

The interface changes clearly separate GraphQL and Prometheus export paths while maintaining backward compatibility with the updated GraphQLMetricsExporter type.


26-31: LGTM: Struct fields updated to support dual export paths.

The addition of prometheusMetricsExporter alongside the updated gqlMetricsExporter type enables the router to export metrics to both GraphQL and Prometheus sinks.


43-52: LGTM: Constructor properly initializes new fields.

All new fields are correctly initialized from the config.


57-69: LGTM: Prometheus tracking enabled conditionally.

Line 64 correctly sets PrometheusUsageInfoTrack based on exporter presence, ensuring tracking overhead is only incurred when the Prometheus exporter is configured.


75-81: LGTM: Accessor methods updated.

The accessor methods correctly return the new exporter types.


83-133: LGTM: Efficient use of cached metrics conversion.

Line 109 now uses GetTypeFieldUsageInfoMetrics() which avoids repeated allocations by caching the converted metrics data. This optimization aligns with the PR's goal to track all operations efficiently.

Note: See the comment on context.go lines 601-608 regarding potential race conditions in the caching implementation.

routerConfigVersion string
logger *zap.Logger
trackUsageInfo bool
prometheusUsageInfoTrack bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prometheusUsageInfoTrack bool
prometheusTrackUsageInfo bool

Comment on lines +80 to 87
if m.trackUsageInfo {
m.routerMetrics.ExportSchemaUsageInfo(reqContext.operation, statusCode, reqContext.error != nil, exportSynchronous)
}

opAttrs := []attribute.KeyValue{
rotel.WgOperationName.String(reqContext.operation.name),
rotel.WgOperationType.String(reqContext.operation.opType),
// Prometheus metrics export (to local Prometheus metrics)
if m.prometheusUsageInfoTrack {
m.routerMetrics.ExportSchemaUsageInfoPrometheus(reqContext.operation, statusCode, reqContext.error != nil, exportSynchronous)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a case where we want both? Should this just be one function that has an interface implementation?

if !m.shouldSampleOperation() {
return
// Export schema usage info to configured exporters
if reqContext.operation != nil && !reqContext.operation.executionOptions.SkipLoader {
Copy link
Member

@endigma endigma Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if reqContext.operation != nil && !reqContext.operation.executionOptions.SkipLoader {
if reqContext.operation != nil {

This check is why we were unable to reproduce a customer issue locally without implementing their subgraph. Is it necessary?

Copy link
Contributor Author

@StarpTech StarpTech Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ensures that we don't emit schema usage data for planning only operations, hence for testing.

RouterMetrics RouterMetrics
Logger *zap.Logger
TrackUsageInfo bool
PrometheusUsageInfoTrack bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PrometheusUsageInfoTrack bool
PrometheusTrackUsageInfo bool

ExportSchemaUsageInfo(operationContext *operationContext, statusCode int, hasError bool, exportSynchronous bool)
GqlMetricsExporter() *graphqlmetrics.Exporter
ExportSchemaUsageInfoPrometheus(operationContext *operationContext, statusCode int, hasError bool, exportSynchronous bool)
GqlMetricsExporter() *graphqlmetrics.GraphQLMetricsExporter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
GqlMetricsExporter() *graphqlmetrics.GraphQLMetricsExporter
GQLMetricsExporter() *graphqlmetrics.GraphQLMetricsExporter

https://go.dev/wiki/CodeReviewComments#initialisms

Comment on lines +9 to +18
type Sink[T any] interface {
// Export sends a batch of items to the sink destination.
// It returns an error if the export fails.
// The context can be used to cancel or timeout the export operation.
Export(ctx context.Context, batch []T) error

// Close performs any cleanup needed when shutting down the sink.
// It should block until all cleanup is complete or the context is cancelled.
Close(ctx context.Context) error
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be put in pkg instead? It seems it might be useful as a generic batcher and as far as I can tell doesn't include anything graphqlmetrics specific?


// Close performs any cleanup needed when shutting down the sink.
// It should block until all cleanup is complete or the context is cancelled.
Close(ctx context.Context) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither current sink actually has a close method, is this necessary?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
router/pkg/config/config.schema.json (1)

1236-1282: Lock down schema_usage to reject legacy SampleRate and unknown keys

Currently additionalProperties is not set on schema_usage, so legacy sample_rate may be silently accepted and ignored. Tighten it and keep exporter properties as-is.

-                "schema_usage": {
-                  "type": "object",
-                  "description": "Configure schema field usage metrics for Prometheus",
-                  "properties": {
+                "schema_usage": {
+                  "type": "object",
+                  "description": "Configure schema field usage metrics for Prometheus",
+                  "additionalProperties": false,
+                  "properties": {
                     "enabled": {

Optionally, add a custom pre-validation check to emit a clearer error if sample_rate is present.

♻️ Duplicate comments (1)
router/core/router_metrics.go (1)

19-19: Consider using GQLMetricsExporter for proper Go initialism formatting.

As noted in a previous review, Go's style guide recommends using all caps for initialisms like "GQL" rather than "Gql".

🧹 Nitpick comments (10)
router/internal/graphqlmetrics/exporter.go (2)

52-63: Clarify QueueSize comment

The queue holds items (T), not batches. Update the comment to avoid confusion.


382-416: Shutdown sequencing could prolong termination

Requests are canceled only in defer, and backoff sleeps are not cancelable. The diff above cancels earlier and makes sleeps cancelable to avoid long waits on shutdown.

router/pkg/config/config.schema.json (3)

1179-1185: Default mismatch: listen_addr

Schema default is "localhost:8088" but code (config.go Line 102) uses "127.0.0.1:8088" and defaults files do too. Align schema default.

-                  "default": "localhost:8088"
+                  "default": "127.0.0.1:8088"

462-466: Default + description mismatch: websocket.forward_upgrade_query_params.enabled

Schema sets default false, description says default true, and code default is true. Align to true.

-              "default": false,
-              "description": "Forward upgrade request query parameters in the extensions payload when starting a subscription on a Subgraph. The default value is true."
+              "default": true,
+              "description": "Forward upgrade request query parameters in the extensions payload when starting a subscription on a Subgraph. The default value is true."

633-637: Default mismatch: access_logs.add_stacktrace

Schema default true; code default is false (config.go Line 883). Align to false.

-          "default": true
+          "default": false
router/core/router_metrics.go (2)

140-170: Document the hash field choice.

The nil check at lines 141-143 properly addresses the defensive programming concern from the previous review. However, line 159 uses operationContext.sha256Hash (original operation hash) while the GraphQL export path uses operationContext.HashString() (normalized hash with variables).

According to the previous review analysis, this difference is intentional: Prometheus uses the original hash for operation correlation. Please add a brief comment at line 159 explaining this choice to prevent future confusion.

Apply this diff to document the hash semantics:

 	item := &graphqlmetricsv1.SchemaUsageInfo{
 		TypeFieldMetrics: operationContext.GetTypeFieldUsageInfoMetrics(),
 		OperationInfo: &graphqlmetricsv1.OperationInfo{
 			Type: opType,
-			Hash: operationContext.sha256Hash,
+			Hash: operationContext.sha256Hash, // Original operation hash for Prometheus correlation (vs normalized hash used in GraphQL export)
 			// parsed operation names are re-used across requests
 			// for that reason, we need to copy the name, or it might get corrupted
 			Name: strings.Clone(operationContext.name),
 		},

145-153: Consider extracting the repeated operation type mapping.

The operation type switch (lines 145-153) is duplicated from the GraphQL export method (lines 93-101). Extracting this into a helper function would improve maintainability.

Example refactor:

func operationTypeToProto(opType OperationType) graphqlmetricsv1.OperationType {
	switch opType {
	case OperationTypeQuery:
		return graphqlmetricsv1.OperationType_QUERY
	case OperationTypeMutation:
		return graphqlmetricsv1.OperationType_MUTATION
	case OperationTypeSubscription:
		return graphqlmetricsv1.OperationType_SUBSCRIPTION
	default:
		return graphqlmetricsv1.OperationType_QUERY
	}
}

Then use opType := operationTypeToProto(operationContext.opType) in both export methods.

router/pkg/graphqlschemausage/schemausage.go (3)

64-104: Type field visitor preallocation and path handling look correct

Preallocating typeFieldUsageInfo with a small fixed capacity and switching to pathCopy := make(len(path)+1) plus copy avoids the extra allocation of slices.Clone(append(...)) while still guaranteeing that:

  • path == nil is handled safely (len(path) is 0, copy is no-op).
  • Each pathCopy is independent and only shared intentionally between the direct field entry and its IndirectInterfaceField variant.
  • Recursion passes pathCopy down and always re-allocates for children, so parent paths are never mutated.

The constants (capacity 32 and per-field path allocations) are modest and appropriate for typical query shapes; if future workloads show much larger field counts, benchmarks in this package can guide tuning, but there’s no correctness or obvious perf risk here.


107-131: Argument usage slice preallocation is safe and aligns with usage pattern

Initializing visitor.usage with make([]*graphqlmetrics.ArgumentUsageInfo, 0, 16) is a straightforward allocation win for common cases and doesn’t change behavior; all appends still produce the same sequence of entries as before, and the walker lifecycle remains unchanged.

If production traces show higher argument counts per operation, you can revisit the 16 heuristic, but as-is this is a low-risk, localized improvement.


159-244: Input usage preallocation and enum handling changes preserve semantics

The new preallocations in input usage look good:

  • usage: make([]*graphqlmetrics.InputUsageInfo, 0, 16) in GetInputUsageInfo matches the typical small number of variables and reduces slice growth.
  • In traverseVariable, setting usageInfo.Path = []string{parentTypeName, fieldName} when parentTypeName != "" allocates an exact-length slice and is equivalent to the previous append-based construction, while avoiding extra capacity and potential future aliasing.
  • For enum arrays, usageInfo.EnumValues = make([]string, len(arr)) followed by index assignment is correct and avoids repeated append calls.

These changes don’t alter how appendUniqueUsage and infoEquals behave, so deduplication semantics remain the same while allocations are slightly reduced.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 86480f7 and bb6d441.

📒 Files selected for processing (9)
  • router/core/router_metrics.go (5 hunks)
  • router/internal/graphqlmetrics/exporter.go (5 hunks)
  • router/internal/graphqlmetrics/exporter_bench_test.go (1 hunks)
  • router/pkg/config/config.go (1 hunks)
  • router/pkg/config/config.schema.json (1 hunks)
  • router/pkg/config/testdata/config_defaults.json (1 hunks)
  • router/pkg/config/testdata/config_full.json (1 hunks)
  • router/pkg/graphqlschemausage/schemausage.go (6 hunks)
  • router/pkg/graphqlschemausage/schemausage_bench_test.go (1 hunks)
🧰 Additional context used
🧠 Learnings (13)
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router/pkg/config/testdata/config_defaults.json
  • router/pkg/config/config.schema.json
  • router/core/router_metrics.go
  • router/pkg/config/testdata/config_full.json
  • router/pkg/graphqlschemausage/schemausage.go
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.

Applied to files:

  • router/pkg/config/config.schema.json
  • router/core/router_metrics.go
📚 Learning: 2025-09-17T20:55:39.456Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2172
File: router/core/graph_server.go:0-0
Timestamp: 2025-09-17T20:55:39.456Z
Learning: The Initialize method in router/internal/retrytransport/manager.go has been updated to properly handle feature-flag-only subgraphs by collecting subgraphs from both routerConfig.GetSubgraphs() and routerConfig.FeatureFlagConfigs.ConfigByFeatureFlagName, ensuring all subgraphs receive retry configuration.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-14T17:45:57.509Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-09-02T12:52:27.677Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2181
File: router/core/operation_processor.go:0-0
Timestamp: 2025-09-02T12:52:27.677Z
Learning: Hash validation for persisted queries with query bodies is performed in router/core/graphql_prehandler.go in the handleOperation function at lines 571-578, not in the APQ processing logic in operation_processor.go. This validates that extensions.persistedQuery.sha256Hash matches the computed query hash before any APQ operations occur. There's also a test case that verifies this behavior.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: In the Cosmo router codebase, JSON schema validation prevents null values in TrafficShapingRules subgraph configurations, making nil checks unnecessary when dereferencing subgraph rule pointers in NewSubgraphTransportOptions.

Applied to files:

  • router/core/router_metrics.go
  • router/pkg/config/testdata/config_full.json
📚 Learning: 2025-09-02T12:52:27.677Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2181
File: router/core/operation_processor.go:0-0
Timestamp: 2025-09-02T12:52:27.677Z
Learning: Hash validation for persisted queries with query bodies is performed in router/core/graphql_prehandler.go in the handleOperation function, not in the APQ processing logic in operation_processor.go. This validates that extensions.persistedQuery.sha256Hash matches the query body before any APQ operations occur.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-14T17:46:00.930Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/metric/oltp_event_metric_store.go:68-68
Timestamp: 2025-08-14T17:46:00.930Z
Learning: In the Cosmo router codebase, meter providers are shut down centrally as part of the router lifecycle, so individual metric stores like otlpEventMetrics use no-op Shutdown() methods rather than calling meterProvider.Shutdown(ctx) directly.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-08-12T13:50:45.964Z
Learnt from: Noroth
Repo: wundergraph/cosmo PR: 2132
File: router-plugin/plugin.go:139-146
Timestamp: 2025-08-12T13:50:45.964Z
Learning: In the Cosmo router plugin system, the plugin framework creates its own logger independently. When creating a logger in NewRouterPlugin, it's only needed for the gRPC server setup (passed via setup.GrpcServerInitOpts.Logger) and doesn't need to be assigned back to serveConfig.Logger.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-11-12T18:38:46.619Z
Learnt from: StarpTech
Repo: wundergraph/cosmo PR: 2327
File: graphqlmetrics/core/metrics_service.go:435-442
Timestamp: 2025-11-12T18:38:46.619Z
Learning: In graphqlmetrics/core/metrics_service.go, the opGuardCache TTL (30 days) is intentionally shorter than the gql_metrics_operations database retention period (90 days). This design reduces memory usage while relying on database constraints to prevent duplicate inserts after cache expiration.

Applied to files:

  • router/core/router_metrics.go
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

  • router/pkg/config/testdata/config_full.json
🧬 Code graph analysis (6)
router/internal/graphqlmetrics/exporter.go (1)
router/internal/graphqlmetrics/sink.go (2)
  • Sink (9-18)
  • SinkErrorHandler (23-23)
router/pkg/config/config.go (2)
router/internal/graphqlmetrics/exporter.go (1)
  • Exporter (17-33)
router/pkg/metric/config.go (1)
  • PrometheusSchemaFieldUsageExporter (48-53)
router/core/router_metrics.go (5)
router/core/operation_metrics.go (1)
  • OperationMetrics (31-40)
router/internal/graphqlmetrics/graphql_exporter.go (1)
  • GraphQLMetricsExporter (13-15)
router/internal/graphqlmetrics/prometheus_exporter.go (1)
  • PrometheusMetricsExporter (13-15)
router/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (10)
  • OperationType (26-26)
  • OperationType (58-60)
  • OperationType (62-64)
  • OperationType (71-73)
  • OperationType_QUERY (29-29)
  • OperationType_MUTATION (30-30)
  • OperationType_SUBSCRIPTION (31-31)
  • SchemaUsageInfo (130-153)
  • SchemaUsageInfo (168-168)
  • SchemaUsageInfo (183-185)
router/core/context.go (4)
  • OperationType (503-503)
  • OperationTypeQuery (506-506)
  • OperationTypeMutation (507-507)
  • OperationTypeSubscription (508-508)
router/internal/graphqlmetrics/exporter_bench_test.go (1)
router/internal/graphqlmetrics/exporter.go (3)
  • ExporterSettings (52-63)
  • RetryOptions (35-40)
  • NewExporter (84-120)
router/pkg/graphqlschemausage/schemausage_bench_test.go (2)
router/pkg/graphqlschemausage/schemausage_test.go (1)
  • FakeFactory (527-529)
router/pkg/graphqlschemausage/schemausage.go (4)
  • GetTypeFieldUsageInfo (14-23)
  • GetArgumentUsageInfo (107-124)
  • GetInputUsageInfo (159-171)
  • TypeFieldMetrics (26-26)
router/pkg/graphqlschemausage/schemausage.go (1)
graphqlmetrics/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (9)
  • TypeFieldUsageInfo (476-493)
  • TypeFieldUsageInfo (508-508)
  • TypeFieldUsageInfo (523-525)
  • ArgumentUsageInfo (569-582)
  • ArgumentUsageInfo (597-597)
  • ArgumentUsageInfo (612-614)
  • InputUsageInfo (644-659)
  • InputUsageInfo (674-674)
  • InputUsageInfo (689-691)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: build-router
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./events)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: build_test
  • GitHub Check: build_push_image
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: build_test
🔇 Additional comments (10)
router/internal/graphqlmetrics/exporter_bench_test.go (1)

27-66: Benchmarks cover key hot paths; LGTM

Good coverage of buffer pooling, high-throughput, batch cycling, growth, and parallel Record. Minor: time.Sleep in benchmarks can add variance, but acceptable here.

Also applies to: 68-111, 112-156, 157-197, 198-242

router/pkg/config/testdata/config_defaults.json (1)

66-71: Defaults alignment looks good

Exporter defaults match code/env defaults (4096/12800/1s/10s). LGTM.

router/pkg/config/testdata/config_full.json (1)

96-101: Config shape migration verified

Exporter object replaces SampleRate as intended; values consistent with defaults/examples. LGTM.

router/pkg/config/config.go (1)

116-119: ****

The two PrometheusSchemaFieldUsageExporter definitions serve distinct purposes and should not be unified:

  • router/pkg/config/config.go:121 is a configuration struct with YAML and environment variable tags for unmarshaling external configuration with defaults
  • router/pkg/metric/config.go:48 is an internal runtime struct used by the metrics package to track exporter state, with no serialization concerns

This separation of concerns is intentional and appropriate. Consolidating them would couple the metric package to configuration unmarshaling logic and add unnecessary struct tags where they don't belong. The apparent duplication is justified.

Likely an incorrect or invalid review comment.

router/core/router_metrics.go (4)

3-5: Good use of standard library.

The strings package import supports the replacement of the custom strCopy implementation with strings.Clone(), which is a cleaner approach.


24-53: LGTM!

The struct fields and constructor are updated consistently. The new prometheusMetricsExporter field is properly initialized in NewRouterMetrics.


76-82: LGTM!

The getter methods correctly expose both exporters through the RouterMetrics interface.


84-138: LGTM!

The method improvements are well-executed:

  • Added defensive nil check for the exporter (lines 89-91)
  • Replaced custom strCopy with standard strings.Clone (line 122)
  • Updated to use GetTypeFieldUsageInfoMetrics() for cleaner data access
router/pkg/graphqlschemausage/schemausage_bench_test.go (2)

106-194: Benchmarks are well-structured and focus on the right hot paths

Each benchmark:

  • Calls setupBenchmark once, then uses b.ResetTimer() and b.ReportAllocs() to keep setup out of the measured section.
  • Exercises a single concern per benchmark (GetTypeFieldUsageInfo, GetArgumentUsageInfo, GetInputUsageInfo, IntoGraphQLMetrics, and the combined end-to-end flow), with errors surfaced via b.Fatal.
  • Prevents compiler optimizations by assigning results to a blank variable.

The end-to-end benchmark intentionally reuses the precomputed plan, docs, and variables to focus on schema usage extraction rather than planning, which matches the PR’s perf focus. This all looks good and should give you stable allocation/CPU signals for regression tracking.


19-104: No issues found; original concern is unfounded.

The pattern of calling astjson.ParseBytes(op.Input.Variables) is safe and well-established. The Variables field originates as json.RawMessage (a []byte type with omitempty tag), which defaults to an empty slice rather than nil when unmarshaled. This empty slice is valid JSON input for astjson.ParseBytes. The identical pattern already exists and is tested in schemausage_test.go (line 206) with successful results, confirming the code path is sound.

Comment on lines +24 to 27
queue chan T
inflightBatches *atomic.Int64
batchBufferPool *sync.Pool // Pool for batch slice buffers to reduce allocations

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: batch buffer returned to pool before export completes (data race/corruption)

Buffer is enqueued for export and immediately returned to the pool (lines 332-333, 341-342, 364-366, 371-372). The goroutine then exports a slice referencing memory that may be reused/modified, leading to corrupted batches and races. Also, the pool stores *[]T and rewraps slice headers, which is error‑prone.

Fix by:

  • Pooling []T values (not pointers).
  • Returning buffers to the pool only after export finishes (inside the export goroutine).
  • Making retry backoff sleep cancelable and cancel export requests earlier during Shutdown.

Apply this diff:

@@
-    "sync"
+    "sync"
@@
-    batchBufferPool   *sync.Pool // Pool for batch slice buffers to reduce allocations
+    batchBufferPool   *sync.Pool // Pool of []T to reduce allocations
@@
-        batchBufferPool: &sync.Pool{
-            New: func() any {
-                // Pre-allocate slice with batch size capacity
-                buffer := make([]T, 0, settings.BatchSize)
-                return &buffer
-            },
-        },
+        batchBufferPool: &sync.Pool{
+            New: func() any {
+                // Pre-allocate slice with batch size capacity
+                return make([]T, 0, settings.BatchSize)
+            },
+        },
@@
-func (e *Exporter[T]) getBatchBuffer() []T {
-    bufferPtr := e.batchBufferPool.Get().(*[]T)
-    buffer := *bufferPtr
-    // Ensure the buffer is empty (should already be, but be defensive)
-    return buffer[:0]
-}
+func (e *Exporter[T]) getBatchBuffer() []T {
+    bufAny := e.batchBufferPool.Get()
+    if bufAny == nil {
+        return make([]T, 0, e.settings.BatchSize)
+    }
+    buf := bufAny.([]T)
+    return buf[:0]
+}
@@
-func (e *Exporter[T]) putBatchBuffer(buffer []T) {
-    // Reset the slice to zero length while keeping capacity
-    buffer = buffer[:0]
-    e.batchBufferPool.Put(&buffer)
-}
+func (e *Exporter[T]) putBatchBuffer(buf []T) {
+    e.batchBufferPool.Put(buf[:0])
+}
@@
 func (e *Exporter[T]) prepareAndSendBatch(batch []T) {
     e.logger.Debug("Preparing to send batch", zap.Int("batch_size", len(batch)))
     e.inflightBatches.Inc()
-    go func() {
+    go func(b []T) {
         defer e.inflightBatches.Dec()
-        e.exportBatchWithRetry(batch)
-    }()
+        defer e.putBatchBuffer(b) // return buffer only after export completes
+        e.exportBatchWithRetry(b)
+    }(batch)
 }
@@
-        case <-ticker.C:
+        case <-ticker.C:
             e.logger.Debug("Tick: flushing buffer", zap.Int("buffer_size", len(buffer)))
             if len(buffer) > 0 {
-                e.prepareAndSendBatch(buffer)
-                // Return buffer to pool after sending
-                e.putBatchBuffer(buffer)
+                e.prepareAndSendBatch(buffer) // buffer will be returned by goroutine
                 buffer = nil
             }
@@
             if len(buffer) == e.settings.BatchSize {
                 e.logger.Debug("Buffer full, sending batch", zap.Int("batch_size", len(buffer)))
-                e.prepareAndSendBatch(buffer)
-                // Return buffer to pool after sending
-                e.putBatchBuffer(buffer)
+                e.prepareAndSendBatch(buffer) // buffer will be returned by goroutine
                 buffer = nil
             }
@@
-            if len(buffer) == e.settings.BatchSize {
-                e.prepareAndSendBatch(buffer)
-                // Return buffer to pool and get a new one
-                e.putBatchBuffer(buffer)
-                buffer = e.getBatchBuffer()
-            }
+            if len(buffer) == e.settings.BatchSize {
+                e.prepareAndSendBatch(buffer) // buffer returned by goroutine
+                buffer = e.getBatchBuffer()
+            }
@@
-            if len(buffer) > 0 {
-                e.prepareAndSendBatch(buffer)
-                // Return buffer to pool before returning
-                e.putBatchBuffer(buffer)
-            }
+            if len(buffer) > 0 {
+                e.prepareAndSendBatch(buffer) // buffer returned by goroutine
+            }

And make backoff sleeps cancelable and cancel export requests earlier during shutdown:

@@
 func (e *Exporter[T]) exportBatchWithRetry(batch []T) {
@@
-        // Wait for the specified backoff period
-        sleepDuration := b.Duration()
-        e.logger.Debug("Retrying export after backoff",
+        // Wait for the specified backoff period (cancelable)
+        sleepDuration := b.Duration()
+        e.logger.Debug("Retrying export after backoff",
             zap.Int("batch_size", len(batch)),
             zap.Int("retry", retry),
             zap.Duration("sleep", sleepDuration),
         )
-
-        time.Sleep(sleepDuration)
+        select {
+        case <-e.exportRequestContext.Done():
+            e.logger.Debug("Export cancelled during backoff")
+            return
+        case <-time.After(sleepDuration):
+        }
@@
 func (e *Exporter[T]) Shutdown(ctx context.Context) error {
     e.logger.Debug("Shutdown started")
@@
-    defer func() {
-        ticker.Stop()
-        // Cancel all export requests
-        e.cancelAllExportRequests()
+    // Cancel export requests early to unblock any long requests/backoff
+    e.cancelAllExportRequests()
+    defer func() {
+        ticker.Stop()
         // Close the sink
         if err := e.sink.Close(ctx); err != nil {
             e.logger.Error("Error closing sink", zap.Error(err))
         }
         e.logger.Debug("Shutdown complete")
     }()

Minor nit: the comment for QueueSize says “number of batches” but the queue holds items. Consider updating it for clarity.

Also applies to: 96-114, 319-344, 352-377

🤖 Prompt for AI Agents
router/internal/graphqlmetrics/exporter.go around lines 24-27 and the other
affected ranges (96-114, 319-344, 352-377): the current implementation returns
*[]T buffers to the sync.Pool immediately after enqueueing, which allows the
buffer memory to be reused while an export goroutine is still referencing it,
and the pool stores pointers to slices which rewraps slice headers unsafely.
Change the pool to hold []T (not *[]T), stop returning buffers to the pool
before export completes, and move the buffer Put(...) into the export goroutine
after the export attempt (including after retries) so the goroutine owns the
slice lifetime; make any retry backoff sleeps cancelable by using a select on
ctx.Done() and a time.After or time.NewTimer channel so sleeps abort immediately
on Shutdown; ensure Shutdown cancels in-flight export requests earlier by
cancelling the export context (or closing a done channel) before waiting for
goroutines, and update the QueueSize comment to accurately say it’s the queue
capacity in items (not "number of batches"). Ensure all places that currently
Put a *[]T are updated to Put a []T and that enqueued values are copies of the
slice header or newly allocated slices if needed so the pool semantics are safe.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
router/internal/graphqlmetrics/exporter_test.go (1)

70-71: Minor: prefer require.NoError for errors

Switch to require.NoError for clarity. Applies to all similar occurrences.

- require.Nil(t, err)
+ require.NoError(t, err)

Also applies to: 149-150, 188-189, 222-223, 345-346, 421-422

♻️ Duplicate comments (3)
router/internal/graphqlmetrics/exporter.go (3)

56-57: Doc nit: QueueSize counts items, not batches

The queue holds items (chan T), not batches. Update comment for accuracy.

- // QueueSize is the maximum number of batches allowed in queue at a given time.
+ // QueueSize is the maximum number of items allowed in the queue at a given time.

272-290: Backoff sleep not cancelable; goroutines may linger on shutdown

Make backoff waits respect e.exportRequestContext so Shutdown can exit promptly.

-    // Wait for the specified backoff period
-    sleepDuration := b.Duration()
-    e.logger.Debug("Retrying export after backoff",
+    // Wait for the specified backoff period (cancelable)
+    sleepDuration := b.Duration()
+    e.logger.Debug("Retrying export after backoff",
       zap.Int("batch_size", len(batch)),
       zap.Int("retry", retry),
       zap.Duration("sleep", sleepDuration),
     )
-
-    time.Sleep(sleepDuration)
+    select {
+    case <-e.exportRequestContext.Done():
+      e.logger.Debug("Export cancelled during backoff")
+      return
+    case <-time.After(sleepDuration):
+    }

381-394: Shutdown can hang; cancel exports earlier

Cancel export requests before waiting, not in defer, so sleeping/blocked exports exit promptly.

 func (e *Exporter[T]) Shutdown(ctx context.Context) error {
   e.logger.Debug("Shutdown started")
 
   ticker := time.NewTicker(time.Millisecond * 100)
-  defer func() {
-    ticker.Stop()
-    // Cancel all export requests
-    e.cancelAllExportRequests()
-    // Close the sink
-    if err := e.sink.Close(ctx); err != nil {
-      e.logger.Error("Error closing sink", zap.Error(err))
-    }
-    e.logger.Debug("Shutdown complete")
-  }()
+  defer func() {
+    ticker.Stop()
+    // Close the sink
+    if err := e.sink.Close(ctx); err != nil {
+      e.logger.Error("Error closing sink", zap.Error(err))
+    }
+    e.logger.Debug("Shutdown complete")
+  }()
 
   // First close the acceptTrafficSema to stop accepting new items
   close(e.acceptTrafficSema)
   // Then trigger the shutdown signal for the exporter goroutine to stop
   // It will drain the queue and send the remaining items
   close(e.shutdownSignal)
+  // Cancel all export requests to unblock backoff/sinks started before shutdown
+  e.cancelAllExportRequests()
🧹 Nitpick comments (5)
router/internal/graphqlmetrics/prometheus_sink.go (1)

104-106: Use int64 for counts to match metric API and avoid unnecessary casts

Store counts as int64; remove per-iteration int conversions.

-func (s *PrometheusSink) aggregateBatch(batch []*graphqlmetrics.SchemaUsageInfo) map[aggregatedUsageKey]int {
-	aggregatedCounts := make(map[aggregatedUsageKey]int)
+func (s *PrometheusSink) aggregateBatch(batch []*graphqlmetrics.SchemaUsageInfo) map[aggregatedUsageKey]int64 {
+	aggregatedCounts := make(map[aggregatedUsageKey]int64)
@@
-			// Increment count, using field.Count if available, otherwise 1
-			if field.Count > 0 {
-				aggregatedCounts[key] += int(field.Count)
-			} else {
-				aggregatedCounts[key]++
-			}
+			// Increment count, using field.Count if available, otherwise 1
+			if field.Count > 0 {
+				aggregatedCounts[key] += int64(field.Count)
+			} else {
+				aggregatedCounts[key] += 1
+			}

And in Export:

-	for key, count := range aggregatedCounts {
+	for key, count := range aggregatedCounts {
@@
-			int64(count),
+			count,

Also applies to: 140-145, 81-87

router/demo.config.yaml (1)

16-25: LGTM: concise demo of Prometheus schema_usage

Sample config is coherent. Optional: add a comment noting include_operation_sha increases label cardinality.

router/internal/graphqlmetrics/exporter_test.go (1)

385-391: Reduce flakiness: prefer Eventually over fixed Sleep

Replace fixed sleep with Eventually to wait for flush deterministically.

- time.Sleep(200 * time.Millisecond)
- 
- defer require.Nil(t, e.Shutdown(context.Background()))
- 
- require.Equal(t, 1, len(c.publishedAggregations))
- require.Equal(t, 5, len(c.publishedAggregations[0]))
+ require.Eventually(t, func() bool {
+   c.mu.Lock()
+   defer c.mu.Unlock()
+   return len(c.publishedAggregations) == 1 && len(c.publishedAggregations[0]) == 5
+ }, 2*time.Second, 50*time.Millisecond)
+ defer require.NoError(t, e.Shutdown(context.Background()))
router/pkg/config/config.schema.json (1)

1249-1281: Tighten exporter schema and fix unit doc

Disallow unknown keys and include 'ms' in interval docs to match usage.

-                    "exporter": {
-                      "type": "object",
-                      "description": "Configuration for the schema usage exporter",
-                      "properties": {
+                    "exporter": {
+                      "type": "object",
+                      "description": "Configuration for the schema usage exporter",
+                      "additionalProperties": false,
+                      "properties": {
@@
-                        "interval": {
+                        "interval": {
                           "type": "string",
                           "default": "2s",
                           "duration": {
                             "minimum": "100ms"
                           },
-                          "description": "The interval at which the schema usage queue is flushed. The period is specified as a string with a number and a unit, e.g. 10s, 1m. The supported units are 's', 'm', 'h'."
+                          "description": "The interval at which the schema usage queue is flushed. Specify a duration, e.g. 100ms, 10s, 1m. Supported units: 'ms', 's', 'm', 'h'."
                         },
router/internal/graphqlmetrics/exporter.go (1)

107-114: *Prefer pooling []T (not []T) to simplify and avoid pointer indirection

Safer and simpler: pool raw slices; ownership already fixed by deferring put in goroutine.

-    batchBufferPool: &sync.Pool{
-      New: func() any {
-        // Pre-allocate slice with batch size capacity
-        buffer := make([]T, 0, settings.BatchSize)
-        return &buffer
-      },
-    },
+    batchBufferPool: &sync.Pool{
+      New: func() any {
+        return make([]T, 0, settings.BatchSize)
+      },
+    },
@@
-func (e *Exporter[T]) getBatchBuffer() []T {
-  bufferPtr := e.batchBufferPool.Get().(*[]T)
-  buffer := *bufferPtr
-  // Ensure the buffer is empty (should already be, but be defensive)
-  return buffer[:0]
-}
+func (e *Exporter[T]) getBatchBuffer() []T {
+  bufAny := e.batchBufferPool.Get()
+  if bufAny == nil {
+    return make([]T, 0, e.settings.BatchSize)
+  }
+  buf := bufAny.([]T)
+  return buf[:0]
+}
@@
-func (e *Exporter[T]) putBatchBuffer(buffer []T) {
-  // Reset the slice to zero length while keeping capacity
-  buffer = buffer[:0]
-  e.batchBufferPool.Put(&buffer)
-}
+func (e *Exporter[T]) putBatchBuffer(buf []T) {
+  e.batchBufferPool.Put(buf[:0])
+}

Also applies to: 154-161, 163-169

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb6d441 and 841d22a.

📒 Files selected for processing (10)
  • router/demo.config.yaml (2 hunks)
  • router/internal/graphqlmetrics/exporter.go (5 hunks)
  • router/internal/graphqlmetrics/exporter_test.go (10 hunks)
  • router/internal/graphqlmetrics/prometheus_sink.go (1 hunks)
  • router/pkg/config/config.go (1 hunks)
  • router/pkg/config/config.schema.json (1 hunks)
  • router/pkg/config/config_test.go (1 hunks)
  • router/pkg/config/testdata/config_defaults.json (1 hunks)
  • router/pkg/config/testdata/config_full.json (1 hunks)
  • router/pkg/graphqlschemausage/schemausage.go (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • router/pkg/config/config_test.go
  • router/pkg/graphqlschemausage/schemausage.go
  • router/pkg/config/testdata/config_full.json
  • router/pkg/config/testdata/config_defaults.json
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-10-01T20:39:16.113Z
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2252
File: router-tests/telemetry/telemetry_test.go:9684-9693
Timestamp: 2025-10-01T20:39:16.113Z
Learning: Repo preference: In router-tests/telemetry/telemetry_test.go, keep strict > 0 assertions for request.operation.*Time (parsingTime, normalizationTime, validationTime, planningTime) in telemetry-related tests; do not relax to >= 0 unless CI flakiness is observed.

Applied to files:

  • router/internal/graphqlmetrics/exporter_test.go
📚 Learning: 2025-07-03T10:33:25.778Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.778Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.

Applied to files:

  • router/pkg/config/config.schema.json
📚 Learning: 2025-08-20T10:08:17.857Z
Learnt from: endigma
Repo: wundergraph/cosmo PR: 2155
File: router/core/router.go:1857-1866
Timestamp: 2025-08-20T10:08:17.857Z
Learning: router/pkg/config/config.schema.json forbids null values for traffic_shaping.subgraphs: additionalProperties references $defs.traffic_shaping_subgraph_request_rule with type "object". Therefore, in core.NewSubgraphTransportOptions, dereferencing each subgraph rule pointer is safe under schema-validated configs, and a nil-check is unnecessary.

Applied to files:

  • router/pkg/config/config.schema.json
🧬 Code graph analysis (4)
router/internal/graphqlmetrics/exporter_test.go (2)
router/internal/graphqlmetrics/graphql_exporter.go (1)
  • NewGraphQLMetricsExporter (19-39)
router/gen/proto/wg/cosmo/graphqlmetrics/v1/graphqlmetrics.pb.go (3)
  • RequestInfo (75-82)
  • RequestInfo (97-97)
  • RequestInfo (112-114)
router/pkg/config/config.go (2)
router/pkg/metric/config.go (1)
  • PrometheusSchemaFieldUsageExporter (48-53)
router-tests/testenv/testenv.go (1)
  • PrometheusSchemaFieldUsageExporter (285-290)
router/internal/graphqlmetrics/prometheus_sink.go (3)
router/pkg/metric/metric_store.go (1)
  • Store (149-162)
router/pkg/otel/attributes.go (5)
  • WgOperationName (10-10)
  • WgOperationType (11-11)
  • WgOperationSha256 (59-59)
  • WgGraphQLFieldName (60-60)
  • WgGraphQLParentType (61-61)
router/pkg/graphqlschemausage/schemausage.go (1)
  • TypeFieldMetrics (26-26)
router/internal/graphqlmetrics/exporter.go (1)
router/internal/graphqlmetrics/sink.go (2)
  • Sink (9-18)
  • SinkErrorHandler (23-23)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: build_push_image
  • GitHub Check: integration_test (./events)
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: image_scan
  • GitHub Check: build_test
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: Analyze (javascript-typescript)
  • GitHub Check: Analyze (go)
🔇 Additional comments (2)
router/pkg/config/config.go (1)

116-119: LGTM: exporter config shape and defaults

Exporter block and env tags look consistent with schema and intended wiring.

Also applies to: 121-126

router/internal/graphqlmetrics/prometheus_sink.go (1)

121-131: Avoid potential nil deref on field entries

TypeFieldMetrics is a slice of pointers; a nil element will panic on Path/TypeNames access.

-		for _, field := range usageInfo.TypeFieldMetrics {
+		for _, field := range usageInfo.TypeFieldMetrics {
+			if field == nil {
+				continue
+			}
 			// Skip fields without valid parent type or path
 			if len(field.Path) == 0 || len(field.TypeNames) < 1 {
 				continue
 			}
⛔ Skipped due to learnings
Learnt from: SkArchon
Repo: wundergraph/cosmo PR: 2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Comment on lines 139 to 149
if e.settings.RetryOptions.MaxDuration <= 0 {
return errors.New("retry max duration must be positive")
return fmt.Errorf("retry max duration must be positive")
}

if e.settings.RetryOptions.Interval <= 0 {
return errors.New("retry interval must be positive")
return fmt.Errorf("retry interval must be positive")
}

if e.settings.RetryOptions.MaxRetry <= 0 {
return errors.New("retry max retry must be positive")
return fmt.Errorf("retry max retry must be positive")
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validation rejects zero retry values even when retries are disabled

Only enforce retry field positivity if RetryOptions.Enabled is true.

-  if e.settings.RetryOptions.MaxDuration <= 0 {
-    return fmt.Errorf("retry max duration must be positive")
-  }
-  if e.settings.RetryOptions.Interval <= 0 {
-    return fmt.Errorf("retry interval must be positive")
-  }
-  if e.settings.RetryOptions.MaxRetry <= 0 {
-    return fmt.Errorf("retry max retry must be positive")
-  }
+  if e.settings.RetryOptions.Enabled {
+    if e.settings.RetryOptions.MaxDuration <= 0 {
+      return fmt.Errorf("retry max duration must be positive")
+    }
+    if e.settings.RetryOptions.Interval <= 0 {
+      return fmt.Errorf("retry interval must be positive")
+    }
+    if e.settings.RetryOptions.MaxRetry <= 0 {
+      return fmt.Errorf("retry max retry must be positive")
+    }
+  }
🤖 Prompt for AI Agents
In router/internal/graphqlmetrics/exporter.go around lines 139 to 149, the
current validation unconditionally rejects zero/negative retry values; change
the checks so they only run when e.settings.RetryOptions.Enabled is true (and
ensure RetryOptions is non-nil before accessing Enabled). Concretely, wrap the
three positivity checks for MaxDuration, Interval and MaxRetry in an if
e.settings.RetryOptions != nil && e.settings.RetryOptions.Enabled { ... } block
so disabled retries may have zero values, and return the same error messages
when validation runs.

Comment on lines +41 to +45
return &PrometheusSink{
metricStore: cfg.MetricStore,
logger: cfg.Logger.With(zap.String("component", "prometheus_sink")),
includeOpSha: cfg.IncludeOpSha,
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against nil logger in constructor

cfg.Logger can be nil; calling .With will panic. Default to zap.NewNop().

 func NewPrometheusSink(cfg PrometheusSinkConfig) *PrometheusSink {
-	return &PrometheusSink{
-		metricStore:  cfg.MetricStore,
-		logger:       cfg.Logger.With(zap.String("component", "prometheus_sink")),
+	lg := cfg.Logger
+	if lg == nil {
+		lg = zap.NewNop()
+	}
+	return &PrometheusSink{
+		metricStore:  cfg.MetricStore,
+		logger:       lg.With(zap.String("component", "prometheus_sink")),
 		includeOpSha: cfg.IncludeOpSha,
 	}
 }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In router/internal/graphqlmetrics/prometheus_sink.go around lines 41 to 45, the
constructor uses cfg.Logger.With(...) but cfg.Logger may be nil and will panic;
guard by defaulting to zap.NewNop() when cfg.Logger is nil, assign a local
logger variable like logger := cfg.Logger; if logger == nil { logger =
zap.NewNop() } and then call
logger.With(zap.String("component","prometheus_sink")) when constructing the
PrometheusSink.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants