Skip to content

[beats receivers] Replace OTel collector internal telemetry monitoring with a scalable approach #10220

@faec

Description

@faec

The initial implementation for ingesting OTel collector internal telemetry has some major limitations:

  • Conversion to Beats-compatible fields for the Agent dashboards is done with a hard-coded Javascript processor, which is not very maintainable or testable.
  • Basing the telemetry scrape on Metricbeat's prometheus input means that we can't assume all relevant metrics will be visible in the same event, since this input partitions fields by their label set (and the labels for the relevant fields vary significantly and are not guaranteed to be stable). This rules out support for some important metrics like output.events.active that require aggregating data from multiple collector telemetry variables. It also means we have questionable mitigations like mangling the metricbeat id so Elasticsearch doesn't reject events with disjoint variable sets as duplicates just because they have the same timestamp and source metadata.
  • Some aspects of the config generation and field conversion only work for the monitoring case. Extending them to support the general case would significantly increase the complexity, and might be entirely infeasible with this approach.
  • Less severe but still undesirable: this approach requires fetching the data over an open TCP port, even when running in the same process as the collector.

A sustainable solution would have the following attributes:

  • Any necessary field conversions can access and logically depend on the full set of Collector telemetry fields (this requires at minimum a custom scraper for the data in Agent and/or Beats instead of adapting the existing Prometheus input).
  • Non-trivial field conversions should be written in unit-tested Go code and updated in sync with the Collector version.
  • If/when possible, scrape the data directly through in-process mechanisms, or through a private socket rather than generic TCP port.

The concrete design should also allow for a viable migration path to pure OTLP metrics, as definitions and dependencies stabilize enough to reliably build dashboards / integration support on that basis.

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions