Skip to content

Conversation

pkoutsovasilis
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis commented Oct 6, 2025

What does this PR do?

This PR improves error handling for Elasticsearch output configurations in the Hybrid Elastic Agent by:

  1. Moving partially configuration translation ownership: Relocates some of the Elasticsearch output translation logic from the beats library (libbeat/otelbeat/oteltranslate/outputs/elasticsearch) into the elastic-agent package (internal/pkg/otel/translate/output_elasticsearch.go). In the future we should do a full transition to elastic-agent repo as this gives elastic-agent full control over the translation.

  2. Enabling graceful error handling: Adds continue_on_error: true to the beatsauth extension configuration in getBeatsAuthExtensionConfig(). This prevents the OpenTelemetry collector from exiting on startup when encountering invalid SSL configurations (e.g., missing certificate files) respective PR.

Why is it important?

When an Elasticsearch output has invalid configuration (like a missing SSL certificate), the collector exits with a vague error message that doesn't identify which output caused the failure:

error found during service initialization: failed to build extensions: failed to create extension "beatsauth": failed unpacking config: open /etc/client/cert.pem: no such file or directory

Benefits of this PR:

  • Collector stays running instead of exiting at startup, thus allowing other pipelines that utilise different exporters with valid config to continue push data
  • Errors are surfaced at the exporter level when requests are made, making it clear which output failed
image

Screenshot shows the intended behavior: collector continues running and errors are properly surfaced at the exporter level.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

No disruptive user impact expected.

How to test this PR locally

build and install elastic-agent from this branch with the following configuration

  id: agent-pernode-debug
  outputs:
    default:
      hosts:
      - ${ES_HOST}
      password: ${ES_PASSWORD}
      type: elasticsearch
      username: ${ES_USERNAME}
      # invalid ssl settings
      ssl.certificate: /etc/client/cert.pem
      ssl.enabled: true
      ssl.key: /etc/client/cert.key
      ssl.key_passphrase: null
      ssl.key_passphrase_path: null
      ssl.verification_mode: none
    system:
      hosts:
      - ${ES_HOST}
      password: ${ES_PASSWORD}
      type: elasticsearch
      username: ${ES_USERNAME}
  secret_references: []
  agent:
    monitoring:
      # enable otel collector for self-monitoring
      _runtime_experimental: otel
      enabled: true
      logs: true
      metrics: true
      namespace: default
      use_output: default
  inputs:
    - data_stream:
        namespace: default
      id: system-logs
      streams:
      - data_stream:
          dataset: system.auth
          type: logs
        exclude_files:
        - \.gz$
        ignore_older: 72h
        multiline:
          match: after
          pattern: ^\s
        paths:
        - /var/log/auth.log*
        - /var/log/secure*
        processors:
        - add_locale: null
        tags:
        - system-auth
      - data_stream:
          dataset: system.syslog
          type: logs
        exclude_files:
        - \.gz$
        ignore_older: 72h
        multiline:
          match: after
          pattern: ^\s
        paths:
        - /var/log/messages*
        - /var/log/syslog*
        - /var/log/system*
        processors:
        - add_locale: null
        tags: null
      type: logfile
      use_output: system
    - data_stream:
        namespace: default
      id: system-metrics
      streams:
      - cpu.metrics:
        - percentages
        - normalized_percentages
        data_stream:
          dataset: system.cpu
          type: metrics
        metricsets:
        - cpu
        period: 10s
      - data_stream:
          dataset: system.diskio
          type: metrics
        diskio.include_devices: null
        metricsets:
        - diskio
        period: 10s
      - data_stream:
          dataset: system.filesystem
          type: metrics
        metricsets:
        - filesystem
        period: 1m
        processors:
        - drop_event.when.regexp:
            system.filesystem.mount_point: ^/(sys|cgroup|proc|dev|etc|host|lib|snap)($|/)
      - data_stream:
          dataset: system.fsstat
          type: metrics
        metricsets:
        - fsstat
        period: 1m
        processors:
        - drop_event.when.regexp:
            system.fsstat.mount_point: ^/(sys|cgroup|proc|dev|etc|host|lib|snap)($|/)
      - condition: ${host.platform} != 'windows'
        data_stream:
          dataset: system.load
          type: metrics
        metricsets:
        - load
        period: 10s
      - data_stream:
          dataset: system.memory
          type: metrics
        metricsets:
        - memory
        period: 10s
      - data_stream:
          dataset: system.network
          type: metrics
        metricsets:
        - network
        network.interfaces: null
        period: 10s
      - data_stream:
          dataset: system.process
          type: metrics
        metricsets:
        - process
        period: 10s
        process.cgroups.enabled: false
        process.cmdline.cache.enabled: true
        process.include_cpu_ticks: false
        process.include_top_n.by_cpu: 5
        process.include_top_n.by_memory: 5
        processes:
        - .*
      - data_stream:
          dataset: system.process_summary
          type: metrics
        metricsets:
        - process_summary
        period: 10s
      - data_stream:
          dataset: system.socket_summary
          type: metrics
        metricsets:
        - socket_summary
        period: 10s
      - data_stream:
          dataset: system.uptime
          type: metrics
        metricsets:
        - uptime
        period: 10s
      type: system/metrics
      use_output: system

Related issues

N/A

@pkoutsovasilis pkoutsovasilis self-assigned this Oct 6, 2025
@pkoutsovasilis pkoutsovasilis added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team skip-changelog backport-9.2 Automated backport to the 9.2 branch labels Oct 6, 2025
@pkoutsovasilis pkoutsovasilis force-pushed the feat/rework_otel_es_output_translation branch from dab0b93 to edde990 Compare October 6, 2025 11:22
@pkoutsovasilis pkoutsovasilis force-pushed the feat/rework_otel_es_output_translation branch from 887033f to 2f276ab Compare October 7, 2025 10:38
@pkoutsovasilis pkoutsovasilis marked this pull request as ready for review October 7, 2025 13:03
@pkoutsovasilis pkoutsovasilis requested a review from a team as a code owner October 7, 2025 13:03
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pkoutsovasilis pkoutsovasilis changed the title feat: rework elasticsearch output translation to otel config to exclu… feat: utilise continue_on_err in beatsauthextension Oct 7, 2025
@pierrehilbert pierrehilbert requested a review from swiatekm October 7, 2025 13:07
blakerouse
blakerouse previously approved these changes Oct 7, 2025
Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks good to me. In call it was decided to just merge this with the otel bump. CI is green!

@swiatekm
Copy link
Contributor

swiatekm commented Oct 7, 2025

This change looks good to me. In call it was decided to just merge this with the otel bump. CI is green!

We should also bump the version in beats, though.

@pkoutsovasilis
Copy link
Contributor Author

We should also bump the version in beats, though.

@swiatekm does this need to happen before this PR, or after? How these get bumped today in beats?

@swiatekm
Copy link
Contributor

swiatekm commented Oct 7, 2025

We should also bump the version in beats, though.

@swiatekm does this need to happen before this PR, or after? How these get bumped today in beats?

You bump in opentelemetry-collector-components, then in beats, and then in agent. I think it's fine to merge this PR, but if you want to be thorough, you should also bump component versions to ones which match the otel core version. Does that make sense?

@swiatekm
Copy link
Contributor

swiatekm commented Oct 7, 2025

The changes look good to me, but I'd really like to have a test verifying that the basic function of this PR works - that is, if we provide invalid TLS configuration to an elasticsearch output used by a beats receiver, the collector still starts and we get the right status. I think an integration test is the right way to do this, but I'd also accept a unit test for the otel manager.

@pkoutsovasilis
Copy link
Contributor Author

You bump in opentelemetry-collector-components, then in beats, and then in agent. I think it's fine to merge this PR, but if you want to be thorough, you should also bump component versions to ones which match the otel core version. Does that make sense?

I can’t confidently own that bump chain right now. I’ll move this PR to Draft and keep it scoped to the minimal change, letting the existing version-bump mechanisms land first. Once those are in, I’ll rebase and flip back to Ready for Review.

The changes look good to me, but I'd really like to have a test verifying that the basic function of this PR works - that is, if we provide invalid TLS configuration to an elasticsearch output used by a beats receiver, the collector still starts and we get the right status. I think an integration

Agreed. I’ll add a test for that

cc @ebeahan @cmacknz

@pkoutsovasilis pkoutsovasilis marked this pull request as draft October 7, 2025 18:39
Copy link
Contributor

mergify bot commented Oct 7, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feat/rework_otel_es_output_translation upstream/feat/rework_otel_es_output_translation
git merge upstream/main
git push upstream feat/rework_otel_es_output_translation

@pkoutsovasilis pkoutsovasilis force-pushed the feat/rework_otel_es_output_translation branch from 79ea9a9 to 5255921 Compare October 8, 2025 18:30
@pkoutsovasilis pkoutsovasilis marked this pull request as ready for review October 8, 2025 18:31
@elasticmachine
Copy link
Collaborator

@pkoutsovasilis pkoutsovasilis added backport-8.19 Automated backport to the 8.19 branch backport-9.1 Automated backport to the 9.1 branch labels Oct 9, 2025
@pkoutsovasilis pkoutsovasilis merged commit 0c0dada into elastic:main Oct 9, 2025
24 checks passed
mergify bot pushed a commit that referenced this pull request Oct 9, 2025
* feat: rework elasticsearch output translation to otel config to exclude validation errors

* ci: add integration test

(cherry picked from commit 0c0dada)

# Conflicts:
#	internal/pkg/otel/translate/otelconfig.go
mergify bot pushed a commit that referenced this pull request Oct 9, 2025
* feat: rework elasticsearch output translation to otel config to exclude validation errors

* ci: add integration test

(cherry picked from commit 0c0dada)
mergify bot pushed a commit that referenced this pull request Oct 9, 2025
* feat: rework elasticsearch output translation to otel config to exclude validation errors

* ci: add integration test

(cherry picked from commit 0c0dada)
pkoutsovasilis added a commit that referenced this pull request Oct 9, 2025
* feat: rework elasticsearch output translation to otel config to exclude validation errors

* ci: add integration test

(cherry picked from commit 0c0dada)

Co-authored-by: Panos Koutsovasilis <[email protected]>
pkoutsovasilis added a commit that referenced this pull request Oct 14, 2025
* feat: rework elasticsearch output translation to otel config to exclude validation errors

* ci: add integration test

(cherry picked from commit 0c0dada)

# Conflicts:
#	internal/pkg/otel/translate/otelconfig.go
pkoutsovasilis added a commit that referenced this pull request Oct 20, 2025
* feat: rework elasticsearch output translation to otel config to exclude validation errors

* ci: add integration test

(cherry picked from commit 0c0dada)

# Conflicts:
#	internal/pkg/otel/translate/otelconfig.go
pkoutsovasilis added a commit that referenced this pull request Oct 20, 2025
…tension (#10443)

* feat: utilise continue_on_err in beatsauthextension (#10343)

* feat: rework elasticsearch output translation to otel config to exclude validation errors

* ci: add integration test

(cherry picked from commit 0c0dada)

# Conflicts:
#	internal/pkg/otel/translate/otelconfig.go

* fix: resolve conflicts

---------

Co-authored-by: Panos Koutsovasilis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-8.19 Automated backport to the 8.19 branch backport-9.1 Automated backport to the 9.1 branch backport-9.2 Automated backport to the 9.2 branch skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants