Skip to content

Conversation

swiatekm
Copy link
Contributor

@swiatekm swiatekm commented Oct 1, 2025

What does this PR do?

Uses a random port for the otel collector's Prometheus endpoint. Right now this is hardcoded at the default.

Normally, this port needs to be known during config generation (so we can add the right self-monitoring configuration), so it needs to be determined before the coordinator starts. In this PR, we instead put it into an environment variable and use the otel collector's ability to load configuration values from the variable. As a result, we can defer determining the actual port as late as possible, and even use a different port on each configuration reload, allowing us to recover from port binding conflicts.

This becomes a bit awkward with the embedded collector, where we need to call SetEnv, always an anti-pattern. But we eventually expect to retire that mode of execution, and it's not the default anymore, so it should be fine.

In the process, I'm also allowing both the metrics port and the healthcheck port to be passed into the otel manager as parameters. This is preparation for making them configurable in a follow-up.

Why is it important?

The port shouldn't be hardcoded. In the event of a conflict, the otel collector can't start.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

How to test this PR locally

Build the agent locally, run it with otel self-monitoring, generate diagnostics, then check the configuration.

Related issues

Copy link
Contributor

mergify bot commented Oct 1, 2025

This pull request does not have a backport label. Could you fix it @swiatekm? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@swiatekm swiatekm added skip-changelog bug Something isn't working labels Oct 1, 2025
@cmacknz cmacknz added the backport-9.2 Automated backport to the 9.2 branch label Oct 1, 2025
@swiatekm swiatekm added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Oct 2, 2025
@swiatekm swiatekm marked this pull request as ready for review October 2, 2025 08:35
@swiatekm swiatekm requested a review from a team as a code owner October 2, 2025 08:35
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@swiatekm swiatekm requested a review from cmacknz October 2, 2025 12:31
@leehinman
Copy link
Contributor

I'm probably paranoid from the npipe permission issue, but do we have a test that starts and stops the collector multiple times in rapid succession? Since we are re-using the same port number it might be worth double checking.

@swiatekm swiatekm marked this pull request as draft October 2, 2025 14:56
@swiatekm
Copy link
Contributor Author

swiatekm commented Oct 2, 2025

I'm going to try an approach using an environment variable, so I can inject this much later in the pipeline. Will undraft once I know if that's feasible.

@swiatekm swiatekm marked this pull request as ready for review October 6, 2025 10:15
@swiatekm swiatekm requested a review from cmacknz October 6, 2025 18:23
@swiatekm swiatekm requested a review from ycombinator October 7, 2025 11:37
ycombinator
ycombinator previously approved these changes Oct 7, 2025
Copy link
Contributor

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending resolution of #10240 (comment)

blakerouse
blakerouse previously approved these changes Oct 7, 2025
Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, glad the environment variable substitution worked out.

pkoutsovasilis
pkoutsovasilis previously approved these changes Oct 7, 2025
@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @swiatekm

@swiatekm swiatekm merged commit 5cb8c31 into main Oct 7, 2025
21 checks passed
@swiatekm swiatekm deleted the fix/collector-random-tcp-port branch October 7, 2025 19:10
mergify bot pushed a commit that referenced this pull request Oct 7, 2025
* Use a random port for otel collector monitoring endpoint

* mage notice

* Use an env variable

* Fix linter warnings

* Fix random port determination for the embedded otel collector

* Drop the ports functions from the utils package

* fixup! Fix random port determination for the embedded otel collector

* Ensure no port conflicts

* Clean up port assignment

* Verify that returned ports are unique

* Add port conflict test

* Fix docstring typo

* More comments

* Add comments explaining the port conflict test

* Update internal/pkg/otel/manager/execution_subprocess.go

Co-authored-by: Blake Rouse <[email protected]>

---------

Co-authored-by: Blake Rouse <[email protected]>
(cherry picked from commit 5cb8c31)
swiatekm added a commit that referenced this pull request Oct 8, 2025
…0390)

* Use a random port for otel collector monitoring endpoint

* mage notice

* Use an env variable

* Fix linter warnings

* Fix random port determination for the embedded otel collector

* Drop the ports functions from the utils package

* fixup! Fix random port determination for the embedded otel collector

* Ensure no port conflicts

* Clean up port assignment

* Verify that returned ports are unique

* Add port conflict test

* Fix docstring typo

* More comments

* Add comments explaining the port conflict test

* Update internal/pkg/otel/manager/execution_subprocess.go



---------


(cherry picked from commit 5cb8c31)

Co-authored-by: Mikołaj Świątek <[email protected]>
Co-authored-by: Blake Rouse <[email protected]>
@ebeahan
Copy link
Member

ebeahan commented Oct 13, 2025

@swiatekm there's a question about backporting this to 8.19: #10449 (comment). Please double check what branches this change should land in.

@swiatekm swiatekm added the backport-8.19 Automated backport to the 8.19 branch label Oct 13, 2025
mergify bot pushed a commit that referenced this pull request Oct 13, 2025
* Use a random port for otel collector monitoring endpoint

* mage notice

* Use an env variable

* Fix linter warnings

* Fix random port determination for the embedded otel collector

* Drop the ports functions from the utils package

* fixup! Fix random port determination for the embedded otel collector

* Ensure no port conflicts

* Clean up port assignment

* Verify that returned ports are unique

* Add port conflict test

* Fix docstring typo

* More comments

* Add comments explaining the port conflict test

* Update internal/pkg/otel/manager/execution_subprocess.go

Co-authored-by: Blake Rouse <[email protected]>

---------

Co-authored-by: Blake Rouse <[email protected]>
(cherry picked from commit 5cb8c31)

# Conflicts:
#	NOTICE-fips.txt
#	NOTICE.txt
#	go.mod
#	internal/pkg/otel/manager/execution_embedded.go
#	internal/pkg/otel/manager/manager.go
#	internal/pkg/otel/manager/manager_test.go
pierrehilbert pushed a commit that referenced this pull request Oct 14, 2025
…ing endpoint (#10520)

* Use a random port for otel collector monitoring endpoint (#10240)

* Use a random port for otel collector monitoring endpoint

* mage notice

* Use an env variable

* Fix linter warnings

* Fix random port determination for the embedded otel collector

* Drop the ports functions from the utils package

* fixup! Fix random port determination for the embedded otel collector

* Ensure no port conflicts

* Clean up port assignment

* Verify that returned ports are unique

* Add port conflict test

* Fix docstring typo

* More comments

* Add comments explaining the port conflict test

* Update internal/pkg/otel/manager/execution_subprocess.go

Co-authored-by: Blake Rouse <[email protected]>

---------

Co-authored-by: Blake Rouse <[email protected]>
(cherry picked from commit 5cb8c31)

# Conflicts:
#	NOTICE-fips.txt
#	NOTICE.txt
#	go.mod
#	internal/pkg/otel/manager/execution_embedded.go
#	internal/pkg/otel/manager/manager.go
#	internal/pkg/otel/manager/manager_test.go

* Fix conflicts

---------

Co-authored-by: Mikołaj Świątek <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-8.19 Automated backport to the 8.19 branch backport-9.2 Automated backport to the 9.2 branch bug Something isn't working skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants