-
Couldn't load subscription status.
- Fork 901
Adding docker exec diagnostic logs to ContainerStepHost #5356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| string containerId); | ||
| } | ||
|
|
||
| public class ContainerDiagnosticsManager : AgentService, IContainerDiagnosticsManager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need L0/l1 Tests for this class please. Also we will need E2E tests as form of canary tests for this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added L0 tests, will add canary tests in a new PR
| var resourceState = await GetResourceState(dockerManager, containerId, trace); | ||
|
|
||
| trace.Info("Retrieving container logs from time of failure"); | ||
| await GetContainerLogs(dockerManager, containerId, trace, resourceState); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This docker logs will be useful for docker initialize failure scenarios as well. can we add it there as well...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Container initialization failures already have basic log collection. These are specifically designed for docker exec failures during task execution, I don't think they overlap.
| trace.Info("Checking PATH and available commands..."); | ||
| if (containerOS == "windows") | ||
| { | ||
| // Check PATH and common commands in Windows container | ||
| await ExecuteDiagnosticCommand(dockerManager.DockerPath, | ||
| $"exec {containerId} cmd /c \"echo PATH=%PATH% & where node 2^>nul ^|^| echo node not found & where npm 2^>nul ^|^| echo npm not found & where powershell 2^>nul ^|^| echo powershell not found\"", | ||
| trace, "Windows PATH and Command Availability"); | ||
| } | ||
| else | ||
| { | ||
| // Check PATH and common commands in Linux container | ||
| await ExecuteDiagnosticCommand(dockerManager.DockerPath, | ||
| $"exec {containerId} sh -c \"echo PATH=$PATH; which node || echo 'node: not found'; which npm || echo 'npm: not found'; which bash || echo 'bash: not found'; which sh || echo 'sh: found'\"", | ||
| trace, "Linux PATH and Command Availability", maxLines: 10); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 127 error code for cmd not found ? If yes does it not print the which cmd not found already ? as this only checkes for certain tools like node/powershell.sh etc it is not enough.
I would to avoiagn doing docker exec just for this and if there is better way to get this into lets find out or else we can skip this IMO
BTW please attach what is current logs looks like and new logs looks like in this case where error code is 127....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Azure Pipelines agent uses node to run task handlers inside containers - when docker exec fails with exit code 127,
the existing log only shows the full docker command, eg:
Failed command: C:\Program Files\Docker\Docker\resources\bin\docker.EXE exec -i 783e59996df5856494f7a42f69f930fbafc12f58c92974d3de590c7a813d5832 node /__w/_temp/containerHandlerInvoker.js
This diagnostics checks for commands (node, bash, etc, ) need to run the task inside container.
If there is some command missing in the task, it is handled using ##vso logging.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Context
When docker exec commands fail inside containers, users receive cryptic exit codes without diagnostic information, making it difficult to identify whether the failure was due to OOM, missing commands, permissions, or other causes.
Work Item : AB#2322289
Description
This PR implements comprehensive docker exec failure diagnostics that collect evidence (container state, resource limits, logs, daemon health) and provide platform-appropriate analysis: definitive diagnosis for Linux containers using standard exit codes, and full evidence presentation for Windows containers where automatic diagnosis is unreliable due to non-standard exit codes
Risk Assessment (Low / Medium / High)
Low
No impact on working pipelines - diagnostics only run when exec already fails
No core logic changes - purely additive diagnostic information
Unit Tests Added or Updated (Yes / No)
No
Additional Testing Performed
Local Testing on Private Org
Change Behind Feature Flag (Yes / No)
No
Tech Design / Approach
Yes
Documentation Changes Required (Yes/No)
No
Logging Added/Updated (Yes/No)
Yes
Rollback Scenario and Process (Yes/No)
Dependency Impact Assessed and Regression Tested (Yes/No)
Yes
Trace Logs
Phase 1: Exception Detection and Initial Logging (19:13:02Z)
Phase 2: Phase 1 - Collect Basic System Information (19:13:02Z - 19:13:06Z)
Phase 3: Container State Check (19:13:06Z)
Phase 5: Resource State Check (19:13:06Z - 19:13:07Z)
Phase 6: Container Logs Retrieval (19:13:07Z)
Phase 7: Docker Daemon Health Check (19:13:07Z)
Phase 9: Root Cause Analysis (19:13:08Z)
Other Exit Code Scenarios