Skip to content

Conversation

@MantavyaDh
Copy link
Contributor

@MantavyaDh MantavyaDh commented Oct 16, 2025

Context

When docker exec commands fail inside containers, users receive cryptic exit codes without diagnostic information, making it difficult to identify whether the failure was due to OOM, missing commands, permissions, or other causes.

Work Item : AB#2322289


Description

This PR implements comprehensive docker exec failure diagnostics that collect evidence (container state, resource limits, logs, daemon health) and provide platform-appropriate analysis: definitive diagnosis for Linux containers using standard exit codes, and full evidence presentation for Windows containers where automatic diagnosis is unreliable due to non-standard exit codes


Risk Assessment (Low / Medium / High)

Low
No impact on working pipelines - diagnostics only run when exec already fails
No core logic changes - purely additive diagnostic information


Unit Tests Added or Updated (Yes / No)

No


Additional Testing Performed

Local Testing on Private Org


Change Behind Feature Flag (Yes / No)

No


Tech Design / Approach

Yes


Documentation Changes Required (Yes/No)

No


Logging Added/Updated (Yes/No)

Yes


Rollback Scenario and Process (Yes/No)

  • Rollback plan is documented.

Dependency Impact Assessed and Regression Tested (Yes/No)

Yes

Trace Logs

Phase 1: Exception Detection and Initial Logging (19:13:02Z)

[2025-10-28 19:13:14Z INFO ContainerStepHost] Docker exec diagnostics enabled, collecting diagnostics
[2025-10-28 19:13:14Z VERB ContainerDiagnosticsManager] Entering CollectDockerExecFailureDiagnosticsAsync
[2025-10-28 19:13:14Z ERR  ContainerDiagnosticsManager] Docker exec failure diagnostics started
[2025-10-28 19:13:14Z ERR  ContainerDiagnosticsManager] Exception: ProcessExitCodeException: Exit code 1 returned from process: file name 'C:\Program Files\Docker\Docker\resources\bin\docker.EXE', arguments 'exec -i   783e59996df5856494f7a42f69f930fbafc12f58c92974d3de590c7a813d5832 node /__w/_temp/containerHandlerInvoker.js'.
[2025-10-28 19:13:14Z ERR  ContainerDiagnosticsManager] Failed command: C:\Program Files\Docker\Docker\resources\bin\docker.EXE exec -i   783e59996df5856494f7a42f69f930fbafc12f58c92974d3de590c7a813d5832 node /__w/_temp/containerHandlerInvoker.js
[2025-10-28 19:13:14Z ERR  ContainerDiagnosticsManager] Exit code: 1
[2025-10-28 19:13:14Z INFO ContainerDiagnosticsManager] Container ID: 783e59996df5856494f7a42f69f930fbafc12f58c92974d3de590c7a813d5832
[2025-10-28 19:13:14Z INFO ContainerDiagnosticsManager] Collecting system information
[2025-10-28 19:13:14Z INFO ContainerDiagnosticsManager] Platform: Microsoft Windows 10.0.26100
[2025-10-28 19:13:14Z INFO ContainerDiagnosticsManager] Architecture: X64
[2025-10-28 19:13:14Z INFO ContainerDiagnosticsManager] Process Architecture: X64

Phase 2: Phase 1 - Collect Basic System Information (19:13:02Z - 19:13:06Z)

2025-10-28 19:13:17Z INFO ContainerDiagnosticsManager] System Information: Exit Code 0
[2025-10-28 19:13:17Z INFO ContainerDiagnosticsManager]   Host Name:                     CPC-mdhin-AEL40
[2025-10-28 19:13:17Z INFO ContainerDiagnosticsManager]   OS Name:                       Microsoft Windows 11 Enterprise
[2025-10-28 19:13:17Z INFO ContainerDiagnosticsManager]   OS Version:                    10.0.26100 N/A Build 26100
[2025-10-28 19:13:17Z INFO ContainerDiagnosticsManager]   OS Manufacturer:               Microsoft Corporation
[2025-10-28 19:13:17Z INFO ContainerDiagnosticsManager]   OS Configuration:              Standalone Workstation
[2025-10-28 19:13:17Z INFO ContainerDiagnosticsManager]   ... (56 more lines truncated)
[2025-10-28 19:13:17Z VERB ContainerDiagnosticsManager] Entering RunDiagnostics
[2025-10-28 19:13:17Z INFO ContainerDiagnosticsManager] Starting diagnostic evidence collection
[2025-10-28 19:13:17Z ERR  ContainerDiagnosticsManager] Docker exec failed with exit code: 1
[2025-10-28 19:13:17Z ERR  ContainerDiagnosticsManager] Failed command: docker exec -i   783e59996df5856494f7a42f69f930fbafc12f58c92974d3de590c7a813d5832 node /__w/_temp/containerHandlerInvoker.js

Phase 3: Container State Check (19:13:06Z)

[2025-10-28 19:13:17Z INFO ContainerDiagnosticsManager] Phase 1: Collecting diagnostic evidence
[2025-10-28 19:13:17Z INFO ContainerDiagnosticsManager] Checking container state and lifecycle
[2025-10-28 19:13:17Z VERB ProcessInvokerWrapper] Entering Initialize
.
.  
-----Process Invoker Logs------
.
.
[2025-10-28 19:13:18Z INFO ContainerDiagnosticsManager] Container state collected: Running=False, Status=exited, ExitCode=137, OS=linux


Phase 5: Resource State Check (19:13:06Z - 19:13:07Z)

[2025-10-28 19:13:18Z INFO ContainerDiagnosticsManager] Checking resource constraints and OOM status
.
.  
-----Process Invoker Logs------
.
.
[2025-10-28 19:13:18Z INFO ContainerDiagnosticsManager] Resource state collected: OOMKilled=False, MemoryLimit=unlimited, LogDriver=json-file

Phase 6: Container Logs Retrieval (19:13:07Z)

[2025-10-28 19:13:18Z INFO ContainerDiagnosticsManager] Retrieving container logs from time of failure
[2025-10-28 19:13:18Z INFO ContainerDiagnosticsManager] Log Configuration: Driver=json-file, 
.
.  
-----Process Invoker Logs------
.
.
[2025-10-28 19:13:18Z INFO ContainerDiagnosticsManager] Container logs are empty. No output was written to stdout or stderr.
[2025-10-28 19:13:18Z INFO ContainerDiagnosticsManager] Possible reasons: Application did not write to stdout/stderr, immediate crash, or output buffering.

Phase 7: Docker Daemon Health Check (19:13:07Z)

[2025-10-28 19:13:18Z INFO ContainerDiagnosticsManager] Checking Docker daemon health
[2025-10-28 19:13:18Z INFO ContainerDiagnosticsManager] Testing Docker daemon connectivity...
[2025-10-28 19:13:18Z VERB ProcessInvokerWrapper] Entering Initialize
.
.  
-----Process Invoker Logs------
.
.
[2025-10-28 19:13:19Z INFO ContainerDiagnosticsManager] Docker Version (Client & Server): Exit Code 0
[2025-10-28 19:13:19Z INFO ContainerDiagnosticsManager]   Client:
[2025-10-28 19:13:19Z INFO ContainerDiagnosticsManager]    Version:           28.5.1
[2025-10-28 19:13:19Z INFO ContainerDiagnosticsManager]    API version:       1.51
[2025-10-28 19:13:19Z INFO ContainerDiagnosticsManager]    Go version:        go1.24.8
[2025-10-28 19:13:19Z INFO ContainerDiagnosticsManager]    Git commit:        e180ab8
[2025-10-28 19:13:19Z INFO ContainerDiagnosticsManager]    Built:             Wed Oct  8 12:19:16 2025
[2025-10-28 19:13:19Z INFO ContainerDiagnosticsManager]    OS/Arch:           windows/amd64
[2025-10-28 19:13:19Z INFO ContainerDiagnosticsManager]    Context:           desktop-linux
[2025-10-28 19:13:19Z INFO ContainerDiagnosticsManager]   Server: Docker Desktop 4.48.0 (207573)
[2025-10-28 19:13:19Z INFO ContainerDiagnosticsManager]    Engine:
[2025-10-28 19:13:19Z INFO ContainerDiagnosticsManager]     Version:          28.5.1
[2025-10-28 19:13:19Z INFO ContainerDiagnosticsManager]     API version:      1.51 (minimum version 1.24)
[2025-10-28 19:13:19Z INFO ContainerDiagnosticsManager]     Go version:       go1.24.8
[2025-10-28 19:13:19Z INFO ContainerDiagnosticsManager]     Git commit:       f8215cc
[2025-10-28 19:13:19Z INFO ContainerDiagnosticsManager]     Built:            Wed Oct  8 12:17:24 2025
[2025-10-28 19:13:19Z INFO ContainerDiagnosticsManager]   ... (11 more lines truncated)
.
.  
-----Process Invoker Logs------
.
.
[2025-10-28 19:13:19Z INFO ContainerDiagnosticsManager] Docker Daemon Status: Exit Code 0
[2025-10-28 19:13:19Z INFO ContainerDiagnosticsManager]   ServerVersion=28.5.1 ContainersRunning=0 MemTotal=33632358400
[2025-10-28 19:13:20Z INFO ContainerDiagnosticsManager] Docker System Disk Usage: Exit Code 0
[2025-10-28 19:13:20Z INFO ContainerDiagnosticsManager]   TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
[2025-10-28 19:13:20Z INFO ContainerDiagnosticsManager]   Images          2         1         1.624GB   181MB (11%)
[2025-10-28 19:13:20Z INFO ContainerDiagnosticsManager]   Containers      1         0         32.77kB   32.77kB (100%)
[2025-10-28 19:13:20Z INFO ContainerDiagnosticsManager]   Local Volumes   0         0         0B        0B
[2025-10-28 19:13:20Z INFO ContainerDiagnosticsManager]   Build Cache     0         0         0B        0B

Phase 9: Root Cause Analysis (19:13:08Z)

[2025-10-28 19:13:20Z INFO ContainerDiagnosticsManager] Phase 2: Analyzing evidence to determine root cause
[2025-10-28 19:13:20Z INFO ContainerDiagnosticsManager] ROOT CAUSE: CONTAINER NOT RUNNING / EXITED
[2025-10-28 19:13:20Z INFO ContainerDiagnosticsManager]   Container running: FALSE
[2025-10-28 19:13:20Z INFO ContainerDiagnosticsManager]   Container status: exited
[2025-10-28 19:13:20Z INFO ContainerDiagnosticsManager]   Container exit code: 137
[2025-10-28 19:13:20Z INFO ContainerDiagnosticsManager]   Docker exec exit code: 1
[2025-10-28 19:13:20Z ERR  ContainerDiagnosticsManager] Docker exec failure diagnostics completed

Other Exit Code Scenarios

[2025-10-28 19:12:01Z INFO ContainerDiagnosticsManager] Phase 2: Analyzing evidence to determine root cause
[2025-10-28 19:12:01Z INFO ContainerDiagnosticsManager] ROOT CAUSE: OUT OF MEMORY
[2025-10-28 19:12:01Z INFO ContainerDiagnosticsManager]   OOMKilled flag: TRUE 
[2025-10-28 19:12:01Z INFO ContainerDiagnosticsManager]   Memory limit: 25 MB
[2025-10-28 19:12:01Z INFO ContainerDiagnosticsManager]   Docker exec exit code: 137
[2025-10-28 19:12:01Z INFO ContainerDiagnosticsManager]   Container OS: linux
[2025-10-28 19:12:01Z INFO ContainerDiagnosticsManager]   The container exceeded its memory limit and was terminated by the system OOM (Out-Of-Memory) killer. Exit codes vary by OS:
[2025-10-28 19:12:01Z ERR  ContainerDiagnosticsManager] Docker exec failure diagnostics completed

[2025-10-28 19:11:19Z INFO ContainerDiagnosticsManager] Phase 2: Analyzing evidence to determine root cause
[2025-10-28 19:11:19Z INFO ContainerDiagnosticsManager]   Container running: TRUE
[2025-10-28 19:11:19Z INFO ContainerDiagnosticsManager]   Container status: running
[2025-10-28 19:11:19Z INFO ContainerDiagnosticsManager] Likely Cause: COMMAND NOT FOUND
[2025-10-28 19:11:19Z INFO ContainerDiagnosticsManager]  Exit code 127 typically indicates the command or executable was not found in the container.
[2025-10-28 19:11:19Z ERR  ContainerDiagnosticsManager] Docker exec failure diagnostics completed

[2025-10-28 18:56:49Z INFO ContainerDiagnosticsManager] Phase 2: Analyzing evidence to determine root cause
[2025-10-28 18:56:49Z INFO ContainerDiagnosticsManager] LIKELY CAUSE: PROCESS CANCELLATION OR TIMEOUT
[2025-10-28 18:56:49Z INFO ContainerDiagnosticsManager]   Exit code: NULL (no exit code returned)
[2025-10-28 18:56:49Z INFO ContainerDiagnosticsManager]   Container running: True
[2025-10-28 18:56:49Z INFO ContainerDiagnosticsManager]   Container status: running
[2025-10-28 18:56:49Z ERR  ContainerDiagnosticsManager] Docker exec failure diagnostics completed

@MantavyaDh MantavyaDh requested review from a team as code owners October 16, 2025 19:43
@MantavyaDh
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@MantavyaDh MantavyaDh added internal Containers Issues related to Docker, containerd, etc. labels Oct 16, 2025
@MantavyaDh
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

string containerId);
}

public class ContainerDiagnosticsManager : AgentService, IContainerDiagnosticsManager
Copy link
Contributor

@tarunramsinghani tarunramsinghani Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need L0/l1 Tests for this class please. Also we will need E2E tests as form of canary tests for this feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added L0 tests, will add canary tests in a new PR

var resourceState = await GetResourceState(dockerManager, containerId, trace);

trace.Info("Retrieving container logs from time of failure");
await GetContainerLogs(dockerManager, containerId, trace, resourceState);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docker logs will be useful for docker initialize failure scenarios as well. can we add it there as well...

Copy link
Contributor Author

@MantavyaDh MantavyaDh Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Container initialization failures already have basic log collection. These are specifically designed for docker exec failures during task execution, I don't think they overlap.

Comment on lines +160 to +174
trace.Info("Checking PATH and available commands...");
if (containerOS == "windows")
{
// Check PATH and common commands in Windows container
await ExecuteDiagnosticCommand(dockerManager.DockerPath,
$"exec {containerId} cmd /c \"echo PATH=%PATH% & where node 2^>nul ^|^| echo node not found & where npm 2^>nul ^|^| echo npm not found & where powershell 2^>nul ^|^| echo powershell not found\"",
trace, "Windows PATH and Command Availability");
}
else
{
// Check PATH and common commands in Linux container
await ExecuteDiagnosticCommand(dockerManager.DockerPath,
$"exec {containerId} sh -c \"echo PATH=$PATH; which node || echo 'node: not found'; which npm || echo 'npm: not found'; which bash || echo 'bash: not found'; which sh || echo 'sh: found'\"",
trace, "Linux PATH and Command Availability", maxLines: 10);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 127 error code for cmd not found ? If yes does it not print the which cmd not found already ? as this only checkes for certain tools like node/powershell.sh etc it is not enough.

I would to avoiagn doing docker exec just for this and if there is better way to get this into lets find out or else we can skip this IMO

BTW please attach what is current logs looks like and new logs looks like in this case where error code is 127....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Azure Pipelines agent uses node to run task handlers inside containers - when docker exec fails with exit code 127,
the existing log only shows the full docker command, eg:

Failed command: C:\Program Files\Docker\Docker\resources\bin\docker.EXE exec -i   783e59996df5856494f7a42f69f930fbafc12f58c92974d3de590c7a813d5832 node /__w/_temp/containerHandlerInvoker.js

This diagnostics checks for commands (node, bash, etc, ) need to run the task inside container.
If there is some command missing in the task, it is handled using ##vso logging.

@MantavyaDh
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Containers Issues related to Docker, containerd, etc. internal

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants