diff --git a/gpu-operator/release-notes.rst b/gpu-operator/release-notes.rst index eb3a8fd7d..be783c74c 100644 --- a/gpu-operator/release-notes.rst +++ b/gpu-operator/release-notes.rst @@ -165,7 +165,7 @@ Fixed Issues Known Issues ------------ -* When using cri-o as the container runtime, several of the GPU Operator pods may be stuck in the ``RunContainerError`` state during installation of GPU Operator, upgrade of GPU Operator, or upgrade of the GPU driver daemonset. +* When using cri-o as the container runtime, several of the GPU Operator pods may be stuck in the ``Init:RunContainerError`` or ``Init:CreateContainerError`` state during installation of GPU Operator, upgrade of GPU Operator, or upgrade of the GPU driver daemonset. The pods may be in this state for several minutes and restart several times. The pods will recover from this state as soon as the container toolkit pod starts running. diff --git a/gpu-operator/troubleshooting.rst b/gpu-operator/troubleshooting.rst index 73ea364f8..5c2cbb323 100644 --- a/gpu-operator/troubleshooting.rst +++ b/gpu-operator/troubleshooting.rst @@ -22,8 +22,17 @@ Troubleshooting the NVIDIA GPU Operator This page outlines common issues and troubleshooting steps for the NVIDIA GPU Operator. -If you are facing an issue that is not covered by this page, please file an issue in the -`NVIDIA GPU Operator GitHub repository `_. +If you are facing a gpu-operator and/or operand(s) issue that is not documented in this guide, its recommended that you run the ``must-gather`` utility, prepare a bug report, then file an issue in the `NVIDIA GPU Operator GitHub repository `_. + +.. code-block:: console + + curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh + chmod +x must-gather.sh + ./must-gather.sh + +This utility is used to collect relevant information from your cluster that is needed for diagnosing and debugging issues. +The final output is an archive file which contains the manifests and logs of all the components managed by gpu-operator. + ************************************************** @@ -649,16 +658,21 @@ EFI Secure Boot is currently not supported with the GPU Operator Disable EFI Secure Boot on the server. -File an issue -================= +*************************************************************************************** +GPU Operator pods in ``Init:RunContainerError`` or ``Init:CreateContainerError`` state +*************************************************************************************** + +.. rubric:: Issue + :class: h4 -If you are facing a gpu-operator and/or operand(s) issue that is not documented in this guide, you can run the ``must-gather`` utility to prepare a bug report. +If you are installing, upgrading, or upgrading the GPU driver daemonset to v25.10 or later with CRI-O as the container runtime, you may notice several of the GPU Operator pods are stuck in the ``Init:RunContainerError`` or ``Init:CreateContainerError`` state. -.. code-block:: console +.. rubric:: Root Cause + :class: h4 - curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh - chmod +x must-gather.sh - ./must-gather.sh +Refer to this `GitHub issue `_ for details on the root cause and proposed solution to this known CRI-O limitation. -This utility is used to collect relevant information from your cluster that is needed for diagnosing and debugging issues. -The final output is an archive file which contains the manifests and logs of all the components managed by gpu-operator. +.. rubric:: Action + :class: h4 + +The errors will eventually resolve on their own after the driver daemonset is installed or the upgrade is complete. \ No newline at end of file