Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion gpu-operator/release-notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ Fixed Issues
Known Issues
------------

* When using cri-o as the container runtime, several of the GPU Operator pods may be stuck in the ``RunContainerError`` state during installation of GPU Operator, upgrade of GPU Operator, or upgrade of the GPU driver daemonset.
* When using cri-o as the container runtime, several of the GPU Operator pods may be stuck in the ``Init:RunContainerError`` or ``Init:CreateContainerError`` state during installation of GPU Operator, upgrade of GPU Operator, or upgrade of the GPU driver daemonset.
The pods may be in this state for several minutes and restart several times.
The pods will recover from this state as soon as the container toolkit pod starts running.

Expand Down
36 changes: 25 additions & 11 deletions gpu-operator/troubleshooting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,17 @@ Troubleshooting the NVIDIA GPU Operator

This page outlines common issues and troubleshooting steps for the NVIDIA GPU Operator.

If you are facing an issue that is not covered by this page, please file an issue in the
`NVIDIA GPU Operator GitHub repository <https://github.com/NVIDIA/gpu-operator/issues>`_.
If you are facing a gpu-operator and/or operand(s) issue that is not documented in this guide, its recommended that you run the ``must-gather`` utility, prepare a bug report, then file an issue in the `NVIDIA GPU Operator GitHub repository <https://github.com/NVIDIA/gpu-operator/issues>`_.

.. code-block:: console

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

This utility is used to collect relevant information from your cluster that is needed for diagnosing and debugging issues.
The final output is an archive file which contains the manifests and logs of all the components managed by gpu-operator.



**************************************************
Expand Down Expand Up @@ -649,16 +658,21 @@ EFI Secure Boot is currently not supported with the GPU Operator

Disable EFI Secure Boot on the server.

File an issue
=================
***************************************************************************************
GPU Operator pods in ``Init:RunContainerError`` or ``Init:CreateContainerError`` state
***************************************************************************************

.. rubric:: Issue
:class: h4

If you are facing a gpu-operator and/or operand(s) issue that is not documented in this guide, you can run the ``must-gather`` utility to prepare a bug report.
If you are installing, upgrading, or upgrading the GPU driver daemonset to v25.10 or later with CRI-O as the container runtime, you may notice several of the GPU Operator pods are stuck in the ``Init:RunContainerError`` or ``Init:CreateContainerError`` state.

.. code-block:: console
.. rubric:: Root Cause
:class: h4

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
Refer to this `GitHub issue <https://github.com/cri-o/cri-o/issues/9521>`_ for details on the root cause and proposed solution to this known CRI-O limitation.

This utility is used to collect relevant information from your cluster that is needed for diagnosing and debugging issues.
The final output is an archive file which contains the manifests and logs of all the components managed by gpu-operator.
.. rubric:: Action
:class: h4

The errors will eventually resolve on their own after the driver daemonset is installed or the upgrade is complete.