nvidia pods for driver, device plugin, dcgm exporter, feature discovery in crash loop after host reboot

_The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense._

### 1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node? => Centos 7.9
- [x ] Are you running Kubernetes v1.13+?  => v1.20.5
- [x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? ==>   19.03.6
- [ ] Do you have `i2c_core` and `ipmi_msghandler` loaded on the nodes?
- [x] Did you apply the CRD (`kubectl describe clusterpolicies --all-namespaces`) ==> Yes, its deployed via GPU operator

### 1. Issue or feature description
On Centos 7.9, GPU operator installed successfully and all pods became ready but after rebooting of node, pods went into crashloopbackoff state.

gpu-operator-resources   gpu-feature-discovery-qcdp4                                       0/1     Init:CrashLoopBackOff   10         44m
gpu-operator-resources   nvidia-container-toolkit-daemonset-rkg4b                          0/1     Init:CrashLoopBackOff   10         44m
gpu-operator-resources   nvidia-cuda-validator-ssgbh                                       0/1     Completed               0          42m
gpu-operator-resources   nvidia-dcgm-exporter-kj45b                                        0/1     Init:CrashLoopBackOff   10         44m
gpu-operator-resources   nvidia-device-plugin-daemonset-zdc4w                              0/1     Init:CrashLoopBackOff   11         44m
gpu-operator-resources   nvidia-device-plugin-validator-qbhtk                              0/1     Completed               0          42m
gpu-operator-resources   nvidia-driver-daemonset-svsmn                                     0/1     CrashLoopBackOff        10         44m
gpu-operator-resources   nvidia-operator-validator-j9m2z                                   0/1     Init:CrashLoopBackOff   11         44m

### 2. Steps to reproduce the issue
1. Install Kubernetes 
2. Install GPU operator and make sure all pods are running. Test sample pod with GPU to see if everything is working.
3. Reboot node 
4. Pod go into crashloopbackoff

### 3. Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/) (optional if deemed irrelevant)

 - [ ] kubernetes pods status: `kubectl get pods --all-namespaces`
 - [ ] kubernetes daemonset status: `kubectl get ds --all-namespaces`
 - [x] If a pod/ds is in an error state or pending state `kubectl describe pod -n NAMESPACE POD_NAME`
kubectl describe pod -n gpu-operator-resources   nvidia-driver-daemonset-svsmn

  Warning  BackOff         3m16s (x143 over 34m)  kubelet            Back-off restarting failed container

      Reason:       ContainerCannotRun
      **Message:      OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown**

 - [] If a pod/ds is in an error state or pending state `kubectl logs -n NAMESPACE POD_NAME`

 - [x] Output of running a container on the GPU machine: `docker run -it alpine echo foo`
 - [ ] 
 $ docker run -it alpine echo foo
foo

 - [ ] Docker configuration file: `cat /etc/docker/daemon.json`
 - [x] Docker runtime configuration: `docker info | grep runtime`

docker info| grep -i runtime
 Runtimes: nvidia runc
 Default Runtime: nvidia

 - [ ] NVIDIA shared directory: `ls -la /run/nvidia`
 - [ ] NVIDIA packages directory: `ls -la /usr/local/nvidia/toolkit`
 - [ ] NVIDIA driver directory: `ls -la /run/nvidia/driver`
 - [ ] kubelet logs `journalctl -u kubelet > kubelet.logs`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

nvidia pods for driver, device plugin, dcgm exporter, feature discovery in crash loop after host reboot #197

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

nvidia pods for driver, device plugin, dcgm exporter, feature discovery in crash loop after host reboot #197

Description

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions