Virtual GPU device plugin for Kubernetes

The virtual device plugin for Kubernetes is a Daemonset that allows you to automatically:

Expose arbitrary number of virtual GPUs on GPU nodes of your cluster.
Run ML serving containers backed by Accelerator with low latency and low cost in your Kubernetes cluster.

This repository contains AWS virtual GPU implementation of the Kubernetes device plugin.

Prerequisites

The list of prerequisites for running the virtual device plugin is described below:

NVIDIA drivers ~= 361.93
nvidia-docker version > 2.0 (see how to install and it's prerequisites)
docker configured with nvidia as the default runtime.
Kubernetes version >= 1.10

Limitations

This solution is build on top of Volta Multi-Process Service(MPS). You can only use it on instances types with Tesla-V100 or newer. (Only Amazon EC2 P3 Instances and Amazon EC2 G4 Instances now)
Virtual GPU device plugin by default set GPU compute mode to EXCLUSIVE_PROCESS which means GPU is assigned to MPS process, individual process threads can submit work to GPU concurrently via MPS server. This GPU can not be used for other purpose.
Virtual GPU device plugin only on single physical GPU instance like P3.2xlarge if you request k8s.amazonaws.com/vgpu more than 1 in the workloads.
Virtual GPU device plugin can not work with Nvidia device plugin together. You can label nodes and use selector to install Virtual GPU device plugin.

High Level Design

Quick Start

Label GPU node groups

kubectl label node <your_k8s_node_name> k8s.amazonaws.com/accelerator=vgpu

Enabling virtual GPU Support in Kubernetes

Update node selector label in the manifest file to match with labels of your GPU node group, then apply it to Kubernetes.

$ kubectl create -f https://raw.githubusercontent.com/awslabs/aws-virtual-gpu-device-plugin/v0.1.1/manifests/device-plugin.yml

Running GPU Jobs

Virtual NVIDIA GPUs can now be consumed via container level resource requirements using the resource name k8s.amazonaws.com/vgpu:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: resnet-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: resnet-server
  template:
    metadata:
      labels:
        app: resnet-server
    spec:
      # hostIPC is required for MPS communication
      hostIPC: true
      containers:
      - name: resnet-container
        image: seedjeffwan/tensorflow-serving-gpu:resnet
        args:
        # Make sure you set limit based on the vGPU account to avoid tf-serving process occupy all the gpu memory
        - --per_process_gpu_memory_fraction=0.2
        env:
        - name: MODEL_NAME
          value: resnet
        ports:
        - containerPort: 8501
        # Use virtual gpu resource here
        resources:
          limits:
            k8s.amazonaws.com/vgpu: 1
        volumeMounts:
        - name: nvidia-mps
          mountPath: /tmp/nvidia-mps
      volumes:
      - name: nvidia-mps
        hostPath:
          path: /tmp/nvidia-mps

WARNING: if you don't request GPUs when using the device plugin all the GPUs on the machine will be exposed inside your container.

Check the full example here

Development

Please check Development for more details.

Credits

The project idea comes from @RenaudWasTaken comment in kubernetes/kubernetes#52757 and Alibaba’s solution from @cheyang GPU Sharing Scheduler Extender Now Supports Fine-Grained Kubernetes Clusters.

Reference

AWS:

28 Nov 2018 - Amazon Elastic Inference – GPU-Powered Deep Learning Inference Acceleration
2 Dec 2018 - Amazon Elastic Inference - Reduce Deep Learning inference costs by 75%
30 JUL 2019 - Running Amazon Elastic Inference Workloads on Amazon ECS
06 SEP 2019 - Optimizing TensorFlow model serving with Kubernetes and Amazon Elastic Inference
03 DEC 2019 - Introducing Amazon EC2 Inf1 Instances, high performance and the lowest cost machine learning inference in the cloud

Community:

License

This project is licensed under the Apache-2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
benchmark		benchmark
examples		examples
manifests		manifests
pkg/gpu/nvidia		pkg/gpu/nvidia
static/img		static/img
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DEVELOPMENT.md		DEVELOPMENT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
mps-performance.ipynb		mps-performance.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Virtual GPU device plugin for Kubernetes

Prerequisites

Limitations

High Level Design

Quick Start

Label GPU node groups

Enabling virtual GPU Support in Kubernetes

Running GPU Jobs

Development

Credits

Reference

License

About

Uh oh!

Releases 2

Packages

Contributors 5

Uh oh!

Languages

License

awslabs/aws-virtual-gpu-device-plugin

Folders and files

Latest commit

History

Repository files navigation

Virtual GPU device plugin for Kubernetes

Prerequisites

Limitations

High Level Design

Quick Start

Label GPU node groups

Enabling virtual GPU Support in Kubernetes

Running GPU Jobs

Development

Credits

Reference

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 5

Uh oh!

Languages

Packages