NVGPUFREQ Slurm Plugin

This plugin unrestrict the Nvidia application clock command when an exclusive job run on a node tagged with gres nvgpufreq.

The nvgpufreq plugin call the nvmlDeviceSetAPIRestriction() to restrict/unrestrict the GPU frequency clock at user level. When the application clock commands have been unrestricted a standard users can chenge the GPU frequency using the nvidia-smi tool or the NVML APIs.

Plugin checks

The nvgpufreq plugin intercepts the prolog and epilog of each job submitted in the cluster (slurm_spank_job_prolog() and slurm_spank_job_epilog()).

Prolog

In the prolog procedure, the plugin does the following checks:

Retrieve the node info from slurmctld. If the plugin cannot contact the slurmctld the plugin terminates its execution.
Check if the node is tagged with the gres nvgpufreq. If the node is not tagged the plugin terminates its execution.
Retrieve the job info from slurmctld. If the plugin cannot contact the slurmctld the plugin terminates its execution.
Check if the job requests the nvgpufreq gres. If the job does not specify the gres nvgpufreq the plugin terminates its execution.
Check if the job run exclusive on the node. If the node can be shared among multiple jobs the plugin terminates its execution.
The plugin call the nvmlDeviceSetAPIRestriction() to unrestrict the GPU frequency clock for regular users.

Epilog

In the epilog procedure, the plugin does the following checks:

Retrieve the node info from slurmctld. If the plugin cannot contact the slurmctld the plugin terminates its execution.
Check if the node is tagged with the gres nvgpufreq. If the node is not tagged the plugin terminates its execution.
Check if the node has been configured from a nvgpufreq job and restore it. After that, the plugin deletes /var/run/nvgpufreq.run and concludes the epilog procedure.

Evaluation

To evaluate if the plugin concludes with the configuration of the node, the users/administrators can check the existence of the file /var/run/nvgpufreq.run, which contains the information if something when wrong or the plugin correctly terminated. This file should always be removed from the plugin in the epilog procedure after the restoration of the node.

Logs

The plugin implements three types of logs:

[SLURM-NVGPUFREQ]: for general information.
[SLURM-NVGPUFREQ][WARN]: for warning information. This includes misconfigurations that do not affect the execution of the plugin.
[SLURM-NVGPUFREQ][ERR]: for error information. This includes problems that terminate the execution of the plugin.

Getting started

Compiling

To compile the code:

Clone this repo to a node where is deployed SLURM daemon

git clone https://gitlab.hpc.cineca.it/dcesari1/slurm-nvgpufreq.git

Create a build directory
```
mkdir build-nvgpufreq
```
Enter in the build directory
```
cd build-nvgpufreq
```

Run CMAKE and specify an install directory

cmake -DCMAKE_INSTALL_PREFIX=../install-nvgpufreq ../slurm-nvgpufreq

Run makefile to start the compilation and install the plugin
```
make && make install
```

Configurations

gres.conf

Before to deploy the plugin must be defined a gres called nvgpufreq. The gres allows the system administrators to identify only a subset of the nodes where the plugin can be used from the users.

NodeName=... Name=nvgpufreq Count=1

slurm.conf

Add the gres configurations to the slurm.conf:

GresTypes=nvgpufreq
PlugStackConfig=/run/slurm/conf/plugstack.conf
NodeName=... Gres=nvgpufreq:1 ...

plugstack.conf

Add the plugin configuration to the plugstack.conf:

optional   /path/to/nvgpufreq.so

Run

When a user wants to use the plugin must submit a job specify the nvgpufreq gres and the exclusivity of the job.

sbatch $SLURM_CONF --gres=nvgpufreq --exclusive $BIN

SLURM Bugs

For SLURM version between 20.0 and 20.02.7 see the following: https://bugs.schedmd.com/show_bug.cgi?id=9081

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
cmake		cmake
src		src
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NVGPUFREQ Slurm Plugin

Plugin checks

Prolog

Epilog

Evaluation

Logs

Getting started

Compiling

Configurations

gres.conf

slurm.conf

plugstack.conf

Run

SLURM Bugs

About

Uh oh!

Releases

Packages

Languages

License

LigateProject/slurm-nvgpufreq

Folders and files

Latest commit

History

Repository files navigation

NVGPUFREQ Slurm Plugin

Plugin checks

Prolog

Epilog

Evaluation

Logs

Getting started

Compiling

Configurations

gres.conf

slurm.conf

plugstack.conf

Run

SLURM Bugs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages