Skip to content

LigateProject/slurm-nvgpufreq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NVGPUFREQ Slurm Plugin

This plugin unrestrict the Nvidia application clock command when an exclusive job run on a node tagged with gres nvgpufreq.

The nvgpufreq plugin call the nvmlDeviceSetAPIRestriction() to restrict/unrestrict the GPU frequency clock at user level. When the application clock commands have been unrestricted a standard users can chenge the GPU frequency using the nvidia-smi tool or the NVML APIs.

Plugin checks

The nvgpufreq plugin intercepts the prolog and epilog of each job submitted in the cluster (slurm_spank_job_prolog() and slurm_spank_job_epilog()).

Prolog

In the prolog procedure, the plugin does the following checks:

  1. Retrieve the node info from slurmctld. If the plugin cannot contact the slurmctld the plugin terminates its execution.
  2. Check if the node is tagged with the gres nvgpufreq. If the node is not tagged the plugin terminates its execution.
  3. Retrieve the job info from slurmctld. If the plugin cannot contact the slurmctld the plugin terminates its execution.
  4. Check if the job requests the nvgpufreq gres. If the job does not specify the gres nvgpufreq the plugin terminates its execution.
  5. Check if the job run exclusive on the node. If the node can be shared among multiple jobs the plugin terminates its execution.
  6. The plugin call the nvmlDeviceSetAPIRestriction() to unrestrict the GPU frequency clock for regular users.

Epilog

In the epilog procedure, the plugin does the following checks:

  1. Retrieve the node info from slurmctld. If the plugin cannot contact the slurmctld the plugin terminates its execution.
  2. Check if the node is tagged with the gres nvgpufreq. If the node is not tagged the plugin terminates its execution.
  3. Check if the node has been configured from a nvgpufreq job and restore it. After that, the plugin deletes /var/run/nvgpufreq.run and concludes the epilog procedure.

Evaluation

To evaluate if the plugin concludes with the configuration of the node, the users/administrators can check the existence of the file /var/run/nvgpufreq.run, which contains the information if something when wrong or the plugin correctly terminated. This file should always be removed from the plugin in the epilog procedure after the restoration of the node.

Logs

The plugin implements three types of logs:

  • [SLURM-NVGPUFREQ]: for general information.
  • [SLURM-NVGPUFREQ][WARN]: for warning information. This includes misconfigurations that do not affect the execution of the plugin.
  • [SLURM-NVGPUFREQ][ERR]: for error information. This includes problems that terminate the execution of the plugin.

Getting started

Compiling

To compile the code:

  1. Clone this repo to a node where is deployed SLURM daemon
    git clone https://gitlab.hpc.cineca.it/dcesari1/slurm-nvgpufreq.git
  2. Create a build directory
    mkdir build-nvgpufreq
  3. Enter in the build directory
    cd build-nvgpufreq
  4. Run CMAKE and specify an install directory
    cmake -DCMAKE_INSTALL_PREFIX=../install-nvgpufreq ../slurm-nvgpufreq
  5. Run makefile to start the compilation and install the plugin
    make && make install

Configurations

gres.conf

Before to deploy the plugin must be defined a gres called nvgpufreq. The gres allows the system administrators to identify only a subset of the nodes where the plugin can be used from the users.

NodeName=... Name=nvgpufreq Count=1

slurm.conf

Add the gres configurations to the slurm.conf:

GresTypes=nvgpufreq
PlugStackConfig=/run/slurm/conf/plugstack.conf
NodeName=... Gres=nvgpufreq:1 ...

plugstack.conf

Add the plugin configuration to the plugstack.conf:

optional   /path/to/nvgpufreq.so

Run

When a user wants to use the plugin must submit a job specify the nvgpufreq gres and the exclusivity of the job.

sbatch $SLURM_CONF --gres=nvgpufreq --exclusive $BIN

SLURM Bugs

For SLURM version between 20.0 and 20.02.7 see the following: https://bugs.schedmd.com/show_bug.cgi?id=9081

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published