Skip to content

(3.11.0) Job submission failure caused by race condition in Pyxis configuration #6459

@gmarciani

Description

@gmarciani

Bug description

We have discovered an issue in the way we configure the Pyxis Slurm plugin in ParallelCluster that can lead to job submission failures. When this issue occurs, the cluster enters an invalid state, and any subsequent job would fail to run, including those that do not require the Pyxis plugin.

If your cluster is affected by this issue, you will experience job failures with the following error in its output:

[ec2-user@ip-27-6-21-47 ~]$ cat slurm-1.out
srun: error: spank: Failed to open /opt/slurm/etc/plugstack.conf.d/sed6Yj8Ga: Permission denied
srun: error: Plug-in initialization failed

When the issue occurs, the cluster is unable to automatically recover from it, and all subsequent jobs will fail to run. However, running jobs will not be affected.

The issue is caused by a race condition happening during the compute node bootstrap process, as multiple processes write temporary files into the shared Slurm configuration directory. The presence of such temporary files causes Slurm failures when loading the SPANK plugins. A failure in removing these temporary files will render the cluster inoperable.

Affected versions (OSes, schedulers)

  • ParallelCluster 3.11.0

Mitigation

You can find a detailed explanation and the mitigation of the problem here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions