Skip to content

async modex issue PMIX v4.2.x #3077

@janjust

Description

@janjust

Background information

A simple MPI_Init()/MPI_Finalize() will fail to bootstrap when async modex is enabled -mca pmix_base_async_modex 1.
A workaround is to set -x PMIX_MCA_gds=hash

What version of the PMIx Reference Library are you using?

+1492c0b3102b02dd854851c458ee68229f35f5a9 3rd-party/openpmix (v4.2.3rc1-1-g1492c0b3)
+4636ea79dce7dea0fe9d27e669a5bfda6b095216 3rd-party/prrte (v3.0.1rc1-1-g4636ea79dc)

Describe how PMIx was installed

From source, OMPI v5.0.x internal pmix version (see above)

Please describe the system on which you are running

  • Operating system/version: RHEL 8.7
  • Computer hardware: x86
  • Network type: IB Connect-X 6

Details of the problem

A simple MPI_Init()/Finalize() will reproduce the issue.
Enabling async modex fails to bootstrap with UCX or OB1

[Wed May  3 20:56:51 2023][1,8]<stdout>: [1683136611.217979] [jazz25:109263:0]        mm_xpmem.c:245  UCX  ERROR   xpmem_get(segid=0x20001aad1) failed: No such file or directory
[Wed May  3 20:56:51 2023][1,8]<stdout>: [1683136611.217995] [jazz25:109263:0]           mm_ep.c:172  UCX  ERROR   mm ep failed to connect to remote FIFO id 0x7f6c61a75000: Shared memory error
[Wed May  3 20:56:51 2023][1,6]<stderr>: [jazz25.swx.labs.mlnx:109262] ../../../../../ompi/ompi/mca/pml/ucx/pml_ucx.c:433  Error: ucp_ep_create(proc=12) failed: Shared memory error
[Wed May  3 20:56:51 2023][1,34]<stderr>: [jazz25.swx.labs.mlnx:109276] ../../../../../ompi/ompi/mca/pml/ucx/pml_ucx.c:433  Error: ucp_ep_create(proc=12) failed: Shared memory error

`mpirun -np 16 -H jazz12:28,jazz13:28 --display-map --tag-output --timestamp-output --mca pml ob1 -mca pmix_base_async_modex 1 -mca mpi_add_procs_cutoff 0 -mca pmix_base_collect_data 0 --map-by node -x LD_LIBRARY_PATH -x PMIX_MCA_gdsXX=hash ./a.out`

It’s a function of scale and I can reproduce with as little as NP=8 PPN=4 but it’s maybe 50% failure rate as opposed to 100% at NP=16; PPN=NP/2 or greater.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions