-
Notifications
You must be signed in to change notification settings - Fork 124
Closed
Description
Background information
A simple MPI_Init()/MPI_Finalize() will fail to bootstrap when async modex is enabled -mca pmix_base_async_modex 1.
A workaround is to set -x PMIX_MCA_gds=hash
What version of the PMIx Reference Library are you using?
+1492c0b3102b02dd854851c458ee68229f35f5a9 3rd-party/openpmix (v4.2.3rc1-1-g1492c0b3)
+4636ea79dce7dea0fe9d27e669a5bfda6b095216 3rd-party/prrte (v3.0.1rc1-1-g4636ea79dc)
Describe how PMIx was installed
From source, OMPI v5.0.x internal pmix version (see above)
Please describe the system on which you are running
- Operating system/version: RHEL 8.7
- Computer hardware: x86
- Network type: IB Connect-X 6
Details of the problem
A simple MPI_Init()/Finalize() will reproduce the issue.
Enabling async modex fails to bootstrap with UCX or OB1
[Wed May 3 20:56:51 2023][1,8]<stdout>: [1683136611.217979] [jazz25:109263:0] mm_xpmem.c:245 UCX ERROR xpmem_get(segid=0x20001aad1) failed: No such file or directory
[Wed May 3 20:56:51 2023][1,8]<stdout>: [1683136611.217995] [jazz25:109263:0] mm_ep.c:172 UCX ERROR mm ep failed to connect to remote FIFO id 0x7f6c61a75000: Shared memory error
[Wed May 3 20:56:51 2023][1,6]<stderr>: [jazz25.swx.labs.mlnx:109262] ../../../../../ompi/ompi/mca/pml/ucx/pml_ucx.c:433 Error: ucp_ep_create(proc=12) failed: Shared memory error
[Wed May 3 20:56:51 2023][1,34]<stderr>: [jazz25.swx.labs.mlnx:109276] ../../../../../ompi/ompi/mca/pml/ucx/pml_ucx.c:433 Error: ucp_ep_create(proc=12) failed: Shared memory error
`mpirun -np 16 -H jazz12:28,jazz13:28 --display-map --tag-output --timestamp-output --mca pml ob1 -mca pmix_base_async_modex 1 -mca mpi_add_procs_cutoff 0 -mca pmix_base_collect_data 0 --map-by node -x LD_LIBRARY_PATH -x PMIX_MCA_gdsXX=hash ./a.out`
It’s a function of scale and I can reproduce with as little as NP=8 PPN=4 but it’s maybe 50% failure rate as opposed to 100% at NP=16; PPN=NP/2 or greater.
Metadata
Metadata
Assignees
Labels
No labels