Skip to content

Dynamic host management/communicators still broken  #4665

@CalugaruVaxile

Description

@CalugaruVaxile

Hi Dev team,

With OpenMPI 3.0.0 the code crashes when I try two MPI_Comm_spawn successive directives, while one single spawn works perfectly. I have downloaded and built the head master OpenMPI, same issue. OS = Debian 7.

The full code is intended to dynamically add/remove hosts to/from a "uber" intracomm (which will be used by the main computation). Therefore I am trying to spawn+merge new communicators and separate/detach nodes in various sequences which will hopefully cover all the situations.

"Prototype" Master code:

`
#include
#include
#include <mpi.h>
#include <unistd.h>
#include <limits.h>

using namespace std;

int main() {
char slavejobtospawn[500];
strcpy(slavejobtospawn, "./bug_slave");
char localhost[HOST_NAME_MAX];
gethostname(localhost, HOST_NAME_MAX);

// == comms
MPI_Comm worker_comm1_;
MPI_Comm worker_comm2_;

// == init MPI
int provided;
MPI_Init_thread(0, 0, MPI_THREAD_MULTIPLE, &provided);
if (provided < MPI_THREAD_MULTIPLE) {
cout << "ERROR: The MPI library does not have full thread support" << endl;
MPI_Abort(MPI_COMM_WORLD, 1);
};

// == MPI_info obj
MPI_Info minfo;
MPI_Info_create(&minfo);

// == first spawn
MPI_Info_set(minfo, "add-host", "houprg118070,houprg118071");
MPI_Comm_spawn(slavejobtospawn, MPI_ARGV_NULL,
2, minfo, 0, MPI_COMM_WORLD, &worker_comm1_, MPI_ERRCODES_IGNORE);

// == second spawn
MPI_Info_set(minfo, "add-host", "houprg118072");
MPI_Comm_spawn(slavejobtospawn, MPI_ARGV_NULL,
1, minfo, 0, MPI_COMM_WORLD, &worker_comm2_, MPI_ERRCODES_IGNORE);

usleep(5000000);

// == stop MPI
MPI_Finalize();
cout << "MASTER " << localhost << " SHUTTING DOWN" << endl;
return 0;
}
`

"Prototype" Slave code
`
#include
#include
#include <mpi.h>
#include <unistd.h>
#include <limits.h>

using namespace std;

int main() {
char localhost[HOST_NAME_MAX];
gethostname(localhost, HOST_NAME_MAX);

// == comms
MPI_Comm slave_Comm_;

// == init MPI, get parent
int provided;
MPI_Init_thread(0, 0, MPI_THREAD_MULTIPLE, &provided);
if (provided < MPI_THREAD_MULTIPLE) {
cout << "ERROR: The MPI library does not have full thread support" << endl;
MPI_Abort(MPI_COMM_WORLD, 1);
};
MPI_Comm_get_parent(&slave_Comm_);

// == test spawning
cout << "SLAVE " << localhost << " GETS PARENT" << slave_Comm_ << endl;

usleep(5000000);
// == stop MPI
MPI_Finalize();
cout << "SLAVE " << localhost << " SHUTTING DOWN" <<endl;
return 0;
}
`

Error

`
%mpirun -np 1 ./bug_master

SLAVE houprg118071 GETS PARENT0x715b40
SLAVE houprg118070 GETS PARENT0x715f40
[houprg118061:23984] PACK-ORTE-ATTR: UNSUPPORTED TYPE
[houprg118061:23984] [[60201,0],0] ORTE_ERROR_LOG: Error in file runtime/data_type_support/orte_dt_unpacking_fns.c at line 109
[houprg118061:23984] [[60201,0],0] ORTE_ERROR_LOG: Error in file base/odls_base_default_fns.c at line 416
[houprg118070:07813] PACK-ORTE-ATTR: UNSUPPORTED TYPE
[houprg118070:07813] [[60201,0],1] ORTE_ERROR_LOG: Error in file runtime/data_type_support/orte_dt_unpacking_fns.c at line 109
[houprg118070:07813] [[60201,0],1] ORTE_ERROR_LOG: Error in file base/odls_base_default_fns.c at line 416
[houprg118071:09950] PACK-ORTE-ATTR: UNSUPPORTED TYPE
[houprg118071:09950] [[60201,0],2] ORTE_ERROR_LOG: Error in file runtime/data_type_support/orte_dt_unpacking_fns.c at line 109
[houprg118071:09950] [[60201,0],2] ORTE_ERROR_LOG: Error in file base/odls_base_default_fns.c at line 416
`

If I comment the second MPI_comm_spawn or if I replace the MPI_info object with MPI_INFO_NULL, everything works well.

I hope this can be fixed.

Cheers,
George

Metadata

Metadata

Assignees

No one assigned

    Labels

    RTEIssue likely is in RTE or PMIx areas

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions