-
Notifications
You must be signed in to change notification settings - Fork 929
Description
Hi Dev team,
With OpenMPI 3.0.0 the code crashes when I try two MPI_Comm_spawn successive directives, while one single spawn works perfectly. I have downloaded and built the head master OpenMPI, same issue. OS = Debian 7.
The full code is intended to dynamically add/remove hosts to/from a "uber" intracomm (which will be used by the main computation). Therefore I am trying to spawn+merge new communicators and separate/detach nodes in various sequences which will hopefully cover all the situations.
"Prototype" Master code:
`
#include
#include
#include <mpi.h>
#include <unistd.h>
#include <limits.h>
using namespace std;
int main() {
char slavejobtospawn[500];
strcpy(slavejobtospawn, "./bug_slave");
char localhost[HOST_NAME_MAX];
gethostname(localhost, HOST_NAME_MAX);
// == comms
MPI_Comm worker_comm1_;
MPI_Comm worker_comm2_;
// == init MPI
int provided;
MPI_Init_thread(0, 0, MPI_THREAD_MULTIPLE, &provided);
if (provided < MPI_THREAD_MULTIPLE) {
cout << "ERROR: The MPI library does not have full thread support" << endl;
MPI_Abort(MPI_COMM_WORLD, 1);
};
// == MPI_info obj
MPI_Info minfo;
MPI_Info_create(&minfo);
// == first spawn
MPI_Info_set(minfo, "add-host", "houprg118070,houprg118071");
MPI_Comm_spawn(slavejobtospawn, MPI_ARGV_NULL,
2, minfo, 0, MPI_COMM_WORLD, &worker_comm1_, MPI_ERRCODES_IGNORE);
// == second spawn
MPI_Info_set(minfo, "add-host", "houprg118072");
MPI_Comm_spawn(slavejobtospawn, MPI_ARGV_NULL,
1, minfo, 0, MPI_COMM_WORLD, &worker_comm2_, MPI_ERRCODES_IGNORE);
usleep(5000000);
// == stop MPI
MPI_Finalize();
cout << "MASTER " << localhost << " SHUTTING DOWN" << endl;
return 0;
}
`
"Prototype" Slave code
`
#include
#include
#include <mpi.h>
#include <unistd.h>
#include <limits.h>
using namespace std;
int main() {
char localhost[HOST_NAME_MAX];
gethostname(localhost, HOST_NAME_MAX);
// == comms
MPI_Comm slave_Comm_;
// == init MPI, get parent
int provided;
MPI_Init_thread(0, 0, MPI_THREAD_MULTIPLE, &provided);
if (provided < MPI_THREAD_MULTIPLE) {
cout << "ERROR: The MPI library does not have full thread support" << endl;
MPI_Abort(MPI_COMM_WORLD, 1);
};
MPI_Comm_get_parent(&slave_Comm_);
// == test spawning
cout << "SLAVE " << localhost << " GETS PARENT" << slave_Comm_ << endl;
usleep(5000000);
// == stop MPI
MPI_Finalize();
cout << "SLAVE " << localhost << " SHUTTING DOWN" <<endl;
return 0;
}
`
Error
`
%mpirun -np 1 ./bug_master
SLAVE houprg118071 GETS PARENT0x715b40
SLAVE houprg118070 GETS PARENT0x715f40
[houprg118061:23984] PACK-ORTE-ATTR: UNSUPPORTED TYPE
[houprg118061:23984] [[60201,0],0] ORTE_ERROR_LOG: Error in file runtime/data_type_support/orte_dt_unpacking_fns.c at line 109
[houprg118061:23984] [[60201,0],0] ORTE_ERROR_LOG: Error in file base/odls_base_default_fns.c at line 416
[houprg118070:07813] PACK-ORTE-ATTR: UNSUPPORTED TYPE
[houprg118070:07813] [[60201,0],1] ORTE_ERROR_LOG: Error in file runtime/data_type_support/orte_dt_unpacking_fns.c at line 109
[houprg118070:07813] [[60201,0],1] ORTE_ERROR_LOG: Error in file base/odls_base_default_fns.c at line 416
[houprg118071:09950] PACK-ORTE-ATTR: UNSUPPORTED TYPE
[houprg118071:09950] [[60201,0],2] ORTE_ERROR_LOG: Error in file runtime/data_type_support/orte_dt_unpacking_fns.c at line 109
[houprg118071:09950] [[60201,0],2] ORTE_ERROR_LOG: Error in file base/odls_base_default_fns.c at line 416
`
If I comment the second MPI_comm_spawn or if I replace the MPI_info object with MPI_INFO_NULL, everything works well.
I hope this can be fixed.
Cheers,
George