-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topic
Description
Bug description
I am trying to run a very simple training script for 2 nodes and I always get this error:
Output:
(ve) root@442a8ba5c0c6:~/ptl# . wr.sh
Start fitting...
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
[W Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function getCvarBool)
442a8ba5c0c6:126284:126284 [0] NCCL INFO cudaDriverVersion 12030
442a8ba5c0c6:126284:126284 [0] NCCL INFO Bootstrap : Using eth0:172.18.0.2<0>
442a8ba5c0c6:126284:126284 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
442a8ba5c0c6:126284:126284 [0] NCCL INFO init.cc:1627 Cuda Host Alloc Size 4 pointer 0x7f1756e00000
442a8ba5c0c6:126284:126391 [0] NCCL INFO Failed to open libibverbs.so[.1]
442a8ba5c0c6:126284:126391 [0] NCCL INFO NET/Socket : Using [0]eth0:172.18.0.2<0>
442a8ba5c0c6:126284:126391 [0] NCCL INFO Using non-device net plugin version 0
442a8ba5c0c6:126284:126391 [0] NCCL INFO Using network Socket
442a8ba5c0c6:126284:126391 [0] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to 192.168.240.2<39817> failed : Software caused connection abort
442a8ba5c0c6:126284:126391 [0] NCCL INFO misc/socket.cc:565 -> 2
442a8ba5c0c6:126284:126391 [0] NCCL INFO misc/socket.cc:619 -> 2
442a8ba5c0c6:126284:126391 [0] NCCL INFO bootstrap.cc:274 -> 2
442a8ba5c0c6:126284:126391 [0] NCCL INFO init.cc:1388 -> 2
442a8ba5c0c6:126284:126391 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
442a8ba5c0c6:126284:126284 [0] NCCL INFO group.cc:418 -> 2
442a8ba5c0c6:126284:126284 [0] NCCL INFO group.cc:95 -> 2
[rank1]:[W Module.cpp:156] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
Traceback (most recent call last):
File "/root/ptl/wrain.py", line 68, in <module>
main()
File "/root/ptl/wrain.py", line 62, in main
trainer.fit(model, train_dataloaders=train_data)
File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
call._call_and_handle_interrupt(
File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 943, in _run
self.__setup_profiler()
File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1078, in __setup_profiler
self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1248, in log_dir
dirpath = self.strategy.broadcast(dirpath)
File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
File "/root/ptl/ve/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
return func(*args, **kwargs)
File "/root/ptl/ve/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2411, in broadcast_object_list
broadcast(object_sizes_tensor, src=src, group=group)
File "/root/ptl/ve/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
return func(*args, **kwargs)
File "/root/ptl/ve/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast
work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
socketStartConnect: Connect to 192.168.240.2<39817> failed : Software caused connection abort
Exception raised from getNCCLComm at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691 (most recent call first):
C++ CapturedTraceback:
#4 c10::Error::Error(c10::SourceLocation, std::string) from ??:0
#5 c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) [clone .cold] from ProcessGroupNCCL.cpp:0
#6 c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) from ??:0
#7 c10d::ops::(anonymous namespace)::broadcast_CUDA(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long) from Ops.cpp:0
#8 std::decay<c10::guts::infer_function_traits<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long> > >::type::return_type>::type c10::impl::call_functor_with_args_from_stack_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long> >, false, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long>(c10::OperatorKernel*, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul>, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long>*) from :0
#9 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
#10 torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
#11 void c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from VariableFallbackKernel.cpp:0
#12 c10::impl::BoxedKernelWrapper<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long) from :0
#13 c10d::ProcessGroup::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) from :0
#14 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from :0
#15 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
#16 PyObject_CallFunctionObjArgs from ??:0
#17 _PyObject_MakeTpCall from ??:0
#18 PyMethod_New from ??:0
#19 _PyEval_EvalFrameDefault from ??:0
#20 _PyFunction_Vectorcall from ??:0
#21 PyObject_Call from ??:0
#22 _PyEval_EvalFrameDefault from ??:0
#23 _PyFunction_Vectorcall from ??:0
#24 _PyEval_EvalFrameDefault from ??:0
#25 _PyFunction_Vectorcall from ??:0
#26 PyObject_Call from ??:0
#27 _PyEval_EvalFrameDefault from ??:0
#28 _PyFunction_Vectorcall from ??:0
#29 _PyEval_EvalFrameDefault from ??:0
#30 _PyFunction_Vectorcall from ??:0
#31 _PyEval_EvalFrameDefault from ??:0
#32 PyObject_IsTrue from ??:0
#33 _PyObject_GenericGetAttrWithDict from ??:0
#34 PyObject_GetAttr from ??:0
#35 _PyEval_EvalFrameDefault from ??:0
#36 _PyFunction_Vectorcall from ??:0
#37 _PyEval_EvalFrameDefault from ??:0
#38 PyMethod_New from ??:0
#39 _PyEval_EvalFrameDefault from ??:0
#40 PyMethod_New from ??:0
#41 _PyEval_EvalFrameDefault from ??:0
#42 PyMethod_New from ??:0
#43 PyObject_Call from ??:0
#44 _PyEval_EvalFrameDefault from ??:0
#45 _PyFunction_Vectorcall from ??:0
#46 _PyEval_EvalFrameDefault from ??:0
#47 PyMethod_New from ??:0
#48 _PyEval_EvalFrameDefault from ??:0
#49 _PyFunction_Vectorcall from ??:0
#50 _PyEval_EvalFrameDefault from ??:0
#51 PyEval_EvalCode from ??:0
#52 PyEval_EvalCode from ??:0
#53 PyUnicode_Tailmatch from ??:0
#54 PyInit__collections from ??:0
#55 PyUnicode_Tailmatch from ??:0
#56 _PyRun_SimpleFileObject from ??:0
#57 _PyRun_AnyFileObject from ??:0
#58 Py_RunMain from ??:0
#59 Py_BytesMain from ??:0
#60 __libc_init_first from ??:0
#61 __libc_start_main from ??:0
#62 _start from ??:0`
What version are you seeing the problem on?
v2.2
How to reproduce the bug
I'm trying to run a very simple script for 2 nodes:
`export MASTER_ADDR=194.26.196.214
export MASTER_PORT=31421
export WORLD_SIZE=2
export NODE_RANK=1
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export TORCH_SHOW_CPP_STACKTRACES=1
export TORCH_NCCL_BLOCKING_WAIT=1
export NCCL_P2P_DISABLE=1
export NCCL_P2P_LEVEL=NVL
export NCCL_IB_GID_INDEX=3
python train.py`
Where train.py is this:
`import os
import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def training_step(self, batch, batch_idx):
print("batch_idx=", batch_idx)
loss = self(batch).sum()
self.log(
"train_loss",
loss,
on_step=True,
on_epoch=True,
prog_bar=True,
logger=True,
)
return {"loss": loss}
def validation_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("valid_loss", loss)
def configure_optimizers(self):
return torch.optim.SGD(self.layer.parameters(), lr=0.1)
def main():
train_data = DataLoader(RandomDataset(32, 64000), batch_size=8000)
model = BoringModel()
trainer = Trainer(
devices=1,
num_nodes=2,
accelerator="gpu",
strategy="ddp",
limit_train_batches=100000,
limit_val_batches=1,
num_sanity_val_steps=0,
max_epochs=1,
log_every_n_steps=1,
)
print(f"Start fitting...")
trainer.fit(model, train_dataloaders=train_data)
print("DONE")
if __name__ == "__main__":
main()`Error messages and logs
# Error messages and logs here please
Environment
Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):
More info
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingdistributedGeneric distributed-related topicGeneric distributed-related topic