-
Notifications
You must be signed in to change notification settings - Fork 6.9k
Closed
Labels
P0Issues that should be fixed in short orderIssues that should be fixed in short orderbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn't
Milestone
Description
What is the problem?
While transitioning Modin to Ray version 1.3.0 we have several tests crash in Github Actions CI. Crashes could not be reproduced in development environment until I tried to create a VM with the same specs as Github Actions run: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners
Running such VM with 2 CPU cores and 7GB of RAM stably reproduces these crashes.
Ray version and other system information (Python version, TensorFlow version, OS):
Ray 1.3.0. Ubuntu 20.04.2 LTS.
Reproduction (REQUIRED)
I created a Vagrantfile to easily create and provision a reproducer VM:
Vagrantfile.gz
To use it follow these steps:
- You need to have virtualization settings enabled in your BIOS https://bce.berkeley.edu/enabling-virtualization-in-your-pc-bios.html
- Install vagrant from https://www.vagrantup.com/downloads
- Install VM provider if not installed. By default vagrant uses VirtualBox.
3.1. Install VirtualBox from https://www.virtualbox.org/wiki/Linux_Downloads
3.2. Alternatively you can use KVM. I checked that both VMs produce the same result. To use KVM you need to installvagrant-libvirtplugin byvagrant plugin install vagrant-libvirt. It requires a bunch of dependencies which can be found here https://github.com/vagrant-libvirt/vagrant-libvirt. Also since my /var/lib filesystem is not large enough, I set up VMs to use animagespool which can be created and activated like this:
virsh pool-define-as images dir --target /localdisk/libvirt
virsh pool-start images
- Add your user to
vboxusersorlibvirtgroup and make sure that setting is effective. - Run
vagrant upin the same directory as Vagrantfile. - To get to VM run
vagrant ssh ubuntu2004-7gb. - On VM activate conda environment and run tests command line.
This is a stack trace that I am getting in the crash:
Thread 37 "worker.io" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffd55fa700 (LWP 16109)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007ffff7c6d859 in __GI_abort () at abort.c:79
#2 0x00007fffdd32bb05 in ray::SpdLogMessage::Flush() ()
from /home/vagrant/miniconda3/envs/modin/lib/python3.8/site-packages/ray/_raylet.so
#3 0x00007fffdd32bb3d in ray::RayLog::~RayLog() () from /home/vagrant/miniconda3/envs/modin/lib/python3.8/site-packages/ray/_raylet.so
#4 0x00007fffdcf48b6c in ray::CoreWorkerDirectTaskSubmitter::RequestNewWorkerIfNeeded(std::tuple<int, std::vector<ray::ObjectID, std::allocator<ray::ObjectID> >, ray::ActorID> const&, ray::rpc::Address const*)::{lambda(ray::Status const&, ray::rpc::RequestWorkerLeaseReply const&)#1}::operator()(ray::Status const&, ray::rpc::RequestWorkerLeaseReply const&) const ()
from /home/vagrant/miniconda3/envs/modin/lib/python3.8/site-packages/ray/_raylet.so
#5 0x00007fffdcf93dd5 in ray::rpc::ClientCallImpl<ray::rpc::RequestWorkerLeaseReply>::OnReplyReceived() ()
from /home/vagrant/miniconda3/envs/modin/lib/python3.8/site-packages/ray/_raylet.so
#6 0x00007fffdce9dacb in std::_Function_handler<void (), ray::rpc::ClientCallManager::PollEventsFromCompletionQueue(int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /home/vagrant/miniconda3/envs/modin/lib/python3.8/site-packages/ray/_raylet.so
#7 0x00007fffdd2daa08 in boost::asio::detail::completion_handler<std::function<void ()> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) ()
from /home/vagrant/miniconda3/envs/modin/lib/python3.8/site-packages/ray/_raylet.so
#8 0x00007fffdd3e09a1 in boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) ()
from /home/vagrant/miniconda3/envs/modin/lib/python3.8/site-packages/ray/_raylet.so
#9 0x00007fffdd3e0ad1 in boost::asio::detail::scheduler::run(boost::system::error_code&) ()
from /home/vagrant/miniconda3/envs/modin/lib/python3.8/site-packages/ray/_raylet.so
#10 0x00007fffdd3e25d0 in boost::asio::io_context::run() ()
from /home/vagrant/miniconda3/envs/modin/lib/python3.8/site-packages/ray/_raylet.so
#11 0x00007fffdce9b895 in ray::CoreWorker::RunIOService() ()
from /home/vagrant/miniconda3/envs/modin/lib/python3.8/site-packages/ray/_raylet.so
#12 0x00007fffdd685d10 in execute_native_thread_routine ()
from /home/vagrant/miniconda3/envs/modin/lib/python3.8/site-packages/ray/_raylet.so
#13 0x00007ffff7fa8609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#14 0x00007ffff7d6a293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
devin-petersohn, rkooo567 and vnlitvinov
Metadata
Metadata
Assignees
Labels
P0Issues that should be fixed in short orderIssues that should be fixed in short orderbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn't