Skip to content

Conversation

@robertnishihara
Copy link
Collaborator

No description provided.

pcmoritz and others added 30 commits August 13, 2016 15:20
Python library for plasma
Basic Python unit tests.
Remove directories as well with make clean.
Test Python 3 in Travis.
Remove C struct duplication and python plasma manager
* refactor plasma to use an event loop

* unify comment style

* Clean up Makefile flags.

* Randomize socket names in tests so multiple copies of the tests can be run in parallel without conflict.
* Run clang-format and add pre-commit hook for it.

* Modify .travis.yml to check

* Try to fix problems with .travis.yml

* Try to fix .travis.yml yet again

* Update .clang-format to Philipp's preferences

* Don't allow lint to fail in Travis

* Remove git-hooks directory

* Improve clang-format failure output

* Fix clang-format error

* Report which commit clang-format is comparing against, and add whitespace error

* Handle non-PR Travis in clang-format, and add another error

* Check $TRAVIS_PULL_REQUEST correctly and add another error

* Fix syntax error in check-git-clang-format-output.sh

* Add whitespace error

* Remove extra whitespace, add clang-format to README
* Use dlmalloc to manage shared memory

* add stresstest
sven1977 added a commit that referenced this pull request Apr 2, 2024
edoakes pushed a commit that referenced this pull request Aug 21, 2025
… condition (#55367)

## Why are these changes needed?

Workers crash with a fatal `RAY_CHECK` failure when the plasma store
connection is broken during shutdown, causing the following error:
```
RAY_CHECK failed: PutInLocalPlasmaStore(object, object_id, true) Status not OK: IOError: Broken pipe
```
Stacktrace:
```
core_worker.cc:720 C  Check failed: PutInLocalPlasmaStore(object, object_id, true) Status not OK: IOError: Broken pipe 
*** StackTrace Information ***
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x141789a) [0x7924dd2c689a] ray::operator<<()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x479) [0x7924dd2c9319] ray::RayLog::~RayLog()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x95cc8a) [0x7924dc80bc8a] ray::core::CoreWorker::CoreWorker()::{lambda()#13}::operator()()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core11TaskManager27MarkTaskReturnObjectsFailedERKNS_17TaskSpecificationENS_3rpc9ErrorTypeEPKNS5_12RayErrorInfoERKN4absl12lts_2023080213flat_hash_setINS_8ObjectIDENSB_13hash_internal4HashISD_EESt8equal_toISD_ESaISD_EEE+0x679) [0x7924dc868f29] ray::core::TaskManager::MarkTaskReturnObjectsFailed()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core11TaskManager15FailPendingTaskERKNS_6TaskIDENS_3rpc9ErrorTypeEPKNS_6StatusEPKNS5_12RayErrorInfoE+0x416) [0x7924dc86f186] ray::core::TaskManager::FailPendingTask()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x9a90e6) [0x7924dc8580e6] ray::core::NormalTaskSubmitter::RequestNewWorkerIfNeeded()::{lambda()#1}::operator()()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray3rpc14ClientCallImplINS0_23RequestWorkerLeaseReplyEE15OnReplyReceivedEv+0x68) [0x7924dc94aa48] ray::rpc::ClientCallImpl<>::OnReplyReceived()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvvEZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E9_M_invokeERKSt9_Any_data+0x15) [0x7924dc79e285] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd9b4c8) [0x7924dcc4a4c8] EventTracker::RecordExecution()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd4648e) [0x7924dcbf548e] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd46906) [0x7924dcbf5906] boost::asio::detail::completion_handler<>::do_complete()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f417b) [0x7924dd2a317b] boost::asio::detail::scheduler::do_run_one()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f5af9) [0x7924dd2a4af9] boost::asio::detail::scheduler::run()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f6202) [0x7924dd2a5202] boost::asio::io_context::run()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0x91) [0x7924dc793a61] ray::core::CoreWorker::RunIOService()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xcba0b0) [0x7924dcb690b0] thread_proxy
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7924dde71ac3]
/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7924ddf03850]
```

Stack trace flow:
1. Task lease request fails ->
`NormalTaskSubmitter::RequestNewWorkerIfNeeded()` callback.
2. Triggers `TaskManager::FailPendingTask()` ->
`MarkTaskReturnObjectsFailed()`.
3. System attempts to store error objects in plasma via
`put_in_local_plasma_callback_`.
4. Plasma connection is broken (raylet/plasma store already shut down).
5. `RAY_CHECK_OK()` in the callback causes fatal crash instead of
graceful handling.

Root Cause:

This is a shutdown ordering race condition:
1. Raylet shuts down first: The raylet stops its IO context
([main_service_.stop()](https://github.com/ray-project/ray/blob/77c5475195e56a26891d88460973198391d20edf/src/ray/object_manager/plasma/store_runner.cc#L146))
which closes plasma store connections.
2. Worker still processes callbacks: Core worker continues processing
pending callbacks on separate threads.
3. Broken connection: When the callback tries to store error objects in
plasma, the connection is already closed.
4. Fatal crash: The `RAY_CHECK_OK()` treats this as an unexpected error
and crashes the process.

Fix:

1. Shutdown-aware plasma operations
- Add `CoreWorker::IsShuttingDown()` method to check shutdown state.
- Skip plasma operations entirely when shutdown is in progress.
- Prevents attempting operations on already-closed connections.

2. Targeted error handling for connection failures
- Replace blanket `RAY_CHECK_OK()` with specific error type checking.
- Handle connection errors (Broken pipe, Connection reset, Bad file
descriptor) as warnings during shutdown scenarios.
- Maintain `RAY_CHECK_OK()` for other error types to catch real issues.

---------

Signed-off-by: Sagar Sumit <[email protected]>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
… condition (ray-project#55367)

## Why are these changes needed?

Workers crash with a fatal `RAY_CHECK` failure when the plasma store
connection is broken during shutdown, causing the following error:
```
RAY_CHECK failed: PutInLocalPlasmaStore(object, object_id, true) Status not OK: IOError: Broken pipe
```
Stacktrace:
```
core_worker.cc:720 C  Check failed: PutInLocalPlasmaStore(object, object_id, true) Status not OK: IOError: Broken pipe
*** StackTrace Information ***
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x141789a) [0x7924dd2c689a] ray::operator<<()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x479) [0x7924dd2c9319] ray::RayLog::~RayLog()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x95cc8a) [0x7924dc80bc8a] ray::core::CoreWorker::CoreWorker()::{lambda()ray-project#13}::operator()()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core11TaskManager27MarkTaskReturnObjectsFailedERKNS_17TaskSpecificationENS_3rpc9ErrorTypeEPKNS5_12RayErrorInfoERKN4absl12lts_2023080213flat_hash_setINS_8ObjectIDENSB_13hash_internal4HashISD_EESt8equal_toISD_ESaISD_EEE+0x679) [0x7924dc868f29] ray::core::TaskManager::MarkTaskReturnObjectsFailed()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core11TaskManager15FailPendingTaskERKNS_6TaskIDENS_3rpc9ErrorTypeEPKNS_6StatusEPKNS5_12RayErrorInfoE+0x416) [0x7924dc86f186] ray::core::TaskManager::FailPendingTask()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x9a90e6) [0x7924dc8580e6] ray::core::NormalTaskSubmitter::RequestNewWorkerIfNeeded()::{lambda()ray-project#1}::operator()()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray3rpc14ClientCallImplINS0_23RequestWorkerLeaseReplyEE15OnReplyReceivedEv+0x68) [0x7924dc94aa48] ray::rpc::ClientCallImpl<>::OnReplyReceived()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvvEZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E9_M_invokeERKSt9_Any_data+0x15) [0x7924dc79e285] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd9b4c8) [0x7924dcc4a4c8] EventTracker::RecordExecution()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd4648e) [0x7924dcbf548e] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd46906) [0x7924dcbf5906] boost::asio::detail::completion_handler<>::do_complete()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f417b) [0x7924dd2a317b] boost::asio::detail::scheduler::do_run_one()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f5af9) [0x7924dd2a4af9] boost::asio::detail::scheduler::run()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f6202) [0x7924dd2a5202] boost::asio::io_context::run()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0x91) [0x7924dc793a61] ray::core::CoreWorker::RunIOService()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xcba0b0) [0x7924dcb690b0] thread_proxy
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7924dde71ac3]
/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7924ddf03850]
```

Stack trace flow:
1. Task lease request fails ->
`NormalTaskSubmitter::RequestNewWorkerIfNeeded()` callback.
2. Triggers `TaskManager::FailPendingTask()` ->
`MarkTaskReturnObjectsFailed()`.
3. System attempts to store error objects in plasma via
`put_in_local_plasma_callback_`.
4. Plasma connection is broken (raylet/plasma store already shut down).
5. `RAY_CHECK_OK()` in the callback causes fatal crash instead of
graceful handling.

Root Cause:

This is a shutdown ordering race condition:
1. Raylet shuts down first: The raylet stops its IO context
([main_service_.stop()](https://github.com/ray-project/ray/blob/77c5475195e56a26891d88460973198391d20edf/src/ray/object_manager/plasma/store_runner.cc#L146))
which closes plasma store connections.
2. Worker still processes callbacks: Core worker continues processing
pending callbacks on separate threads.
3. Broken connection: When the callback tries to store error objects in
plasma, the connection is already closed.
4. Fatal crash: The `RAY_CHECK_OK()` treats this as an unexpected error
and crashes the process.

Fix:

1. Shutdown-aware plasma operations
- Add `CoreWorker::IsShuttingDown()` method to check shutdown state.
- Skip plasma operations entirely when shutdown is in progress.
- Prevents attempting operations on already-closed connections.

2. Targeted error handling for connection failures
- Replace blanket `RAY_CHECK_OK()` with specific error type checking.
- Handle connection errors (Broken pipe, Connection reset, Bad file
descriptor) as warnings during shutdown scenarios.
- Maintain `RAY_CHECK_OK()` for other error types to catch real issues.

---------

Signed-off-by: Sagar Sumit <[email protected]>
Signed-off-by: jugalshah291 <[email protected]>
kyuds added a commit to kyuds/ray that referenced this pull request Sep 28, 2025
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
… condition (#55367)

## Why are these changes needed?

Workers crash with a fatal `RAY_CHECK` failure when the plasma store
connection is broken during shutdown, causing the following error:
```
RAY_CHECK failed: PutInLocalPlasmaStore(object, object_id, true) Status not OK: IOError: Broken pipe
```
Stacktrace:
```
core_worker.cc:720 C  Check failed: PutInLocalPlasmaStore(object, object_id, true) Status not OK: IOError: Broken pipe
*** StackTrace Information ***
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x141789a) [0x7924dd2c689a] ray::operator<<()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x479) [0x7924dd2c9319] ray::RayLog::~RayLog()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x95cc8a) [0x7924dc80bc8a] ray::core::CoreWorker::CoreWorker()::{lambda()#13}::operator()()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core11TaskManager27MarkTaskReturnObjectsFailedERKNS_17TaskSpecificationENS_3rpc9ErrorTypeEPKNS5_12RayErrorInfoERKN4absl12lts_2023080213flat_hash_setINS_8ObjectIDENSB_13hash_internal4HashISD_EESt8equal_toISD_ESaISD_EEE+0x679) [0x7924dc868f29] ray::core::TaskManager::MarkTaskReturnObjectsFailed()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core11TaskManager15FailPendingTaskERKNS_6TaskIDENS_3rpc9ErrorTypeEPKNS_6StatusEPKNS5_12RayErrorInfoE+0x416) [0x7924dc86f186] ray::core::TaskManager::FailPendingTask()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x9a90e6) [0x7924dc8580e6] ray::core::NormalTaskSubmitter::RequestNewWorkerIfNeeded()::{lambda()#1}::operator()()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray3rpc14ClientCallImplINS0_23RequestWorkerLeaseReplyEE15OnReplyReceivedEv+0x68) [0x7924dc94aa48] ray::rpc::ClientCallImpl<>::OnReplyReceived()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvvEZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E9_M_invokeERKSt9_Any_data+0x15) [0x7924dc79e285] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd9b4c8) [0x7924dcc4a4c8] EventTracker::RecordExecution()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd4648e) [0x7924dcbf548e] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xd46906) [0x7924dcbf5906] boost::asio::detail::completion_handler<>::do_complete()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f417b) [0x7924dd2a317b] boost::asio::detail::scheduler::do_run_one()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f5af9) [0x7924dd2a4af9] boost::asio::detail::scheduler::run()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0x13f6202) [0x7924dd2a5202] boost::asio::io_context::run()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0x91) [0x7924dc793a61] ray::core::CoreWorker::RunIOService()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/_raylet.so(+0xcba0b0) [0x7924dcb690b0] thread_proxy
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7924dde71ac3]
/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7924ddf03850]
```

Stack trace flow:
1. Task lease request fails ->
`NormalTaskSubmitter::RequestNewWorkerIfNeeded()` callback.
2. Triggers `TaskManager::FailPendingTask()` ->
`MarkTaskReturnObjectsFailed()`.
3. System attempts to store error objects in plasma via
`put_in_local_plasma_callback_`.
4. Plasma connection is broken (raylet/plasma store already shut down).
5. `RAY_CHECK_OK()` in the callback causes fatal crash instead of
graceful handling.

Root Cause:

This is a shutdown ordering race condition:
1. Raylet shuts down first: The raylet stops its IO context
([main_service_.stop()](https://github.com/ray-project/ray/blob/77c5475195e56a26891d88460973198391d20edf/src/ray/object_manager/plasma/store_runner.cc#L146))
which closes plasma store connections.
2. Worker still processes callbacks: Core worker continues processing
pending callbacks on separate threads.
3. Broken connection: When the callback tries to store error objects in
plasma, the connection is already closed.
4. Fatal crash: The `RAY_CHECK_OK()` treats this as an unexpected error
and crashes the process.

Fix:

1. Shutdown-aware plasma operations
- Add `CoreWorker::IsShuttingDown()` method to check shutdown state.
- Skip plasma operations entirely when shutdown is in progress.
- Prevents attempting operations on already-closed connections.

2. Targeted error handling for connection failures
- Replace blanket `RAY_CHECK_OK()` with specific error type checking.
- Handle connection errors (Broken pipe, Connection reset, Bad file
descriptor) as warnings during shutdown scenarios.
- Maintain `RAY_CHECK_OK()` for other error types to catch real issues.

---------

Signed-off-by: Sagar Sumit <[email protected]>
Signed-off-by: Douglas Strodtman <[email protected]>
kyuds added a commit to kyuds/ray that referenced this pull request Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants