-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[core] Fix python 3.12 asyncio RecursionError leading to objects_valid check #57253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…d check Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
| cdef increase_recursion_limit(): | ||
| """Double the recusion limit if current depth is close to the limit""" | ||
| """ | ||
| Ray does some weird things with asio fibers and asyncio to run asyncio actors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Killing boost fiber is on the @edoakes's wishlist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya ik I talked to him about it, he said he's gonna try prioritizing next quarter 👀
| ctypedef struct CPyThreadState "PyThreadState": | ||
| int recursion_limit | ||
| int recursion_remaining | ||
| int c_recursion_remaining |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field exists in all Python versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no it doesn't but having it here doesn't cause any issues even if used with old python versions
recursion_limit and recursion_remaining also don't exist for all python versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, interesting
python/ray/_raylet.pyx
Outdated
| """ | ||
| #if PY_VERSION_HEX >= 0x30C0000 | ||
| #define CURRENT_DEPTH(x) ((x)->py_recursion_limit - (x)->py_recursion_remaining) | ||
| std::pair<bool, int> DoOrGetRecursionMadness(PyThreadState *x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does Do mean here? I thought this function is supposed to be read-only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for 3.12 it has to increase c_recursion_remaining now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Py_SetRecursionLimit is just a C function so it can be called inside cdef extern from *: as well so we can do all the adjustment inside the C function we define? (Just trying to see how to make the code cleaner)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ya nice suggestion, didn't think of that.
Cleaned up code to just have everything happen in the cdef extern
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
python/ray/_raylet.pyx
Outdated
| """ | ||
| #if PY_VERSION_HEX >= 0x30C0000 | ||
| #define CURRENT_DEPTH(x) ((x)->py_recursion_limit - (x)->py_recursion_remaining) | ||
| std::pair<bool, int> DoOrGetRecursionMadness(PyThreadState *x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Py_SetRecursionLimit is just a C function so it can be called inside cdef extern from *: as well so we can do all the adjustment inside the C function we define? (Just trying to see how to make the code cleaner)
Signed-off-by: dayshah <[email protected]>
edoakes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work piecing this together. I think we have enough datapoints to indicate that we really need to get away from boost fiber...
Signed-off-by: dayshah <[email protected]>
cherrypick #57247 #57253 #57138 Signed-off-by: Lonnie Liu <[email protected]>
cherrypick #57247 #57253 #57138 Signed-off-by: Lonnie Liu <[email protected]>
…d check (ray-project#57253) Signed-off-by: dayshah <[email protected]>
…d check (ray-project#57253) Signed-off-by: dayshah <[email protected]> Signed-off-by: Josh Kodi <[email protected]>
…d check (ray-project#57253) Signed-off-by: dayshah <[email protected]>
…d check (ray-project#57253) Signed-off-by: dayshah <[email protected]> Signed-off-by: xgui <[email protected]>
…d check (ray-project#57253) Signed-off-by: dayshah <[email protected]>
…d check (ray-project#57253) Signed-off-by: dayshah <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>
Why are these changes needed?
Currently if the
task_execution_handlerin _raylet.pyx throws an exception before we can get into the actual task handling code that catches all exceptions, cython will end up just continuing execution to return Status::OK. This means that we could finishExecuteTaskwith Status::OK and a return objects vector that has empty objects leading to the dreadedobjects_validcheck failure.One of the primary reasons this can happen is due to a RecursionError that can be thrown pretty easily (and in an unexpected way to users) due to the way we do asyncio through boost fibers which makes the Python interpreter think our recursion depth is very high. We try to handle this with this code
ray/python/ray/_raylet.pyx
Lines 681 to 701 in 534b0e4
but it doesn't actually work in the Python 3.12 (PY_VERSION_HEX >= 0x30C0000) case. This is because CPython changed to having separate variables for c recursion and py recursion in 3.12 here python/cpython#96510.
CPython just sets the
c_recursion_remainingof the thread state toC_RECURSION_LIMITat thread state initialization and then only usesc_recursion_remaining, so we can just to increasec_recursion_remainingby 1k once it drops below 1k to keep it over 1k.https://github.com/python/cpython/blob/bfb9e2f4a4e690099ec2ec53c08b90f4d64fde36/Python/pystate.c#L1353
Repro
Note that there is a test that does something very similar to this and triggers the same condition:
test_asyncio.py::test_asyncio_actor_high_concurrency, but it seems like it might not running on Python 3.12 on CI now, trying to get to the bottom of that.Related issue number
#57173