-
Notifications
You must be signed in to change notification settings - Fork 6.9k
Revert "Revert "[core] Fix wrong local resource view in raylet (#1991… #19996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…roject#19911)" (ray-project#19992)" This reverts commit f1eedb1.
| << ". Updating resource map. skip=" << (node_id == self_node_id_); | ||
| } | ||
|
|
||
| if (node_id == self_node_id_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, what is the mechanism that processing resource deletion for this node slows down the test and makes it flaky?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, actually it's not slowing down the test. it's hanging and timeout eventually.
Basically, the local resource was deleted before scheduling and raylet found no resource to run the job.
Luckily, I can always reproduce it locally.
|
windows test failure is flaky. mac build hangs forever. everything else looks good. |
| // Updating local node could result in a inconsistence view in cluster resource | ||
| // scheduler which could make task hang. | ||
| if (node_id == self_node_id_) { | ||
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding cluster_task_manager_->ScheduleAndDispatchTasks(); here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like some tests are still failing after this PR, and this is the only behavior change I can imagine (since changes below seem to be only applied to grpc resource broadcast). I think calling the function one more time here won't hurt the performance
This reverts commit f1eedb1.
Why are these changes needed?
Self node should avoid reading any updates from gcs for node resource change since it'll maintain local view by itself.
Related issue number
Checks
scripts/format.shto lint the changes in this PR.