Skip to content

Conversation

@fishbone
Copy link
Contributor

@fishbone fishbone commented Nov 2, 2021

This reverts commit f1eedb1.

Why are these changes needed?

Self node should avoid reading any updates from gcs for node resource change since it'll maintain local view by itself.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@fishbone fishbone marked this pull request as ready for review November 3, 2021 01:26
<< ". Updating resource map. skip=" << (node_id == self_node_id_);
}

if (node_id == self_node_id_) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only this line

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, what is the mechanism that processing resource deletion for this node slows down the test and makes it flaky?

Copy link
Contributor Author

@fishbone fishbone Nov 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, actually it's not slowing down the test. it's hanging and timeout eventually.

Basically, the local resource was deleted before scheduling and raylet found no resource to run the job.

Luckily, I can always reproduce it locally.

@fishbone
Copy link
Contributor Author

fishbone commented Nov 3, 2021

windows test failure is flaky. mac build hangs forever. everything else looks good.
I want nightly to cover this PR. merge it now.

@fishbone fishbone merged commit 99034f5 into ray-project:master Nov 3, 2021
// Updating local node could result in a inconsistence view in cluster resource
// scheduler which could make task hang.
if (node_id == self_node_id_) {
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding cluster_task_manager_->ScheduleAndDispatchTasks(); here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like some tests are still failing after this PR, and this is the only behavior change I can imagine (since changes below seem to be only applied to grpc resource broadcast). I think calling the function one more time here won't hurt the performance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants