-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19803][TEST] flaky BlockManagerReplicationSuite test failure #17144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #73800 has finished for PR 17144 at commit
|
|
one more flaky test? |
|
Test build #73814 has started for PR 17144 at commit |
|
test crash. retest this please. |
|
Test build #73833 has finished for PR 17144 at commit
|
|
I'm not really the right person to review this code, but that being said, I'm not crazy about this fix, because 1s is kind of a long time to consistently wait. It's better for tests to continually check a condition and then timeout after some time -- this allows the test to complete quickly in the normal case, but still give some leeway for when Jenkins is busy. What about instead wrapping the condition in the second part of the test in an eventually block, and then giving that a more generous timeout (e.g., a few seconds)? |
|
cc @shubhamchopra who wrote the original code and @JoshRosen who did the main review |
|
Also @uncleGen would you mind filing a JIRA for the second failed test case? |
|
@kayousterhout |
| eventually(timeout(5 seconds), interval(10 millis)) { | ||
| assert(newLocations.size === replicationFactor) | ||
| } | ||
| // there should only be one common block manager between initial and new locations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
continually check a condition and then timeout after 5 seconds
|
Test build #74058 has finished for PR 17144 at commit
|
|
cc @srowen |
| logInfo(s"New locations : $newLocations") | ||
| assert(newLocations.size === replicationFactor) | ||
| eventually(timeout(5 seconds), interval(10 millis)) { | ||
| assert(newLocations.size === replicationFactor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 495 needs to be in here too -- otherwise you're continually checking the same set of locations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also can you remove the two sleeps above now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, we can not remove the first sleep. For example there are three blockmanager A, B, C. When we stats to remove BM-A, all blocks in BM-A will be replicated to BM-B and BM-C. We can not remove BM-B immediately or too fast, as there may be no enough time to do replication and new block info may can not be registered to master properly. But it is OK to remove the second sleep.
@kayousterhout Tell me if i was missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen Please view the discussion here. Maybe we should keep the first sleep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm not asking if it should be removed, but be restored to 200ms at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh, got it
|
OK but should the Thread.sleep change be reverted entirely then? |
|
Test build #74085 has finished for PR 17144 at commit
|
|
Test build #74089 has finished for PR 17144 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK w.r.t. Kay's comment
|
Ok this LGTM and I merged to master. I tested this a bunch because in theory, it seems like the check that the block has been properly re-replicated should / could happen inside the loop (after each block is removed), which would also avoid the sleep. But there seem to be various race conditions in the code that means this doesn't work, and this PR remains an incremental improvement to make this more reliable. |
What changes were proposed in this pull request?
200ms may be too short. Give more time for replication to happen and new block be reported to master
How was this patch tested?
test manully