[SPARK-19803][TEST] flaky BlockManagerReplicationSuite test failure #17144

uncleGen · 2017-03-03T03:04:42Z

What changes were proposed in this pull request?

200ms may be too short. Give more time for replication to happen and new block be reported to master

How was this patch tested?

test manully

SparkQA · 2017-03-03T05:07:24Z

Test build #73800 has finished for PR 17144 at commit 9ec5caf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

uncleGen · 2017-03-03T05:30:28Z

one more flaky test? org.apache.spark.streaming.CheckpointSuite.recovery with map and reduceByKey operations I will check it later. retest this please.

SparkQA · 2017-03-03T05:32:35Z

Test build #73814 has started for PR 17144 at commit 9ec5caf.

uncleGen · 2017-03-03T08:09:04Z

test crash. retest this please.

uncleGen · 2017-03-03T09:37:12Z

cc @kayousterhout

SparkQA · 2017-03-03T10:41:04Z

Test build #73833 has finished for PR 17144 at commit 9ec5caf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kayousterhout · 2017-03-03T19:26:45Z

I'm not really the right person to review this code, but that being said, I'm not crazy about this fix, because 1s is kind of a long time to consistently wait. It's better for tests to continually check a condition and then timeout after some time -- this allows the test to complete quickly in the normal case, but still give some leeway for when Jenkins is busy. What about instead wrapping the condition in the second part of the test in an eventually block, and then giving that a more generous timeout (e.g., a few seconds)?

kayousterhout · 2017-03-03T19:27:55Z

cc @shubhamchopra who wrote the original code and @JoshRosen who did the main review

kayousterhout · 2017-03-03T19:28:38Z

Also @uncleGen would you mind filing a JIRA for the second failed test case?

uncleGen · 2017-03-05T05:09:43Z

@kayousterhout ~~sure, I was being fixing that second flaky test.~~
It has been fixed in pr #17167

uncleGen · 2017-03-07T01:35:04Z

core/src/test/scala/org/apache/spark/storage/BlockManagerReplicationSuite.scala

+    eventually(timeout(5 seconds), interval(10 millis)) {
+      assert(newLocations.size === replicationFactor)
+    }
    // there should only be one common block manager between initial and new locations


continually check a condition and then timeout after 5 seconds

SparkQA · 2017-03-07T04:08:34Z

Test build #74058 has finished for PR 17144 at commit 9a6cc92.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

uncleGen · 2017-03-07T04:09:34Z

cc @srowen

kayousterhout · 2017-03-07T07:19:35Z

core/src/test/scala/org/apache/spark/storage/BlockManagerReplicationSuite.scala

    logInfo(s"New locations : $newLocations")
-    assert(newLocations.size === replicationFactor)
+    eventually(timeout(5 seconds), interval(10 millis)) {
+      assert(newLocations.size === replicationFactor)


line 495 needs to be in here too -- otherwise you're continually checking the same set of locations

Also can you remove the two sleeps above now?

IMHO, we can not remove the first sleep. For example there are three blockmanager A, B, C. When we stats to remove BM-A, all blocks in BM-A will be replicated to BM-B and BM-C. We can not remove BM-B immediately or too fast, as there may be no enough time to do replication and new block info may can not be registered to master properly. But it is OK to remove the second sleep.
@kayousterhout Tell me if i was missing something.

@srowen Please view the discussion here. Maybe we should keep the first sleep.

Yes, I'm not asking if it should be removed, but be restored to 200ms at least.

Ahhh, got it

srowen · 2017-03-07T10:00:13Z

OK but should the Thread.sleep change be reverted entirely then?

SparkQA · 2017-03-07T11:26:43Z

Test build #74085 has finished for PR 17144 at commit 9c182ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-07T13:11:07Z

Test build #74089 has finished for PR 17144 at commit 09e8879.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Looks OK w.r.t. Kay's comment

kayousterhout · 2017-03-07T20:25:41Z

Ok this LGTM and I merged to master.

I tested this a bunch because in theory, it seems like the check that the block has been properly re-replicated should / could happen inside the loop (after each block is removed), which would also avoid the sleep. But there seem to be various race conditions in the code that means this doesn't work, and this PR remains an incremental improvement to make this more reliable.

srowen approved these changes Mar 3, 2017

View reviewed changes

uncleGen added 2 commits March 7, 2017 09:26

flaky BlockManagerReplicationSuite test failure

a132b51

address comments

9a6cc92

uncleGen force-pushed the SPARK-19803 branch from 9ec5caf to 9a6cc92 Compare March 7, 2017 01:31

uncleGen commented Mar 7, 2017

View reviewed changes

kayousterhout reviewed Mar 7, 2017

View reviewed changes

bug fix

9c182ef

revert

09e8879

srowen approved these changes Mar 7, 2017

View reviewed changes

asfgit closed this in 49570ed Mar 7, 2017

[SPARK-19803][TEST] flaky BlockManagerReplicationSuite test failure #17144

[SPARK-19803][TEST] flaky BlockManagerReplicationSuite test failure #17144

Uh oh!

Conversation

uncleGen commented Mar 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 3, 2017

Uh oh!

uncleGen commented Mar 3, 2017

Uh oh!

SparkQA commented Mar 3, 2017

Uh oh!

uncleGen commented Mar 3, 2017

Uh oh!

uncleGen commented Mar 3, 2017

Uh oh!

SparkQA commented Mar 3, 2017

Uh oh!

kayousterhout commented Mar 3, 2017

Uh oh!

kayousterhout commented Mar 3, 2017

Uh oh!

kayousterhout commented Mar 3, 2017

Uh oh!

uncleGen commented Mar 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

uncleGen Mar 7, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 7, 2017

Uh oh!

uncleGen commented Mar 7, 2017

Uh oh!

kayousterhout Mar 7, 2017

Choose a reason for hiding this comment

Uh oh!

kayousterhout Mar 7, 2017

Choose a reason for hiding this comment

Uh oh!

uncleGen Mar 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

uncleGen Mar 7, 2017

Choose a reason for hiding this comment

Uh oh!

srowen Mar 7, 2017

Choose a reason for hiding this comment

Uh oh!

uncleGen Mar 7, 2017

Choose a reason for hiding this comment

Uh oh!

srowen commented Mar 7, 2017

Uh oh!

SparkQA commented Mar 7, 2017

Uh oh!

SparkQA commented Mar 7, 2017

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

kayousterhout commented Mar 7, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

uncleGen commented Mar 3, 2017 •

edited

Loading

uncleGen commented Mar 5, 2017 •

edited

Loading

uncleGen Mar 7, 2017 •

edited

Loading