Skip to content

Conversation

@uncleGen
Copy link
Contributor

@uncleGen uncleGen commented Mar 3, 2017

What changes were proposed in this pull request?

200ms may be too short. Give more time for replication to happen and new block be reported to master

How was this patch tested?

test manully

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73800 has finished for PR 17144 at commit 9ec5caf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@uncleGen
Copy link
Contributor Author

uncleGen commented Mar 3, 2017

one more flaky test? org.apache.spark.streaming.CheckpointSuite.recovery with map and reduceByKey operations I will check it later. retest this please.

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73814 has started for PR 17144 at commit 9ec5caf.

@uncleGen
Copy link
Contributor Author

uncleGen commented Mar 3, 2017

test crash. retest this please.

@uncleGen
Copy link
Contributor Author

uncleGen commented Mar 3, 2017

cc @kayousterhout

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73833 has finished for PR 17144 at commit 9ec5caf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kayousterhout
Copy link
Contributor

I'm not really the right person to review this code, but that being said, I'm not crazy about this fix, because 1s is kind of a long time to consistently wait. It's better for tests to continually check a condition and then timeout after some time -- this allows the test to complete quickly in the normal case, but still give some leeway for when Jenkins is busy. What about instead wrapping the condition in the second part of the test in an eventually block, and then giving that a more generous timeout (e.g., a few seconds)?

@kayousterhout
Copy link
Contributor

cc @shubhamchopra who wrote the original code and @JoshRosen who did the main review

@kayousterhout
Copy link
Contributor

Also @uncleGen would you mind filing a JIRA for the second failed test case?

@uncleGen
Copy link
Contributor Author

uncleGen commented Mar 5, 2017

@kayousterhout sure, I was being fixing that second flaky test.
It has been fixed in pr #17167

eventually(timeout(5 seconds), interval(10 millis)) {
assert(newLocations.size === replicationFactor)
}
// there should only be one common block manager between initial and new locations
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continually check a condition and then timeout after 5 seconds

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74058 has finished for PR 17144 at commit 9a6cc92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@uncleGen
Copy link
Contributor Author

uncleGen commented Mar 7, 2017

cc @srowen

logInfo(s"New locations : $newLocations")
assert(newLocations.size === replicationFactor)
eventually(timeout(5 seconds), interval(10 millis)) {
assert(newLocations.size === replicationFactor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 495 needs to be in here too -- otherwise you're continually checking the same set of locations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can you remove the two sleeps above now?

Copy link
Contributor Author

@uncleGen uncleGen Mar 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, we can not remove the first sleep. For example there are three blockmanager A, B, C. When we stats to remove BM-A, all blocks in BM-A will be replicated to BM-B and BM-C. We can not remove BM-B immediately or too fast, as there may be no enough time to do replication and new block info may can not be registered to master properly. But it is OK to remove the second sleep.
@kayousterhout Tell me if i was missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Please view the discussion here. Maybe we should keep the first sleep.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm not asking if it should be removed, but be restored to 200ms at least.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, got it

@srowen
Copy link
Member

srowen commented Mar 7, 2017

OK but should the Thread.sleep change be reverted entirely then?

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74085 has finished for PR 17144 at commit 9c182ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74089 has finished for PR 17144 at commit 09e8879.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK w.r.t. Kay's comment

@kayousterhout
Copy link
Contributor

Ok this LGTM and I merged to master.

I tested this a bunch because in theory, it seems like the check that the block has been properly re-replicated should / could happen inside the loop (after each block is removed), which would also avoid the sleep. But there seem to be various race conditions in the code that means this doesn't work, and this PR remains an incremental improvement to make this more reliable.

@asfgit asfgit closed this in 49570ed Mar 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants