[SPARK-26891][YARN] Fixing flaky test in YarnSchedulerBackendSuite #23801

attilapiros · 2019-02-15T15:32:12Z

What changes were proposed in this pull request?

The test "RequestExecutors reflects node blacklist and is serializable" is flaky because of multi threaded access of the mock task scheduler. For details check Mockito FAQ (occasional exceptions like: WrongTypeOfReturnValue). So instead of mocking the task scheduler in the test TaskSchedulerImpl is simply subclassed.

This multithreaded access of the nodeBlacklist() method is coming from:

the unit test thread via calling of the method prepareRequestExecutors()
the DriverEndpoint.onStart which runs a periodic task that ends up calling this method

How was this patch tested?

Existing unittest.

SparkQA · 2019-02-15T15:51:23Z

Test build #102395 has finished for PR 23801 at commit 90caf6d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-02-15T19:39:05Z

Hmm, I think I understand where this is coming from. But if that's the case, then the following test in the same suite is also flaky, for the same reason.

I think the problem is here:

val driverEndpoint = rpcEnv.setupEndpoint(ENDPOINT_NAME, createDriverEndpoint())

When the test instantiates a YarnSchedulerBackend, that line is overwriting the endpoint in the running SparkContext, so that any messages now are sent to the test scheduler instead. That's where the multi-threaded access comes from.

So I think the best thing here would be to avoid having a live SparkContext if possible. If not, then figure out another way to avoid the situation above.

vanzin · 2019-02-15T19:42:14Z

So just remember that I changed that endpoint initialization recently... still think the best way is to avoid the live SparkContext, but if that's not possible, one way to fix this would be:

private val _driverEndpoint: RpcEndpointRef = _
def driverEndpoint: RpcEndpointRef = _driverEndpointRef

And then initialize the endpoint in start(), which is not called by the test.

attilapiros · 2019-02-15T19:58:27Z

@vanzin in your PR at the failed test the exception's stacktrace was:

at org.apache.spark.scheduler.cluster.YarnSchedulerBackendSuite.$anonfun$new$4(YarnSchedulerBackendSuite.scala:54)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.cluster.YarnSchedulerBackendSuite.$anonfun$new$3(YarnSchedulerBackendSuite.scala:48)
	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
	at org.apache.spark.scheduler.cluster.YarnSchedulerBackendSuite.$anonfun$new$2(YarnSchedulerBackendSuite.scala:47)

Based on the exception the problematic line is (YarnSchedulerBackendSuite.scala:54):

      when(sched.nodeBlacklist()).thenReturn(blacklist)

I see there is SparkContext access in the 2nd test as well but I do not see failures for that one. On the other hand I have seen this failed more then once.

vanzin · 2019-02-15T20:03:26Z

Yes, but that's what flaky means. It may fail or not.

Your PR description doesn't explain what the problem is. It just says "multi threaded access" but you haven't explained how that happens - there's no multi threaded code in this test at all.

So given that you haven't explained the problem, neither here nor in the bug, I was curious about how this was happening, and figured it out (see my explanation). Which means that the problem also exists in the other test.

attilapiros · 2019-02-15T21:01:58Z

You are right I have not explained it in detailed way as I thought based on error text and the stack trace the problem is here.

I have run the old code code 60 times and the error occurred 4 times. After my change it was running successfully for 600 times so I stoped there and created the PR.

But of course I will take look to your suggestions and I will modify the PR accordingly.

vanzin · 2019-02-15T23:34:40Z

Actually I'm not so sure my analysis is correct either, after looking more... the ctx is running in local mode, so there's no second CoarseGrainedSchedulerBackend (otherwise there would be an exception when the test registered a second one).

But I'm still not happy with the "this is a multi threading problem" explanation, because this test is not multi-threaded. So we should understand where the call that's causing the problem is coming from, at least.

vanzin · 2019-02-15T23:43:30Z

Ok, found it. It's not multiple schedulers being active, it's because of that code starting the scheduler endpoint; DriverEndpoint.onStart runs a periodic task that ends up calling the method the method that is being complained about.

If it's called at the same time as the mock is being modified in L54 of the test, things blow up.

So the "delayed initialization" of the driver endpoint that I suggested would fix it.

vanzin · 2019-02-15T23:47:12Z

BTW this fix would be fine too except for one issue: each instantiation of YarnSchedulerBackend is creating a single thread pool that is not shutdown (since the test doesn't call YarnSchedulerBackend.stop()), and there will be a task scheduled on that thread pool for the duration of the test run (you can see a thread leak warning in the test logs).

So your fix would be fine you also fix that problem by calling stop() in the tests, and updating the PR description to explain the underlying problem.

SparkQA · 2019-02-16T10:05:32Z

Test build #102414 has finished for PR 23801 at commit 765f7e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Looks fine to me, just a few comments

srowen · 2019-02-17T18:39:40Z

...agers/yarn/src/test/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackendSuite.scala

+      override def nodeBlacklist(): Set[String] = blacklistedNodes.get()
+    }
+
+    val yarnSchedulerBackendExtended = new YarnSchedulerBackend(sched, sc) {


Do you need the extra variable here, vs assigning to yarnSchedulerBackend? I don't see that they are used separately.

It is needed because of the different type: the yarnSchedulerBackend type is YarnSchedulerBackend but yarnSchedulerBackendExtended type is an anonim subclass of YarnSchedulerBackend with the extra def setNodeBlacklist. On yarnSchedulerBackend I cannot call this extra method.

If so then how is it assigned in the next line? a subclass of YarnSchedulerBackend is still assignable to YarnSchedulerBackend. I might be missing something obvious here.

It is assignable as yarnSchedulerBackendExtended is an instance of YarnSchedulerBackend too, although not a direct one.

srowen · 2019-02-17T18:39:57Z

...agers/yarn/src/test/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackendSuite.scala

+
+  override def afterEach() {
+    try {
+      yarnSchedulerBackend.stop()


Should this check if it's null, in case some tests don't set it? they might all do so now.

You are right It is better to have it so I add it soon.

SparkQA · 2019-02-17T19:20:27Z

Test build #102435 has finished for PR 23801 at commit 1ff92bf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-02-19T21:22:18Z

Merging to master.

dongjoon-hyun · 2019-04-26T17:53:58Z

Hi, All.
Can we have this in branch-2.4 too because branch-2.4 is the last branch in 2.x line and we should keep for a long term?

attilapiros · 2019-04-26T18:42:21Z

If it is needed I am happy to open a new PR for the backport.

dongjoon-hyun · 2019-04-26T19:35:12Z

+1, @attilapiros . Please make a new PR for branch-2.4.

attilapiros · 2019-04-26T20:02:35Z

ok :)

The test "RequestExecutors reflects node blacklist and is serializable" is flaky because of multi threaded access of the mock task scheduler. For details check [Mockito FAQ (occasional exceptions like: WrongTypeOfReturnValue)](https://github.com/mockito/mockito/wiki/FAQ#is-mockito-thread-safe). So instead of mocking the task scheduler in the test TaskSchedulerImpl is simply subclassed. This multithreaded access of the `nodeBlacklist()` method is coming from: 1) the unit test thread via calling of the method `prepareRequestExecutors()` 2) the `DriverEndpoint.onStart` which runs a periodic task that ends up calling this method Existing unittest. Closes apache#23801 from attilapiros/SPARK-26891. Authored-by: “attilapiros” <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]> (cherry picked from commit e4e4e2b)

attilapiros · 2019-04-26T20:27:55Z

Ready: #24474

initial commit

90caf6d

attilapiros changed the title ~~[SPARK-26891][YARN] Fixing flaky test YarnSchedulerBackendSuite~~ [SPARK-26891][YARN] Fixing flaky test in YarnSchedulerBackendSuite Feb 15, 2019

Fixing resource leak

765f7e8

srowen reviewed Feb 17, 2019

View reviewed changes

adding missing if

1ff92bf

vanzin closed this in e4e4e2b Feb 19, 2019

[SPARK-26891][YARN] Fixing flaky test in YarnSchedulerBackendSuite #23801

[SPARK-26891][YARN] Fixing flaky test in YarnSchedulerBackendSuite #23801

Uh oh!

Conversation

attilapiros commented Feb 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Feb 15, 2019

Uh oh!

vanzin commented Feb 15, 2019

Uh oh!

vanzin commented Feb 15, 2019

Uh oh!

attilapiros commented Feb 15, 2019

Uh oh!

vanzin commented Feb 15, 2019

Uh oh!

attilapiros commented Feb 15, 2019

Uh oh!

vanzin commented Feb 15, 2019

Uh oh!

vanzin commented Feb 15, 2019

Uh oh!

vanzin commented Feb 15, 2019

Uh oh!

SparkQA commented Feb 16, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

srowen Feb 17, 2019

Choose a reason for hiding this comment

Uh oh!

attilapiros Feb 17, 2019

Choose a reason for hiding this comment

Uh oh!

srowen Feb 17, 2019

Choose a reason for hiding this comment

Uh oh!

attilapiros Feb 17, 2019

Choose a reason for hiding this comment

Uh oh!

srowen Feb 17, 2019

Choose a reason for hiding this comment

Uh oh!

attilapiros Feb 17, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 17, 2019

Uh oh!

vanzin commented Feb 19, 2019

Uh oh!

dongjoon-hyun commented Apr 26, 2019

Uh oh!

attilapiros commented Apr 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Apr 26, 2019

Uh oh!

attilapiros commented Apr 26, 2019

Uh oh!

attilapiros commented Apr 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

attilapiros commented Feb 15, 2019 •

edited

Loading

attilapiros commented Apr 26, 2019 •

edited

Loading