Skip to content

Conversation

@attilapiros
Copy link
Contributor

@attilapiros attilapiros commented Feb 15, 2019

What changes were proposed in this pull request?

The test "RequestExecutors reflects node blacklist and is serializable" is flaky because of multi threaded access of the mock task scheduler. For details check Mockito FAQ (occasional exceptions like: WrongTypeOfReturnValue). So instead of mocking the task scheduler in the test TaskSchedulerImpl is simply subclassed.

This multithreaded access of the nodeBlacklist() method is coming from:

  1. the unit test thread via calling of the method prepareRequestExecutors()
  2. the DriverEndpoint.onStart which runs a periodic task that ends up calling this method

How was this patch tested?

Existing unittest.

@attilapiros attilapiros changed the title [SPARK-26891][YARN] Fixing flaky test YarnSchedulerBackendSuite [SPARK-26891][YARN] Fixing flaky test in YarnSchedulerBackendSuite Feb 15, 2019
@SparkQA
Copy link

SparkQA commented Feb 15, 2019

Test build #102395 has finished for PR 23801 at commit 90caf6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Feb 15, 2019

Hmm, I think I understand where this is coming from. But if that's the case, then the following test in the same suite is also flaky, for the same reason.

I think the problem is here:

val driverEndpoint = rpcEnv.setupEndpoint(ENDPOINT_NAME, createDriverEndpoint())

When the test instantiates a YarnSchedulerBackend, that line is overwriting the endpoint in the running SparkContext, so that any messages now are sent to the test scheduler instead. That's where the multi-threaded access comes from.

So I think the best thing here would be to avoid having a live SparkContext if possible. If not, then figure out another way to avoid the situation above.

@vanzin
Copy link
Contributor

vanzin commented Feb 15, 2019

So just remember that I changed that endpoint initialization recently... still think the best way is to avoid the live SparkContext, but if that's not possible, one way to fix this would be:

private val _driverEndpoint: RpcEndpointRef = _
def driverEndpoint: RpcEndpointRef = _driverEndpointRef

And then initialize the endpoint in start(), which is not called by the test.

@attilapiros
Copy link
Contributor Author

@vanzin in your PR at the failed test the exception's stacktrace was:

at org.apache.spark.scheduler.cluster.YarnSchedulerBackendSuite.$anonfun$new$4(YarnSchedulerBackendSuite.scala:54)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.spark.scheduler.cluster.YarnSchedulerBackendSuite.$anonfun$new$3(YarnSchedulerBackendSuite.scala:48)
	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
	at org.apache.spark.scheduler.cluster.YarnSchedulerBackendSuite.$anonfun$new$2(YarnSchedulerBackendSuite.scala:47)

Based on the exception the problematic line is (YarnSchedulerBackendSuite.scala:54):

      when(sched.nodeBlacklist()).thenReturn(blacklist)	

I see there is SparkContext access in the 2nd test as well but I do not see failures for that one. On the other hand I have seen this failed more then once.

@vanzin
Copy link
Contributor

vanzin commented Feb 15, 2019

Yes, but that's what flaky means. It may fail or not.

Your PR description doesn't explain what the problem is. It just says "multi threaded access" but you haven't explained how that happens - there's no multi threaded code in this test at all.

So given that you haven't explained the problem, neither here nor in the bug, I was curious about how this was happening, and figured it out (see my explanation). Which means that the problem also exists in the other test.

@attilapiros
Copy link
Contributor Author

You are right I have not explained it in detailed way as I thought based on error text and the stack trace the problem is here.

I have run the old code code 60 times and the error occurred 4 times. After my change it was running successfully for 600 times so I stoped there and created the PR.

But of course I will take look to your suggestions and I will modify the PR accordingly.

@vanzin
Copy link
Contributor

vanzin commented Feb 15, 2019

Actually I'm not so sure my analysis is correct either, after looking more... the ctx is running in local mode, so there's no second CoarseGrainedSchedulerBackend (otherwise there would be an exception when the test registered a second one).

But I'm still not happy with the "this is a multi threading problem" explanation, because this test is not multi-threaded. So we should understand where the call that's causing the problem is coming from, at least.

@vanzin
Copy link
Contributor

vanzin commented Feb 15, 2019

Ok, found it. It's not multiple schedulers being active, it's because of that code starting the scheduler endpoint; DriverEndpoint.onStart runs a periodic task that ends up calling the method the method that is being complained about.

If it's called at the same time as the mock is being modified in L54 of the test, things blow up.

So the "delayed initialization" of the driver endpoint that I suggested would fix it.

@vanzin
Copy link
Contributor

vanzin commented Feb 15, 2019

BTW this fix would be fine too except for one issue: each instantiation of YarnSchedulerBackend is creating a single thread pool that is not shutdown (since the test doesn't call YarnSchedulerBackend.stop()), and there will be a task scheduled on that thread pool for the duration of the test run (you can see a thread leak warning in the test logs).

So your fix would be fine you also fix that problem by calling stop() in the tests, and updating the PR description to explain the underlying problem.

@SparkQA
Copy link

SparkQA commented Feb 16, 2019

Test build #102414 has finished for PR 23801 at commit 765f7e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me, just a few comments

override def nodeBlacklist(): Set[String] = blacklistedNodes.get()
}

val yarnSchedulerBackendExtended = new YarnSchedulerBackend(sched, sc) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need the extra variable here, vs assigning to yarnSchedulerBackend? I don't see that they are used separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed because of the different type: the yarnSchedulerBackend type is YarnSchedulerBackend but yarnSchedulerBackendExtended type is an anonim subclass of YarnSchedulerBackend with the extra def setNodeBlacklist. On yarnSchedulerBackend I cannot call this extra method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so then how is it assigned in the next line? a subclass of YarnSchedulerBackend is still assignable to YarnSchedulerBackend. I might be missing something obvious here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is assignable as yarnSchedulerBackendExtended is an instance of YarnSchedulerBackend too, although not a direct one.


override def afterEach() {
try {
yarnSchedulerBackend.stop()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check if it's null, in case some tests don't set it? they might all do so now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right It is better to have it so I add it soon.

@SparkQA
Copy link

SparkQA commented Feb 17, 2019

Test build #102435 has finished for PR 23801 at commit 1ff92bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Feb 19, 2019

Merging to master.

@vanzin vanzin closed this in e4e4e2b Feb 19, 2019
@dongjoon-hyun
Copy link
Member

Hi, All.
Can we have this in branch-2.4 too because branch-2.4 is the last branch in 2.x line and we should keep for a long term?

@attilapiros
Copy link
Contributor Author

attilapiros commented Apr 26, 2019

If it is needed I am happy to open a new PR for the backport.

@dongjoon-hyun
Copy link
Member

+1, @attilapiros . Please make a new PR for branch-2.4.

@attilapiros
Copy link
Contributor Author

ok :)

attilapiros added a commit to attilapiros/spark that referenced this pull request Apr 26, 2019
The test "RequestExecutors reflects node blacklist and is serializable" is flaky because of multi threaded access of the mock task scheduler. For details check [Mockito FAQ (occasional exceptions like: WrongTypeOfReturnValue)](https://github.com/mockito/mockito/wiki/FAQ#is-mockito-thread-safe). So instead of mocking the task scheduler in the test TaskSchedulerImpl is simply subclassed.

This multithreaded access of the `nodeBlacklist()` method is coming from:
1) the unit test thread via calling of the method `prepareRequestExecutors()`
2) the `DriverEndpoint.onStart` which runs a periodic task that ends up calling this method

Existing unittest.

Closes apache#23801 from attilapiros/SPARK-26891.

Authored-by: “attilapiros” <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
(cherry picked from commit e4e4e2b)
@attilapiros
Copy link
Contributor Author

Ready: #24474

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants