-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26891][YARN] Fixing flaky test in YarnSchedulerBackendSuite #23801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #102395 has finished for PR 23801 at commit
|
|
Hmm, I think I understand where this is coming from. But if that's the case, then the following test in the same suite is also flaky, for the same reason. I think the problem is here: When the test instantiates a So I think the best thing here would be to avoid having a live |
|
So just remember that I changed that endpoint initialization recently... still think the best way is to avoid the live And then initialize the endpoint in |
|
@vanzin in your PR at the failed test the exception's stacktrace was: Based on the exception the problematic line is (YarnSchedulerBackendSuite.scala:54): I see there is SparkContext access in the 2nd test as well but I do not see failures for that one. On the other hand I have seen this failed more then once. |
|
Yes, but that's what flaky means. It may fail or not. Your PR description doesn't explain what the problem is. It just says "multi threaded access" but you haven't explained how that happens - there's no multi threaded code in this test at all. So given that you haven't explained the problem, neither here nor in the bug, I was curious about how this was happening, and figured it out (see my explanation). Which means that the problem also exists in the other test. |
|
You are right I have not explained it in detailed way as I thought based on error text and the stack trace the problem is here. I have run the old code code 60 times and the error occurred 4 times. After my change it was running successfully for 600 times so I stoped there and created the PR. But of course I will take look to your suggestions and I will modify the PR accordingly. |
|
Actually I'm not so sure my analysis is correct either, after looking more... the ctx is running in local mode, so there's no second But I'm still not happy with the "this is a multi threading problem" explanation, because this test is not multi-threaded. So we should understand where the call that's causing the problem is coming from, at least. |
|
Ok, found it. It's not multiple schedulers being active, it's because of that code starting the scheduler endpoint; If it's called at the same time as the mock is being modified in L54 of the test, things blow up. So the "delayed initialization" of the driver endpoint that I suggested would fix it. |
|
BTW this fix would be fine too except for one issue: each instantiation of So your fix would be fine you also fix that problem by calling stop() in the tests, and updating the PR description to explain the underlying problem. |
|
Test build #102414 has finished for PR 23801 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me, just a few comments
| override def nodeBlacklist(): Set[String] = blacklistedNodes.get() | ||
| } | ||
|
|
||
| val yarnSchedulerBackendExtended = new YarnSchedulerBackend(sched, sc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need the extra variable here, vs assigning to yarnSchedulerBackend? I don't see that they are used separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is needed because of the different type: the yarnSchedulerBackend type is YarnSchedulerBackend but yarnSchedulerBackendExtended type is an anonim subclass of YarnSchedulerBackend with the extra def setNodeBlacklist. On yarnSchedulerBackend I cannot call this extra method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so then how is it assigned in the next line? a subclass of YarnSchedulerBackend is still assignable to YarnSchedulerBackend. I might be missing something obvious here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is assignable as yarnSchedulerBackendExtended is an instance of YarnSchedulerBackend too, although not a direct one.
|
|
||
| override def afterEach() { | ||
| try { | ||
| yarnSchedulerBackend.stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this check if it's null, in case some tests don't set it? they might all do so now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right It is better to have it so I add it soon.
|
Test build #102435 has finished for PR 23801 at commit
|
|
Merging to master. |
|
Hi, All. |
|
If it is needed I am happy to open a new PR for the backport. |
|
+1, @attilapiros . Please make a new PR for |
|
ok :) |
The test "RequestExecutors reflects node blacklist and is serializable" is flaky because of multi threaded access of the mock task scheduler. For details check [Mockito FAQ (occasional exceptions like: WrongTypeOfReturnValue)](https://github.com/mockito/mockito/wiki/FAQ#is-mockito-thread-safe). So instead of mocking the task scheduler in the test TaskSchedulerImpl is simply subclassed. This multithreaded access of the `nodeBlacklist()` method is coming from: 1) the unit test thread via calling of the method `prepareRequestExecutors()` 2) the `DriverEndpoint.onStart` which runs a periodic task that ends up calling this method Existing unittest. Closes apache#23801 from attilapiros/SPARK-26891. Authored-by: “attilapiros” <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]> (cherry picked from commit e4e4e2b)
|
Ready: #24474 |
What changes were proposed in this pull request?
The test "RequestExecutors reflects node blacklist and is serializable" is flaky because of multi threaded access of the mock task scheduler. For details check Mockito FAQ (occasional exceptions like: WrongTypeOfReturnValue). So instead of mocking the task scheduler in the test TaskSchedulerImpl is simply subclassed.
This multithreaded access of the
nodeBlacklist()method is coming from:prepareRequestExecutors()DriverEndpoint.onStartwhich runs a periodic task that ends up calling this methodHow was this patch tested?
Existing unittest.