Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Conversation

@mccheah
Copy link

@mccheah mccheah commented Aug 24, 2017

Requires #454, which in turn requires #452.

@mccheah mccheah changed the title Unit Tests for KubernetesClusterSchedulerBackend [WIP] Unit Tests for KubernetesClusterSchedulerBackend Aug 24, 2017
@ifilonenko
Copy link
Member

Thank you for this 👍

@mccheah
Copy link
Author

mccheah commented Aug 28, 2017

@varunkatta I changed some things in the scheduler backend related to #244. Some of them are style things and variable renames, but there was a case I identified where we didn't call removeExecutor when I think we should. The unit tests reflect what I thought should be the expected behavior. It would be much appreciated if you could take a look at the changes and the tests to verify whether or not we're doing the right thing here.

@mccheah
Copy link
Author

mccheah commented Aug 29, 2017

Integration test failure is legit from #454.

@mccheah mccheah force-pushed the separate-external-shuffle-management branch from 9f9b432 to 65496d2 Compare August 29, 2017 18:47
@mccheah mccheah force-pushed the cluster-scheduler-backend-unit-tests branch from 91d5415 to d7453c4 Compare August 29, 2017 18:49
@mccheah mccheah changed the title [WIP] Unit Tests for KubernetesClusterSchedulerBackend Unit Tests for KubernetesClusterSchedulerBackend Aug 29, 2017
@mccheah
Copy link
Author

mccheah commented Aug 29, 2017

This is ready for review, but is not strictly complete. I'm certain that some corner cases are missing in the scheduler backend tests. As well, we don't yet unit test the components that were factored out in #454 and #452. I'll probably revisit these but I would like to keep these PRs small so that we can incrementally follow the cases that we've been covering in these tests.


import KubernetesClusterSchedulerBackend._

private val EXECUTOR_ID_COUNTER = new AtomicLong(0L)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved this here so there's less static state?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - it's much more difficult to unit test if the counter is global, because between different tests one needs to know what the counter is set to.

}
}

def deleteExecutorFromApiAndDataStructures(executorId: String): Unit = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleteExecutorFromClusterAndDataStructures

val exitReason = ExecutorExited(getExecutorExitStatus(pod), exitCausedByApp = false,
"Pod " + pod.getMetadata.getName + " deleted or lost.")
failedPods.put(pod.getMetadata.getName, exitReason)
val alreadyReleased = isPodAlreadyReleased(pod)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline alreadyReleased

executorsToRemove.add(executorId)
RUNNING_EXECUTOR_PODS_LOCK.synchronized {
runningExecutorsToPods.get(executorId).foreach { pod =>
disconnectedPodsByExecutorIdPendingRemoval(executorId) = pod
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain a bit why it's safe to remove the executor directly here, rather than going through the executorsToRemove set first?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't remove the executor directly here. Some of the logic has changed and variables are renamed, so executorsToRemove doesn't exactly exist anymore anyways.

Copy link
Author

@mccheah mccheah Aug 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are marking the executor as disconnected and the allocator thread will clean it up once the exit reason is known. I believe this is the same semantics as before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. No change in semantics just that the marking is done after verifying that the executor is still running/tracked.

@mccheah
Copy link
Author

mccheah commented Aug 31, 2017

Assigning @varunkatta and @foxish to verify correctness of the nuanced but important logic changes done here.

val exitReason = ExecutorExited(getExecutorExitStatus(pod), exitCausedByApp = false,
"Pod " + pod.getMetadata.getName + " deleted or lost.")
failedPods.put(pod.getMetadata.getName, exitReason)
val exitMessage = if (isPodAlreadyReleased(pod)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this change..Makes it consistent with the case of ErroredPod handling

val reasonCheckCount = executorReasonCheckAttemptCounts.getOrElse(executorId, 0)
if (reasonCheckCount >= MAX_EXECUTOR_LOST_REASON_CHECKS) {
removeExecutor(executorId, SlaveLost("Executor lost for unknown reasons."))
deleteExecutorFromClusterAndDataStructures(executorId)
Copy link
Member

@varunkatta varunkatta Sep 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the change. Thanks for catching the omission here.

@@ -0,0 +1,383 @@
/*
Copy link
Member

@varunkatta varunkatta Sep 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge thanks for this. This unit test one of most important additions to keep the cluster backend healthy, correct and maintainable.

Copy link
Member

@varunkatta varunkatta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. Makes the code more readable. Most importantly this change fixes a subtle bug, and adds unit-tests which is a huge win! Thanks Matt!

@varunkatta
Copy link
Member

Also, I think this should probably go in before #392 . I can merge changes from this PR into #392 once this PR is accepted.

@mccheah mccheah changed the base branch from separate-external-shuffle-management to branch-2.2-kubernetes September 7, 2017 22:23
@mccheah mccheah force-pushed the cluster-scheduler-backend-unit-tests branch from 458082f to 2dcaa52 Compare September 7, 2017 22:23
@mccheah
Copy link
Author

mccheah commented Sep 7, 2017

Rebased

@mccheah
Copy link
Author

mccheah commented Sep 7, 2017

rerun unit tests please

Copy link

@ash211 ash211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor nit that might not be worth a change, but otherwise really happy to see tests coming in to this class!

val allocatorExecutor = ThreadUtils
.newDaemonSingleThreadScheduledExecutor("kubernetes-pod-allocator")
val requestExecutorsService = ThreadUtils.newDaemonCachedThreadPool(
"kubernetes-request-executors")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean to change this name from kubernetes-executor-requests ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not particularly.

@ash211
Copy link

ash211 commented Sep 8, 2017

K reverted it -- will merge when builds are green!

@ash211 ash211 merged commit 6053455 into branch-2.2-kubernetes Sep 8, 2017
@ash211
Copy link

ash211 commented Sep 15, 2017

It's merged -- was there more you thought should be done here?

@duyanghao
Copy link

duyanghao commented Sep 15, 2017

@ash211 Sorry,i have just seen it,will it be closed soon?
i just think There may be some unknown problems in #244 as it still lacks large-scale use.
Anyway,i will test this function in practical use and report problems.

@ash211
Copy link

ash211 commented Sep 15, 2017

I've seen some problems myself with executor recovery, but haven't dug into why yet. Please do open new issues with any observations you see of bad behavior!

ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 26, 2019
…#459)

* Start unit tests for the scheduler backend.

* More tests for the scheduler backend.

* Unit tests and possible preemptive corrections to failover logic.

* Address PR comments.

* Resolve merge conflicts.

Move MiB change to ExecutorPodFactory.

* Revert accidental thread pool name change
puneetloya pushed a commit to puneetloya/spark that referenced this pull request Mar 11, 2019
…#459)

* Start unit tests for the scheduler backend.

* More tests for the scheduler backend.

* Unit tests and possible preemptive corrections to failover logic.

* Address PR comments.

* Resolve merge conflicts.

Move MiB change to ExecutorPodFactory.

* Revert accidental thread pool name change
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants