Unit Tests for KubernetesClusterSchedulerBackend #459

mccheah · 2017-08-24T00:11:28Z

Requires #454, which in turn requires #452.

ifilonenko · 2017-08-24T21:50:31Z

Thank you for this 👍

mccheah · 2017-08-28T23:23:04Z

@varunkatta I changed some things in the scheduler backend related to #244. Some of them are style things and variable renames, but there was a case I identified where we didn't call removeExecutor when I think we should. The unit tests reflect what I thought should be the expected behavior. It would be much appreciated if you could take a look at the changes and the tests to verify whether or not we're doing the right thing here.

mccheah · 2017-08-29T05:35:42Z

Integration test failure is legit from #454.

mccheah · 2017-08-29T18:52:16Z

This is ready for review, but is not strictly complete. I'm certain that some corner cases are missing in the scheduler backend tests. As well, we don't yet unit test the components that were factored out in #454 and #452. I'll probably revisit these but I would like to keep these PRs small so that we can incrementally follow the cases that we've been covering in these tests.

ash211 · 2017-08-29T21:29:26Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala


  import KubernetesClusterSchedulerBackend._

+  private val EXECUTOR_ID_COUNTER = new AtomicLong(0L)


moved this here so there's less static state?

Correct - it's much more difficult to unit test if the counter is global, because between different tests one needs to know what the counter is set to.

ash211 · 2017-08-29T21:31:59Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+      }
+    }
+
+    def deleteExecutorFromApiAndDataStructures(executorId: String): Unit = {


deleteExecutorFromClusterAndDataStructures

ash211 · 2017-08-29T21:34:35Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

-      val exitReason = ExecutorExited(getExecutorExitStatus(pod), exitCausedByApp = false,
-        "Pod " + pod.getMetadata.getName + " deleted or lost.")
-        failedPods.put(pod.getMetadata.getName, exitReason)
+      val alreadyReleased = isPodAlreadyReleased(pod)


inline alreadyReleased

ash211 · 2017-08-29T21:49:28Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

-            executorsToRemove.add(executorId)
+          RUNNING_EXECUTOR_PODS_LOCK.synchronized {
+            runningExecutorsToPods.get(executorId).foreach { pod =>
+              disconnectedPodsByExecutorIdPendingRemoval(executorId) = pod


can you explain a bit why it's safe to remove the executor directly here, rather than going through the executorsToRemove set first?

We don't remove the executor directly here. Some of the logic has changed and variables are renamed, so executorsToRemove doesn't exactly exist anymore anyways.

Here we are marking the executor as disconnected and the allocator thread will clean it up once the exit reason is known. I believe this is the same semantics as before.

Yup. No change in semantics just that the marking is done after verifying that the executor is still running/tracked.

mccheah · 2017-08-31T21:23:19Z

Assigning @varunkatta and @foxish to verify correctness of the nuanced but important logic changes done here.

varunkatta · 2017-09-06T00:49:44Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

-      val exitReason = ExecutorExited(getExecutorExitStatus(pod), exitCausedByApp = false,
-        "Pod " + pod.getMetadata.getName + " deleted or lost.")
-        failedPods.put(pod.getMetadata.getName, exitReason)
+      val exitMessage = if (isPodAlreadyReleased(pod)) {


Thanks for this change..Makes it consistent with the case of ErroredPod handling

varunkatta · 2017-09-06T01:39:15Z

.../scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackend.scala

+      val reasonCheckCount = executorReasonCheckAttemptCounts.getOrElse(executorId, 0)
+      if (reasonCheckCount >= MAX_EXECUTOR_LOST_REASON_CHECKS) {
+        removeExecutor(executorId, SlaveLost("Executor lost for unknown reasons."))
+        deleteExecutorFromClusterAndDataStructures(executorId)


+1 for the change. Thanks for catching the omission here.

varunkatta · 2017-09-06T01:41:06Z

...a/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterSchedulerBackendSuite.scala

@@ -0,0 +1,383 @@
+/*


Huge thanks for this. This unit test one of most important additions to keep the cluster backend healthy, correct and maintainable.

varunkatta

Changes look good. Makes the code more readable. Most importantly this change fixes a subtle bug, and adds unit-tests which is a huge win! Thanks Matt!

varunkatta · 2017-09-06T01:50:42Z

Also, I think this should probably go in before #392 . I can merge changes from this PR into #392 once this PR is accepted.

Move MiB change to ExecutorPodFactory.

mccheah · 2017-09-07T22:24:33Z

Rebased

mccheah · 2017-09-07T23:49:24Z

rerun unit tests please

ash211

One minor nit that might not be worth a change, but otherwise really happy to see tests coming in to this class!

ash211 · 2017-09-08T00:59:19Z

.../src/main/scala/org/apache/spark/scheduler/cluster/kubernetes/KubernetesClusterManager.scala

+    val allocatorExecutor = ThreadUtils
+        .newDaemonSingleThreadScheduledExecutor("kubernetes-pod-allocator")
+    val requestExecutorsService = ThreadUtils.newDaemonCachedThreadPool(
+        "kubernetes-request-executors")


did you mean to change this name from kubernetes-executor-requests ?

Not particularly.

ash211 · 2017-09-08T01:48:25Z

K reverted it -- will merge when builds are green!

ash211 · 2017-09-15T03:44:57Z

It's merged -- was there more you thought should be done here?

duyanghao · 2017-09-15T04:18:04Z

@ash211 Sorry,i have just seen it,will it be closed soon?
i just think There may be some unknown problems in #244 as it still lacks large-scale use.
Anyway,i will test this function in practical use and report problems.

ash211 · 2017-09-15T04:42:22Z

I've seen some problems myself with executor recovery, but haven't dug into why yet. Please do open new issues with any observations you see of bad behavior!

…#459) * Start unit tests for the scheduler backend. * More tests for the scheduler backend. * Unit tests and possible preemptive corrections to failover logic. * Address PR comments. * Resolve merge conflicts. Move MiB change to ExecutorPodFactory. * Revert accidental thread pool name change

mccheah changed the title ~~Unit Tests for KubernetesClusterSchedulerBackend~~ [WIP] Unit Tests for KubernetesClusterSchedulerBackend Aug 24, 2017

mccheah force-pushed the separate-external-shuffle-management branch from 9f9b432 to 65496d2 Compare August 29, 2017 18:47

mccheah force-pushed the cluster-scheduler-backend-unit-tests branch from 91d5415 to d7453c4 Compare August 29, 2017 18:49

mccheah changed the title ~~[WIP] Unit Tests for KubernetesClusterSchedulerBackend~~ Unit Tests for KubernetesClusterSchedulerBackend Aug 29, 2017

ash211 reviewed Aug 29, 2017

View reviewed changes

mccheah assigned varunkatta Aug 31, 2017

mccheah assigned foxish Aug 31, 2017

varunkatta reviewed Sep 6, 2017

View reviewed changes

varunkatta approved these changes Sep 6, 2017

View reviewed changes

varunkatta mentioned this pull request Sep 6, 2017

Spark driver should exit and report a failure when all executors get killed/fail #134

Open

mccheah force-pushed the separate-external-shuffle-management branch from a55b28e to e7a460e Compare September 7, 2017 00:43

mccheah force-pushed the cluster-scheduler-backend-unit-tests branch from e1d008c to bc48dd2 Compare September 7, 2017 01:01

mccheah mentioned this pull request Sep 7, 2017

Upstreaming and pull request strategy for Spark on Kubernetes #441

Open

ash211 mentioned this pull request Sep 7, 2017

Code enhancement: Replaced explicit synchronized access to a hashmap with a concurrent map. #392

Merged

mccheah added 5 commits September 7, 2017 15:22

Start unit tests for the scheduler backend.

a5e579d

More tests for the scheduler backend.

5caa519

Unit tests and possible preemptive corrections to failover logic.

b729789

Address PR comments.

e0936c5

Resolve merge conflicts.

2dcaa52

Move MiB change to ExecutorPodFactory.

mccheah changed the base branch from separate-external-shuffle-management to branch-2.2-kubernetes September 7, 2017 22:23

mccheah force-pushed the cluster-scheduler-backend-unit-tests branch from 458082f to 2dcaa52 Compare September 7, 2017 22:23

ash211 approved these changes Sep 8, 2017

View reviewed changes

Revert accidental thread pool name change

e972ccf

ash211 merged commit 6053455 into branch-2.2-kubernetes Sep 8, 2017

ash211 mentioned this pull request Oct 10, 2017

Mount emptyDir volumes for temporary directories on executors in static allocation mode. #486

Closed


		import KubernetesClusterSchedulerBackend._

		private val EXECUTOR_ID_COUNTER = new AtomicLong(0L)

Unit Tests for KubernetesClusterSchedulerBackend #459

Unit Tests for KubernetesClusterSchedulerBackend #459

Uh oh!

Conversation

mccheah commented Aug 24, 2017

Uh oh!

ifilonenko commented Aug 24, 2017

Uh oh!

mccheah commented Aug 28, 2017

Uh oh!

mccheah commented Aug 29, 2017

Uh oh!

mccheah commented Aug 29, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mccheah Aug 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mccheah commented Aug 31, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

varunkatta Sep 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

varunkatta Sep 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

varunkatta left a comment

Choose a reason for hiding this comment

Uh oh!

varunkatta commented Sep 6, 2017

Uh oh!

mccheah commented Sep 7, 2017

Uh oh!

mccheah commented Sep 7, 2017

Uh oh!

ash211 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ash211 commented Sep 8, 2017

Uh oh!

ash211 commented Sep 15, 2017

Uh oh!

duyanghao commented Sep 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ash211 commented Sep 15, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mccheah Aug 30, 2017 •

edited

Loading

varunkatta Sep 6, 2017 •

edited

Loading

varunkatta Sep 6, 2017 •

edited

Loading

duyanghao commented Sep 15, 2017 •

edited

Loading