Monitor pod status in submission v2. #283

mccheah · 2017-05-19T00:21:19Z

Allows the submission client to remain running until the driver pod completes. This comprises a number of refactors to allow the pod monitoring to be mocked in a unit test, and also includes a variant of #282 so that both V1 and V2 can have the exit code of the containers be logged at the end of the running application.

mccheah · 2017-05-19T00:23:50Z

@lins05 @ash211

mccheah · 2017-05-19T00:33:15Z

Note that this approach differs in some key ways from #282. The statuses of all containers are logged without necessarily assuming the driver container is the only container, although it's more than likely that there will only be one container here. The status of the containers are described without an assumption that they are necessarily in the "FINISHED" state. Finally, the logging itself is built into the new LoggingPodStatusWatcher#awaitCompletion API which also makes it such that callers do not need to maintain their own CountDownLatch as well as not needing to worry about separately asking the Watcher to log the final status.

It would be good to compare this with #282 and combine the best aspects of each.

lins05 · 2017-05-19T02:15:46Z

.../core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/LoggingPodStatusWatcher.scala

+1 for encapsulating the latch here

lins05 · 2017-05-19T14:06:44Z

...urce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/config.scala

Is this necessary? I see yarn also use this https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala#L134

Good point - reverted it.

lins05 · 2017-05-19T14:07:21Z

.../core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/LoggingPodStatusWatcher.scala

This is exceeds 100 chars and would fail scala style check.

Scalastyle doesn't check imports for line length.

ash211 · 2017-05-22T19:23:50Z

.../core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/LoggingPodStatusWatcher.scala

+              ("Container state", "Terminated"),
+              ("Exit code", terminated.getExitCode.toString))
+          case unknown =>
+            throw new SparkException(s"Unexpected container status type ${unknown.getClass}.")


where would this exception get caught? I don't think we want to throw an exception that possibly blocks application flow just for logging purposes -- maybe instead log Container state "Unknown" so as to not break code flow

This should be happening in the logging thread, I don't think this will crash the application.

ash211 · 2017-05-22T19:24:31Z

...ers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/v1/Client.scala

    logInfo(s"Starting application $kubernetesAppId in Kubernetes...")
+    loggingInterval.foreach { interval =>
+      require(interval > 0, "Logging interval must be a positive number.")
+    }


seems like this check should be done inside the LoggingPodStatusWatcherImpl -- otherwise all clients of that API need to do this check

ash211 · 2017-05-22T19:37:31Z

...ers/kubernetes/core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/v2/Client.scala

+            throw e
+        }
+        if (waitForAppCompletion) {
+          loggingPodStatusWatcher.awaitCompletion()


should we put similar logging for application finished / launched / waiting on completion here as we have in V1? I rather like the progress logging for debugging purposes (and building an intuition for where time is spent in the process)

One difference here is that we create all of the resources with fire-and-forget semantics if SparkSubmit doesn't wait for the application to complete. The pod is created and SparkSubmit exits without checking if the pod has started running. This is because unlike the first version of submission which requires the driver to be running in order to bootstrap it, the bootstrap process here can be done without any additional interaction from the submission client other than to create the resources. However it's not clear if the submission client should also wait for the driver pod to start running before exiting even if fire-and-forget mode is turned on.

I think the semantics of fire-and-forget is that you have to make sure it's actually fired before forgetting. Do you know if in yarn-cluster mode the submission fails if e.g. the job gets hung in ACCEPTED state and the AM never gets resources?

This is probably something we can do in a followup commit and needn't block this one

ash211

LGTM, will merge when build passes

kimoonkim · 2017-06-02T22:03:56Z

.../core/src/main/scala/org/apache/spark/deploy/kubernetes/submit/LoggingPodStatusWatcher.scala

  }
+
+  override def awaitCompletion(): Unit = {
+    podCompletedFuture.countDown()


Shouldn't this be podCompletedFuture.await? My client exits right away without waiting for the driver pod to complete. I wonder if it's because of this line?

2017-06-02 14:48:36 INFO Client:54 - Waiting for application spark-hdfstest-1496440110123 to finish...
2017-06-02 14:48:36 INFO LoggingPodStatusWatcherImpl:54 - Container final statuses:
Container name: spark-kubernetes-driver
Container image: docker:5000/spark-driver:kimoon-0602-1
Container state: Waiting
Pending reason: PodInitializing
2017-06-02 14:48:36 INFO Client:54 - Application spark-hdfstest-1496440110123 finished.

Good catch, missed this. Can you send a fix?

Never mind I see the fix now.

* Monitor pod status in submission v2. * Address comments

* upgrade hadoop to 2.9.0-palantir.1-rc9 * run test-dependencies.sh --replace-manifest * missed one * no more rc for deps * and the poms * fix the test * bump to 2.9.0-palantir.2

* Revert "Bump Hadoop to 2.9.0-palantir.3 (apache-spark-on-k8s#288)" This reverts commit bb010b8. * Revert "Hadoop 2.9.0-palantir.2 (apache-spark-on-k8s#283)" This reverts commit 65956b7.

* Monitor pod status in submission v2. * Address comments

mccheah mentioned this pull request May 19, 2017

Log the driver exit status explicitly #282

Closed

lins05 reviewed May 19, 2017

View reviewed changes

mccheah force-pushed the external-vs-internal-uri-tls branch from 377ef96 to b5733f7 Compare May 19, 2017 19:34

mccheah force-pushed the logging-pod-status-v2 branch from 1ab7603 to 73659c4 Compare May 19, 2017 19:38

Monitor pod status in submission v2.

b030f2f

mccheah force-pushed the logging-pod-status-v2 branch from 73659c4 to b030f2f Compare May 19, 2017 23:12

mccheah changed the base branch from external-vs-internal-uri-tls to branch-2.1-kubernetes May 19, 2017 23:12

ash211 reviewed May 22, 2017

View reviewed changes

Address comments

dfee30f

ash211 approved these changes May 22, 2017

View reviewed changes

ash211 merged commit 408c65f into branch-2.1-kubernetes May 22, 2017

ash211 mentioned this pull request May 22, 2017

Log the driver exit status explicitly #276

Closed

ash211 deleted the logging-pod-status-v2 branch May 22, 2017 23:18

kimoonkim reviewed Jun 2, 2017

View reviewed changes

foxish pushed a commit that referenced this pull request Jul 24, 2017

Monitor pod status in submission v2. (#283)

2e5f2cd

* Monitor pod status in submission v2. * Address comments

puneetloya pushed a commit to puneetloya/spark that referenced this pull request Mar 11, 2019

Monitor pod status in submission v2. (apache-spark-on-k8s#283)

8df9fd1

* Monitor pod status in submission v2. * Address comments

Monitor pod status in submission v2. #283

Monitor pod status in submission v2. #283

Uh oh!

Conversation

mccheah commented May 19, 2017

Uh oh!

mccheah commented May 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mccheah commented May 19, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mccheah May 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ash211 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mccheah commented May 19, 2017 •

edited

Loading

mccheah May 22, 2017 •

edited

Loading