-
Notifications
You must be signed in to change notification settings - Fork 117
Monitor pod status in submission v2. #283
Conversation
|
Note that this approach differs in some key ways from #282. The statuses of all containers are logged without necessarily assuming the driver container is the only container, although it's more than likely that there will only be one container here. The status of the containers are described without an assumption that they are necessarily in the "FINISHED" state. Finally, the logging itself is built into the new It would be good to compare this with #282 and combine the best aspects of each. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for encapsulating the latch here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? I see yarn also use this https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala#L134
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - reverted it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exceeds 100 chars and would fail scala style check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scalastyle doesn't check imports for line length.
377ef96 to
b5733f7
Compare
1ab7603 to
73659c4
Compare
73659c4 to
b030f2f
Compare
| ("Container state", "Terminated"), | ||
| ("Exit code", terminated.getExitCode.toString)) | ||
| case unknown => | ||
| throw new SparkException(s"Unexpected container status type ${unknown.getClass}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where would this exception get caught? I don't think we want to throw an exception that possibly blocks application flow just for logging purposes -- maybe instead log Container state "Unknown" so as to not break code flow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be happening in the logging thread, I don't think this will crash the application.
| logInfo(s"Starting application $kubernetesAppId in Kubernetes...") | ||
| loggingInterval.foreach { interval => | ||
| require(interval > 0, "Logging interval must be a positive number.") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like this check should be done inside the LoggingPodStatusWatcherImpl -- otherwise all clients of that API need to do this check
| throw e | ||
| } | ||
| if (waitForAppCompletion) { | ||
| loggingPodStatusWatcher.awaitCompletion() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we put similar logging for application finished / launched / waiting on completion here as we have in V1? I rather like the progress logging for debugging purposes (and building an intuition for where time is spent in the process)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One difference here is that we create all of the resources with fire-and-forget semantics if SparkSubmit doesn't wait for the application to complete. The pod is created and SparkSubmit exits without checking if the pod has started running. This is because unlike the first version of submission which requires the driver to be running in order to bootstrap it, the bootstrap process here can be done without any additional interaction from the submission client other than to create the resources. However it's not clear if the submission client should also wait for the driver pod to start running before exiting even if fire-and-forget mode is turned on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the semantics of fire-and-forget is that you have to make sure it's actually fired before forgetting. Do you know if in yarn-cluster mode the submission fails if e.g. the job gets hung in ACCEPTED state and the AM never gets resources?
This is probably something we can do in a followup commit and needn't block this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, will merge when build passes
| } | ||
|
|
||
| override def awaitCompletion(): Unit = { | ||
| podCompletedFuture.countDown() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be podCompletedFuture.await? My client exits right away without waiting for the driver pod to complete. I wonder if it's because of this line?
2017-06-02 14:48:36 INFO Client:54 - Waiting for application spark-hdfstest-1496440110123 to finish...
2017-06-02 14:48:36 INFO LoggingPodStatusWatcherImpl:54 - Container final statuses:
Container name: spark-kubernetes-driver
Container image: docker:5000/spark-driver:kimoon-0602-1
Container state: Waiting
Pending reason: PodInitializing
2017-06-02 14:48:36 INFO Client:54 - Application spark-hdfstest-1496440110123 finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, missed this. Can you send a fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind I see the fix now.
* Monitor pod status in submission v2. * Address comments
* upgrade hadoop to 2.9.0-palantir.1-rc9 * run test-dependencies.sh --replace-manifest * missed one * no more rc for deps * and the poms * fix the test * bump to 2.9.0-palantir.2
* Revert "Bump Hadoop to 2.9.0-palantir.3 (apache-spark-on-k8s#288)" This reverts commit bb010b8. * Revert "Hadoop 2.9.0-palantir.2 (apache-spark-on-k8s#283)" This reverts commit 65956b7.
* Monitor pod status in submission v2. * Address comments
Allows the submission client to remain running until the driver pod completes. This comprises a number of refactors to allow the pod monitoring to be mocked in a unit test, and also includes a variant of #282 so that both V1 and V2 can have the exit code of the containers be logged at the end of the running application.