Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Conversation

@mccheah
Copy link

@mccheah mccheah commented May 19, 2017

Allows the submission client to remain running until the driver pod completes. This comprises a number of refactors to allow the pod monitoring to be mocked in a unit test, and also includes a variant of #282 so that both V1 and V2 can have the exit code of the containers be logged at the end of the running application.

@mccheah
Copy link
Author

mccheah commented May 19, 2017

@lins05 @ash211

@mccheah
Copy link
Author

mccheah commented May 19, 2017

Note that this approach differs in some key ways from #282. The statuses of all containers are logged without necessarily assuming the driver container is the only container, although it's more than likely that there will only be one container here. The status of the containers are described without an assumption that they are necessarily in the "FINISHED" state. Finally, the logging itself is built into the new LoggingPodStatusWatcher#awaitCompletion API which also makes it such that callers do not need to maintain their own CountDownLatch as well as not needing to worry about separately asking the Watcher to log the final status.

It would be good to compare this with #282 and combine the best aspects of each.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for encapsulating the latch here

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - reverted it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exceeds 100 chars and would fail scala style check.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scalastyle doesn't check imports for line length.

@mccheah mccheah force-pushed the external-vs-internal-uri-tls branch from 377ef96 to b5733f7 Compare May 19, 2017 19:34
@mccheah mccheah force-pushed the logging-pod-status-v2 branch from 1ab7603 to 73659c4 Compare May 19, 2017 19:38
@mccheah mccheah force-pushed the logging-pod-status-v2 branch from 73659c4 to b030f2f Compare May 19, 2017 23:12
@mccheah mccheah changed the base branch from external-vs-internal-uri-tls to branch-2.1-kubernetes May 19, 2017 23:12
("Container state", "Terminated"),
("Exit code", terminated.getExitCode.toString))
case unknown =>
throw new SparkException(s"Unexpected container status type ${unknown.getClass}.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where would this exception get caught? I don't think we want to throw an exception that possibly blocks application flow just for logging purposes -- maybe instead log Container state "Unknown" so as to not break code flow

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be happening in the logging thread, I don't think this will crash the application.

logInfo(s"Starting application $kubernetesAppId in Kubernetes...")
loggingInterval.foreach { interval =>
require(interval > 0, "Logging interval must be a positive number.")
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this check should be done inside the LoggingPodStatusWatcherImpl -- otherwise all clients of that API need to do this check

throw e
}
if (waitForAppCompletion) {
loggingPodStatusWatcher.awaitCompletion()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we put similar logging for application finished / launched / waiting on completion here as we have in V1? I rather like the progress logging for debugging purposes (and building an intuition for where time is spent in the process)

Copy link
Author

@mccheah mccheah May 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One difference here is that we create all of the resources with fire-and-forget semantics if SparkSubmit doesn't wait for the application to complete. The pod is created and SparkSubmit exits without checking if the pod has started running. This is because unlike the first version of submission which requires the driver to be running in order to bootstrap it, the bootstrap process here can be done without any additional interaction from the submission client other than to create the resources. However it's not clear if the submission client should also wait for the driver pod to start running before exiting even if fire-and-forget mode is turned on.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the semantics of fire-and-forget is that you have to make sure it's actually fired before forgetting. Do you know if in yarn-cluster mode the submission fails if e.g. the job gets hung in ACCEPTED state and the AM never gets resources?

This is probably something we can do in a followup commit and needn't block this one

Copy link

@ash211 ash211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, will merge when build passes

@ash211 ash211 merged commit 408c65f into branch-2.1-kubernetes May 22, 2017
@ash211 ash211 deleted the logging-pod-status-v2 branch May 22, 2017 23:18
}

override def awaitCompletion(): Unit = {
podCompletedFuture.countDown()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be podCompletedFuture.await? My client exits right away without waiting for the driver pod to complete. I wonder if it's because of this line?

2017-06-02 14:48:36 INFO Client:54 - Waiting for application spark-hdfstest-1496440110123 to finish...
2017-06-02 14:48:36 INFO LoggingPodStatusWatcherImpl:54 - Container final statuses:
Container name: spark-kubernetes-driver
Container image: docker:5000/spark-driver:kimoon-0602-1
Container state: Waiting
Pending reason: PodInitializing
2017-06-02 14:48:36 INFO Client:54 - Application spark-hdfstest-1496440110123 finished.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, missed this. Can you send a fix?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind I see the fix now.

foxish pushed a commit that referenced this pull request Jul 24, 2017
* Monitor pod status in submission v2.

* Address comments
ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 26, 2019
* upgrade hadoop to 2.9.0-palantir.1-rc9

* run test-dependencies.sh --replace-manifest

* missed one

* no more rc for deps

* and the poms

* fix the test

* bump to 2.9.0-palantir.2
ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 26, 2019
* Revert "Bump Hadoop to 2.9.0-palantir.3 (apache-spark-on-k8s#288)"

This reverts commit bb010b8.

* Revert "Hadoop 2.9.0-palantir.2 (apache-spark-on-k8s#283)"

This reverts commit 65956b7.
puneetloya pushed a commit to puneetloya/spark that referenced this pull request Mar 11, 2019
* Monitor pod status in submission v2.

* Address comments
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants