-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-37331][K8S] Add the ability to create resources before driverPod creating #34599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #145228 has finished for PR 34599 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #145231 has finished for PR 34599 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Intergration test failed is unrelated, ready for review. |
|
Cool, I'll take a look. I've also got a PR to cleanup the integration test failures ( #34636 ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nits, but LGTM if @dongjoon-hyun doesn't have any concerns.
...s/core/src/main/scala/org/apache/spark/deploy/k8s/features/KubernetesFeatureConfigStep.scala
Outdated
Show resolved
Hide resolved
...s/core/src/main/scala/org/apache/spark/deploy/k8s/features/KubernetesFeatureConfigStep.scala
Outdated
Show resolved
Hide resolved
...tes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala
Outdated
Show resolved
Hide resolved
|
@holdenk Thanks for your review! |
|
Test build #145362 has finished for PR 34599 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
@dongjoon-hyun Would you mind giving some suggestion on this? Sorry to ping wrong person in #34599 (comment) . : ) |
|
Sorry for being late. I'll review now. |
...ce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala
Outdated
Show resolved
Hide resolved
...ce-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala
Outdated
Show resolved
Hide resolved
...s/core/src/main/scala/org/apache/spark/deploy/k8s/features/KubernetesFeatureConfigStep.scala
Outdated
Show resolved
Hide resolved
...tes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two comments.
- In theory, this feature looks unsafe because there is a chance to leak the pre-populated resources because they have no owner yet.
- Please minimize the change. For example, we had better keep the existing resource management code for the non-pre-polulated code.
Yes, that's a good point, and the cleanup operations of pre-populated resource is required, will add cleaup pre-populated resource logic
Thanks for your suggestion, will address in next commit. |
|
Test build #145816 has finished for PR 34599 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #146272 has finished for PR 34599 at commit
|
Sure, thanks for reminder, I have added the testcase in latest commit.
These resources are usually necessary in pod creation and scheduling, they should be ready before pod creation, should be deleted when user delete pod, the lifecycle of these resource is same with pod, there are some scenarioes:
@dongjoon-hyun Much thank for your patient and guidance. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #146280 has finished for PR 34599 at commit
|
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Retest this please |
|
Test build #146285 has finished for PR 34599 at commit
|
|
Test build #146286 has finished for PR 34599 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Thank you for update, @Yikun ! |
|
Test build #146294 has finished for PR 34599 at commit
|
| kubernetesClient.resourceList(preKubernetesResources: _*).createOrReplace() | ||
| } catch { | ||
| case NonFatal(e) => | ||
| kubernetesClient.resourceList(preKubernetesResources: _*).delete() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is the first error causing Job failure, could you add a user-friendly error message with logError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, how about:
logError("Please check \"kubectl auth can-i create [resource]\" first." +
" It should be yes. And please also check your feature step implementation.")| } catch { | ||
| case NonFatal(e) => | ||
| kubernetesClient.pods().delete(createdDriverPod) | ||
| kubernetesClient.resourceList(preKubernetesResources: _*).delete() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when the previous line deletes some of the resources of preKubernetesResources? For the non-existing resource, delete() API is okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when the previous line deletes some of the resources of preKubernetesResources? For the non-existing resource, delete() API is okay?
Yes, it's okay, false return in here instead of exception rasing. We could find answer in kubernetes-client delete implementation and resource list delete test.
I also write a simple test to delete non-existing crd in real env, it passed as expected.
test("Deleting nonExisting crds") {
val crd = new CustomResourceDefinitionBuilder()
.withNewMetadata()
.withName("nonExisting1")
.endMetadata()
.build()
val crd2 = new CustomResourceDefinitionBuilder()
.withNewMetadata()
.withName("nonExisting")
.endMetadata()
.build()
val crds = Seq(crd, crd2)
val client = new DefaultKubernetesClient("https://127.0.0.1:52878")
assert(client.inNamespace("default").resourceList(crd).delete() === false)
assert(client.inNamespace("default").resourceList(crds: _*).delete() == false)
}
...managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/submit/ClientSuite.scala
Outdated
Show resolved
Hide resolved
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #146312 has finished for PR 34599 at commit
|
|
@dongjoon-hyun Thanks for your review, addressed your comments in latest commit. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@dongjoon-hyun Much thanks for your help! |
What changes were proposed in this pull request?
This patch adds a new method
getAdditionalPreKubernetesResourcesforKubernetesFeatureConfigStep. It returns any additional Kubernetes resources that should be added to support this feature and resources would be setup before driver podcreating.
After this patch:
getAdditionalPreKubernetesResources: Devs should return resources in here when they want to create resources before pod creatinggetAdditionalKubernetesResources: Devs should return resources in here when they can accept the resources create after pod, and spark will also help to refresh owner reference after resources created, that means if any resource is expected to refresh the owner pod reference, it should be added it here, even if it already in the getAdditionalPreKubernetesResources as same.Why are the changes needed?
We need to setup K8S resources or extension resources before driver pod creating, and then create pod, after the pod created, the owner refernce would be owner to this Pod.
These pre-resources are usually necessary in pod creation and scheduling, they should be ready before pod creation, should be deleted when user delete pod, the lifecycle of these resource is same with pod, such as the customized batch scheduler scenario, the user need to create the addtional pre resources (such as pod group in Volcano) to help the specific spark job complete batch scheduling.
Does this PR introduce any user-facing change?
No
How was this patch tested?
UT