Skip to content

Conversation

@Yikun
Copy link
Member

@Yikun Yikun commented Nov 15, 2021

What changes were proposed in this pull request?

This patch adds a new method getAdditionalPreKubernetesResources for KubernetesFeatureConfigStep. It returns any additional Kubernetes resources that should be added to support this feature and resources would be setup before driver pod
creating.

After this patch:

  • getAdditionalPreKubernetesResources: Devs should return resources in here when they want to create resources before pod creating
  • getAdditionalKubernetesResources: Devs should return resources in here when they can accept the resources create after pod, and spark will also help to refresh owner reference after resources created, that means if any resource is expected to refresh the owner pod reference, it should be added it here, even if it already in the getAdditionalPreKubernetesResources as same.

Why are the changes needed?

We need to setup K8S resources or extension resources before driver pod creating, and then create pod, after the pod created, the owner refernce would be owner to this Pod.

These pre-resources are usually necessary in pod creation and scheduling, they should be ready before pod creation, should be deleted when user delete pod, the lifecycle of these resource is same with pod, such as the customized batch scheduler scenario, the user need to create the addtional pre resources (such as pod group in Volcano) to help the specific spark job complete batch scheduling.

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

@Yikun
Copy link
Member Author

Yikun commented Nov 15, 2021

cc @dongjoon-hyun @holdenk

@SparkQA
Copy link

SparkQA commented Nov 15, 2021

Test build #145228 has finished for PR 34599 at commit 68b5cf3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49698/

@SparkQA
Copy link

SparkQA commented Nov 15, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49698/

@SparkQA
Copy link

SparkQA commented Nov 15, 2021

Test build #145231 has finished for PR 34599 at commit f8f2dd4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49701/

@SparkQA
Copy link

SparkQA commented Nov 15, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49701/

@Yikun
Copy link
Member Author

Yikun commented Nov 17, 2021

Intergration test failed is unrelated, ready for review.

@holdenk
Copy link
Contributor

holdenk commented Nov 17, 2021

Cool, I'll take a look. I've also got a PR to cleanup the integration test failures ( #34636 )

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nits, but LGTM if @dongjoon-hyun doesn't have any concerns.

@Yikun
Copy link
Member Author

Yikun commented Nov 18, 2021

@holdenk Thanks for your review!

@SparkQA
Copy link

SparkQA commented Nov 18, 2021

Test build #145362 has finished for PR 34599 at commit 7d7d98d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49833/

@SparkQA
Copy link

SparkQA commented Nov 18, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49833/

@Yikun
Copy link
Member Author

Yikun commented Nov 19, 2021

@dongjoon-hyun Would you mind giving some suggestion on this? Sorry to ping wrong person in #34599 (comment) . : )

@dongjoon-hyun
Copy link
Member

Sorry for being late. I'll review now.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two comments.

  • In theory, this feature looks unsafe because there is a chance to leak the pre-populated resources because they have no owner yet.
  • Please minimize the change. For example, we had better keep the existing resource management code for the non-pre-polulated code.

@Yikun
Copy link
Member Author

Yikun commented Dec 1, 2021

In theory, this feature looks unsafe because there is a chance to leak the pre-populated resources because they have no owner yet.

Yes, that's a good point, and the cleanup operations of pre-populated resource is required, will add cleaup pre-populated resource logic

Please minimize the change. For example, we had better keep the existing resource management code for the non-pre-polulated code.

Thanks for your suggestion, will address in next commit.

@SparkQA
Copy link

SparkQA commented Dec 1, 2021

Test build #145816 has finished for PR 34599 at commit a48e2fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50291/

@SparkQA
Copy link

SparkQA commented Dec 1, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50291/

@SparkQA
Copy link

SparkQA commented Dec 16, 2021

Test build #146272 has finished for PR 34599 at commit 7e3c0ae.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Yikun
Copy link
Member Author

Yikun commented Dec 16, 2021

This PR doesn't have a test case. Could you add a test coverage for your code path please?

Sure, thanks for reminder, I have added the testcase in latest commit.

could you elaborate why we need to create some resource before driver pod creation?

These resources are usually necessary in pod creation and scheduling, they should be ready before pod creation, should be deleted when user delete pod, the lifecycle of these resource is same with pod, there are some scenarioes:

  • Extension custom resources, they are usually created in k8s or cncf extension compoent, such as the customized batch scheduler scenario, the user need to create the addtional pre resources (such as pod group in Volcano) to help the specific spark job complete batch scheduling.
  • The temporary priority class, the priority class is created in custom feature step for specific job and also is epected to be deleted when job deleted.

@dongjoon-hyun Much thank for your patient and guidance.

@SparkQA
Copy link

SparkQA commented Dec 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50747/

@SparkQA
Copy link

SparkQA commented Dec 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50747/

@SparkQA
Copy link

SparkQA commented Dec 16, 2021

Test build #146280 has finished for PR 34599 at commit 5aba6eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 16, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50754/

@Yikun
Copy link
Member Author

Yikun commented Dec 16, 2021

Retest this please

@SparkQA
Copy link

SparkQA commented Dec 16, 2021

Test build #146285 has finished for PR 34599 at commit 5aba6eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 16, 2021

Test build #146286 has finished for PR 34599 at commit 5aba6eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50759/

@SparkQA
Copy link

SparkQA commented Dec 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50759/

@dongjoon-hyun
Copy link
Member

Thank you for update, @Yikun !

@SparkQA
Copy link

SparkQA commented Dec 16, 2021

Test build #146294 has finished for PR 34599 at commit ebbfea6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

kubernetesClient.resourceList(preKubernetesResources: _*).createOrReplace()
} catch {
case NonFatal(e) =>
kubernetesClient.resourceList(preKubernetesResources: _*).delete()
Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is the first error causing Job failure, could you add a user-friendly error message with logError?

Copy link
Member Author

@Yikun Yikun Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, how about:

logError("Please check \"kubectl auth can-i create [resource]\" first." +
  " It should be yes. And please also check your feature step implementation.")

} catch {
case NonFatal(e) =>
kubernetesClient.pods().delete(createdDriverPod)
kubernetesClient.resourceList(preKubernetesResources: _*).delete()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the previous line deletes some of the resources of preKubernetesResources? For the non-existing resource, delete() API is okay?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the previous line deletes some of the resources of preKubernetesResources? For the non-existing resource, delete() API is okay?

Yes, it's okay, false return in here instead of exception rasing. We could find answer in kubernetes-client delete implementation and resource list delete test.

I also write a simple test to delete non-existing crd in real env, it passed as expected.

  test("Deleting nonExisting crds") {
    val crd = new CustomResourceDefinitionBuilder()
      .withNewMetadata()
      .withName("nonExisting1")
      .endMetadata()
      .build()
    val crd2 = new CustomResourceDefinitionBuilder()
      .withNewMetadata()
      .withName("nonExisting")
      .endMetadata()
      .build()
    val crds = Seq(crd, crd2)
    val client = new DefaultKubernetesClient("https://127.0.0.1:52878")
    assert(client.inNamespace("default").resourceList(crd).delete() === false)
    assert(client.inNamespace("default").resourceList(crds: _*).delete() == false)
  }

@SparkQA
Copy link

SparkQA commented Dec 16, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50766/

@SparkQA
Copy link

SparkQA commented Dec 16, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50766/

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Test build #146312 has finished for PR 34599 at commit 2613a92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Yikun
Copy link
Member Author

Yikun commented Dec 17, 2021

@dongjoon-hyun Thanks for your review, addressed your comments in latest commit.

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50784/

@SparkQA
Copy link

SparkQA commented Dec 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50784/

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you for your patience, @Yikun and @holdenk .
Merged to master for Apache Spark 3.3.

@Yikun
Copy link
Member Author

Yikun commented Dec 18, 2021

@dongjoon-hyun Much thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants