Skip to content

Conversation

@tanyatik
Copy link
Contributor

The issue happens when Spark is run standalone on a cluster.
When master and driver fall simultaneously on one node in a cluster, master tries to recover its state and restart spark driver.
While restarting driver, it falls with NPE exception (stacktrace is below).
After falling, it restarts and tries to recover its state and restart Spark driver again. It happens over and over in an infinite cycle.
Namely, Spark tries to read DriverInfo state from zookeeper, but after reading it happens to be null in DriverInfo.worker.

https://issues.apache.org/jira/browse/SPARK-3150

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@JoshRosen
Copy link
Contributor

Jenkins, this is ok to test.

@SparkQA
Copy link

SparkQA commented Aug 20, 2014

QA tests have started for PR 2062 at commit 9936043.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 20, 2014

QA tests have finished for PR 2062 at commit 9936043.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understand your intention. These are all transient and won't actually be serialized, so why do we need to set them every time we readObject?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a field is deserialized, transient fields become null value, not None, thats why NullPointerException happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar approach is used in WorkerInfo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, thanks for the context.

@JoshRosen
Copy link
Contributor

FWIW, we don't have any Jenkins tests for Zookeeper-based multi-master FT. There's a docker-based set of integration tests in FaultToleranceTest.scala, so maybe we could add a test case there.

We should create a JIRA for proper automated testing of this, including a sub-task to a regression-test for this issue.

@andrewor14
Copy link
Contributor

@tanyatik Have you verified that this fixes the NPE you ran into? If so this LGTM

@tanyatik
Copy link
Contributor Author

Yes I did, this patch fixes NPE and Spark restarts successfully.

@JoshRosen
Copy link
Contributor

This looks good to me, too. I'm merging this into master, branch-1.1, and branch-1.0. Thanks!

asfgit pushed a commit that referenced this pull request Aug 28, 2014
…alizing default values in DriverInfo.init()

The issue happens when Spark is run standalone on a cluster.
When master and driver fall simultaneously on one node in a cluster, master tries to recover its state and restart spark driver.
While restarting driver, it falls with NPE exception (stacktrace is below).
After falling, it restarts and tries to recover its state and restart Spark driver again. It happens over and over in an infinite cycle.
Namely, Spark tries to read DriverInfo state from zookeeper, but after reading it happens to be null in DriverInfo.worker.

https://issues.apache.org/jira/browse/SPARK-3150

Author: Tatiana Borisova <[email protected]>

Closes #2062 from tanyatik/spark-3150 and squashes the following commits:

9936043 [Tatiana Borisova] Add initializing default values in DriverInfo.init()

(cherry picked from commit 70d8146)
Signed-off-by: Josh Rosen <[email protected]>
@asfgit asfgit closed this in 70d8146 Aug 28, 2014
asfgit pushed a commit that referenced this pull request Aug 28, 2014
…alizing default values in DriverInfo.init()

The issue happens when Spark is run standalone on a cluster.
When master and driver fall simultaneously on one node in a cluster, master tries to recover its state and restart spark driver.
While restarting driver, it falls with NPE exception (stacktrace is below).
After falling, it restarts and tries to recover its state and restart Spark driver again. It happens over and over in an infinite cycle.
Namely, Spark tries to read DriverInfo state from zookeeper, but after reading it happens to be null in DriverInfo.worker.

https://issues.apache.org/jira/browse/SPARK-3150

Author: Tatiana Borisova <[email protected]>

Closes #2062 from tanyatik/spark-3150 and squashes the following commits:

9936043 [Tatiana Borisova] Add initializing default values in DriverInfo.init()

(cherry picked from commit 70d8146)
Signed-off-by: Josh Rosen <[email protected]>
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…alizing default values in DriverInfo.init()

The issue happens when Spark is run standalone on a cluster.
When master and driver fall simultaneously on one node in a cluster, master tries to recover its state and restart spark driver.
While restarting driver, it falls with NPE exception (stacktrace is below).
After falling, it restarts and tries to recover its state and restart Spark driver again. It happens over and over in an infinite cycle.
Namely, Spark tries to read DriverInfo state from zookeeper, but after reading it happens to be null in DriverInfo.worker.

https://issues.apache.org/jira/browse/SPARK-3150

Author: Tatiana Borisova <[email protected]>

Closes apache#2062 from tanyatik/spark-3150 and squashes the following commits:

9936043 [Tatiana Borisova] Add initializing default values in DriverInfo.init()
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Sep 24, 2024
…he#2062)

Implement DataFrame handling, matching the existing Catalog/Table handling in uc-spark-authz.

apple-cloud-services/[email protected]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants