[SPARK-3150] Fix NullPointerException in in Spark recovery: Add initializing default values in DriverInfo.init() #2062

tanyatik · 2014-08-20T18:21:56Z

The issue happens when Spark is run standalone on a cluster.
When master and driver fall simultaneously on one node in a cluster, master tries to recover its state and restart spark driver.
While restarting driver, it falls with NPE exception (stacktrace is below).
After falling, it restarts and tries to recover its state and restart Spark driver again. It happens over and over in an infinite cycle.
Namely, Spark tries to read DriverInfo state from zookeeper, but after reading it happens to be null in DriverInfo.worker.

https://issues.apache.org/jira/browse/SPARK-3150

AmplabJenkins · 2014-08-20T18:23:42Z

Can one of the admins verify this patch?

JoshRosen · 2014-08-20T19:19:18Z

Jenkins, this is ok to test.

SparkQA · 2014-08-20T19:25:32Z

QA tests have started for PR 2062 at commit 9936043.

This patch merges cleanly.

SparkQA · 2014-08-20T20:20:33Z

QA tests have finished for PR 2062 at commit 9936043.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2014-08-20T21:34:09Z

core/src/main/scala/org/apache/spark/deploy/master/DriverInfo.scala

Not sure if I understand your intention. These are all transient and won't actually be serialized, so why do we need to set them every time we readObject?

When a field is deserialized, transient fields become null value, not None, thats why NullPointerException happens.

Similar approach is used in WorkerInfo

Ah I see, thanks for the context.

JoshRosen · 2014-08-20T21:40:22Z

FWIW, we don't have any Jenkins tests for Zookeeper-based multi-master FT. There's a docker-based set of integration tests in FaultToleranceTest.scala, so maybe we could add a test case there.

We should create a JIRA for proper automated testing of this, including a sub-task to a regression-test for this issue.

andrewor14 · 2014-08-25T23:26:01Z

@tanyatik Have you verified that this fixes the NPE you ran into? If so this LGTM

tanyatik · 2014-08-26T08:17:06Z

Yes I did, this patch fixes NPE and Spark restarts successfully.

JoshRosen · 2014-08-28T17:35:55Z

This looks good to me, too. I'm merging this into master, branch-1.1, and branch-1.0. Thanks!

…alizing default values in DriverInfo.init() The issue happens when Spark is run standalone on a cluster. When master and driver fall simultaneously on one node in a cluster, master tries to recover its state and restart spark driver. While restarting driver, it falls with NPE exception (stacktrace is below). After falling, it restarts and tries to recover its state and restart Spark driver again. It happens over and over in an infinite cycle. Namely, Spark tries to read DriverInfo state from zookeeper, but after reading it happens to be null in DriverInfo.worker. https://issues.apache.org/jira/browse/SPARK-3150 Author: Tatiana Borisova <[email protected]> Closes #2062 from tanyatik/spark-3150 and squashes the following commits: 9936043 [Tatiana Borisova] Add initializing default values in DriverInfo.init() (cherry picked from commit 70d8146) Signed-off-by: Josh Rosen <[email protected]>

…alizing default values in DriverInfo.init() The issue happens when Spark is run standalone on a cluster. When master and driver fall simultaneously on one node in a cluster, master tries to recover its state and restart spark driver. While restarting driver, it falls with NPE exception (stacktrace is below). After falling, it restarts and tries to recover its state and restart Spark driver again. It happens over and over in an infinite cycle. Namely, Spark tries to read DriverInfo state from zookeeper, but after reading it happens to be null in DriverInfo.worker. https://issues.apache.org/jira/browse/SPARK-3150 Author: Tatiana Borisova <[email protected]> Closes apache#2062 from tanyatik/spark-3150 and squashes the following commits: 9936043 [Tatiana Borisova] Add initializing default values in DriverInfo.init()

…he#2062) Implement DataFrame handling, matching the existing Catalog/Table handling in uc-spark-authz. apple-cloud-services/[email protected]

Add initializing default values in DriverInfo.init()

9936043

andrewor14 reviewed Aug 20, 2014
View reviewed changes

asfgit closed this in 70d8146 Aug 28, 2014

[SPARK-3150] Fix NullPointerException in in Spark recovery: Add initializing default values in DriverInfo.init() #2062

[SPARK-3150] Fix NullPointerException in in Spark recovery: Add initializing default values in DriverInfo.init() #2062

Uh oh!

Conversation

tanyatik commented Aug 20, 2014

Uh oh!

AmplabJenkins commented Aug 20, 2014

Uh oh!

JoshRosen commented Aug 20, 2014

Uh oh!

SparkQA commented Aug 20, 2014

Uh oh!

SparkQA commented Aug 20, 2014

Uh oh!

andrewor14 Aug 20, 2014

Choose a reason for hiding this comment

Uh oh!

tanyatik Aug 21, 2014

Choose a reason for hiding this comment

Uh oh!

tanyatik Aug 21, 2014

Choose a reason for hiding this comment

Uh oh!

andrewor14 Aug 25, 2014

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Aug 20, 2014

Uh oh!

andrewor14 commented Aug 25, 2014

Uh oh!

tanyatik commented Aug 26, 2014

Uh oh!

JoshRosen commented Aug 28, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants