Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Mar 31, 2017

What changes were proposed in this pull request?

Currently, when we infer the types for vaild JSON strings but object or array, we are producing empty schemas regardless of parse modes as below:

scala> spark.read.option("mode", "DROPMALFORMED").json(Seq("""{"a": 1}""", """"a"""").toDS).printSchema()
root
scala> spark.read.option("mode", "FAILFAST").json(Seq("""{"a": 1}""", """"a"""").toDS).printSchema()
root

This PR proposes to handle parse modes in type inference.

After this PR,

scala> spark.read.option("mode", "DROPMALFORMED").json(Seq("""{"a": 1}""", """"a"""").toDS).printSchema()
root
 |-- a: long (nullable = true)
scala> spark.read.option("mode", "FAILFAST").json(Seq("""{"a": 1}""", """"a"""").toDS).printSchema()
java.lang.RuntimeException: Failed to infer a common schema. Struct types are expected but string was found.

This PR is based on NathanHowell@e233fd0 and I and @NathanHowell talked about this in https://issues.apache.org/jira/browse/SPARK-19641

How was this patch tested?

Unit tests in JsonSuite for both DROPMALFORMED and FAILFAST modes.

@HyukjinKwon
Copy link
Member Author

cc @NathanHowell and @cloud-fan.

@SparkQA
Copy link

SparkQA commented Mar 31, 2017

Test build #75421 has started for PR 17492 at commit dc24522.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 31, 2017

Test build #75425 has finished for PR 17492 at commit dc24522.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}

private def withParseMode(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we embed this method in withCorruptField?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@SparkQA
Copy link

SparkQA commented Apr 2, 2017

Test build #75458 has finished for PR 17492 at commit 1c1c054.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

struct
}

case FailFastMode =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will we hit this branch? Seems never?

Copy link
Member Author

@HyukjinKwon HyukjinKwon Apr 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks possible to run this line. I added a test in 1936L in JsonSuite. In more details, if the json is a valid one but not a object of an areay of object, it will infer not StructType per record. If one of the types is not a struct type and failfast mode is enabled, we will hit this line.

I just checked as below:

2017-04-02 10 46 39

.collect()
}
assert(exceptionOne.getMessage.contains("Failed to parse a value"))
assert(exceptionOne.getMessage.contains("Failed to infer a common schema"))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to my knowledge, this test case throws an exception when actually parsing the data before. Now, I intended to check the exception in schema inference as it looks parsing data case is covered below.

spark.read
.option("mode", "FAILFAST")
.json(corruptRecords)
.collect()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended to remove this line to make sure it happens in schrma inference. I think this is not related with the change here.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 4fa1a43 Apr 3, 2017
@HyukjinKwon
Copy link
Member Author

Thank you @cloud-fan.

@HyukjinKwon HyukjinKwon deleted the SPARK-19641 branch January 2, 2018 03:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants