[SPARK-25159][SQL] json schema inference should only trigger one job #22152

cloud-fan · 2018-08-20T08:40:34Z

What changes were proposed in this pull request?

This fixes a perf regression caused by #21376 .

We should not use RDD#toLocalIterator, which triggers one Spark job per RDD partition. This is very bad for RDDs with a lot of small partitions.

To fix it, this PR introduces a way to access SQLConf in the scheduler event loop thread, so that we don't need to use RDD#toLocalIterator anymore in JsonInferSchema.

How was this patch tested?

a new test

cloud-fan · 2018-08-20T08:41:10Z

cc @gatorsmile @hvanhovell @kiszk @viirya @mgaido91

viirya · 2018-08-20T09:05:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala

+    // the fold functions in the scheduler event loop thread.
+    val existingConf = SQLConf.get
+    var rootType: DataType = StructType(Nil)
+    val foldPartition = (iter: Iterator[DataType]) => iter.fold(StructType(Nil))(typeMerger)


Need to do sc.clean(typeMerger) manually here?

This closure is defined by us and I don't think we leak outer reference here. If we do, it's a bug and we should fix it.

Yeah, agreed.

SparkQA · 2018-08-20T12:22:53Z

Test build #94952 has finished for PR 22152 at commit cf13d71.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-08-21T12:25:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

-        val schedulerEventLoopThread =
-          SparkContext.getActive.get.dagScheduler.eventProcessLoop.eventThread
-        if (schedulerEventLoopThread.getId == Thread.currentThread().getId) {
+        // will return `fallbackConf` which is unexpected. Here we requires the caller to get the


nit: we require

mgaido91 · 2018-08-21T12:27:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala

+    var rootType: DataType = StructType(Nil)
+    val foldPartition = (iter: Iterator[DataType]) => iter.fold(StructType(Nil))(typeMerger)
+    val mergeResult = (index: Int, taskResult: DataType) => {
+      rootType = SQLConf.withExistingConf(existingConf) {


just a question, wouldn't:

val partitionsResult = json.sparkContext.runJob(mergedTypesFromPartitions, foldPartition) partitionsResult.fold(typeMerger)

do the same without requiring these changes?

This can work, but the problem is, we have to keep a large result array which can cause GC problems.

it would contain one result per partition, do you think this is enough to cause GC problems?

the schema can be very complex (e.g. very wide and deep schema).

yes, makes sense, thanks.

Same question was in my mind. thanks for clarification.

mgaido91 · 2018-08-21T14:27:04Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+      // triggers one Spark job per RDD partition.
+      Seq(1 -> "a", 2 -> "b").toDF("i", "p")
+        // The data set has 2 partitions, so Spark will write at least 2 json files.
+        // Use a non-splittable compression (gzip), to make sure the json scan RDD has at lease 2


nit: at least

SparkQA · 2018-08-21T16:55:24Z

Test build #95022 has finished for PR 22152 at commit 95ec4d7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-08-22T00:48:04Z

retest this please.

SparkQA · 2018-08-22T04:37:02Z

Test build #95068 has finished for PR 22152 at commit 95ec4d7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile

All the test passes. The last commit is a typo fix. It should be fine.

LGTM

gatorsmile · 2018-08-22T05:20:41Z

Thanks! Merged to master.

SparkQA · 2018-08-22T06:54:36Z

Test build #95077 has finished for PR 22152 at commit 23dfcda.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

allow accessing SQLConf in the scheduler event loop thread

cf13d71

viirya reviewed Aug 20, 2018

View reviewed changes

mgaido91 reviewed Aug 21, 2018

View reviewed changes

fix typo

95ec4d7

mgaido91 reviewed Aug 21, 2018

View reviewed changes

another typo

23dfcda

gatorsmile reviewed Aug 22, 2018

View reviewed changes

asfgit closed this in 4a9c9d8 Aug 22, 2018

[SPARK-25159][SQL] json schema inference should only trigger one job #22152

[SPARK-25159][SQL] json schema inference should only trigger one job #22152

Uh oh!

Conversation

cloud-fan commented Aug 20, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Aug 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Aug 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 21, 2018

Uh oh!

viirya commented Aug 22, 2018

Uh oh!

SparkQA commented Aug 22, 2018

Uh oh!

gatorsmile left a comment

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Aug 22, 2018

Uh oh!

SparkQA commented Aug 22, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cloud-fan Aug 21, 2018 •

edited

Loading