[SPARK-27418][SQL] Migrate Parquet to File Data Source V2 #24327

gengliangwang · 2019-04-09T17:22:05Z

What changes were proposed in this pull request?

Migrate Parquet to File Data Source V2

How was this patch tested?

Unit test

SparkQA · 2019-04-09T19:21:15Z

Test build #104442 has finished for PR 24327 at commit 27e602f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class ParquetLogRedirector implements Serializable
class ParquetFilters(
class ParquetOutputWriter(path: String, context: TaskAttemptContext)
class ParquetReadSupport(val convertTz: Option[TimeZone],
case class FileTypes(
class ParquetWriteSupport extends WriteSupport[InternalRow] with Logging
class ParquetDataSourceV2 extends FileDataSourceV2
case class ParquetPartitionReaderFactory(
case class ParquetScan(
case class ParquetScanBuilder(
case class ParquetTable(
class ParquetWriteBuilder(

SparkQA · 2019-04-11T10:53:07Z

Test build #104503 has finished for PR 24327 at commit 336ac92.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-12T18:47:15Z

Test build #104550 has finished for PR 24327 at commit 99b8575.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2019-04-17T18:18:05Z

Test build #104662 has finished for PR 24327 at commit cf6837c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala

SparkQA · 2019-04-23T01:33:00Z

Test build #104816 has finished for PR 24327 at commit ab70c37.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-23T07:05:03Z

Test build #104823 has finished for PR 24327 at commit b3b04b0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-23T21:17:28Z

Test build #104841 has finished for PR 24327 at commit 138344e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-04-26T22:25:00Z

This is ready for review. @cloud-fan @HyukjinKwon @dongjoon-hyun
I will update test cases related to sameResult after #24475 is merged.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

...c/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetDataSourceV2.scala

SparkQA · 2019-04-27T00:37:12Z

Test build #104948 has finished for PR 24327 at commit 30d88cb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala

SparkQA · 2019-04-29T23:00:40Z

Test build #105005 has finished for PR 24327 at commit c6e602f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala

SparkQA · 2019-04-30T22:22:39Z

Test build #105040 has finished for PR 24327 at commit 18fda4f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-30T23:08:32Z

Test build #105042 has finished for PR 24327 at commit 104aaa1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-05-01T00:02:58Z

The test case "org.apache.spark.sql.streaming.FileStreamSinkSuite.writing with aggregation" becomes flaky with this PR.
@cloud-fan @jose-torres Any idea about that?

gengliangwang · 2019-05-01T00:03:10Z

retest this please.

rdblue · 2019-06-06T19:35:45Z

...la/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala

+  }
+
+  private def createRowBaseReader(file: PartitionedFile): ParquetRecordReader[UnsafeRow] = {
+    buildReaderBase(file, createRowBaseReader0).asInstanceOf[ParquetRecordReader[UnsafeRow]]


buildReaderBase is parameterized, but the result is still casted. Why not parameterize so that ti retursn ParquetRecordReader[UnsafeRow] to avoid the cast? I think these casts should be removed.

rdblue · 2019-06-06T19:41:07Z

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScanBuilder.scala

+
+  // The actual filter push down happens in [[ParquetPartitionReaderFactory]].
+  // It requires the Parquet physical schema to determine whether a filter is convertible.
+  // So here we simply mark that all the filters are pushed down.


This comment isn't correct. All filters that can be converted to Parquet are pushed down.

rdblue · 2019-06-06T19:45:11Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetTable.scala

+    paths: Seq[String],
+    userSpecifiedSchema: Option[StructType],
+    fallbackFileFormat: Class[_ <: FileFormat])
+  extends FileTable(sparkSession, options, paths, userSpecifiedSchema) {


Looks like this will also hit SPARK-27960. I think this is okay for now. No need to block Parquet to fix it.

However, it would be good to follow up with a suite of SQL tests for each v2 implementation that validates overall behavior, like reporting the metastore schema after a table is created.

rdblue · 2019-06-06T19:45:42Z

...c/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetWriteBuilder.scala

+
+    val committerClass =
+      conf.getClass(
+        SQLConf.PARQUET_OUTPUT_COMMITTER_CLASS.key,


Does v2 also use Parquet _metadata files?

I think it is disabled by default

rdblue · 2019-06-06T19:48:00Z

...c/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetWriteBuilder.scala

+    // Sets compression scheme
+    conf.set(ParquetOutputFormat.COMPRESSION, parquetOptions.compressionCodecClassName)
+
+    // SPARK-15719: Disables writing Parquet summary files by default.


If they are disabled by default in v1, why allow writing them in v2?

I think the behavior in V1 and V2 are the same: by default set "parquet.summary.metadata.level" as "NONE" and don't write the summary file. If the conf "parquet.summary.metadata.level" is set by user and spark.sql.parquet.output.committer.class is set correctly, then it will write the summary file.
See: https://issues.apache.org/jira/browse/SPARK-15719

Why should v2 support deprecated metadata files?

I think it is consistent with V1 here.
The value of parquet.summary.metadata.level is ALL by default. As per SPARK-15719, we should set it as NONE by default in Spark.
If users set the conf parquet.summary.metadata.level as ALL or COMMON_ONLY explicitly, Spark should write metadata files.

SparkQA · 2019-06-07T19:54:29Z

Test build #106277 has finished for PR 24327 at commit 7d3a568.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-06-08T05:50:58Z

@rdblue Thanks for the review. I have addressed all your comments. Any other concerns?

gengliangwang · 2019-06-11T06:44:24Z

retest this please.

SparkQA · 2019-06-11T07:05:02Z

Test build #106380 has finished for PR 24327 at commit 7d3a568.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-06-11T15:10:00Z

retest this please.

SparkQA · 2019-06-11T18:26:29Z

Test build #106391 has finished for PR 24327 at commit 7d3a568.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-06-11T22:28:19Z

@dongjoon-hyun Would you help do a final review and merge this one? Thanks!

rdblue · 2019-06-12T00:07:59Z

...la/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala

+
+  private def createVectorizedReader(file: PartitionedFile): VectorizedParquetRecordReader = {
+    val vectorizedReader =
+      buildReaderBase(file, createVectorizedReader0).asInstanceOf[VectorizedParquetRecordReader]


@gengliangwang, why is this cast here? I expected it to be removed when the one in createRowBaseReader was removed.

This is because here we need to call the method initBatch and enableReturningBatches of VectorizedParquetRecordReader. We can't just change the returned type as RecordReader[Void, Object] here.

HyukjinKwon · 2019-06-14T01:03:54Z

...la/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala

+    vectorizedReader
+  }
+
+  private def createVectorizedReader0(


no biggie but I'd name it to createParquetVectorizedReader

HyukjinKwon · 2019-06-14T01:06:23Z

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScanBuilder.scala

+    sparkSession.sessionState.newHadoopConfWithOptions(caseSensitiveMap)
+  }
+
+  lazy val _pushedFilters = {


not a big deal here too but I'd name it to pushedParquetFilters

HyukjinKwon · 2019-06-14T07:55:08Z

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetScanBuilder.scala

+      new SparkToParquetSchemaConverter(sparkSession.sessionState.conf).convert(schema)
+    val parquetFilters = new ParquetFilters(parquetSchema, pushDownDate, pushDownTimestamp,
+      pushDownDecimal, pushDownStringStartWith, pushDownInFilterThreshold, isCaseSensitive)
+    parquetFilters.convertibleFilters(this.filters).toArray


Sorry if I missed some context. What's diff between ParquetFilters.convertibleFilters and ParquetFilters.createFilters? Seems like logic is duplicated.

ParquetFilters.convertibleFilters returns Seq[org.apache.spark.sql.sources.Filter]
ParquetFilters.createFilters returns org.apache.parquet.filter2.predicate.FilterPredicate

The overlap of the two methods is only on the And/Or/Not operator.

SparkQA · 2019-06-14T20:26:46Z

Test build #106522 has finished for PR 24327 at commit f658e92.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-06-15T11:52:20Z

Merged to master.

For #24327 (comment), I think it's confusing. Let me take a look and see if I can make it simpler separately.

HyukjinKwon · 2019-06-15T11:53:58Z

@dongjoon-hyun, @rdblue, @cloud-fan, let me know if there are any major comments to address that I missed. If that's not easily fixable, I don't mind reverting it as well.

gengliangwang · 2019-06-15T13:35:00Z

@dongjoon-hyun @rdblue @cloud-fan @mallman @HyukjinKwon @gatorsmile @jaceklaskowski Thanks for the review!

cloud-fan · 2019-06-17T07:14:36Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala

-    assertDF(df)
+    // TODO: fix file source V2 as well.
+    withSQLConf(SQLConf.USE_V1_SOURCE_READER_LIST.key -> "parquet") {
+      val df = spark.readStream.format(classOf[FakeDefaultSource].getName).load()


how is this related to parquet?

[info] Decoded objects do not match expected objects: [info] expected: WrappedArray(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) [info] actual: WrappedArray(9, 0, 10, 1, 2, 8, 3, 6, 7, 5, 4) [info] assertnotnull(upcast(getcolumnbyordinal(0, LongType), LongType, - root class: "scala.Long")) [info] +- upcast(getcolumnbyordinal(0, LongType), LongType, - root class: "scala.Long") [info] +- getcolumnbyordinal(0, LongType) (QueryTest.scala:70)

We need to fix the read path for steaming output.

gengliangwang force-pushed the parquetV2 branch from 29a28f7 to 27e602f Compare April 9, 2019 17:23

gengliangwang force-pushed the parquetV2 branch from 336ac92 to 99b8575 Compare April 12, 2019 13:54

gengliangwang force-pushed the parquetV2 branch from 99b8575 to cf6837c Compare April 17, 2019 14:11

gengliangwang force-pushed the parquetV2 branch from cf6837c to ab70c37 Compare April 22, 2019 23:12

gengliangwang commented Apr 22, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala Outdated Show resolved Hide resolved

gengliangwang force-pushed the parquetV2 branch from b3b04b0 to 138344e Compare April 23, 2019 18:09

gengliangwang changed the title ~~[WIP][SPARK-27418][SQL] Migrate Parquet to File Data Source V2~~ [SPARK-27418][SQL] Migrate Parquet to File Data Source V2 Apr 26, 2019