[SPARK-36726] Upgrade Parquet to 1.12.1 #33969

sunchao · 2021-09-12T08:11:25Z

What changes were proposed in this pull request?

Upgrade Apache Parquet to 1.12.1

Why are the changes needed?

Parquet 1.12.1 contains the following bug fixes:

PARQUET-2064: Make Range public accessible in RowRanges
PARQUET-2022: ZstdDecompressorStream should close zstdInputStream
PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding
PARQUET-1633: Fix integer overflow
PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile
PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats
PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase
PARQUET-2078: Failed to read parquet file after writing with the same

In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests + a new test for the issue in SPARK-36696

SparkQA · 2021-09-12T08:43:33Z

Test build #143177 has finished for PR 33969 at commit bdbba44.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-09-12T08:57:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47680/

SparkQA · 2021-09-12T09:06:16Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47680/

srowen · 2021-09-12T14:48:32Z

This blocks 3.2.0 right?
Maybe it's not 100% published just yet?
[error] sbt.librarymanagement.ResolveException: Error downloading org.apache.parquet:parquet-column:1.12.1

sunchao · 2021-09-12T16:24:29Z

Oops I missed a few places where the Apache staging repo needs to be added. Trying it again.

sunchao · 2021-09-12T16:25:04Z

blocks 3.2.0 right?

Yes this blocks 3.2.0 - I'm testing RC0 here

dongjoon-hyun · 2021-09-12T17:00:03Z

Thank you, @sunchao !

viirya

Thanks for this work! We also need to update dependency manifest.

sunchao · 2021-09-12T17:30:24Z

Thanks, will do in the next run. I verified this PR with the problematic file in SPARK-36696, and it now works:

  test("SPARK-36726: test incorrect Parquet row group file offset") {
    readParquetFile("/tmp/example.parquet") { df =>
      assert(df.count() == 3650)
    }
  }

srowen · 2021-09-12T17:32:16Z

pom.xml

+    <repository>
+       <id>staged</id>
+       <name>parquet-staged-releases</name>
+       <url>https://repository.apache.org/content/repositories/staging/</url>


This is OK to test - we'd want to wait until it's officially released and then remove it before merging

SparkQA · 2021-09-12T17:44:04Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47682/

SparkQA · 2021-09-12T17:52:46Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47682/

dongjoon-hyun · 2021-09-12T19:11:27Z

cc @gengliangwang

SparkQA · 2021-09-12T19:19:01Z

Test build #143179 has finished for PR 33969 at commit 96a7269.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2021-09-13T02:20:55Z

@sunchao Thank you for making this PR right after the Parquet 1.12.1 RC0!

dongjoon-hyun

Please regenerate the dependency manifests.

dev/test-dependencies.sh --replace-manifest

holdenk · 2021-09-13T19:30:27Z

I know this is covered by UT in Parquet, but would it make sense to cover the behaviour which was a regression inside of Spark UT as well?

sunchao · 2021-09-13T20:28:14Z

@holdenk yes this is a good point, just added. Also updated the dependency manifest files.

SparkQA · 2021-09-13T21:28:15Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47725/

SparkQA · 2021-09-13T21:36:50Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47725/

SparkQA · 2021-09-13T22:55:36Z

Test build #143222 has finished for PR 33969 at commit de50e4b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2021-09-14T00:02:39Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala


+  test("SPARK-36726: test incorrect Parquet row group file offset") {
+    readParquetFile("test-data/malformed-file-offset.parquet") { df =>
+      assert(df.count() == 3650)


should we test for both vec/non vec code path?

I think readParquetFile already takes care of that, see FileBasedDataSourceTest.readFile.

I thought vectorized is enabled by default, and non-vectorized path will not be tested. protected def readFile(path: String, testVectorized: Boolean = true), so we need

Seq(true, false).foreach { vec => readParquetFile(testFile("file.parquet"), vec) { df => ... } }

This is how FileBasedDataSourceTest.readFile is defined:

protected def readFile(path: String, testVectorized: Boolean = true) (f: DataFrame => Unit): Unit = { withSQLConf(vectorizedReaderEnabledKey -> "false") { f(spark.read.format(dataSourceName).load(path.toString)) } if (testVectorized) { withSQLConf(vectorizedReaderEnabledKey -> "true") { f(spark.read.format(dataSourceName).load(path.toString)) } } }

so to me it tests both paths. Did I miss anything?

I am surprised that the non-vectorized reader is always tested with/without testVectorized enabled...

protected def readFile(path: String, testVectorized: Boolean = true) (f: DataFrame => Unit): Unit = { withSQLConf(vectorizedReaderEnabledKey -> "false") { f(spark.read.format(dataSourceName).load(path.toString)) } if (testVectorized) { withSQLConf(vectorizedReaderEnabledKey -> "true") { f(spark.read.format(dataSourceName).load(path.toString)) } } }

Yes, I guess it makes sense since in most cases we'd want to test both code paths and this is convenient for that.

You are right. I'm surprised that non-vectorized path is always tested. In many tests, we specifically test both paths, and they are not necessary given the non-vectorized will always be tested.

SparkQA · 2021-09-14T01:29:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47730/

SparkQA · 2021-09-14T01:38:25Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47730/

SparkQA · 2021-09-14T03:40:53Z

Test build #143227 has finished for PR 33969 at commit 7888295.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-09-14T06:14:39Z

BTW, I assumed that the last tests passed with RC1 instead of RC0.
@sunchao . Is there a way to check that?

sunchao · 2021-09-14T07:01:43Z

@dongjoon-hyun yes I manually checked https://repository.apache.org/content/repositories/staging/org/apache/parquet/ and verified it has the latests bits before kicking off the CI here. I also synced with @shangxinli who's the release manager on this.

SparkQA · 2021-09-14T19:28:07Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47776/

SparkQA · 2021-09-14T19:37:00Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47776/

SparkQA · 2021-09-14T20:54:09Z

Test build #143273 has finished for PR 33969 at commit ba93e35.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai

LGTM. Pending CI.

holdenk

+1 (pending ci)

SparkQA · 2021-09-15T17:22:34Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47819/

SparkQA · 2021-09-15T17:30:15Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47819/

SparkQA · 2021-09-15T19:04:09Z

Test build #143316 has finished for PR 33969 at commit 0097af6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2021-09-15T19:17:32Z

Merged into master/branch-3.2. Thanks.

cc @gengliangwang

### What changes were proposed in this pull request? Upgrade Apache Parquet to 1.12.1 ### Why are the changes needed? Parquet 1.12.1 contains the following bug fixes: - PARQUET-2064: Make Range public accessible in RowRanges - PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` - PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding - PARQUET-1633: Fix integer overflow - PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile - PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats - PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase - PARQUET-2078: Failed to read parquet file after writing with the same In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests + a new test for the issue in SPARK-36696 Closes #33969 from sunchao/upgrade-parquet-12.1. Authored-by: Chao Sun <[email protected]> Signed-off-by: DB Tsai <[email protected]> (cherry picked from commit a927b08) Signed-off-by: DB Tsai <[email protected]>

gengliangwang · 2021-09-16T00:24:35Z

Late LGTM. Thank you, @sunchao !

Upgrade Apache Parquet to 1.12.1 Parquet 1.12.1 contains the following bug fixes: - PARQUET-2064: Make Range public accessible in RowRanges - PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` - PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding - PARQUET-1633: Fix integer overflow - PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile - PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats - PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase - PARQUET-2078: Failed to read parquet file after writing with the same In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release. No Existing tests + a new test for the issue in SPARK-36696 Closes apache#33969 from sunchao/upgrade-parquet-12.1. Authored-by: Chao Sun <[email protected]> Signed-off-by: DB Tsai <[email protected]> (cherry picked from commit a927b08) Signed-off-by: DB Tsai <[email protected]>

* [CARMEL-5873] Upgrade Parquet to 1.12.2 (#896) * [SPARK-36726] Upgrade Parquet to 1.12.1 ### What changes were proposed in this pull request? Upgrade Apache Parquet to 1.12.1 ### Why are the changes needed? Parquet 1.12.1 contains the following bug fixes: - PARQUET-2064: Make Range public accessible in RowRanges - PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` - PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding - PARQUET-1633: Fix integer overflow - PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile - PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats - PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase - PARQUET-2078: Failed to read parquet file after writing with the same In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests + a new test for the issue in SPARK-36696 Closes #33969 from sunchao/upgrade-parquet-12.1. Authored-by: Chao Sun <[email protected]> Signed-off-by: DB Tsai <[email protected]> (cherry picked from commit a927b08) * [SPARK-34542][BUILD] Upgrade Parquet to 1.12.0 ### What changes were proposed in this pull request? Parquet 1.12.0 New Feature - PARQUET-41 - Add bloom filters to parquet statistics - PARQUET-1373 - Encryption key management tools - PARQUET-1396 - Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory - PARQUET-1622 - Add BYTE_STREAM_SPLIT encoding - PARQUET-1784 - Column-wise configuration - PARQUET-1817 - Crypto Properties Factory - PARQUET-1854 - Properties-Driven Interface to Parquet Encryption Parquet 1.12.0 release notes: https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/CHANGES.md ### Why are the changes needed? - Bloom filters to improve filter performance - ZSTD enhancement ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit test. Closes #31649 from wangyum/SPARK-34542. Lead-authored-by: Yuming Wang <[email protected]> Co-authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit cbffc12) Co-authored-by: Chao Sun <[email protected]> * [HADP-44647] Parquet file based kms client for encryption keys (#897) * [HADP-44647] Parquet file based kms client for encryption keys (#82) create/write parquet encryption table. ``` set spark.sql.parquet.encryption.key.file=/path/to/key/file; create table parquet_encryption(a int, b int, c int) using parquet options ( 'parquet.encryption.column.keys' 'columnKey1: a, b; columnKey2: c', 'parquet.encryption.footer.key' 'footerKey'); ``` read parquet encryption table; ``` set spark.sql.parquet.encryption.key.file=/path/to/key/file; select ... from parquet_encryption ... ``` Will raise another pr for default footerKey. * [HADP-44647][FOLLOWUP] Reuse the kms instance for same key file (#84) * Fix Co-authored-by: fwang12 <[email protected]> Co-authored-by: Chao Sun <[email protected]> Co-authored-by: fwang12 <[email protected]>

initial commit

bdbba44

github-actions bot added the BUILD label Sep 12, 2021

fix additional build

96a7269

github-actions bot added the SQL label Sep 12, 2021

viirya reviewed Sep 12, 2021

View reviewed changes

srowen reviewed Sep 12, 2021

View reviewed changes

dongjoon-hyun reviewed Sep 13, 2021

View reviewed changes

address comments

de50e4b

dbtsai reviewed Sep 14, 2021

View reviewed changes

use readFile

7888295

retrigger build

ba93e35

remove staging config

0097af6

sunchao marked this pull request as ready for review September 15, 2021 16:07

dbtsai self-requested a review September 15, 2021 16:24

dbtsai approved these changes Sep 15, 2021

View reviewed changes

holdenk approved these changes Sep 15, 2021

View reviewed changes

dbtsai closed this in a927b08 Sep 15, 2021

[SPARK-36726] Upgrade Parquet to 1.12.1 #33969

[SPARK-36726] Upgrade Parquet to 1.12.1 #33969

Uh oh!

Conversation

sunchao commented Sep 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Sep 12, 2021

Uh oh!

SparkQA commented Sep 12, 2021

Uh oh!

SparkQA commented Sep 12, 2021

Uh oh!

srowen commented Sep 12, 2021

Uh oh!

sunchao commented Sep 12, 2021

Uh oh!

sunchao commented Sep 12, 2021

Uh oh!

dongjoon-hyun commented Sep 12, 2021

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

sunchao commented Sep 12, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 12, 2021

Uh oh!

SparkQA commented Sep 12, 2021

Uh oh!

dongjoon-hyun commented Sep 12, 2021

Uh oh!

SparkQA commented Sep 12, 2021

Uh oh!

gengliangwang commented Sep 13, 2021

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

holdenk commented Sep 13, 2021

Uh oh!

sunchao commented Sep 13, 2021

Uh oh!

SparkQA commented Sep 13, 2021

Uh oh!

SparkQA commented Sep 13, 2021

Uh oh!

SparkQA commented Sep 13, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 14, 2021

Uh oh!

SparkQA commented Sep 14, 2021

Uh oh!

SparkQA commented Sep 14, 2021

Uh oh!

dongjoon-hyun commented Sep 14, 2021

Uh oh!

sunchao commented Sep 14, 2021

sunchao commented Sep 12, 2021 •

edited

Loading