Skip to content

Conversation

@sunchao
Copy link
Member

@sunchao sunchao commented Sep 12, 2021

What changes were proposed in this pull request?

Upgrade Apache Parquet to 1.12.1

Why are the changes needed?

Parquet 1.12.1 contains the following bug fixes:

  • PARQUET-2064: Make Range public accessible in RowRanges
  • PARQUET-2022: ZstdDecompressorStream should close zstdInputStream
  • PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding
  • PARQUET-1633: Fix integer overflow
  • PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile
  • PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats
  • PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase
  • PARQUET-2078: Failed to read parquet file after writing with the same

In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests + a new test for the issue in SPARK-36696

@github-actions github-actions bot added the BUILD label Sep 12, 2021
@SparkQA
Copy link

SparkQA commented Sep 12, 2021

Test build #143177 has finished for PR 33969 at commit bdbba44.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47680/

@SparkQA
Copy link

SparkQA commented Sep 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47680/

@srowen
Copy link
Member

srowen commented Sep 12, 2021

This blocks 3.2.0 right?
Maybe it's not 100% published just yet?
[error] sbt.librarymanagement.ResolveException: Error downloading org.apache.parquet:parquet-column:1.12.1

@github-actions github-actions bot added the SQL label Sep 12, 2021
@sunchao
Copy link
Member Author

sunchao commented Sep 12, 2021

Oops I missed a few places where the Apache staging repo needs to be added. Trying it again.

@sunchao
Copy link
Member Author

sunchao commented Sep 12, 2021

blocks 3.2.0 right?

Yes this blocks 3.2.0 - I'm testing RC0 here

@dongjoon-hyun
Copy link
Member

Thank you, @sunchao !

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this work! We also need to update dependency manifest.

@sunchao
Copy link
Member Author

sunchao commented Sep 12, 2021

Thanks, will do in the next run. I verified this PR with the problematic file in SPARK-36696, and it now works:

  test("SPARK-36726: test incorrect Parquet row group file offset") {
    readParquetFile("/tmp/example.parquet") { df =>
      assert(df.count() == 3650)
    }
  }

pom.xml Outdated
<repository>
<id>staged</id>
<name>parquet-staged-releases</name>
<url>https://repository.apache.org/content/repositories/staging/</url>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OK to test - we'd want to wait until it's officially released and then remove it before merging

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@SparkQA
Copy link

SparkQA commented Sep 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47682/

@SparkQA
Copy link

SparkQA commented Sep 12, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47682/

@dongjoon-hyun
Copy link
Member

cc @gengliangwang

@SparkQA
Copy link

SparkQA commented Sep 12, 2021

Test build #143179 has finished for PR 33969 at commit 96a7269.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

@sunchao Thank you for making this PR right after the Parquet 1.12.1 RC0!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please regenerate the dependency manifests.

dev/test-dependencies.sh --replace-manifest

@holdenk
Copy link
Contributor

holdenk commented Sep 13, 2021

I know this is covered by UT in Parquet, but would it make sense to cover the behaviour which was a regression inside of Spark UT as well?

@sunchao
Copy link
Member Author

sunchao commented Sep 13, 2021

@holdenk yes this is a good point, just added. Also updated the dependency manifest files.

@SparkQA
Copy link

SparkQA commented Sep 13, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47725/

@SparkQA
Copy link

SparkQA commented Sep 13, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47725/

@SparkQA
Copy link

SparkQA commented Sep 13, 2021

Test build #143222 has finished for PR 33969 at commit de50e4b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


test("SPARK-36726: test incorrect Parquet row group file offset") {
readParquetFile("test-data/malformed-file-offset.parquet") { df =>
assert(df.count() == 3650)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we test for both vec/non vec code path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think readParquetFile already takes care of that, see FileBasedDataSourceTest.readFile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought vectorized is enabled by default, and non-vectorized path will not be tested. protected def readFile(path: String, testVectorized: Boolean = true), so we need

  Seq(true, false).foreach { vec =>
      readParquetFile(testFile("file.parquet"), vec) { df =>
         ...
     }
  }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how FileBasedDataSourceTest.readFile is defined:

  protected def readFile(path: String, testVectorized: Boolean = true)
      (f: DataFrame => Unit): Unit = {
    withSQLConf(vectorizedReaderEnabledKey -> "false") {
      f(spark.read.format(dataSourceName).load(path.toString))
    }
    if (testVectorized) {
      withSQLConf(vectorizedReaderEnabledKey -> "true") {
        f(spark.read.format(dataSourceName).load(path.toString))
      }
    }
  }

so to me it tests both paths. Did I miss anything?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised that the non-vectorized reader is always tested with/without testVectorized enabled...

  protected def readFile(path: String, testVectorized: Boolean = true)
      (f: DataFrame => Unit): Unit = {
    withSQLConf(vectorizedReaderEnabledKey -> "false") {
      f(spark.read.format(dataSourceName).load(path.toString))
    }
    if (testVectorized) {
      withSQLConf(vectorizedReaderEnabledKey -> "true") {
        f(spark.read.format(dataSourceName).load(path.toString))
      }
    }
  }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I guess it makes sense since in most cases we'd want to test both code paths and this is convenient for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I'm surprised that non-vectorized path is always tested. In many tests, we specifically test both paths, and they are not necessary given the non-vectorized will always be tested.

@SparkQA
Copy link

SparkQA commented Sep 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47730/

@SparkQA
Copy link

SparkQA commented Sep 14, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47730/

@SparkQA
Copy link

SparkQA commented Sep 14, 2021

Test build #143227 has finished for PR 33969 at commit 7888295.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

BTW, I assumed that the last tests passed with RC1 instead of RC0.
@sunchao . Is there a way to check that?

@sunchao
Copy link
Member Author

sunchao commented Sep 14, 2021

@dongjoon-hyun yes I manually checked https://repository.apache.org/content/repositories/staging/org/apache/parquet/ and verified it has the latests bits before kicking off the CI here. I also synced with @shangxinli who's the release manager on this.

@SparkQA
Copy link

SparkQA commented Sep 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47776/

@SparkQA
Copy link

SparkQA commented Sep 14, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47776/

@SparkQA
Copy link

SparkQA commented Sep 14, 2021

Test build #143273 has finished for PR 33969 at commit ba93e35.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao sunchao marked this pull request as ready for review September 15, 2021 16:07
@dbtsai dbtsai self-requested a review September 15, 2021 16:24
Copy link
Member

@dbtsai dbtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Pending CI.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 (pending ci)

@SparkQA
Copy link

SparkQA commented Sep 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47819/

@SparkQA
Copy link

SparkQA commented Sep 15, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47819/

@SparkQA
Copy link

SparkQA commented Sep 15, 2021

Test build #143316 has finished for PR 33969 at commit 0097af6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Sep 15, 2021

Merged into master/branch-3.2. Thanks.

cc @gengliangwang

@dbtsai dbtsai closed this in a927b08 Sep 15, 2021
dbtsai pushed a commit that referenced this pull request Sep 15, 2021
### What changes were proposed in this pull request?

Upgrade Apache Parquet to 1.12.1

### Why are the changes needed?

Parquet 1.12.1 contains the following bug fixes:
- PARQUET-2064: Make Range public accessible in RowRanges
- PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream`
- PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding
- PARQUET-1633: Fix integer overflow
- PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile
- PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats
- PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase
- PARQUET-2078: Failed to read parquet file after writing with the same

In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests + a new test for the issue in SPARK-36696

Closes #33969 from sunchao/upgrade-parquet-12.1.

Authored-by: Chao Sun <[email protected]>
Signed-off-by: DB Tsai <[email protected]>
(cherry picked from commit a927b08)
Signed-off-by: DB Tsai <[email protected]>
@gengliangwang
Copy link
Member

Late LGTM. Thank you, @sunchao !

domybest11 pushed a commit to domybest11/spark that referenced this pull request Jun 15, 2022
Upgrade Apache Parquet to 1.12.1

Parquet 1.12.1 contains the following bug fixes:
- PARQUET-2064: Make Range public accessible in RowRanges
- PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream`
- PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding
- PARQUET-1633: Fix integer overflow
- PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile
- PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats
- PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase
- PARQUET-2078: Failed to read parquet file after writing with the same

In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release.

No

Existing tests + a new test for the issue in SPARK-36696

Closes apache#33969 from sunchao/upgrade-parquet-12.1.

Authored-by: Chao Sun <[email protected]>
Signed-off-by: DB Tsai <[email protected]>
(cherry picked from commit a927b08)
Signed-off-by: DB Tsai <[email protected]>
wangyum added a commit that referenced this pull request May 26, 2023
* [CARMEL-5873] Upgrade Parquet to 1.12.2 (#896)

* [SPARK-36726] Upgrade Parquet to 1.12.1

### What changes were proposed in this pull request?

Upgrade Apache Parquet to 1.12.1

### Why are the changes needed?

Parquet 1.12.1 contains the following bug fixes:
- PARQUET-2064: Make Range public accessible in RowRanges
- PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream`
- PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding
- PARQUET-1633: Fix integer overflow
- PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile
- PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats
- PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase
- PARQUET-2078: Failed to read parquet file after writing with the same

In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests + a new test for the issue in SPARK-36696

Closes #33969 from sunchao/upgrade-parquet-12.1.

Authored-by: Chao Sun <[email protected]>
Signed-off-by: DB Tsai <[email protected]>

(cherry picked from commit a927b08)

* [SPARK-34542][BUILD] Upgrade Parquet to 1.12.0

### What changes were proposed in this pull request?

Parquet 1.12.0 New Feature
- PARQUET-41 - Add bloom filters to parquet statistics
- PARQUET-1373 - Encryption key management tools
- PARQUET-1396 - Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory
- PARQUET-1622 - Add BYTE_STREAM_SPLIT encoding
- PARQUET-1784 - Column-wise configuration
- PARQUET-1817 - Crypto Properties Factory
- PARQUET-1854 - Properties-Driven Interface to Parquet Encryption

Parquet 1.12.0 release notes:
https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/CHANGES.md

### Why are the changes needed?

- Bloom filters to improve filter performance
- ZSTD enhancement

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit test.

Closes #31649 from wangyum/SPARK-34542.

Lead-authored-by: Yuming Wang <[email protected]>
Co-authored-by: Yuming Wang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

(cherry picked from commit cbffc12)

Co-authored-by: Chao Sun <[email protected]>

* [HADP-44647] Parquet file based kms client for encryption keys (#897)

* [HADP-44647] Parquet file based kms client for encryption keys (#82)

create/write parquet encryption table.
```
set spark.sql.parquet.encryption.key.file=/path/to/key/file;

create table parquet_encryption(a int, b int, c int)
using parquet
options (
'parquet.encryption.column.keys' 'columnKey1: a, b; columnKey2: c',
'parquet.encryption.footer.key' 'footerKey');
```

read parquet encryption table;

```
set spark.sql.parquet.encryption.key.file=/path/to/key/file;

select ... from parquet_encryption ...
```

Will raise another pr for default footerKey.

* [HADP-44647][FOLLOWUP] Reuse the kms instance for same key file (#84)

* Fix

Co-authored-by: fwang12 <[email protected]>

Co-authored-by: Chao Sun <[email protected]>
Co-authored-by: fwang12 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants