-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-36726] Upgrade Parquet to 1.12.1 #33969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #143177 has finished for PR 33969 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
This blocks 3.2.0 right? |
|
Oops I missed a few places where the Apache staging repo needs to be added. Trying it again. |
Yes this blocks 3.2.0 - I'm testing RC0 here |
|
Thank you, @sunchao ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this work! We also need to update dependency manifest.
|
Thanks, will do in the next run. I verified this PR with the problematic file in SPARK-36696, and it now works: test("SPARK-36726: test incorrect Parquet row group file offset") {
readParquetFile("/tmp/example.parquet") { df =>
assert(df.count() == 3650)
}
} |
pom.xml
Outdated
| <repository> | ||
| <id>staged</id> | ||
| <name>parquet-staged-releases</name> | ||
| <url>https://repository.apache.org/content/repositories/staging/</url> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is OK to test - we'd want to wait until it's officially released and then remove it before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #143179 has finished for PR 33969 at commit
|
|
@sunchao Thank you for making this PR right after the Parquet 1.12.1 RC0! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please regenerate the dependency manifests.
dev/test-dependencies.sh --replace-manifest
|
I know this is covered by UT in Parquet, but would it make sense to cover the behaviour which was a regression inside of Spark UT as well? |
|
@holdenk yes this is a good point, just added. Also updated the dependency manifest files. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #143222 has finished for PR 33969 at commit
|
|
|
||
| test("SPARK-36726: test incorrect Parquet row group file offset") { | ||
| readParquetFile("test-data/malformed-file-offset.parquet") { df => | ||
| assert(df.count() == 3650) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we test for both vec/non vec code path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think readParquetFile already takes care of that, see FileBasedDataSourceTest.readFile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought vectorized is enabled by default, and non-vectorized path will not be tested. protected def readFile(path: String, testVectorized: Boolean = true), so we need
Seq(true, false).foreach { vec =>
readParquetFile(testFile("file.parquet"), vec) { df =>
...
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how FileBasedDataSourceTest.readFile is defined:
protected def readFile(path: String, testVectorized: Boolean = true)
(f: DataFrame => Unit): Unit = {
withSQLConf(vectorizedReaderEnabledKey -> "false") {
f(spark.read.format(dataSourceName).load(path.toString))
}
if (testVectorized) {
withSQLConf(vectorizedReaderEnabledKey -> "true") {
f(spark.read.format(dataSourceName).load(path.toString))
}
}
}so to me it tests both paths. Did I miss anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am surprised that the non-vectorized reader is always tested with/without testVectorized enabled...
protected def readFile(path: String, testVectorized: Boolean = true)
(f: DataFrame => Unit): Unit = {
withSQLConf(vectorizedReaderEnabledKey -> "false") {
f(spark.read.format(dataSourceName).load(path.toString))
}
if (testVectorized) {
withSQLConf(vectorizedReaderEnabledKey -> "true") {
f(spark.read.format(dataSourceName).load(path.toString))
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I guess it makes sense since in most cases we'd want to test both code paths and this is convenient for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I'm surprised that non-vectorized path is always tested. In many tests, we specifically test both paths, and they are not necessary given the non-vectorized will always be tested.
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #143227 has finished for PR 33969 at commit
|
|
BTW, I assumed that the last tests passed with RC1 instead of RC0. |
|
@dongjoon-hyun yes I manually checked https://repository.apache.org/content/repositories/staging/org/apache/parquet/ and verified it has the latests bits before kicking off the CI here. I also synced with @shangxinli who's the release manager on this. |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #143273 has finished for PR 33969 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Pending CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 (pending ci)
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #143316 has finished for PR 33969 at commit
|
|
Merged into master/branch-3.2. Thanks. |
### What changes were proposed in this pull request? Upgrade Apache Parquet to 1.12.1 ### Why are the changes needed? Parquet 1.12.1 contains the following bug fixes: - PARQUET-2064: Make Range public accessible in RowRanges - PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` - PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding - PARQUET-1633: Fix integer overflow - PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile - PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats - PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase - PARQUET-2078: Failed to read parquet file after writing with the same In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests + a new test for the issue in SPARK-36696 Closes #33969 from sunchao/upgrade-parquet-12.1. Authored-by: Chao Sun <[email protected]> Signed-off-by: DB Tsai <[email protected]> (cherry picked from commit a927b08) Signed-off-by: DB Tsai <[email protected]>
|
Late LGTM. Thank you, @sunchao ! |
Upgrade Apache Parquet to 1.12.1 Parquet 1.12.1 contains the following bug fixes: - PARQUET-2064: Make Range public accessible in RowRanges - PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` - PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding - PARQUET-1633: Fix integer overflow - PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile - PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats - PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase - PARQUET-2078: Failed to read parquet file after writing with the same In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release. No Existing tests + a new test for the issue in SPARK-36696 Closes apache#33969 from sunchao/upgrade-parquet-12.1. Authored-by: Chao Sun <[email protected]> Signed-off-by: DB Tsai <[email protected]> (cherry picked from commit a927b08) Signed-off-by: DB Tsai <[email protected]>
* [CARMEL-5873] Upgrade Parquet to 1.12.2 (#896) * [SPARK-36726] Upgrade Parquet to 1.12.1 ### What changes were proposed in this pull request? Upgrade Apache Parquet to 1.12.1 ### Why are the changes needed? Parquet 1.12.1 contains the following bug fixes: - PARQUET-2064: Make Range public accessible in RowRanges - PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` - PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding - PARQUET-1633: Fix integer overflow - PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile - PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats - PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase - PARQUET-2078: Failed to read parquet file after writing with the same In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests + a new test for the issue in SPARK-36696 Closes #33969 from sunchao/upgrade-parquet-12.1. Authored-by: Chao Sun <[email protected]> Signed-off-by: DB Tsai <[email protected]> (cherry picked from commit a927b08) * [SPARK-34542][BUILD] Upgrade Parquet to 1.12.0 ### What changes were proposed in this pull request? Parquet 1.12.0 New Feature - PARQUET-41 - Add bloom filters to parquet statistics - PARQUET-1373 - Encryption key management tools - PARQUET-1396 - Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory - PARQUET-1622 - Add BYTE_STREAM_SPLIT encoding - PARQUET-1784 - Column-wise configuration - PARQUET-1817 - Crypto Properties Factory - PARQUET-1854 - Properties-Driven Interface to Parquet Encryption Parquet 1.12.0 release notes: https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/CHANGES.md ### Why are the changes needed? - Bloom filters to improve filter performance - ZSTD enhancement ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit test. Closes #31649 from wangyum/SPARK-34542. Lead-authored-by: Yuming Wang <[email protected]> Co-authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit cbffc12) Co-authored-by: Chao Sun <[email protected]> * [HADP-44647] Parquet file based kms client for encryption keys (#897) * [HADP-44647] Parquet file based kms client for encryption keys (#82) create/write parquet encryption table. ``` set spark.sql.parquet.encryption.key.file=/path/to/key/file; create table parquet_encryption(a int, b int, c int) using parquet options ( 'parquet.encryption.column.keys' 'columnKey1: a, b; columnKey2: c', 'parquet.encryption.footer.key' 'footerKey'); ``` read parquet encryption table; ``` set spark.sql.parquet.encryption.key.file=/path/to/key/file; select ... from parquet_encryption ... ``` Will raise another pr for default footerKey. * [HADP-44647][FOLLOWUP] Reuse the kms instance for same key file (#84) * Fix Co-authored-by: fwang12 <[email protected]> Co-authored-by: Chao Sun <[email protected]> Co-authored-by: fwang12 <[email protected]>
What changes were proposed in this pull request?
Upgrade Apache Parquet to 1.12.1
Why are the changes needed?
Parquet 1.12.1 contains the following bug fixes:
zstdInputStreamIn particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing tests + a new test for the issue in SPARK-36696