Skip to content

Commit a7dc824

Browse files
sunchaodbtsai
authored andcommitted
[SPARK-36726] Upgrade Parquet to 1.12.1
### What changes were proposed in this pull request? Upgrade Apache Parquet to 1.12.1 ### Why are the changes needed? Parquet 1.12.1 contains the following bug fixes: - PARQUET-2064: Make Range public accessible in RowRanges - PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` - PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding - PARQUET-1633: Fix integer overflow - PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile - PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats - PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase - PARQUET-2078: Failed to read parquet file after writing with the same In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests + a new test for the issue in SPARK-36696 Closes #33969 from sunchao/upgrade-parquet-12.1. Authored-by: Chao Sun <[email protected]> Signed-off-by: DB Tsai <[email protected]> (cherry picked from commit a927b08) Signed-off-by: DB Tsai <[email protected]>
1 parent 017bce7 commit a7dc824

File tree

5 files changed

+19
-13
lines changed

5 files changed

+19
-13
lines changed

dev/deps/spark-deps-hadoop-2.7-hive-2.3

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -201,12 +201,12 @@ orc-shims/1.6.10//orc-shims-1.6.10.jar
201201
oro/2.0.8//oro-2.0.8.jar
202202
osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar
203203
paranamer/2.8//paranamer-2.8.jar
204-
parquet-column/1.12.0//parquet-column-1.12.0.jar
205-
parquet-common/1.12.0//parquet-common-1.12.0.jar
206-
parquet-encoding/1.12.0//parquet-encoding-1.12.0.jar
207-
parquet-format-structures/1.12.0//parquet-format-structures-1.12.0.jar
208-
parquet-hadoop/1.12.0//parquet-hadoop-1.12.0.jar
209-
parquet-jackson/1.12.0//parquet-jackson-1.12.0.jar
204+
parquet-column/1.12.1//parquet-column-1.12.1.jar
205+
parquet-common/1.12.1//parquet-common-1.12.1.jar
206+
parquet-encoding/1.12.1//parquet-encoding-1.12.1.jar
207+
parquet-format-structures/1.12.1//parquet-format-structures-1.12.1.jar
208+
parquet-hadoop/1.12.1//parquet-hadoop-1.12.1.jar
209+
parquet-jackson/1.12.1//parquet-jackson-1.12.1.jar
210210
protobuf-java/2.5.0//protobuf-java-2.5.0.jar
211211
py4j/0.10.9.2//py4j-0.10.9.2.jar
212212
pyrolite/4.30//pyrolite-4.30.jar

dev/deps/spark-deps-hadoop-3.2-hive-2.3

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -172,12 +172,12 @@ orc-shims/1.6.10//orc-shims-1.6.10.jar
172172
oro/2.0.8//oro-2.0.8.jar
173173
osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar
174174
paranamer/2.8//paranamer-2.8.jar
175-
parquet-column/1.12.0//parquet-column-1.12.0.jar
176-
parquet-common/1.12.0//parquet-common-1.12.0.jar
177-
parquet-encoding/1.12.0//parquet-encoding-1.12.0.jar
178-
parquet-format-structures/1.12.0//parquet-format-structures-1.12.0.jar
179-
parquet-hadoop/1.12.0//parquet-hadoop-1.12.0.jar
180-
parquet-jackson/1.12.0//parquet-jackson-1.12.0.jar
175+
parquet-column/1.12.1//parquet-column-1.12.1.jar
176+
parquet-common/1.12.1//parquet-common-1.12.1.jar
177+
parquet-encoding/1.12.1//parquet-encoding-1.12.1.jar
178+
parquet-format-structures/1.12.1//parquet-format-structures-1.12.1.jar
179+
parquet-hadoop/1.12.1//parquet-hadoop-1.12.1.jar
180+
parquet-jackson/1.12.1//parquet-jackson-1.12.1.jar
181181
protobuf-java/2.5.0//protobuf-java-2.5.0.jar
182182
py4j/0.10.9.2//py4j-0.10.9.2.jar
183183
pyrolite/4.30//pyrolite-4.30.jar

pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,7 @@
136136
<kafka.version>2.8.0</kafka.version>
137137
<!-- After 10.15.1.3, the minimum required version is JDK9 -->
138138
<derby.version>10.14.2.0</derby.version>
139-
<parquet.version>1.12.0</parquet.version>
139+
<parquet.version>1.12.1</parquet.version>
140140
<orc.version>1.6.10</orc.version>
141141
<jetty.version>9.4.43.v20210629</jetty.version>
142142
<jakartaservlet.version>4.0.3</jakartaservlet.version>
Binary file not shown.

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -855,6 +855,12 @@ class ParquetIOSuite extends QueryTest with ParquetTest with SharedSparkSession
855855
}
856856
}
857857

858+
test("SPARK-36726: test incorrect Parquet row group file offset") {
859+
readParquetFile(testFile("test-data/malformed-file-offset.parquet")) { df =>
860+
assert(df.count() == 3650)
861+
}
862+
}
863+
858864
test("VectorizedParquetRecordReader - direct path read") {
859865
val data = (0 to 10).map(i => (i, (i + 'a').toChar.toString))
860866
withTempPath { dir =>

0 commit comments

Comments
 (0)