-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Describe the bug, including details regarding any error messages, version, and platform.
Noticed when upgrading from 1.13.1 to 1.14.1
java.lang.ClassCastException: class org.apache.parquet.column.values.dictionary.DictionaryValuesReader cannot be cast to class org.apache.parquet.column.values.deltastrings.DeltaByteArrayReader (org.apache.parquet.column.values.dictionary.DictionaryValuesReader and org.apache.parquet.column.values.deltastrings.DeltaByteArrayReader are in unnamed module of loader 'app')
at org.apache.parquet.column.values.deltastrings.DeltaByteArrayReader.setPreviousReader(DeltaByteArrayReader.java:92)
at org.apache.parquet.column.impl.ColumnReaderBase.initDataReader(ColumnReaderBase.java:734)
at org.apache.parquet.column.impl.ColumnReaderBase.readPageV2(ColumnReaderBase.java:766)
at org.apache.parquet.column.impl.ColumnReaderBase.access$400(ColumnReaderBase.java:56)
at org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:695)
at org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:686)
at org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:232)
at org.apache.parquet.column.impl.ColumnReaderBase.readPage(ColumnReaderBase.java:686)
at org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:660)
at org.apache.parquet.column.impl.ColumnReaderBase.consume(ColumnReaderBase.java:802)
at org.apache.parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:30)
at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:427)
This appears to be due to PARQUET-2431 - https://github.com/apache/parquet-java/pull/1274/files#diff-362b7d44b24283c1bb1f6ca3e124cb72706a33ed96d86b58bf3339f20aafb4e9R732
Looking into how my code hit this and it seems to be that CorruptDeltaByteArrays.requiresSequentialReads was essentially doing the dataColumn instanceof RequiresPreviousReader check previously (CorruptDeltaByteArrays.requiresSequentialReads can only return true when encoding == Encoding.DELTA_BYTE_ARRAY, and org.apache.parquet.column.values.RequiresPreviousReader is only implemented by *DeltaByteArrayReader classes).
With no check on previousReader instanceof RequiresPreviousReader the ClassCastException is possible above.
This is more likely to happen when using org.apache.parquet.io.ColumnIOFactory#ColumnIOFactory() to read files without createdBy. In my case I was able to fix this by adding createdBy, knowing that all Parquet files I have were written after PARQUET-246, which prevents CorruptDeltaByteArrays.requiresSequentialReads from returning true
val reader: ParquetFileReader = ...
val fileMetadata = reader.getFooter.getFileMetaData
val createdBy = fileMetadata.getCreatedBy
val columnIO: MessageColumnIO = new ColumnIOFactory(createdBy)...
Component(s)
No response