Skip to content

ClassCastException possible in DeltaByteArrayReader after PARQUET-2431 #3013

@bwjoh

Description

@bwjoh

Describe the bug, including details regarding any error messages, version, and platform.

Noticed when upgrading from 1.13.1 to 1.14.1

java.lang.ClassCastException: class org.apache.parquet.column.values.dictionary.DictionaryValuesReader cannot be cast to class org.apache.parquet.column.values.deltastrings.DeltaByteArrayReader (org.apache.parquet.column.values.dictionary.DictionaryValuesReader and org.apache.parquet.column.values.deltastrings.DeltaByteArrayReader are in unnamed module of loader 'app')
	at org.apache.parquet.column.values.deltastrings.DeltaByteArrayReader.setPreviousReader(DeltaByteArrayReader.java:92)
	at org.apache.parquet.column.impl.ColumnReaderBase.initDataReader(ColumnReaderBase.java:734)
	at org.apache.parquet.column.impl.ColumnReaderBase.readPageV2(ColumnReaderBase.java:766)
	at org.apache.parquet.column.impl.ColumnReaderBase.access$400(ColumnReaderBase.java:56)
	at org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:695)
	at org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:686)
	at org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:232)
	at org.apache.parquet.column.impl.ColumnReaderBase.readPage(ColumnReaderBase.java:686)
	at org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:660)
	at org.apache.parquet.column.impl.ColumnReaderBase.consume(ColumnReaderBase.java:802)
	at org.apache.parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:30)
	at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:427)

This appears to be due to PARQUET-2431 - https://github.com/apache/parquet-java/pull/1274/files#diff-362b7d44b24283c1bb1f6ca3e124cb72706a33ed96d86b58bf3339f20aafb4e9R732

Looking into how my code hit this and it seems to be that CorruptDeltaByteArrays.requiresSequentialReads was essentially doing the dataColumn instanceof RequiresPreviousReader check previously (CorruptDeltaByteArrays.requiresSequentialReads can only return true when encoding == Encoding.DELTA_BYTE_ARRAY, and org.apache.parquet.column.values.RequiresPreviousReader is only implemented by *DeltaByteArrayReader classes).

With no check on previousReader instanceof RequiresPreviousReader the ClassCastException is possible above.

This is more likely to happen when using org.apache.parquet.io.ColumnIOFactory#ColumnIOFactory() to read files without createdBy. In my case I was able to fix this by adding createdBy, knowing that all Parquet files I have were written after PARQUET-246, which prevents CorruptDeltaByteArrays.requiresSequentialReads from returning true

val reader: ParquetFileReader = ...
val fileMetadata = reader.getFooter.getFileMetaData
val createdBy = fileMetadata.getCreatedBy
val columnIO: MessageColumnIO = new ColumnIOFactory(createdBy)...

Component(s)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions