Skip to content

Support Parquet v2 Spark vectorized read #7162

@anthonysgro

Description

@anthonysgro

Feature Request / Improvement

As it stands today, if you want to employ both Spark and AWS Athena for your iceberg tables in v1.1.0, you must disable the vectorized reader. The reason is because Athena writes fields in a delta encoded manner, which is unsupported by the vectorized reader.

If you have ever hit the following error for a primitive type (complex types can be solved by #521), you have probably been impacted by this issue:

java.lang.UnsupportedOperationException: Cannot support vectorized reads for column [email] optional binary email (STRING) = 1 with encoding DELTA_BYTE_ARRAY. Disable vectorized reads to read this table/file
	at org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator.initDataReader(VectorizedPageIterator.java:96)

Spark has implemented this support in 2022: apache/spark#35262. However, Iceberg uses its own vectorized reader.

Is it possible to implement support for these encodings? It would solve a significant interoperability problem between Athena, Spark, and possibly other query engines using them.

Query engine

Athena + Spark

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions