-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Closed as not planned
Labels
Description
Feature Request / Improvement
As it stands today, if you want to employ both Spark and AWS Athena for your iceberg tables in v1.1.0, you must disable the vectorized reader. The reason is because Athena writes fields in a delta encoded manner, which is unsupported by the vectorized reader.
If you have ever hit the following error for a primitive type (complex types can be solved by #521), you have probably been impacted by this issue:
java.lang.UnsupportedOperationException: Cannot support vectorized reads for column [email] optional binary email (STRING) = 1 with encoding DELTA_BYTE_ARRAY. Disable vectorized reads to read this table/file
at org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator.initDataReader(VectorizedPageIterator.java:96)
Spark has implemented this support in 2022: apache/spark#35262. However, Iceberg uses its own vectorized reader.
Is it possible to implement support for these encodings? It would solve a significant interoperability problem between Athena, Spark, and possibly other query engines using them.
Query engine
Athena + Spark
mapleFU, sfc-gh-sili, rohanag12 and zhaoyongjie