-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Is your feature request related to a problem or challenge?
Processing semi-structured data (basically think anything that can be represented in JSON) efficiently is becoming more and more important.
As @wjones127 says in https://github.com/apache/datafusion/issues/10987>
This would be a high-performance data type for semi-structured data, designed for better OLAP performance than JSON or BSON (discussed in #7845).
While it is certainly possible to implement semi-structured, JSON and even Variant support today using the DataFusion extension apis (e.g. https://github.com/datafusion-contrib/datafusion-functions-json) this ticket tracks adding such support to DataFusion itself
Parquet recently adopted the Variant type : https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
We see adoption of this in other systems as well such as Iceberg and Spark.
I think DataBricks did a good job describing its rationale:
Without Variant, customers had to choose between flexibility and performance. To maintain flexibility, customers would store JSON in single columns as strings. To see better performance, customers would apply strict schematizing approaches with structs, which requires separate processes to maintain and update with schema changes. With Variant, customers can retain flexibility (there's no need to define an explicit schema) and receive vastly improved performance compared to querying the JSON as a string.
Describe the solution you'd like
No response
Describe alternatives you've considered
This will be a big project. Here are some of the related pre-requisites
- [EPIC] [Parquet] Implement Variant type support in Parquet arrow-rs#6736
- [EPIC][Parquet] Finalize Variant Type support in Parquet arrow-rs#8480
- Support Push down expression evaluation in
TableProviders
#14993 - Support Extension Types / User Defined Types in DataFusion #12644
It is not clear to me if variant should be "built in" or if it should be an add on (for example, add a variant
feature and a datafusion-variant
crate)
Additional context
Related tickets