Add `VariantArray` and `VariantArrayBuilder` for constructing Arrow Arrays of Variants #7905

alamb · 2025-07-11T15:02:09Z

Which issue does this PR close?

Rationale for this change

As we begin to add operations on Variants stored in arrays, we need some better abstractions of working with those arrays

This PR builds on the great work of @harshmotw-db in #7884 to start adding t

What changes are included in this PR?

Add VariantArray that wraps a StructArray and adds useful accessors
Add VariantArrayBuilder as described in [Variant] API to construct Shredded Variant Arrays #7895 to construct VariantArrays
rework batch_json_string_to_variant to use the new builder and array wrapper
Note while these APIs have no shredding support yet, I think shredding can be added in a straightforward way

Are these changes tested?

Yes, unit tests and doc examples are included

Are there any user-facing changes?

New VariantArray and VariantArrayBuilder

…rrays of Variants

alamb · 2025-07-11T15:09:31Z

@scovich @friendlymatthew and @harshmotw-db -- do you have time and interest to review this PR?

friendlymatthew · 2025-07-11T15:50:10Z

parquet-variant-compute/src/variant_array.rs

+        if !inner.fields().iter().any(|f| f.name() == "value") {
+            return Err(ArrowError::InvalidArgumentError(
+                "Invalid VariantArray: StructArray must contain a 'value' field".to_string(),
+            ));
+        }


~~More of a note to myself for the shredding work, but I don't think we should necessarily error when the value column is not found?~~

^ Disregard. The value and typed_value columns should exist, it is rather the elements that may be null.

I agree there are cases when value is not required (when the variant is fully shredded)

The only requirement for shredded variant is that at least one of the two must exist.

BTW, does shredded variant spec forbid existence of other (unknown) columns?
(I though it did, but I can't find any obvious wording)

I don't know either

friendlymatthew

The VariantArray and VariantArrayBuilder looks great to me. Thank you!

Personally, it would be great to get this merged ASAP. I plan on further iterating on it in #7895

Samyak2

This API looks great! Hoping to see it merged soon, even if it's not fully done - it will unblock things like #7893

Samyak2 · 2025-07-11T17:40:22Z

parquet-variant-compute/src/variant_array.rs

+    /// NOTE: It is also permissible for the metadata field to be
+    /// Dictionary-Encoded, preferably (but not required) with an index type of
+    /// int8.
+    inner: StructArray,


Would it be worth it to store the index of the metadata and value arrays? This would speed up metadata_field and value_field functions

yes that would be a good optimization -- maybe we can do it as a follow on PR (I believe @zeroshade did it in the go implementation in fact(

Yup, that's precisely how the Go version works. We store the indices of the metadata/value/typed_value arrays in the extension type itself

I filed a ticket to track

[Variant] Improve VariantArray performance by storing the index of the metadata and value arrays #7920

Samyak2 · 2025-07-11T17:44:42Z

parquet-variant-compute/src/variant_array.rs

+            return Err(ArrowError::InvalidArgumentError(
+                "Invalid VariantArray: StructArray must contain a 'value' field".to_string(),
+            ));
+        }


I'm not very familiar with arrow-rs, so this question may not make sense. But what's the rationale for not validating that the value and metadata fields are of binary type (one of binary, large_binary, or binary_view)?

My thinking is that the proposed arrow spec allows a VariantArray to use BinaryArray, LargeBinaryArray or BinaryViewArray

Though since this PR makes assumptions about BinaryView elsewhere it would make sense to return an not implemented error for those other cases

Update: I will update this PR to validate BinaryView

Samyak2 · 2025-07-11T17:46:01Z

parquet-variant-compute/src/variant_array.rs

+    }
+
+    /// Return a reference to the metadata field of the [`StructArray`]
+    pub fn metadata_field(&self) -> &ArrayRef {


In value(..), we're doing self.metadata_field().as_binary_view(). What's the reason for not returning a &BinaryViewArray here directly?
The same for value_field.

I was thinking ahead to when this could be binary or large binary as well

alamb · 2025-07-11T15:04:08Z

parquet-variant-compute/src/from_json.rs

 /// STRUCT<metadata: BINARY, value: BINARY> where nulls are preserved. The JSON strings in the input
 /// must be valid.
-pub fn batch_json_string_to_variant(input: &ArrayRef) -> Result<StructArray, ArrowError> {
+pub fn batch_json_string_to_variant(input: &ArrayRef) -> Result<VariantArray, ArrowError> {


Change 1: a new VariantArray that wraps a StructArray and does basic validation

alamb · 2025-07-11T15:04:44Z

parquet-variant-compute/src/from_json.rs

-    value_offsets.push(value_current_offset);
-
-    let mut validity = BooleanBufferBuilder::new(input.len());
+    let mut variant_array_builder = VariantArrayBuilder::new(input_string_array.len());


This PR refactors the details of creating the output arrays into VariantArrayBuilder so it can be reused

alamb · 2025-07-11T15:05:34Z

parquet-variant-compute/src/from_json.rs

-            value_buffer.extend(value);
-            value_offsets.push(value_current_offset);
-            value_validity.append(true);
+            variant_array_builder.append_variant_buffers(&metadata, &value);


I have ideas on how to make this more efficient by writing directly into the target buffers. I will work on this in a follow on PR

This is what I came up with

Convert JSON to VariantArray without copying (8 - 32% faster) #7911

alamb · 2025-07-11T15:06:04Z

parquet-variant-compute/src/from_json.rs

        ]);
        let array_ref: ArrayRef = Arc::new(input);
-        let output = batch_json_string_to_variant(&array_ref).unwrap();
+        let variant_array = batch_json_string_to_variant(&array_ref).unwrap();


This test uses the new VariantArray APIs to access the relevant fields / value / etc

alamb · 2025-07-11T15:06:41Z

parquet-variant-compute/src/from_json.rs

-        assert!(struct_array.is_null(4));
+        // Compare row 0
+        assert!(!variant_array.is_null(0));
+        assert_eq!(variant_array.value(0), Variant::Int8(1));


I think VariantArray::value() that returns Variant is particularly nice to use

That's also how the Go VariantArray works as well, Value() on the VariantArray returns a Variant (performing the consolidation if the underlying VariantArray is shredded.

I haven't fully worked out in my head how this will work for shredded Variants -- specifically how to avoid copying stuff around when accessing a shredded variant. We'll have to figure that out in follow on PRs

alamb · 2025-07-11T18:48:23Z

BTW I hacked up a follow on PR that writes into VariantArrayBuilder without any copying. I am quite pleased that it works

Convert JSON to VariantArray without copying (8 - 32% faster) #7911

scovich

very nice!

scovich · 2025-07-11T22:34:51Z

parquet-variant-compute/src/variant_array.rs

+    /// If typed_value is a nested type, its elements must be required and must
+    /// be a struct containing only one of the following:
+    ///
+    /// 1. A single required field, of type binary, large_binary, or binary_view named value
+    ///
+    /// 2. An optional field named value of type binary, large_binary, or
+    ///    binary_view AND an optional field named typed_value which follows these
+    ///    same rules


I don't understand this part? typed_value is the arbitrary type of a shredded column. That could be a primitive type (e.g. INT), or it could be struct (with fields that are themselves either strongly typed or variant), or it could even be variant (technically, a writer could choose to shred out a commonly-used field, even if that field doesn't have a strong type that could shred in structured form).

Any nested variants must have only value and typed_value columns tho -- the top-level metadata applies to everything. Readers who encounter nested variants inside the typed_value column need to be prepared to use the top-level metadata when reading them, which may require some plumbing, given that it's not actually materialized locally?

I agree -- I think I did not read the proposal correctly. I will correct the wording

https://docs.google.com/document/d/1pw0AWoMQY3SjD7R4LgbPvMjG_xSCtXp3rZHkVp9jpZ4/edit?tab=t.0#heading=h.ru2zdn2szuc4

Update: It seems I just copy/pasted the text from https://docs.google.com/document/d/1pw0AWoMQY3SjD7R4LgbPvMjG_xSCtXp3rZHkVp9jpZ4/edit?tab=t.0#heading=h.ru2zdn2szuc4 and looking at the documentation there is a bit of confusion about what the text means

So for now I removed the section about the contents of shredded columns, since the code doesn't support this yet. We can add the comments in as we start to implement shredding for real

scovich · 2025-07-11T22:40:39Z

parquet-variant-compute/src/variant_array.rs

+        if !inner.fields().iter().any(|f| f.name() == "value") {
+            return Err(ArrowError::InvalidArgumentError(
+                "Invalid VariantArray: StructArray must contain a 'value' field".to_string(),
+            ));
+        }


The only requirement for shredded variant is that at least one of the two must exist.

BTW, does shredded variant spec forbid existence of other (unknown) columns?
(I though it did, but I can't find any obvious wording)

Still very early. Opening this PR to get some early feedback on the approach. The approach is roughly like this: - Allocate a new offset array (all zeroes) and a new nulls array (copy of variant's null buffer). - For every variant path access, we iterate through all the values and increment the offsets to the desired object field's offset or array index's offset. - If the value isn't an object/array, we set a null for this row. - We then extract all the values at the new offsets into an array - I have currently only done it for u64. For this PR, I can make it generic for all primitive types. Some open questions: - This seems like a good vectorized approach to me, but it comes at the cost of allocating new buffers for every variant_get invocation. - Would it be worth it to try a row-wise approach instead? It would be something like: do the whole path access for each row and append into the appropriate ArrayBuilder. - This offset-based approach works quite well for extracting complex types too (mainly arrays). I have not implemented it here yet but I have done it elsewhere before. - Databricks has two variations of this function: `variant_get` and `try_variant_get`. - The only difference in `try_variant_get` is that cast errors are ignored. - I'm guessing this is covered by `CastOptions`? I haven't looked at it yet. - Perhaps extracting complex types can be a separate PR? I can do it here, but the PR might become too large to review. I'm new to this codebase, please let me know if I missed anything! :) I will be rebasing this PR once apache#7905 is merged. I'll be using VariantArray instead of StructArray once that is merged.

…uilder

alamb · 2025-07-12T19:28:47Z

Given this PR seems to block several others, I plan to merge it once CI is complete

alamb · 2025-07-12T19:56:52Z

Thanks again for the quick reviews everyone!

alamb · 2025-07-12T19:57:26Z

I am happy to make PRs / address comments from anyone who might not have had a chance to comment -- just let me know

Still very early. Opening this PR to get some early feedback on the approach. The approach is roughly like this: - Allocate a new offset array (all zeroes) and a new nulls array (copy of variant's null buffer). - For every variant path access, we iterate through all the values and increment the offsets to the desired object field's offset or array index's offset. - If the value isn't an object/array, we set a null for this row. - We then extract all the values at the new offsets into an array - I have currently only done it for u64. For this PR, I can make it generic for all primitive types. Some open questions: - This seems like a good vectorized approach to me, but it comes at the cost of allocating new buffers for every variant_get invocation. - Would it be worth it to try a row-wise approach instead? It would be something like: do the whole path access for each row and append into the appropriate ArrayBuilder. - This offset-based approach works quite well for extracting complex types too (mainly arrays). I have not implemented it here yet but I have done it elsewhere before. - Databricks has two variations of this function: `variant_get` and `try_variant_get`. - The only difference in `try_variant_get` is that cast errors are ignored. - I'm guessing this is covered by `CastOptions`? I haven't looked at it yet. - Perhaps extracting complex types can be a separate PR? I can do it here, but the PR might become too large to review. I'm new to this codebase, please let me know if I missed anything! :) I will be rebasing this PR once apache#7905 is merged. I'll be using VariantArray instead of StructArray once that is merged.

@Samyak2

# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Part of #7911 - Part of #6736 - Follow on to #7905 # Rationale for this change I wrote benchmark some changes to the json decoder in #7911 but they are non trivial. To keep #7911 easier to review I have pulled the benchmarks out to their own PR # What changes are included in this PR? 1. Add new json benchmark 2. Include the `variant_get` benchmark added in #7919 by @Samyak2 # Are these changes tested? I tested them manually and clippy CI coverage ensures they compile # Are there any user-facing changes? No these are only benchmarks

# Which issue does this PR close? - part of #6736 - Closes #7964 - Follow on to #7905 # Rationale for this change In a quest to have the fastest and most efficient Variant implementation I would like to avoid copies if at all possible Right now, to make a VariantArray first requires completing an individual buffer and appending it to the array. Let's make that faster by having the VariantBuilder append directly into the buffer # What changes are included in this PR? 1. Add `VariantBuilder::new_from_existing` 2. Add a `VariantArrayBuilder::variant_builder` that reuses the buffers # Are these changes tested? 1. New unit tests 1. Yes by existing tests # Are there any user-facing changes? Hopefully faster performance --------- Co-authored-by: Congxian Qiu <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]>

alamb added 2 commits July 11, 2025 11:00

Add VariantArray and VariantArrayBuilder for constructing Arrow a…

6ae7894

…rrays of Variants

fix docs

e741c16

alamb mentioned this pull request Jul 11, 2025

[Variant] API to construct Shredded Variant Arrays #7895

Closed

This was referenced Jul 11, 2025

[EPIC] [Parquet] Implement Variant type support in Parquet #6736

Closed

[Variant] Add low level support for shredding and unshredding #7715

Closed

friendlymatthew reviewed Jul 11, 2025

View reviewed changes

friendlymatthew approved these changes Jul 11, 2025

View reviewed changes

alamb mentioned this pull request Jul 11, 2025

[Variant] Introduce parquet-variant-compute crate to transform batches of JSON strings to and from Variants #7884

Merged

Samyak2 approved these changes Jul 11, 2025

View reviewed changes

alamb commented Jul 11, 2025

View reviewed changes

alamb mentioned this pull request Jul 11, 2025

Convert JSON to VariantArray without copying (8 - 32% faster) #7911

Merged

scovich approved these changes Jul 11, 2025

View reviewed changes

Samyak2 mentioned this pull request Jul 12, 2025

[Variant] Add variant_get compute kernel #7919

Merged

Merge remote-tracking branch 'apache/main' into alamb/variant_array_b…

10fd445

…uilder

alamb mentioned this pull request Jul 12, 2025

[Variant] Improve VariantArray performance by storing the index of the metadata and value arrays #7920

Closed

alamb added 2 commits July 12, 2025 13:45

Remove details of shredded column details

6fad15f

Check for BinaryView

595ded1

alamb mentioned this pull request Jul 12, 2025

[Variant] VariantBuilder with VariantMetadata instead of MetadataBuilder #7915

Closed

fix doc

15b02bd

alamb self-assigned this Jul 12, 2025

alamb merged commit fe77f2f into apache:main Jul 12, 2025
9 checks passed

alamb deleted the alamb/variant_array_builder branch July 12, 2025 19:56

alamb mentioned this pull request Jul 16, 2025

[Variant] Add variant_kernels benchmark #7944

Merged

Add VariantArray and VariantArrayBuilder for constructing Arrow Arrays of Variants #7905

Add VariantArray and VariantArrayBuilder for constructing Arrow Arrays of Variants #7905

Uh oh!

Conversation

alamb commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Jul 11, 2025

Uh oh!

friendlymatthew Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

friendlymatthew left a comment

Choose a reason for hiding this comment

Uh oh!

Samyak2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 11, 2025

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 12, 2025

Uh oh!

Uh oh!

Add `VariantArray` and `VariantArrayBuilder` for constructing Arrow Arrays of Variants #7905

Add `VariantArray` and `VariantArrayBuilder` for constructing Arrow Arrays of Variants #7905

alamb commented Jul 11, 2025 •

edited

Loading

friendlymatthew Jul 11, 2025 •

edited

Loading