Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Jul 11, 2025

Which issue does this PR close?

Rationale for this change

As we begin to add operations on Variants stored in arrays, we need some better abstractions of working with those arrays

This PR builds on the great work of @harshmotw-db in #7884 to start adding t

What changes are included in this PR?

  1. Add VariantArray that wraps a StructArray and adds useful accessors
  2. Add VariantArrayBuilder as described in [Variant] API to construct Shredded Variant Arrays #7895 to construct VariantArrays
  3. rework batch_json_string_to_variant to use the new builder and array wrapper
    Note while these APIs have no shredding support yet, I think shredding can be added in a straightforward way

Are these changes tested?

Yes, unit tests and doc examples are included

Are there any user-facing changes?

New VariantArray and VariantArrayBuilder

@alamb
Copy link
Contributor Author

alamb commented Jul 11, 2025

@scovich @friendlymatthew and @harshmotw-db -- do you have time and interest to review this PR?

Comment on lines 89 to 93
if !inner.fields().iter().any(|f| f.name() == "value") {
return Err(ArrowError::InvalidArgumentError(
"Invalid VariantArray: StructArray must contain a 'value' field".to_string(),
));
}
Copy link
Contributor

@friendlymatthew friendlymatthew Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a note to myself for the shredding work, but I don't think we should necessarily error when the value column is not found?

^ Disregard. The value and typed_value columns should exist, it is rather the elements that may be null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree there are cases when value is not required (when the variant is fully shredded)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only requirement for shredded variant is that at least one of the two must exist.

BTW, does shredded variant spec forbid existence of other (unknown) columns?
(I though it did, but I can't find any obvious wording)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know either

Copy link
Contributor

@friendlymatthew friendlymatthew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VariantArray and VariantArrayBuilder looks great to me. Thank you!

Personally, it would be great to get this merged ASAP. I plan on further iterating on it in #7895

Copy link
Contributor

@Samyak2 Samyak2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API looks great! Hoping to see it merged soon, even if it's not fully done - it will unblock things like #7893

/// NOTE: It is also permissible for the metadata field to be
/// Dictionary-Encoded, preferably (but not required) with an index type of
/// int8.
inner: StructArray,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth it to store the index of the metadata and value arrays? This would speed up metadata_field and value_field functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that would be a good optimization -- maybe we can do it as a follow on PR (I believe @zeroshade did it in the go implementation in fact(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's precisely how the Go version works. We store the indices of the metadata/value/typed_value arrays in the extension type itself

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return Err(ArrowError::InvalidArgumentError(
"Invalid VariantArray: StructArray must contain a 'value' field".to_string(),
));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with arrow-rs, so this question may not make sense. But what's the rationale for not validating that the value and metadata fields are of binary type (one of binary, large_binary, or binary_view)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is that the proposed arrow spec allows a VariantArray to use BinaryArray, LargeBinaryArray or BinaryViewArray

Though since this PR makes assumptions about BinaryView elsewhere it would make sense to return an not implemented error for those other cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I will update this PR to validate BinaryView

}

/// Return a reference to the metadata field of the [`StructArray`]
pub fn metadata_field(&self) -> &ArrayRef {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In value(..), we're doing self.metadata_field().as_binary_view(). What's the reason for not returning a &BinaryViewArray here directly?
The same for value_field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking ahead to when this could be binary or large binary as well

/// STRUCT<metadata: BINARY, value: BINARY> where nulls are preserved. The JSON strings in the input
/// must be valid.
pub fn batch_json_string_to_variant(input: &ArrayRef) -> Result<StructArray, ArrowError> {
pub fn batch_json_string_to_variant(input: &ArrayRef) -> Result<VariantArray, ArrowError> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change 1: a new VariantArray that wraps a StructArray and does basic validation

value_offsets.push(value_current_offset);

let mut validity = BooleanBufferBuilder::new(input.len());
let mut variant_array_builder = VariantArrayBuilder::new(input_string_array.len());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR refactors the details of creating the output arrays into VariantArrayBuilder so it can be reused

value_buffer.extend(value);
value_offsets.push(value_current_offset);
value_validity.append(true);
variant_array_builder.append_variant_buffers(&metadata, &value);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have ideas on how to make this more efficient by writing directly into the target buffers. I will work on this in a follow on PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

]);
let array_ref: ArrayRef = Arc::new(input);
let output = batch_json_string_to_variant(&array_ref).unwrap();
let variant_array = batch_json_string_to_variant(&array_ref).unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test uses the new VariantArray APIs to access the relevant fields / value / etc

assert!(struct_array.is_null(4));
// Compare row 0
assert!(!variant_array.is_null(0));
assert_eq!(variant_array.value(0), Variant::Int8(1));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think VariantArray::value() that returns Variant is particularly nice to use :bowtie:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's also how the Go VariantArray works as well, Value() on the VariantArray returns a Variant (performing the consolidation if the underlying VariantArray is shredded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't fully worked out in my head how this will work for shredded Variants -- specifically how to avoid copying stuff around when accessing a shredded variant. We'll have to figure that out in follow on PRs

@alamb
Copy link
Contributor Author

alamb commented Jul 11, 2025

BTW I hacked up a follow on PR that writes into VariantArrayBuilder without any copying. I am quite pleased that it works

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice!

Comment on lines 51 to 58
/// If typed_value is a nested type, its elements must be required and must
/// be a struct containing only one of the following:
///
/// 1. A single required field, of type binary, large_binary, or binary_view named value
///
/// 2. An optional field named value of type binary, large_binary, or
/// binary_view AND an optional field named typed_value which follows these
/// same rules
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this part? typed_value is the arbitrary type of a shredded column. That could be a primitive type (e.g. INT), or it could be struct (with fields that are themselves either strongly typed or variant), or it could even be variant (technically, a writer could choose to shred out a commonly-used field, even if that field doesn't have a strong type that could shred in structured form).

Any nested variants must have only value and typed_value columns tho -- the top-level metadata applies to everything. Readers who encounter nested variants inside the typed_value column need to be prepared to use the top-level metadata when reading them, which may require some plumbing, given that it's not actually materialized locally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- I think I did not read the proposal correctly. I will correct the wording

https://docs.google.com/document/d/1pw0AWoMQY3SjD7R4LgbPvMjG_xSCtXp3rZHkVp9jpZ4/edit?tab=t.0#heading=h.ru2zdn2szuc4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: It seems I just copy/pasted the text from https://docs.google.com/document/d/1pw0AWoMQY3SjD7R4LgbPvMjG_xSCtXp3rZHkVp9jpZ4/edit?tab=t.0#heading=h.ru2zdn2szuc4 and looking at the documentation there is a bit of confusion about what the text means

So for now I removed the section about the contents of shredded columns, since the code doesn't support this yet. We can add the comments in as we start to implement shredding for real

Comment on lines 89 to 93
if !inner.fields().iter().any(|f| f.name() == "value") {
return Err(ArrowError::InvalidArgumentError(
"Invalid VariantArray: StructArray must contain a 'value' field".to_string(),
));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only requirement for shredded variant is that at least one of the two must exist.

BTW, does shredded variant spec forbid existence of other (unknown) columns?
(I though it did, but I can't find any obvious wording)

Samyak2 added a commit to Samyak2/arrow-rs that referenced this pull request Jul 12, 2025
Still very early. Opening this PR to get some early feedback on the approach.

The approach is roughly like this:
- Allocate a new offset array (all zeroes) and a new nulls array (copy of variant's null buffer).
- For every variant path access, we iterate through all the values and increment the offsets to the desired object field's offset or array index's offset.
    - If the value isn't an object/array, we set a null for this row.
- We then extract all the values at the new offsets into an array
    - I have currently only done it for u64. For this PR, I can make it generic for all primitive types.

Some open questions:
- This seems like a good vectorized approach to me, but it comes at the cost of allocating new buffers for every variant_get invocation.
    - Would it be worth it to try a row-wise approach instead? It would be something like: do the whole path access for each row and append into the appropriate ArrayBuilder.
    - This offset-based approach works quite well for extracting complex types too (mainly arrays). I have not implemented it here yet but I have done it elsewhere before.
- Databricks has two variations of this function: `variant_get` and `try_variant_get`.
    - The only difference in `try_variant_get` is that cast errors are ignored.
    - I'm guessing this is covered by `CastOptions`? I haven't looked at it yet.
- Perhaps extracting complex types can be a separate PR? I can do it here, but the PR might become too large to review.

I'm new to this codebase, please let me know if I missed anything! :)

I will be rebasing this PR once apache#7905 is merged. I'll be using VariantArray instead of StructArray once that is merged.
@alamb
Copy link
Contributor Author

alamb commented Jul 12, 2025

Given this PR seems to block several others, I plan to merge it once CI is complete

@alamb alamb self-assigned this Jul 12, 2025
@alamb alamb merged commit fe77f2f into apache:main Jul 12, 2025
9 checks passed
@alamb
Copy link
Contributor Author

alamb commented Jul 12, 2025

Thanks again for the quick reviews everyone!

@alamb alamb deleted the alamb/variant_array_builder branch July 12, 2025 19:56
@alamb
Copy link
Contributor Author

alamb commented Jul 12, 2025

I am happy to make PRs / address comments from anyone who might not have had a chance to comment -- just let me know

Samyak2 added a commit to Samyak2/arrow-rs that referenced this pull request Jul 13, 2025
Still very early. Opening this PR to get some early feedback on the approach.

The approach is roughly like this:
- Allocate a new offset array (all zeroes) and a new nulls array (copy of variant's null buffer).
- For every variant path access, we iterate through all the values and increment the offsets to the desired object field's offset or array index's offset.
    - If the value isn't an object/array, we set a null for this row.
- We then extract all the values at the new offsets into an array
    - I have currently only done it for u64. For this PR, I can make it generic for all primitive types.

Some open questions:
- This seems like a good vectorized approach to me, but it comes at the cost of allocating new buffers for every variant_get invocation.
    - Would it be worth it to try a row-wise approach instead? It would be something like: do the whole path access for each row and append into the appropriate ArrayBuilder.
    - This offset-based approach works quite well for extracting complex types too (mainly arrays). I have not implemented it here yet but I have done it elsewhere before.
- Databricks has two variations of this function: `variant_get` and `try_variant_get`.
    - The only difference in `try_variant_get` is that cast errors are ignored.
    - I'm guessing this is covered by `CastOptions`? I haven't looked at it yet.
- Perhaps extracting complex types can be a separate PR? I can do it here, but the PR might become too large to review.

I'm new to this codebase, please let me know if I missed anything! :)

I will be rebasing this PR once apache#7905 is merged. I'll be using VariantArray instead of StructArray once that is merged.
Samyak2 added a commit to Samyak2/arrow-rs that referenced this pull request Jul 13, 2025
Still very early. Opening this PR to get some early feedback on the approach.

The approach is roughly like this:
- Allocate a new offset array (all zeroes) and a new nulls array (copy of variant's null buffer).
- For every variant path access, we iterate through all the values and increment the offsets to the desired object field's offset or array index's offset.
    - If the value isn't an object/array, we set a null for this row.
- We then extract all the values at the new offsets into an array
    - I have currently only done it for u64. For this PR, I can make it generic for all primitive types.

Some open questions:
- This seems like a good vectorized approach to me, but it comes at the cost of allocating new buffers for every variant_get invocation.
    - Would it be worth it to try a row-wise approach instead? It would be something like: do the whole path access for each row and append into the appropriate ArrayBuilder.
    - This offset-based approach works quite well for extracting complex types too (mainly arrays). I have not implemented it here yet but I have done it elsewhere before.
- Databricks has two variations of this function: `variant_get` and `try_variant_get`.
    - The only difference in `try_variant_get` is that cast errors are ignored.
    - I'm guessing this is covered by `CastOptions`? I haven't looked at it yet.
- Perhaps extracting complex types can be a separate PR? I can do it here, but the PR might become too large to review.

I'm new to this codebase, please let me know if I missed anything! :)

I will be rebasing this PR once apache#7905 is merged. I'll be using VariantArray instead of StructArray once that is merged.
alamb added a commit that referenced this pull request Jul 18, 2025
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Part of #7911
- Part of  #6736
- Follow on to #7905


# Rationale for this change

I wrote benchmark some changes to the json decoder in
#7911 but they are non trivial.
To keep #7911 easier to review I
have pulled the benchmarks out to their own PR

# What changes are included in this PR?

1. Add new json  benchmark
2. Include the `variant_get` benchmark added in
#7919 by @Samyak2
# Are these changes tested?

I tested them manually and clippy CI coverage ensures they compile

# Are there any user-facing changes?

No these are only benchmarks
alamb added a commit that referenced this pull request Jul 23, 2025
# Which issue does this PR close?

- part of #6736
- Closes #7964
- Follow on to #7905

# Rationale for this change

In a quest to have the fastest and most efficient Variant implementation
I would like to avoid copies if at all possible
Right now, to make a VariantArray first requires completing an
individual buffer and appending it
to the array.

Let's make that faster by having the VariantBuilder append directly into
the buffer


# What changes are included in this PR?

1. Add `VariantBuilder::new_from_existing`
2. Add a `VariantArrayBuilder::variant_builder` that reuses the buffers

# Are these changes tested?

1. New unit tests 
1. Yes by existing tests

# Are there any user-facing changes?

Hopefully faster performance

---------

Co-authored-by: Congxian Qiu <[email protected]>
Co-authored-by: Liang-Chi Hsieh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants