-
Notifications
You must be signed in to change notification settings - Fork 1k
Add VariantArray and VariantArrayBuilder for constructing Arrow Arrays of Variants
#7905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rrays of Variants
|
@scovich @friendlymatthew and @harshmotw-db -- do you have time and interest to review this PR? |
| if !inner.fields().iter().any(|f| f.name() == "value") { | ||
| return Err(ArrowError::InvalidArgumentError( | ||
| "Invalid VariantArray: StructArray must contain a 'value' field".to_string(), | ||
| )); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a note to myself for the shredding work, but I don't think we should necessarily error when the value column is not found?
^ Disregard. The value and typed_value columns should exist, it is rather the elements that may be null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree there are cases when value is not required (when the variant is fully shredded)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only requirement for shredded variant is that at least one of the two must exist.
BTW, does shredded variant spec forbid existence of other (unknown) columns?
(I though it did, but I can't find any obvious wording)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know either
friendlymatthew
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VariantArray and VariantArrayBuilder looks great to me. Thank you!
Personally, it would be great to get this merged ASAP. I plan on further iterating on it in #7895
Samyak2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API looks great! Hoping to see it merged soon, even if it's not fully done - it will unblock things like #7893
| /// NOTE: It is also permissible for the metadata field to be | ||
| /// Dictionary-Encoded, preferably (but not required) with an index type of | ||
| /// int8. | ||
| inner: StructArray, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth it to store the index of the metadata and value arrays? This would speed up metadata_field and value_field functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that would be a good optimization -- maybe we can do it as a follow on PR (I believe @zeroshade did it in the go implementation in fact(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that's precisely how the Go version works. We store the indices of the metadata/value/typed_value arrays in the extension type itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed a ticket to track
| return Err(ArrowError::InvalidArgumentError( | ||
| "Invalid VariantArray: StructArray must contain a 'value' field".to_string(), | ||
| )); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very familiar with arrow-rs, so this question may not make sense. But what's the rationale for not validating that the value and metadata fields are of binary type (one of binary, large_binary, or binary_view)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is that the proposed arrow spec allows a VariantArray to use BinaryArray, LargeBinaryArray or BinaryViewArray
Though since this PR makes assumptions about BinaryView elsewhere it would make sense to return an not implemented error for those other cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I will update this PR to validate BinaryView
| } | ||
|
|
||
| /// Return a reference to the metadata field of the [`StructArray`] | ||
| pub fn metadata_field(&self) -> &ArrayRef { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In value(..), we're doing self.metadata_field().as_binary_view(). What's the reason for not returning a &BinaryViewArray here directly?
The same for value_field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking ahead to when this could be binary or large binary as well
| /// STRUCT<metadata: BINARY, value: BINARY> where nulls are preserved. The JSON strings in the input | ||
| /// must be valid. | ||
| pub fn batch_json_string_to_variant(input: &ArrayRef) -> Result<StructArray, ArrowError> { | ||
| pub fn batch_json_string_to_variant(input: &ArrayRef) -> Result<VariantArray, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change 1: a new VariantArray that wraps a StructArray and does basic validation
| value_offsets.push(value_current_offset); | ||
|
|
||
| let mut validity = BooleanBufferBuilder::new(input.len()); | ||
| let mut variant_array_builder = VariantArrayBuilder::new(input_string_array.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR refactors the details of creating the output arrays into VariantArrayBuilder so it can be reused
| value_buffer.extend(value); | ||
| value_offsets.push(value_current_offset); | ||
| value_validity.append(true); | ||
| variant_array_builder.append_variant_buffers(&metadata, &value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have ideas on how to make this more efficient by writing directly into the target buffers. I will work on this in a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I came up with
| ]); | ||
| let array_ref: ArrayRef = Arc::new(input); | ||
| let output = batch_json_string_to_variant(&array_ref).unwrap(); | ||
| let variant_array = batch_json_string_to_variant(&array_ref).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test uses the new VariantArray APIs to access the relevant fields / value / etc
| assert!(struct_array.is_null(4)); | ||
| // Compare row 0 | ||
| assert!(!variant_array.is_null(0)); | ||
| assert_eq!(variant_array.value(0), Variant::Int8(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think VariantArray::value() that returns Variant is particularly nice to use ![]()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's also how the Go VariantArray works as well, Value() on the VariantArray returns a Variant (performing the consolidation if the underlying VariantArray is shredded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't fully worked out in my head how this will work for shredded Variants -- specifically how to avoid copying stuff around when accessing a shredded variant. We'll have to figure that out in follow on PRs
|
BTW I hacked up a follow on PR that writes into VariantArrayBuilder without any copying. I am quite pleased that it works |
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice!
| /// If typed_value is a nested type, its elements must be required and must | ||
| /// be a struct containing only one of the following: | ||
| /// | ||
| /// 1. A single required field, of type binary, large_binary, or binary_view named value | ||
| /// | ||
| /// 2. An optional field named value of type binary, large_binary, or | ||
| /// binary_view AND an optional field named typed_value which follows these | ||
| /// same rules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this part? typed_value is the arbitrary type of a shredded column. That could be a primitive type (e.g. INT), or it could be struct (with fields that are themselves either strongly typed or variant), or it could even be variant (technically, a writer could choose to shred out a commonly-used field, even if that field doesn't have a strong type that could shred in structured form).
Any nested variants must have only value and typed_value columns tho -- the top-level metadata applies to everything. Readers who encounter nested variants inside the typed_value column need to be prepared to use the top-level metadata when reading them, which may require some plumbing, given that it's not actually materialized locally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- I think I did not read the proposal correctly. I will correct the wording
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: It seems I just copy/pasted the text from https://docs.google.com/document/d/1pw0AWoMQY3SjD7R4LgbPvMjG_xSCtXp3rZHkVp9jpZ4/edit?tab=t.0#heading=h.ru2zdn2szuc4 and looking at the documentation there is a bit of confusion about what the text means
So for now I removed the section about the contents of shredded columns, since the code doesn't support this yet. We can add the comments in as we start to implement shredding for real
| if !inner.fields().iter().any(|f| f.name() == "value") { | ||
| return Err(ArrowError::InvalidArgumentError( | ||
| "Invalid VariantArray: StructArray must contain a 'value' field".to_string(), | ||
| )); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only requirement for shredded variant is that at least one of the two must exist.
BTW, does shredded variant spec forbid existence of other (unknown) columns?
(I though it did, but I can't find any obvious wording)
Still very early. Opening this PR to get some early feedback on the approach.
The approach is roughly like this:
- Allocate a new offset array (all zeroes) and a new nulls array (copy of variant's null buffer).
- For every variant path access, we iterate through all the values and increment the offsets to the desired object field's offset or array index's offset.
- If the value isn't an object/array, we set a null for this row.
- We then extract all the values at the new offsets into an array
- I have currently only done it for u64. For this PR, I can make it generic for all primitive types.
Some open questions:
- This seems like a good vectorized approach to me, but it comes at the cost of allocating new buffers for every variant_get invocation.
- Would it be worth it to try a row-wise approach instead? It would be something like: do the whole path access for each row and append into the appropriate ArrayBuilder.
- This offset-based approach works quite well for extracting complex types too (mainly arrays). I have not implemented it here yet but I have done it elsewhere before.
- Databricks has two variations of this function: `variant_get` and `try_variant_get`.
- The only difference in `try_variant_get` is that cast errors are ignored.
- I'm guessing this is covered by `CastOptions`? I haven't looked at it yet.
- Perhaps extracting complex types can be a separate PR? I can do it here, but the PR might become too large to review.
I'm new to this codebase, please let me know if I missed anything! :)
I will be rebasing this PR once apache#7905 is merged. I'll be using VariantArray instead of StructArray once that is merged.
|
Given this PR seems to block several others, I plan to merge it once CI is complete |
|
Thanks again for the quick reviews everyone! |
|
I am happy to make PRs / address comments from anyone who might not have had a chance to comment -- just let me know |
Still very early. Opening this PR to get some early feedback on the approach.
The approach is roughly like this:
- Allocate a new offset array (all zeroes) and a new nulls array (copy of variant's null buffer).
- For every variant path access, we iterate through all the values and increment the offsets to the desired object field's offset or array index's offset.
- If the value isn't an object/array, we set a null for this row.
- We then extract all the values at the new offsets into an array
- I have currently only done it for u64. For this PR, I can make it generic for all primitive types.
Some open questions:
- This seems like a good vectorized approach to me, but it comes at the cost of allocating new buffers for every variant_get invocation.
- Would it be worth it to try a row-wise approach instead? It would be something like: do the whole path access for each row and append into the appropriate ArrayBuilder.
- This offset-based approach works quite well for extracting complex types too (mainly arrays). I have not implemented it here yet but I have done it elsewhere before.
- Databricks has two variations of this function: `variant_get` and `try_variant_get`.
- The only difference in `try_variant_get` is that cast errors are ignored.
- I'm guessing this is covered by `CastOptions`? I haven't looked at it yet.
- Perhaps extracting complex types can be a separate PR? I can do it here, but the PR might become too large to review.
I'm new to this codebase, please let me know if I missed anything! :)
I will be rebasing this PR once apache#7905 is merged. I'll be using VariantArray instead of StructArray once that is merged.
Still very early. Opening this PR to get some early feedback on the approach.
The approach is roughly like this:
- Allocate a new offset array (all zeroes) and a new nulls array (copy of variant's null buffer).
- For every variant path access, we iterate through all the values and increment the offsets to the desired object field's offset or array index's offset.
- If the value isn't an object/array, we set a null for this row.
- We then extract all the values at the new offsets into an array
- I have currently only done it for u64. For this PR, I can make it generic for all primitive types.
Some open questions:
- This seems like a good vectorized approach to me, but it comes at the cost of allocating new buffers for every variant_get invocation.
- Would it be worth it to try a row-wise approach instead? It would be something like: do the whole path access for each row and append into the appropriate ArrayBuilder.
- This offset-based approach works quite well for extracting complex types too (mainly arrays). I have not implemented it here yet but I have done it elsewhere before.
- Databricks has two variations of this function: `variant_get` and `try_variant_get`.
- The only difference in `try_variant_get` is that cast errors are ignored.
- I'm guessing this is covered by `CastOptions`? I haven't looked at it yet.
- Perhaps extracting complex types can be a separate PR? I can do it here, but the PR might become too large to review.
I'm new to this codebase, please let me know if I missed anything! :)
I will be rebasing this PR once apache#7905 is merged. I'll be using VariantArray instead of StructArray once that is merged.
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Part of #7911 - Part of #6736 - Follow on to #7905 # Rationale for this change I wrote benchmark some changes to the json decoder in #7911 but they are non trivial. To keep #7911 easier to review I have pulled the benchmarks out to their own PR # What changes are included in this PR? 1. Add new json benchmark 2. Include the `variant_get` benchmark added in #7919 by @Samyak2 # Are these changes tested? I tested them manually and clippy CI coverage ensures they compile # Are there any user-facing changes? No these are only benchmarks
# Which issue does this PR close? - part of #6736 - Closes #7964 - Follow on to #7905 # Rationale for this change In a quest to have the fastest and most efficient Variant implementation I would like to avoid copies if at all possible Right now, to make a VariantArray first requires completing an individual buffer and appending it to the array. Let's make that faster by having the VariantBuilder append directly into the buffer # What changes are included in this PR? 1. Add `VariantBuilder::new_from_existing` 2. Add a `VariantArrayBuilder::variant_builder` that reuses the buffers # Are these changes tested? 1. New unit tests 1. Yes by existing tests # Are there any user-facing changes? Hopefully faster performance --------- Co-authored-by: Congxian Qiu <[email protected]> Co-authored-by: Liang-Chi Hsieh <[email protected]>
Which issue does this PR close?
Rationale for this change
As we begin to add operations on Variants stored in arrays, we need some better abstractions of working with those arrays
This PR builds on the great work of @harshmotw-db in #7884 to start adding t
What changes are included in this PR?
VariantArraythat wraps aStructArrayand adds useful accessorsVariantArrayBuilderas described in [Variant] API to construct Shredded Variant Arrays #7895 to constructVariantArraysbatch_json_string_to_variantto use the new builder and array wrapperNote while these APIs have no shredding support yet, I think shredding can be added in a straightforward way
Are these changes tested?
Yes, unit tests and doc examples are included
Are there any user-facing changes?
New VariantArray and VariantArrayBuilder