- 
                Notifications
    
You must be signed in to change notification settings  - Fork 1k
 
Initial Builder API for Creating Variant Values #7653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 
           It also looks like there are some CI failures on this PR: 
  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @PinkCrow007 -- this is looking like a great start
Overall
I tried out the APIs and wrote some docs and I think this API works well.
Suggested next steps
In my opinion, since we are early in this project we should "move fast and bias to action" -- thus, what I would suggest is:
- Get this PR so it passes the CI tests (so we can merge it in)
 - Address any comments you have tome to
 - Merge it in
 
Then we can iterate / add additional tests / features / coverage as needed.
Suggestions for Docs:
One thing that helps me with these early days of API design / development is to try and write examples out of common things people want to do with the builders.
That often helps clarify API design points and has the added bonus of better code.
I took a shot at creating an example in the following PR (that targets your branch, so merging it will update this PR):
We can also merge this PR without the change and I'll make a new PR targeting this repo
| } | ||
| 
               | 
          ||
| #[test] | ||
| fn variant_primitive_builder() -> Result<(), ArrowError> { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is neat
        
          
                parquet-variant/src/variant.rs
              
                Outdated
          
        
      | 
               | 
          ||
| impl<'m, 'v> From<()> for Variant<'m, 'v> { | ||
| fn from(_: ()) -> Self { | ||
| Variant::Null | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally this should not return Null but the actual variant value
I think with the change in this PR you could implement this API directly
Add documentation and examples to VariantBuilder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial pass
| /// object_builder.append_value("last_name", "Li"); | ||
| /// object_builder.finish(); | ||
| /// // Finish the builder to get the metadata and value | ||
| /// let (metadata, value) = builder.finish(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to deal with nested data efficiently, we will need a separate metadata builder that multiple object builders can make use of. Otherwise, we would need to merge multiple metadata dictionaries, which would also require rewriting the corresponding field ids. Very complex and error-prone.
We should also give some thought into how to build sorted metadata dictionaries. It probably would require making two passes -- one to identify the string values and sort them into a variant metadata; and another to build the variant value that references the string ids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the solution that @zeroshade suggested:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supporting nested data is tracked here:
| fn append_decimal4(&mut self, integer: i32, scale: u8) { | ||
| self.buffer | ||
| .push(primitive_header(VariantPrimitiveType::Decimal4)); | ||
| self.buffer.push(scale); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we validate the scale before pushing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should (or at least have optional checking 🤔 )
| self.buffer.extend_from_slice(µs.to_le_bytes()); | ||
| } | ||
| 
               | 
          ||
| fn append_decimal4(&mut self, integer: i32, scale: u8) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, to match the variant spec:
| fn append_decimal4(&mut self, integer: i32, scale: u8) { | |
| fn append_decimal4(&mut self, unscaled_value: i32, scale: u8) { | 
(more below)
| Entry::Vacant(entry) => { | ||
| let id = self.dict_keys.len() as u32; | ||
| entry.insert(id); | ||
| self.dict_keys.push(key.to_string()); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit unfortunate we have to double-allocate strings. Ideally the dictionary could track &str backed by dict_keys, but I don't know how to manage mutability and lifetimes to achieve that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the dictionary could be a map to u32 (an index into the relevant entry in the dict_keys Vec) instead of a String
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could work. Also, we ultimately need a single packed array of utf8 bytes; so we might want to consider storing the same kinds of offsets that the actual variant metadata stores, where each string's bytes are found in the range offset[i]..offset[i+1]. That shouldn't be any more expensive to access (sort) vs. an array of (heap-allocated) strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing offset pairs directly would also work really nicely if we decide not to sort+dedup the metadata field entries (**), because then the Vec<u8> can simply be copied as-is to its final home.
(**) Sort+dedup is actually quite some work, because it requires finding and rewriting all field ids scattered through the value bytes. So it would almost certainly be a second step (either a preliminary pass before we actually generate the variant value bytes, or as a fix-up pass afterward to rewrite field ids)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(**) Sort+dedup is actually quite some work, because it requires finding and rewriting all field ids scattered through the value bytes. So it would almost certainly be a second step (either a preliminary pass before we actually generate the variant value bytes, or as a fix-up pass afterward to rewrite field ids)
@zeroshade described a pretty elegant solution in the Go variant builder which was to write the dictionary fields in the order they are encountered., and then decides if the dictionary is sorted when writing it out. The builder has an additional API that you can use to directly add a dictionary key
One benefit of this approach is that different writers can decide if they care about sorted dictionaries and if they do they can do a pre-pass to figure them out and add them directly to the builder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
        
          
                parquet-variant/src/builder.rs
              
                Outdated
          
        
      | (metadata, self.buffer) | ||
| } | ||
| 
               | 
          ||
| pub fn append_value<T: Into<Variant<'static, 'static>>>(&mut self, value: T) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we adjust lifetimes, so normal strings can be appended and not just string constants?
| pub fn append_value<T: Into<Variant<'static, 'static>>>(&mut self, value: T) { | |
| pub fn append_value<'v, T: Into<Variant<'_, 'v>>>(&mut self, value: T) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did so in d7d1449
        
          
                parquet-variant/src/builder.rs
              
                Outdated
          
        
      | pub fn new_array(&mut self) -> ArrayBuilder { | ||
| ArrayBuilder::new(self) | ||
| } | ||
| 
               | 
          ||
| /// Create an [`ObjectBuilder`] for creating [`Variant::Object`] values. | ||
| /// | ||
| /// See the examples on [`VariantBuilder`] for usage. | ||
| pub fn new_object(&mut self) -> ObjectBuilder { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrays and objects are tricky to build incrementally. The value offset and field id arrays require either knowing the number of elements/fields to be created in advance (and then worrying about what happens if the caller builds too many/few entries afterward), or building the arrays in separate storage and then moving an arbitrarily large number of buffered bytes to make room for the them after the fact.
I guess this PR takes the latter approach, but for deeply nested data that will lead to something that looks like quadratic cost -- O(num_bytes * depth) -- as the deepest fields get moved repeatedly while finishing each of the parent builders in turn.
Not sure the best solution here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the builder can leave room for the list length and then append the values, and then go back and update the length when the list is finsihed
This would get tricky for building "large" lists as the length field may not be known upfront.
we could also introduce potentially a functon like new_large_object() or something for callers to hint up front their object has many fields, and if they use new_object but push too many values fallback to copying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ObjectBuilder::new(self) | ||
| } | ||
| 
               | 
          ||
| pub fn finish(self) -> (Vec<u8>, Vec<u8>) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this approach can be easily extended to allow nested variants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: Never mind -- I missed the fact that VariantBuilder is the truly top-level builder, and the array/object builders both nest around it. There's still the problem that variant can't impl From arrays/objects, but I guess we could add a call to create a new object/array builder from an existing one, in order to achieve nesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still the problem that variant can't impl From arrays/objects, but I guess we could add a call to create a new object/array builder from an existing one, in order to achieve nesting?
I am not sure we will be able to implement a general purpose From impl for native rust slices or objects. We could potentially offer a serde style thing (like serde_json but serde_variant 🤔 )
| Variant::Float(v) => self.append_float(v), | ||
| Variant::Double(v) => self.append_double(v), | ||
| Variant::Binary(v) => self.append_binary(v), | ||
| Variant::String(s) | Variant::ShortString(s) => self.append_string(s), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will make three times deciding between short vs. long string? one to make the input, another here, and a third to consume it? Any way we could make this more direct/efficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to have a self.append_short_string and self.append_string and then call the appropriate ones. We could have self.append_short_string return an error potentially as well if the string was too long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the liberty of merging up from main to resolve conflicts and updating the API t get the CI passing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this PR and comments again and I think it looks good enough to merge
There are several outstanding questions, that I think we can address as follow on PRs (I will file tickets)
- Supporting nested objects / lists
 - Error checking for variants that have restricted values (e.g. binary that is longer than u32::MAX, valid scale for decimals, etc)
 - improving performance of dictionary creation / handling sorted dictionaries
 
But all in all I think this is a great step forward if I do say so my self.
Thank you @scovich and @PinkCrow007
🚀
| .extend_from_slice(&(value.len() as u32).to_le_bytes()); | ||
| self.buffer.extend_from_slice(value); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should throw an error if value.len() > u32::max
        
          
                parquet-variant/src/builder.rs
              
                Outdated
          
        
      | pub fn new_array(&mut self) -> ArrayBuilder { | ||
| ArrayBuilder::new(self) | ||
| } | ||
| 
               | 
          ||
| /// Create an [`ObjectBuilder`] for creating [`Variant::Object`] values. | ||
| /// | ||
| /// See the examples on [`VariantBuilder`] for usage. | ||
| pub fn new_object(&mut self) -> ObjectBuilder { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the builder can leave room for the list length and then append the values, and then go back and update the length when the list is finsihed
This would get tricky for building "large" lists as the length field may not be known upfront.
we could also introduce potentially a functon like new_large_object() or something for callers to hint up front their object has many fields, and if they use new_object but push too many values fallback to copying
| fn append_decimal4(&mut self, integer: i32, scale: u8) { | ||
| self.buffer | ||
| .push(primitive_header(VariantPrimitiveType::Decimal4)); | ||
| self.buffer.push(scale); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should (or at least have optional checking 🤔 )
| ObjectBuilder::new(self) | ||
| } | ||
| 
               | 
          ||
| pub fn finish(self) -> (Vec<u8>, Vec<u8>) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still the problem that variant can't impl From arrays/objects, but I guess we could add a call to create a new object/array builder from an existing one, in order to achieve nesting?
I am not sure we will be able to implement a general purpose From impl for native rust slices or objects. We could potentially offer a serde style thing (like serde_json but serde_variant 🤔 )
        
          
                parquet-variant/src/builder.rs
              
                Outdated
          
        
      | Variant::Binary(v) => self.append_binary(v), | ||
| Variant::String(s) | Variant::ShortString(s) => self.append_string(s), | ||
| Variant::Object(_) | Variant::List(_) => { | ||
| unreachable!("Object and Array variants cannot be created through Into<Variant>") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I think you can append a variant object. I will write a test / file a ticket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Variant::Float(v) => self.append_float(v), | ||
| Variant::Double(v) => self.append_double(v), | ||
| Variant::Binary(v) => self.append_binary(v), | ||
| Variant::String(s) | Variant::ShortString(s) => self.append_string(s), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to have a self.append_short_string and self.append_string and then call the appropriate ones. We could have self.append_short_string return an error potentially as well if the string was too long
        
          
                parquet-variant/src/builder.rs
              
                Outdated
          
        
      | (metadata, self.buffer) | ||
| } | ||
| 
               | 
          ||
| pub fn append_value<T: Into<Variant<'static, 'static>>>(&mut self, value: T) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did so in d7d1449
| 
           I also updated the naming to match the change to rename  The more I see that the less I like it -- I will file a separate ticket  | 
    
| 
           All right -- I have filed a bunch of follow on tickets, and attached them to Once I get the CI to pass I'll plan to merge this PR Thanks again @PinkCrow007 and @scovich  | 
    
Which issue does this PR close?
Rationale for this change
This PR introduces a basic builder API for creating Variant values, building on the foundation laid by @mkarbo. The builder provides a user-friendly nested API while maintaining performance through a single-buffer design. The design was shaped with huge help from @alamb, @scovich and @Weijun-H ’s feedback, and draws much inspiration from the excellent work by @zeroshade
This is an initial version and does not yet support nested values, metadata key sorting, and so on
What changes are included in this PR?
Are there any user-facing changes?
The new API's added in parquet-variant will be user facing.