Skip to content

Conversation

@PinkCrow007
Copy link
Contributor

@PinkCrow007 PinkCrow007 commented Jun 13, 2025

Which issue does this PR close?

Rationale for this change

This PR introduces a basic builder API for creating Variant values, building on the foundation laid by @mkarbo. The builder provides a user-friendly nested API while maintaining performance through a single-buffer design. The design was shaped with huge help from @alamb, @scovich and @Weijun-H ’s feedback, and draws much inspiration from the excellent work by @zeroshade

This is an initial version and does not yet support nested values, metadata key sorting, and so on

What changes are included in this PR?

  • Adds VariantBuilder, ObjectBuilder, ArrayBuilder

Are there any user-facing changes?

The new API's added in parquet-variant will be user facing.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 13, 2025
@alamb
Copy link
Contributor

alamb commented Jun 13, 2025

It also looks like there are some CI failures on this PR:

  • RAT means you have to put the apache license at the start of the file
  • Clippy means run cargo clippy -p parquet-variant
  • The test seems to be failing when you run cargo test -p parquet-variant

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @PinkCrow007 -- this is looking like a great start

Overall

I tried out the APIs and wrote some docs and I think this API works well.

Suggested next steps

In my opinion, since we are early in this project we should "move fast and bias to action" -- thus, what I would suggest is:

  1. Get this PR so it passes the CI tests (so we can merge it in)
  2. Address any comments you have tome to
  3. Merge it in

Then we can iterate / add additional tests / features / coverage as needed.

Suggestions for Docs:

One thing that helps me with these early days of API design / development is to try and write examples out of common things people want to do with the builders.

That often helps clarify API design points and has the added bonus of better code.

I took a shot at creating an example in the following PR (that targets your branch, so merging it will update this PR):

We can also merge this PR without the change and I'll make a new PR targeting this repo

FYI @scovich @mkarbo @superserious-dev

}

#[test]
fn variant_primitive_builder() -> Result<(), ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is neat


impl<'m, 'v> From<()> for Variant<'m, 'v> {
fn from(_: ()) -> Self {
Variant::Null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this should not return Null but the actual variant value

I think with the change in this PR you could implement this API directly

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial pass

/// object_builder.append_value("last_name", "Li");
/// object_builder.finish();
/// // Finish the builder to get the metadata and value
/// let (metadata, value) = builder.finish();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to deal with nested data efficiently, we will need a separate metadata builder that multiple object builders can make use of. Otherwise, we would need to merge multiple metadata dictionaries, which would also require rewriting the corresponding field ids. Very complex and error-prone.

We should also give some thought into how to build sorted metadata dictionaries. It probably would require making two passes -- one to identify the string values and sort them into a variant metadata; and another to build the variant value that references the string ids?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supporting nested data is tracked here:

fn append_decimal4(&mut self, integer: i32, scale: u8) {
self.buffer
.push(primitive_header(VariantPrimitiveType::Decimal4));
self.buffer.push(scale);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we validate the scale before pushing it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should (or at least have optional checking 🤔 )

self.buffer.extend_from_slice(&micros.to_le_bytes());
}

fn append_decimal4(&mut self, integer: i32, scale: u8) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, to match the variant spec:

Suggested change
fn append_decimal4(&mut self, integer: i32, scale: u8) {
fn append_decimal4(&mut self, unscaled_value: i32, scale: u8) {

(more below)

Entry::Vacant(entry) => {
let id = self.dict_keys.len() as u32;
entry.insert(id);
self.dict_keys.push(key.to_string());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unfortunate we have to double-allocate strings. Ideally the dictionary could track &str backed by dict_keys, but I don't know how to manage mutability and lifetimes to achieve that...

Copy link
Contributor

@alamb alamb Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the dictionary could be a map to u32 (an index into the relevant entry in the dict_keys Vec) instead of a String

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could work. Also, we ultimately need a single packed array of utf8 bytes; so we might want to consider storing the same kinds of offsets that the actual variant metadata stores, where each string's bytes are found in the range offset[i]..offset[i+1]. That shouldn't be any more expensive to access (sort) vs. an array of (heap-allocated) strings?

Copy link
Contributor

@scovich scovich Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing offset pairs directly would also work really nicely if we decide not to sort+dedup the metadata field entries (**), because then the Vec<u8> can simply be copied as-is to its final home.

(**) Sort+dedup is actually quite some work, because it requires finding and rewriting all field ids scattered through the value bytes. So it would almost certainly be a second step (either a preliminary pass before we actually generate the variant value bytes, or as a fix-up pass afterward to rewrite field ids)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(**) Sort+dedup is actually quite some work, because it requires finding and rewriting all field ids scattered through the value bytes. So it would almost certainly be a second step (either a preliminary pass before we actually generate the variant value bytes, or as a fix-up pass afterward to rewrite field ids)

@zeroshade described a pretty elegant solution in the Go variant builder which was to write the dictionary fields in the order they are encountered., and then decides if the dictionary is sorted when writing it out. The builder has an additional API that you can use to directly add a dictionary key

One benefit of this approach is that different writers can decide if they care about sorted dictionaries and if they do they can do a pre-pass to figure them out and add them directly to the builder

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(metadata, self.buffer)
}

pub fn append_value<T: Into<Variant<'static, 'static>>>(&mut self, value: T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we adjust lifetimes, so normal strings can be appended and not just string constants?

Suggested change
pub fn append_value<T: Into<Variant<'static, 'static>>>(&mut self, value: T) {
pub fn append_value<'v, T: Into<Variant<'_, 'v>>>(&mut self, value: T) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did so in d7d1449

Comment on lines 322 to 329
pub fn new_array(&mut self) -> ArrayBuilder {
ArrayBuilder::new(self)
}

/// Create an [`ObjectBuilder`] for creating [`Variant::Object`] values.
///
/// See the examples on [`VariantBuilder`] for usage.
pub fn new_object(&mut self) -> ObjectBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrays and objects are tricky to build incrementally. The value offset and field id arrays require either knowing the number of elements/fields to be created in advance (and then worrying about what happens if the caller builds too many/few entries afterward), or building the arrays in separate storage and then moving an arbitrarily large number of buffered bytes to make room for the them after the fact.

I guess this PR takes the latter approach, but for deeply nested data that will lead to something that looks like quadratic cost -- O(num_bytes * depth) -- as the deepest fields get moved repeatedly while finishing each of the parent builders in turn.

Not sure the best solution here...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the builder can leave room for the list length and then append the values, and then go back and update the length when the list is finsihed

This would get tricky for building "large" lists as the length field may not be known upfront.

we could also introduce potentially a functon like new_large_object() or something for callers to hint up front their object has many fields, and if they use new_object but push too many values fallback to copying

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ObjectBuilder::new(self)
}

pub fn finish(self) -> (Vec<u8>, Vec<u8>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this approach can be easily extended to allow nested variants?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: Never mind -- I missed the fact that VariantBuilder is the truly top-level builder, and the array/object builders both nest around it. There's still the problem that variant can't impl From arrays/objects, but I guess we could add a call to create a new object/array builder from an existing one, in order to achieve nesting?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still the problem that variant can't impl From arrays/objects, but I guess we could add a call to create a new object/array builder from an existing one, in order to achieve nesting?

I am not sure we will be able to implement a general purpose From impl for native rust slices or objects. We could potentially offer a serde style thing (like serde_json but serde_variant 🤔 )

Variant::Float(v) => self.append_float(v),
Variant::Double(v) => self.append_double(v),
Variant::Binary(v) => self.append_binary(v),
Variant::String(s) | Variant::ShortString(s) => self.append_string(s),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will make three times deciding between short vs. long string? one to make the input, another here, and a third to consume it? Any way we could make this more direct/efficient?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to have a self.append_short_string and self.append_string and then call the appropriate ones. We could have self.append_short_string return an error potentially as well if the string was too long

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the liberty of merging up from main to resolve conflicts and updating the API t get the CI passing

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this PR and comments again and I think it looks good enough to merge

There are several outstanding questions, that I think we can address as follow on PRs (I will file tickets)

  1. Supporting nested objects / lists
  2. Error checking for variants that have restricted values (e.g. binary that is longer than u32::MAX, valid scale for decimals, etc)
  3. improving performance of dictionary creation / handling sorted dictionaries

But all in all I think this is a great step forward if I do say so my self.

Thank you @scovich and @PinkCrow007

🚀

Comment on lines +284 to +285
.extend_from_slice(&(value.len() as u32).to_le_bytes());
self.buffer.extend_from_slice(value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should throw an error if value.len() > u32::max

Comment on lines 322 to 329
pub fn new_array(&mut self) -> ArrayBuilder {
ArrayBuilder::new(self)
}

/// Create an [`ObjectBuilder`] for creating [`Variant::Object`] values.
///
/// See the examples on [`VariantBuilder`] for usage.
pub fn new_object(&mut self) -> ObjectBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the builder can leave room for the list length and then append the values, and then go back and update the length when the list is finsihed

This would get tricky for building "large" lists as the length field may not be known upfront.

we could also introduce potentially a functon like new_large_object() or something for callers to hint up front their object has many fields, and if they use new_object but push too many values fallback to copying

fn append_decimal4(&mut self, integer: i32, scale: u8) {
self.buffer
.push(primitive_header(VariantPrimitiveType::Decimal4));
self.buffer.push(scale);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should (or at least have optional checking 🤔 )

ObjectBuilder::new(self)
}

pub fn finish(self) -> (Vec<u8>, Vec<u8>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still the problem that variant can't impl From arrays/objects, but I guess we could add a call to create a new object/array builder from an existing one, in order to achieve nesting?

I am not sure we will be able to implement a general purpose From impl for native rust slices or objects. We could potentially offer a serde style thing (like serde_json but serde_variant 🤔 )

Variant::Binary(v) => self.append_binary(v),
Variant::String(s) | Variant::ShortString(s) => self.append_string(s),
Variant::Object(_) | Variant::List(_) => {
unreachable!("Object and Array variants cannot be created through Into<Variant>")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think you can append a variant object. I will write a test / file a ticket

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variant::Float(v) => self.append_float(v),
Variant::Double(v) => self.append_double(v),
Variant::Binary(v) => self.append_binary(v),
Variant::String(s) | Variant::ShortString(s) => self.append_string(s),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to have a self.append_short_string and self.append_string and then call the appropriate ones. We could have self.append_short_string return an error potentially as well if the string was too long

(metadata, self.buffer)
}

pub fn append_value<T: Into<Variant<'static, 'static>>>(&mut self, value: T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did so in d7d1449

@alamb
Copy link
Contributor

alamb commented Jun 17, 2025

I also updated the naming to match the change to rename Variant::Array to Variant::List

The more I see that the less I like it -- I will file a separate ticket

@alamb
Copy link
Contributor

alamb commented Jun 17, 2025

All right -- I have filed a bunch of follow on tickets, and attached them to

Once I get the CI to pass I'll plan to merge this PR

Thanks again @PinkCrow007 and @scovich

@alamb alamb merged commit 56ac4dc into apache:main Jun 17, 2025
13 checks passed
@PinkCrow007
Copy link
Contributor Author

Huge thanks @alamb @scovich — I’m honestly really touched by all the care and thought you put into this. I learned so much from your improvements and comments. Really appreciate you helping push it forward! I’ll keep working on the follow-ups and keep learning from you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant]: Rust API to Create Variant Values

3 participants