-
Notifications
You must be signed in to change notification settings - Fork 1k
Added List and Struct Encoding to arrow-avro Writer #8274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
RecordEncoder and s…| /// Avro types are not nullable, with nullability instead encoded as a union | ||
| /// where one of the variants is the null type. | ||
| /// | ||
| /// To accommodate this we special case two-variant unions where one of the | ||
| /// variants is the null type, and use this to derive arrow's notion of nullability | ||
| #[derive(Debug, Copy, Clone, PartialEq)] | ||
| pub enum Nullability { | ||
| /// The nulls are encoded as the first union variant | ||
| NullFirst, | ||
| /// The nulls are encoded as the second union variant | ||
| NullSecond, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this to src/schema.rs to keep the imports clean and non-circular.
arrow-avro/src/schema.rs
Outdated
| name_gen, | ||
| order, | ||
| )?; | ||
| let items_with_meta = merge_extras(items_inner, ie); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps worth it to put this nullable logic in datatype_to_avro_with_order as it's duplicated here and in Map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solid catch, @scovich caught this one too.
arrow-avro/src/writer/encoder.rs
Outdated
| #[inline] | ||
| fn precomputed_union_value_branch(&self, order: Option<Nullability>) -> Option<u8> { | ||
| match (order, self.has_nulls()) { | ||
| (Some(Nullability::NullFirst), false) => Some(0x02), // value branch index 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is explained elsewhere, I think we can eliminate the comments. Maybe alias the values as constants globally.
arrow-avro/src/writer/format.rs
Outdated
| None => "null", | ||
| }; | ||
| write_long(writer, 2)?; | ||
| write_long(writer, 2)?; // two entries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This note seems unnecessary
…implify field handling. Includes encoder type dispatch improvements, nullability support, and schema-driven write plans.
d9ea385 to
4e3950f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing a partial review. I got all the files except encoder.rs, which is partially reviewed.
| let record_name = schema | ||
| .metadata | ||
| .get(AVRO_NAME_METADATA_KEY) | ||
| .map_or("topLevelRecord", |s| s.as_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: Is this a well-known default name? Or just an arbitrary naming choice by this package? And does it actually matter in practice? (I guess if it mattered, the schema metadata would say so)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: Is this a well-known default name? Or just an arbitrary naming choice by this package?
While not an Avro‑spec default, topLevelRecord is used as a de‑facto default because several popular tools (notably Spark/Databricks) default to the same name when they synthesize an Avro schema from a struct/row.
And does it actually matter in practice? (I guess if it mattered, the schema metadata would say so)?
Avro requires that a record have a name to be valid. Also because the record name participates in canonical form parsing, changing a record's name will change it's fingerprint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So do we need to make this tunable somehow? Or the de facto default is good enough in practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good enough in practice, but we can always follow-up on this if needed.
arrow-avro/src/schema.rs
Outdated
| fn arrow_field_to_avro( | ||
| field: &ArrowField, | ||
| name_gen: &mut NameGenerator, | ||
| ) -> Result<Value, ArrowError> { | ||
| arrow_field_to_avro_with_order(field, name_gen, Nullability::NullFirst) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not define this down below, similar to what you did here with data_type_to_avro[_with_order]?
arrow-avro/src/schema.rs
Outdated
| let (items_inner, ie) = datatype_to_avro_with_order( | ||
| child.data_type(), | ||
| child.name(), | ||
| child.metadata(), | ||
| name_gen, | ||
| order, | ||
| )?; | ||
| let items_with_meta = merge_extras(items_inner, ie); | ||
| let items_schema = if child.is_nullable() { | ||
| wrap_nullable(items_with_meta, order) | ||
| } else { | ||
| items_with_meta | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exact pattern shows up four times (List/LargeList, FixedSizedList, Map, and struct field). Can we pull out a helper that all four call sites can use? It might look something like this:
fn process_datatype_with_nullability(
dt: &DataType,
field_name: &str,
metadata: &HashMap<String, String>,
name_gen: &mut NameGenerator,
null_order: Nullability,
is_nullable: bool,
) -> Result<Value, ArrowError> {
let (schema, extras) = datatype_to_avro_with_order(dt, field_name, metadata, name_gen, null_order)?;
let schema = merge_extras(schema, extras);
if is_nullable {
wrap_nullable(schema, null_union_order)
} else {
Ok(schema)
}
}(terrible name for the helper, but you get the idea)
There are two additional sites (Dictionary and RunEndEncoded) that could also use the helper -- they'd just need to pass is_nullable: false.
If you're looking for ways to split up the PR, a useful "prerefactor" might:
- plumb through the new (largely unused)
orderarg indatatype_to_avro_with_orderand friends - add+use this new helper at the 4-6 call sites that benefit from it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solid catch, I'll add that in!
arrow-avro/src/writer/encoder.rs
Outdated
| impala_mode: bool, // Will be fully implemented in a follow-up PR | ||
| } | ||
| /// Plan reference passed to the unified encoder constructor (required). | ||
| type PlanRef<'p> = &'p FieldPlan; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is a helpful type alias? It only has two use sites, both as function arg types with elided lifetimes. With the alias they look like this, because lifetime elision can't fully simplify the signature:
plan: PlanRef<'_>,vs. without the alias and complete lifetime elision:
plan: &Plan,There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
arrow-avro/src/writer/encoder.rs
Outdated
| }) | ||
| } | ||
|
|
||
| #[inline] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a lot of these #[inline] markers, but the compiler already inlines code pretty aggressively.
Do we have some evidence that these markers actually improve performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good callout. Before I push up my changes, I'll benchmark with and without the #[inline] markers to determine whether they add value or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't be shocked if one or two inline markers actually matter... but no idea which two. Can we just remove all the markers and track a follow-up issue to make profile-guided decisions on which ones to bring back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the inlines caused about a ~4% decrease in throughput performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet profile-guided optimizations would give back far more than 4%, since it covers far more than just inlining decisions. If we care about optimizing that much, we should probably go that route? But that requires benchmarking and profiling infra that arrow-rs probably lacks. An hybrid (and probably almost optimal) approach would be to manually gather .profdata, and then see what llvm-profdata show has to say about these functions.
Humans are provably terrible at guessing where to sprinkle hints like inlining or branch prediction, and so I would expect that some of the inline markers are badly-placed and actually hurting performance -- especially for real-world (non-microbenchmark) scenarios where CPU instruction cache capacity matters -- which eats into the performance gains of the well-placed markers. Automatic or manual profile-guided optimization at least gives some concrete data to replace those guesses.
Overall, my recommended approach would be to remove all the inline markers for now (focus on getting correct code in), and if we want to chase the 4% that could be a follow-up item to do it right. Preferably after the code has stabilized, since it's still evolving rapidly right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% fair point. I went ahead and removed them.
arrow-avro/src/writer/encoder.rs
Outdated
| out | ||
| } | ||
|
|
||
| fn encode_all(array: &dyn Array, plan: &FieldPlan, site: Option<Nullability>) -> Vec<u8> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
site is a really unintuitive name for this... earlier code used site_nullability which seems better?
But why not just nullability for both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rename to nullability, that's a good recommendation.
Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Ryan Johnson <[email protected]>
c388069 to
1ecbfce
Compare
|
@scovich @nathaniel-d-ef I really appreciate the solid reviews. The changes I just pushed should address most of your feedback. There are two items I wanted to follow-up on in future PRs:
Let me know what you all think and what else needs to be done to get this PR into solid shape whenever you get a chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bunch of nits to simplify the code, but otherwise ready to merge.
There are two items I wanted to follow-up on in future PRs:
1. Further `arrow-avro/src/schema.rs` file refactoring (if possible). As I updated the file I saw other potential areas to improve. For example, I think if we introduced some sort of struct to hold context for `datatype_to_avro`, we could simplify and improve the logic.
Definitely future work.
2. Renaming `Nullability`. The name change will impact other areas of the code not related to the `Writer`. I'd also want to consider the impact `Union` type support will have as well when we get there. But I do think @scovich has a valid point about the current naming convention.
Yes. Mixing refactors with new functionality is always messy.
I would recommend two other follow-ups:
3. Profile-guided placement of #[inline] markers
4. Harmonize arg ordering for the various encoding methods that take out; some pass it as the first arg, others as the second arg.
arrow-avro/src/writer/encoder.rs
Outdated
| let null_buffer = array.nulls().cloned().ok_or_else(|| { | ||
| ArrowError::InvalidArgumentError(format!( | ||
| "Array for Avro site '{}' reports nulls but has no null buffer", | ||
| field.name() | ||
| )) | ||
| })?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: I'm not sure ok_or_else plus ? is actually helpful... unless the thirteen extra chars from return Err(...); makes the inner part spill to multiple lines (which it doesn't, in this case)
| let null_buffer = array.nulls().cloned().ok_or_else(|| { | |
| ArrowError::InvalidArgumentError(format!( | |
| "Array for Avro site '{}' reports nulls but has no null buffer", | |
| field.name() | |
| )) | |
| })?; | |
| let Some(nulls) = array.nulls().cloned() else { | |
| return Err(ArrowError::InvalidArgumentError(format!( | |
| "Array for Avro site '{}' reports nulls but has no null buffer", | |
| field.name() | |
| ))); | |
| }; |
arrow-avro/src/writer/encoder.rs
Outdated
| // Emit a single positive block for performance, then the end marker. | ||
| write_long(out, len as i64)?; | ||
| for j in start..end { | ||
| write_item(j, out)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: It's a bit disorienting that some methods (like this one and encoder encode) take out as the second param, while others (like write_long below) take out as the first param? Can we harmonize them? I would suggest out is always the first param, but consistency is the main concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very fair, I'll clean that up.
Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Ryan Johnson <[email protected]>
7c8545b to
6f01bbc
Compare
@scovich All great call outs. I accepted your latest suggestions and pushed up changes that harmonized the arg ordering for the methods taking @alamb @mbrobbel Would either of you be able to look over this PR if you get a chance? |
Co-authored-by: Ryan Johnson <[email protected]>
501666e to
90fc8b4
Compare
6733228 to
8dd0947
Compare
Co-authored-by: Matthijs Brobbel <[email protected]>
Co-authored-by: Matthijs Brobbel <[email protected]>
Co-authored-by: Matthijs Brobbel <[email protected]>
5585625 to
df4d795
Compare
|
I'll plan to merge this PR once the CI passes |
df4d795 to
67f70d7
Compare
|
Integration test failure is unrelated |
|
Thanks @jecsand838 |
# Which issue does this PR close? - Part of #4886 - Related to #8274 # Rationale for this change This PR extends on work introduced in #8274, adding additional complex type support to the Avro writer. This brings us closer to a complete round-trip capability and Avro spec support in the arrow-avro crate. # What changes are included in this PR? New encoders: Fixed UUID IntervalMonthDayNano IntervalYearMonth IntervalDayTime Decimal32 Decimal64 Decimal128 Decimal256 Corresponding changes in support of these encoders in FieldEncoder and FieldPlan # Are these changes tested? Yes, additional complex type unit tests have been added. Benchmark tests have also been written but are being omitted here to keep the diff manageable. All tests, new and existing, pass. # Are there any user-facing changes? n/a, arrow-avro crate is not yet public
# Which issue does this PR close? - Part of #4886 - Related to #8274 and #8298 # Rationale for this change This PR adds Map and Enum encoders to the arrow-avro crate writer, along with new benchmark tests for remaining types and round-trip tests. # What changes are included in this PR? New encoders: **Map** **Enum** Corresponding changes in support of these encoders in FieldEncoder and FieldPlan ## Additional round trip tests in `mod.rs` New tests follow existing file read pattern - simple_fixed - duration_uuid - nonnullable.impala.avro - decimals - enum ## Additional benchmark tests for data types - Utf8 - List<Utf8> - Struct - FixedSizeBinary16 - UUID - IntervalMonthDayNanoDuration - Decimal32(bytes) - Decimal64(bytes) - Decimal128(bytes) - Decimal128(fixed16) - Decimal256(bytes) - Map - Enum # Are these changes tested? Yes, additional complex type unit tests have been added for Map and Enum. The rest of the PR beyond the new types are tests themselves. All tests, new and existing, pass. # Are there any user-facing changes? n/a, arrow-avro crate is not yet public --------- Co-authored-by: Connor Sanders <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Rationale for this change
This refactor streamlines the
arrow-avrowriter by introducing a single, schema‑drivenRecordEncoderthat plans writes up front and encodes rows using consistent, explicit rules for nullability and type dispatch. It reduces duplication in nested/struct/list handling, makes the order of Avro union branches (null‑first vs null‑second) an explicit choice, and aligns header schema generation with value encoding.This should improve correctness (especially for nested optionals), make behavior easier to reason about, and pave the way for future optimizations.
What changes are included in this PR?
High‑level:
RecordEncoderwith a builder that walks the Avro record in Avro order and maps each field to its Arrow column, producing a reusable write plan. The encoder covers scalars and nested types (struct, (large) lists, maps, strings/binaries).API and implementation details:
Writer / encoder refactor
FieldPlantree (variants forScalar,Struct { … }, andList { … }) and per‑sitenullabilitycarried from the Avro schema.LargeBinary,Utf8,Utf8Large,List,LargeList, andStruct.write_optional_index(writes0x00/0x02according to Null‑First/Null‑Second), replacing the old branch write.Schema generation & metadata
Nullabilityenum toschema.rsand threads it through schema generation and writer logic.AvroSchema::from_arrow_with_options(schema, Option<Nullability>)to either reuse embedded Avro JSON or build new Avro JSON that honors the requested null‑union order at all nullable sites.extend_with_passthrough_metadataso Arrow schema metadata is copied into Avro JSON while skipping Avro‑reserved and internal Arrow keys.wrap_nullableandarrow_field_to_avro_with_orderto apply ordering consistently for arrays, fixed‑size lists, maps, structs, and unions.Format and glue
writer/format.rsby removing theEncoderOptionsplumbing from the OCF format;write_longremains exported for header writing.Are these changes tested?
Yes.
writer/encoder.rsthat verify scalar and string/binary encodings (e.g., Binary/LargeBinary, Utf8/LargeUtf8) and validate length/branch encoding primitives used by the writer.writer/mod.rs.Are there any user-facing changes?
N/A because arrow-avro is not public yet.