Skip to content

Conversation

@jecsand838
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This refactor streamlines the arrow-avro writer by introducing a single, schema‑driven RecordEncoder that plans writes up front and encodes rows using consistent, explicit rules for nullability and type dispatch. It reduces duplication in nested/struct/list handling, makes the order of Avro union branches (null‑first vs null‑second) an explicit choice, and aligns header schema generation with value encoding.

This should improve correctness (especially for nested optionals), make behavior easier to reason about, and pave the way for future optimizations.

What changes are included in this PR?

High‑level:

  • Introduces a unified, schema‑driven RecordEncoder with a builder that walks the Avro record in Avro order and maps each field to its Arrow column, producing a reusable write plan. The encoder covers scalars and nested types (struct, (large) lists, maps, strings/binaries).
  • Applies a single model of nullability throughout encoding, including nested sites (list items, fixed‑size list items, map values), and uses explicit union‑branch indices according to the chosen order.

API and implementation details:

  • Writer / encoder refactor

    • Replaces the previous per‑column/child encoding paths with a FieldPlan tree (variants for Scalar, Struct { … }, and List { … }) and per‑site nullability carried from the Avro schema.
    • Adds encoder variants for LargeBinary, Utf8, Utf8Large, List, LargeList, and Struct.
    • Encodes union branch indices with write_optional_index (writes 0x00/0x02 according to Null‑First/Null‑Second), replacing the old branch write.
  • Schema generation & metadata

    • Moves the Nullability enum to schema.rs and threads it through schema generation and writer logic.
    • Adds AvroSchema::from_arrow_with_options(schema, Option<Nullability>) to either reuse embedded Avro JSON or build new Avro JSON that honors the requested null‑union order at all nullable sites.
    • Adds extend_with_passthrough_metadata so Arrow schema metadata is copied into Avro JSON while skipping Avro‑reserved and internal Arrow keys.
    • Introduces helpers like wrap_nullable and arrow_field_to_avro_with_order to apply ordering consistently for arrays, fixed‑size lists, maps, structs, and unions.
  • Format and glue

    • Simplifies writer/format.rs by removing the EncoderOptions plumbing from the OCF format; write_long remains exported for header writing.

Are these changes tested?

Yes.

  • Adds focused unit tests in writer/encoder.rs that verify scalar and string/binary encodings (e.g., Binary/LargeBinary, Utf8/LargeUtf8) and validate length/branch encoding primitives used by the writer.
  • Round trip integration tests that validate List and Struct decoding in writer/mod.rs.
  • Adjusts existing schema tests (e.g., decimal metadata expectations) to align with the new schema/metadata handling.

Are there any user-facing changes?

N/A because arrow-avro is not public yet.

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Sep 3, 2025
@jecsand838 jecsand838 changed the title Refactor arrow-avro writer to introduce unified RecordEncoder and s… Added List and Struct Decoding to arrow-avro Writer Sep 3, 2025
Comment on lines -31 to -42
/// Avro types are not nullable, with nullability instead encoded as a union
/// where one of the variants is the null type.
///
/// To accommodate this we special case two-variant unions where one of the
/// variants is the null type, and use this to derive arrow's notion of nullability
#[derive(Debug, Copy, Clone, PartialEq)]
pub enum Nullability {
/// The nulls are encoded as the first union variant
NullFirst,
/// The nulls are encoded as the second union variant
NullSecond,
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this to src/schema.rs to keep the imports clean and non-circular.

@jecsand838
Copy link
Contributor Author

@scovich @alamb Would either of you be able to review this one? This should be the last large PR for arrow-avro as it lays the full foundation for the Writer. I tried to slim it down as much as possible.

@jecsand838 jecsand838 changed the title Added List and Struct Decoding to arrow-avro Writer Added List and Struct Encoding to arrow-avro Writer Sep 3, 2025
name_gen,
order,
)?;
let items_with_meta = merge_extras(items_inner, ie);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps worth it to put this nullable logic in datatype_to_avro_with_order as it's duplicated here and in Map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid catch, @scovich caught this one too.

#[inline]
fn precomputed_union_value_branch(&self, order: Option<Nullability>) -> Option<u8> {
match (order, self.has_nulls()) {
(Some(Nullability::NullFirst), false) => Some(0x02), // value branch index 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is explained elsewhere, I think we can eliminate the comments. Maybe alias the values as constants globally.

None => "null",
};
write_long(writer, 2)?;
write_long(writer, 2)?; // two entries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This note seems unnecessary

…implify field handling. Includes encoder type dispatch improvements, nullability support, and schema-driven write plans.
@jecsand838 jecsand838 force-pushed the avro-writer-struct-list-types branch from d9ea385 to 4e3950f Compare September 3, 2025 17:44
Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing a partial review. I got all the files except encoder.rs, which is partially reviewed.

let record_name = schema
.metadata
.get(AVRO_NAME_METADATA_KEY)
.map_or("topLevelRecord", |s| s.as_str());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: Is this a well-known default name? Or just an arbitrary naming choice by this package? And does it actually matter in practice? (I guess if it mattered, the schema metadata would say so)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: Is this a well-known default name? Or just an arbitrary naming choice by this package?

While not an Avro‑spec default, topLevelRecord is used as a de‑facto default because several popular tools (notably Spark/Databricks) default to the same name when they synthesize an Avro schema from a struct/row.

And does it actually matter in practice? (I guess if it mattered, the schema metadata would say so)?

Avro requires that a record have a name to be valid. Also because the record name participates in canonical form parsing, changing a record's name will change it's fingerprint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do we need to make this tunable somehow? Or the de facto default is good enough in practice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good enough in practice, but we can always follow-up on this if needed.

Comment on lines 875 to 880
fn arrow_field_to_avro(
field: &ArrowField,
name_gen: &mut NameGenerator,
) -> Result<Value, ArrowError> {
arrow_field_to_avro_with_order(field, name_gen, Nullability::NullFirst)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not define this down below, similar to what you did here with data_type_to_avro[_with_order]?

Comment on lines 1001 to 1013
let (items_inner, ie) = datatype_to_avro_with_order(
child.data_type(),
child.name(),
child.metadata(),
name_gen,
order,
)?;
let items_with_meta = merge_extras(items_inner, ie);
let items_schema = if child.is_nullable() {
wrap_nullable(items_with_meta, order)
} else {
items_with_meta
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exact pattern shows up four times (List/LargeList, FixedSizedList, Map, and struct field). Can we pull out a helper that all four call sites can use? It might look something like this:

fn process_datatype_with_nullability(
    dt: &DataType,
    field_name: &str,
    metadata: &HashMap<String, String>,
    name_gen: &mut NameGenerator,
    null_order: Nullability,
    is_nullable: bool,
) -> Result<Value, ArrowError> {
    let (schema, extras) = datatype_to_avro_with_order(dt, field_name, metadata, name_gen, null_order)?;
    let schema = merge_extras(schema, extras);
    
    if is_nullable {
        wrap_nullable(schema, null_union_order)
    } else {
        Ok(schema)
    }
}

(terrible name for the helper, but you get the idea)

There are two additional sites (Dictionary and RunEndEncoded) that could also use the helper -- they'd just need to pass is_nullable: false.

If you're looking for ways to split up the PR, a useful "prerefactor" might:

  • plumb through the new (largely unused) order arg in datatype_to_avro_with_order and friends
  • add+use this new helper at the 4-6 call sites that benefit from it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid catch, I'll add that in!

impala_mode: bool, // Will be fully implemented in a follow-up PR
}
/// Plan reference passed to the unified encoder constructor (required).
type PlanRef<'p> = &'p FieldPlan;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is a helpful type alias? It only has two use sites, both as function arg types with elided lifetimes. With the alias they look like this, because lifetime elision can't fully simplify the signature:

        plan: PlanRef<'_>,

vs. without the alias and complete lifetime elision:

        plan: &Plan,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

})
}

#[inline]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a lot of these #[inline] markers, but the compiler already inlines code pretty aggressively.
Do we have some evidence that these markers actually improve performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good callout. Before I push up my changes, I'll benchmark with and without the #[inline] markers to determine whether they add value or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be shocked if one or two inline markers actually matter... but no idea which two. Can we just remove all the markers and track a follow-up issue to make profile-guided decisions on which ones to bring back?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the inlines caused about a ~4% decrease in throughput performance.

Copy link
Contributor

@scovich scovich Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet profile-guided optimizations would give back far more than 4%, since it covers far more than just inlining decisions. If we care about optimizing that much, we should probably go that route? But that requires benchmarking and profiling infra that arrow-rs probably lacks. An hybrid (and probably almost optimal) approach would be to manually gather .profdata, and then see what llvm-profdata show has to say about these functions.

Humans are provably terrible at guessing where to sprinkle hints like inlining or branch prediction, and so I would expect that some of the inline markers are badly-placed and actually hurting performance -- especially for real-world (non-microbenchmark) scenarios where CPU instruction cache capacity matters -- which eats into the performance gains of the well-placed markers. Automatic or manual profile-guided optimization at least gives some concrete data to replace those guesses.

Overall, my recommended approach would be to remove all the inline markers for now (focus on getting correct code in), and if we want to chase the 4% that could be a follow-up item to do it right. Preferably after the code has stabilized, since it's still evolving rapidly right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% fair point. I went ahead and removed them.

out
}

fn encode_all(array: &dyn Array, plan: &FieldPlan, site: Option<Nullability>) -> Vec<u8> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

site is a really unintuitive name for this... earlier code used site_nullability which seems better?
But why not just nullability for both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rename to nullability, that's a good recommendation.

@jecsand838 jecsand838 force-pushed the avro-writer-struct-list-types branch from c388069 to 1ecbfce Compare September 4, 2025 17:39
@jecsand838
Copy link
Contributor Author

jecsand838 commented Sep 4, 2025

@scovich @nathaniel-d-ef I really appreciate the solid reviews.

The changes I just pushed should address most of your feedback. There are two items I wanted to follow-up on in future PRs:

  1. Further arrow-avro/src/schema.rs file refactoring (if possible). As I updated the file I saw other potential areas to improve. For example, I think if we introduced some sort of struct to hold context for datatype_to_avro, we could simplify and improve the logic.
  2. Renaming Nullability. The name change will impact other areas of the code not related to the Writer. I'd also want to consider the impact Union type support will have as well when we get there. But I do think @scovich has a valid point about the current naming convention.

Let me know what you all think and what else needs to be done to get this PR into solid shape whenever you get a chance.

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bunch of nits to simplify the code, but otherwise ready to merge.

There are two items I wanted to follow-up on in future PRs:

1. Further `arrow-avro/src/schema.rs` file refactoring (if possible). As I updated the file I saw other potential areas to improve. For example, I think if we introduced some sort of struct to hold context for `datatype_to_avro`, we could simplify and improve the logic.

Definitely future work.

2. Renaming `Nullability`. The name change will impact other areas of the code not related to the `Writer`. I'd also want to consider the impact `Union` type support will have as well when we get there. But I do think @scovich has a valid point about the current naming convention.

Yes. Mixing refactors with new functionality is always messy.

I would recommend two other follow-ups:
3. Profile-guided placement of #[inline] markers
4. Harmonize arg ordering for the various encoding methods that take out; some pass it as the first arg, others as the second arg.

Comment on lines 213 to 218
let null_buffer = array.nulls().cloned().ok_or_else(|| {
ArrowError::InvalidArgumentError(format!(
"Array for Avro site '{}' reports nulls but has no null buffer",
field.name()
))
})?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: I'm not sure ok_or_else plus ? is actually helpful... unless the thirteen extra chars from return Err(...); makes the inner part spill to multiple lines (which it doesn't, in this case)

Suggested change
let null_buffer = array.nulls().cloned().ok_or_else(|| {
ArrowError::InvalidArgumentError(format!(
"Array for Avro site '{}' reports nulls but has no null buffer",
field.name()
))
})?;
let Some(nulls) = array.nulls().cloned() else {
return Err(ArrowError::InvalidArgumentError(format!(
"Array for Avro site '{}' reports nulls but has no null buffer",
field.name()
)));
};

// Emit a single positive block for performance, then the end marker.
write_long(out, len as i64)?;
for j in start..end {
write_item(j, out)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: It's a bit disorienting that some methods (like this one and encoder encode) take out as the second param, while others (like write_long below) take out as the first param? Can we harmonize them? I would suggest out is always the first param, but consistency is the main concern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very fair, I'll clean that up.

@jecsand838 jecsand838 force-pushed the avro-writer-struct-list-types branch from 7c8545b to 6f01bbc Compare September 5, 2025 18:28
@jecsand838
Copy link
Contributor Author

Bunch of nits to simplify the code, but otherwise ready to merge.

There are two items I wanted to follow-up on in future PRs:

1. Further `arrow-avro/src/schema.rs` file refactoring (if possible). As I updated the file I saw other potential areas to improve. For example, I think if we introduced some sort of struct to hold context for `datatype_to_avro`, we could simplify and improve the logic.

Definitely future work.

2. Renaming `Nullability`. The name change will impact other areas of the code not related to the `Writer`. I'd also want to consider the impact `Union` type support will have as well when we get there. But I do think @scovich has a valid point about the current naming convention.

Yes. Mixing refactors with new functionality is always messy.

I would recommend two other follow-ups: 3. Profile-guided placement of #[inline] markers 4. Harmonize arg ordering for the various encoding methods that take out; some pass it as the first arg, others as the second arg.

@scovich All great call outs. I accepted your latest suggestions and pushed up changes that harmonized the arg ordering for the methods taking out as a parameter. As always thank you for the solid and thorough review 😄. We'll also be sure to follow-up on those PRs.

@alamb @mbrobbel Would either of you be able to look over this PR if you get a chance?

@jecsand838 jecsand838 force-pushed the avro-writer-struct-list-types branch from 501666e to 90fc8b4 Compare September 6, 2025 04:56
@jecsand838 jecsand838 force-pushed the avro-writer-struct-list-types branch from 6733228 to 8dd0947 Compare September 7, 2025 19:44
jecsand838 and others added 3 commits September 8, 2025 10:41
@jecsand838 jecsand838 force-pushed the avro-writer-struct-list-types branch from 5585625 to df4d795 Compare September 8, 2025 15:43
@alamb
Copy link
Contributor

alamb commented Sep 8, 2025

I'll plan to merge this PR once the CI passes

@jecsand838 jecsand838 force-pushed the avro-writer-struct-list-types branch from df4d795 to 67f70d7 Compare September 8, 2025 16:09
@alamb
Copy link
Contributor

alamb commented Sep 8, 2025

Integration test failure is unrelated

@alamb alamb merged commit 911940f into apache:main Sep 8, 2025
22 of 23 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 8, 2025

Thanks @jecsand838

alamb pushed a commit that referenced this pull request Sep 13, 2025
# Which issue does this PR close?

- Part of #4886
- Related to #8274

# Rationale for this change

This PR extends on work introduced in #8274, adding additional complex
type support to the Avro writer. This brings us closer to a complete
round-trip capability and Avro spec support in the arrow-avro crate.

# What changes are included in this PR?

New encoders:
Fixed
UUID
IntervalMonthDayNano
IntervalYearMonth
IntervalDayTime
Decimal32
Decimal64
Decimal128
Decimal256

Corresponding changes in support of these encoders in FieldEncoder and
FieldPlan

# Are these changes tested?

Yes, additional complex type unit tests have been added. Benchmark tests
have also been written but are being omitted here to keep the diff
manageable. All tests, new and existing, pass.

# Are there any user-facing changes?

n/a, arrow-avro crate is not yet public
alamb added a commit that referenced this pull request Sep 17, 2025
# Which issue does this PR close?

- Part of #4886
- Related to #8274 and
#8298

# Rationale for this change

This PR adds Map and Enum encoders to the arrow-avro crate writer, along
with new benchmark tests for remaining types and round-trip tests.

# What changes are included in this PR?

New encoders:
**Map**
**Enum**

Corresponding changes in support of these encoders in FieldEncoder and
FieldPlan

## Additional round trip tests in `mod.rs`

New tests follow existing file read pattern

- simple_fixed
- duration_uuid
- nonnullable.impala.avro
- decimals
- enum

## Additional benchmark tests for data types

- Utf8
- List<Utf8>
- Struct
- FixedSizeBinary16
- UUID
- IntervalMonthDayNanoDuration
- Decimal32(bytes)
- Decimal64(bytes)
- Decimal128(bytes)
- Decimal128(fixed16)
- Decimal256(bytes)
- Map
- Enum

# Are these changes tested?

Yes, additional complex type unit tests have been added for Map and
Enum. The rest of the PR beyond the new types are tests themselves. All
tests, new and existing, pass.

# Are there any user-facing changes?

n/a, arrow-avro crate is not yet public

---------

Co-authored-by: Connor Sanders <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
@jecsand838 jecsand838 deleted the avro-writer-struct-list-types branch September 22, 2025 04:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants