Skip to content

Conversation

@jecsand838
Copy link
Contributor

Which issue does this PR close?

This work continues arrow-avro schema resolution support and aligns behavior with the Avro spec.

Rationale for this change

Avro’s specification requires readers to materialize default values when a field exists in the reader schema but not in the writer schema, and to validate defaults (i.e., union defaults must match the first branch; bytes/fixed defaults must be JSON strings; enums may specify a default symbol for unknown writer symbols). Implementing this behavior makes arrow-avro more standards‑compliant and improves interoperability with evolving schemas.

What changes are included in this PR?

High‑level summary

  • Refactor RecordDecoder around a simpler Projector‑style abstraction that consumes ResolvedRecord to: (a) skip writer‑only fields, and (b) materialize reader‑only defaulted fields, reducing branching in the hot path. (See commit subject and record decoder changes.)
    Touched files (2):

  • arrow-avro/src/reader/record.rs - refactor decoder to use precomputed mappings and defaults.

  • arrow-avro/src/reader/mod.rs - add comprehensive tests for defaults and error cases (see below).

Are these changes tested?

Yes, new integration tests cover both the happy path and validation errors:

  • test_schema_resolution_defaults_all_supported_types: verifies that defaults for boolean/int/long/float/double/bytes/string/date/time/timestamp/decimal/fixed/enum/duration/uuid/array/map/nested record and unions are materialized correctly for all rows.
  • test_schema_resolution_default_enum_invalid_symbol_errors: invalid enum default symbol is rejected.
  • test_schema_resolution_default_fixed_size_mismatch_errors: mismatched fixed/bytes default lengths are rejected.

These tests assert the Avro‑spec behavior (i.e., union defaults must match the first branch; bytes/fixed defaults use JSON strings).

Are there any user-facing changes?

N/A

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Sep 8, 2025
@jecsand838 jecsand838 changed the title Add default values support and refactor RecordDecoder to simplify s… Add default values support to RecordDecoder Sep 8, 2025
@jecsand838 jecsand838 force-pushed the avro-schema-resolution-default-values branch from da5c105 to b4f2a0c Compare September 8, 2025 03:03
…chema resolution with `Projector` abstraction. Enables efficient skipping of writer-only fields and improves handling of default field values, enums, and resolved mappings.
@jecsand838 jecsand838 force-pushed the avro-schema-resolution-default-values branch from b4f2a0c to e948975 Compare September 8, 2025 04:17
@jecsand838 jecsand838 changed the title Add default values support to RecordDecoder Add projection with default values support to RecordDecoder Sep 8, 2025
@jecsand838 jecsand838 force-pushed the avro-schema-resolution-default-values branch 2 times, most recently from 5c00977 to 4250668 Compare September 8, 2025 18:14
@jecsand838 jecsand838 force-pushed the avro-schema-resolution-default-values branch 4 times, most recently from 90fed9f to 1e778ed Compare September 9, 2025 02:40
@jecsand838 jecsand838 force-pushed the avro-schema-resolution-default-values branch from 1e778ed to 2c586b8 Compare September 10, 2025 23:08
@jecsand838 jecsand838 marked this pull request as ready for review September 11, 2025 21:55
@jecsand838 jecsand838 force-pushed the avro-schema-resolution-default-values branch from eccfff0 to e5551ce Compare September 14, 2025 19:41
@jecsand838
Copy link
Contributor Author

@alamb Would you be able to review this PR when you have a chance? About half of the code is tests, so ~650 lines of added production code.

Also after this PR, there's two Reader PRs left related to Union support before arrow-avro has the functionality to replace apache_avro in DataFusion. I'm finishing them both up right now.

Copy link
Member

@mbrobbel mbrobbel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jecsand838

@jecsand838 jecsand838 force-pushed the avro-schema-resolution-default-values branch from 86219f7 to 12d7a47 Compare September 16, 2025 06:32
@jecsand838
Copy link
Contributor Author

Thanks @jecsand838

@mbrobbel Thank you so much for the review and those catches.

I went ahead and pushed up changes addressing your comments.

@mbrobbel mbrobbel merged commit ca07b06 into apache:main Sep 16, 2025
23 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 16, 2025

Thanks @mbrobbel and @jecsand838

@jecsand838 jecsand838 deleted the avro-schema-resolution-default-values branch September 22, 2025 04:31
alamb added a commit that referenced this pull request Sep 24, 2025
…art 2) (#8349)

# Which issue does this PR close?

This work continues arrow-avro schema resolution support and aligns
behavior with the Avro spec.

- **Related to**: #4886 (“Add Avro Support”): ongoing work to round out
the reader/decoder, including schema resolution and type promotion.
- **Follow-ups/Context**: #8348 (Add arrow-avro Reader support for Dense
Union and Union resolution (Part 1)), #8293 (Add projection with default
values support to RecordDecoder), #8124 (schema resolution & type
promotion for the decoder), #8223 (enum mapping for schema resolution).
These previous efforts established the foundations that this PR extends
to Union types and Union resolution.

# Rationale for this change

`arrow-avro` lacked end‑to‑end support for Avro unions and Arrow `Union`
schemas. Many Avro datasets rely on unions (i.e.., `["null","string"]`,
tagged unions of different records), and without schema‐level resolution
and JSON encoding the crate could not interoperate cleanly. This PR
complete the initial Decoder support for Union types and Union
resolution.

# What changes are included in this PR?

* Decoder support for Dense Union decoding and Union resolution.

# Are these changes tested?

Yes,
New detailed end to end integration tests have been added to
`reader/mod.rs` and unit tests covering the new Union and Union
resolution functionality are included in the `reader/record.rs` file.

# Are there any user-facing changes?

N/A

---------

Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants