Add schema resolution and type promotion support to arrow-avro Decoder #8124

jecsand838 · 2025-08-12T23:03:29Z

Which issue does this PR close?

Part of Add Avro Support #4886
Follows up on Added arrow-avro schema resolution foundations and type promotion #8047

Rationale for this change

Avro allows safe widening between numeric primitives and interoperability between bytes and UTF‑8 string during schema resolution.

Implementing promotion-aware decoding lets us:

Honor the Avro spec’s resolution matrix directly in the reader, improving interoperability with evolving schemas.
Decode directly into the target Arrow type (avoiding extra passes and temporary arrays).
Produce clear errors for illegal promotions, instead of surprising behavior. (Per the spec, unresolved writer/reader mismatches are errors.)

What changes are included in this PR?

Core decoding (arrow-avro/src/reader/record.rs):

Add promotion-aware decoder variants:
- Int32ToInt64, Int32ToFloat32, Int32ToFloat64
- Int64ToFloat32, Int64ToFloat64
- Float32ToFloat64
- BytesToString, StringToBytes
Teach Decoder::try_new to inspect ResolutionInfo::Promotion and select the appropriate variant, so conversion happens as we decode, not after.
Extend decode, append_null, and flush to handle the new variants and materialize the correct Arrow arrays (Int64Array, Float32Array, Float64Array, StringArray, BinaryArray).
Keep existing behavior for Utf8View for non-promoted strings; promotions to string materialize a StringArray (not StringViewArray) for correctness and simplicity. (StringView remains available for native UTF‑8 paths.)

Integration tests & helpers (arrow-avro/src/reader/mod.rs):

Add utilities to load a file’s writer schema JSON and synthesize a reader schema with field-level promotions (make_reader_schema_with_promotions).
Add cross‑codec tests on alltypes_plain (no compression, snappy, zstd, bzip2, xz) that validate:
- Mixed numeric promotions to float/double and int to long.
- bytes to string and string to bytes.
- Timestamp/timezone behavior unchanged.
Add negative test ensuring illegal promotions (e.g., boolean to double) produce a descriptive error.

Are these changes tested?

Yes.

Unit tests (in record.rs) for each promotion path:
- int to long, int to float, int to double
- long to float, long to double
- float to double
- bytes to string (including non‑ASCII UTF‑8) and string to bytes
- Verifies that illegal promotions fail fast.
Integration tests (in mod.rs) reading real alltypes_plain Avro files across multiple compression codecs, asserting exact Arrow outputs for promoted fields.
Existing tests continue to pass.

Are there any user-facing changes?

N/A

alamb

Thanks @jecsand838 -- this looks good and well tested to me

alamb · 2025-08-13T18:40:14Z

arrow-avro/src/reader/mod.rs

+                    true,
+                ),
+                (
+                    "date_string_col",


why does a date_string_col come through as Binary by default?

@alamb The underlying data in the alltypes_plain.avro file is typed and stored as bytes.

Here's the Avro schema info for the date_string_col field:

{ "name": "date_string_col", "type": [ "bytes", "null" ] },

I also noticed that the string_col is bytes as well:

{ "name": "string_col", "type": [ "bytes", "null" ] },

alamb · 2025-08-14T16:58:17Z

Thank you @jecsand838

…row-avro codec (#8292) # Which issue does this PR close? This work continues arrow-avro schema resolution support and aligns behavior with the Avro spec. - **Related to**: #4886 (“Add Avro Support”): ongoing work to round out the reader/decoder, including schema resolution and type promotion. - **Follow-ups/Context**: #8124 (schema resolution & type promotion for the decoder), #8223 (enum mapping for schema resolution). These previous efforts established the foundations that this PR extends to default values and additional resolvable types. # Rationale for this change Avro’s **schema resolution** requires readers to reconcile differences between the writer and reader schemas, including: - Using record-field **default values** when the writer lacks a field present in the reader; defaults must be type-correct (i.e., union defaults match the first union member; bytes/fixed defaults are JSON strings). - Recursively resolving **arrays** (by item schema) and **maps** (by value schema). - Resolving **fixed** types (size and unqualified name must match) and erroring when they do not. Prior to this change, arrow-avro’s resolution handled some cases but lacked full Codec support for **default values** and for resolving **array/map/fixed** shapes between writer and reader. This led to gaps when reading evolved data or datasets produced by heterogeneous systems. This PR implements these missing pieces so the Arrow reader behaves per the spec in common evolution scenarios. # What changes are included in this PR? This PR modifies **`arrow-avro/src/codec.rs`** to extend the schema-resolution path - **Default value handling** for record fields - Reads and applies default values when the reader expects a field absent from the writer, including **nested defaults**. - Validates defaults per the Avro spec (e.g., union defaults match the first schema; bytes/fixed defaults are JSON strings). - **Array / Map / Fixed schema resolution** - **Array**: recursively resolves item schemas (writer↔reader). - **Map**: recursively resolves value schemas. - **Fixed**: enforces matching size and (unqualified) name; otherwise signals an error, consistent with the spec. - **Codec updates** - Refactors internal codec logic to support the above during decoding, including resolution for **record fields** and **nested defaults**. (See commit message for the high-level summary.) # Are these changes tested? **Yes.** This PR includes new unit tests in `arrow-avro/src/codec.rs` covering: 1) **Default validation & persistence** - `Null`/union‑nullability rules; metadata persistence of defaults (`AVRO_FIELD_DEFAULT_METADATA_KEY`). 2) **`AvroLiteral` Parsing** - Range checks for `i32`/`f32`; correct literals for `i64`/`f64`; `Utf8`/`Utf8View`; `uuid` strings (RFC‑4122). - Byte‑range mapping for `bytes`/`fixed` defaults; `Fixed(n)` length enforcement; `decimal` on `fixed` vs `bytes`; `duration`/interval fixed **12**‑byte enforcement. 3) **Collections & records** - Array/map defaults shape; enum symbol validity; record defaults for missing fields, required‑field errors, and honoring field‑level defaults; skip‑fields retained for writer‑only fields. 4) **Resolution mechanics** - Element **promotion** (`int` to `long`) for arrays; **reader metadata precedence** for colliding attributes; `fixed` name/size match including **alias**. # Are there any user-facing changes? N/A --------- Co-authored-by: Andrew Lamb <[email protected]>

# Which issue does this PR close? This work continues arrow-avro schema resolution support and aligns behavior with the Avro spec. - **Related to**: #4886 (“Add Avro Support”): ongoing work to round out the reader/decoder, including schema resolution and type promotion. - **Follow-ups/Context**: #8292 (Add array/map/fixed schema resolution and default value support to arrow-avro codec), #8124 (schema resolution & type promotion for the decoder), #8223 (enum mapping for schema resolution). These previous efforts established the foundations that this PR extends to default values and additional resolvable types. # Rationale for this change Avro’s specification requires readers to materialize default values when a field exists in the **reader** schema but not in the **writer** schema, and to validate defaults (i.e., union defaults must match the first branch; bytes/fixed defaults must be JSON strings; enums may specify a default symbol for unknown writer symbols). Implementing this behavior makes `arrow-avro` more standards‑compliant and improves interoperability with evolving schemas. # What changes are included in this PR? **High‑level summary** * **Refactor `RecordDecoder`** around a simpler **`Projector`**‑style abstraction that consumes `ResolvedRecord` to: (a) skip writer‑only fields, and (b) materialize reader‑only defaulted fields, reducing branching in the hot path. (See commit subject and record decoder changes.) **Touched files (2):** * `arrow-avro/src/reader/record.rs` - refactor decoder to use precomputed mappings and defaults. * `arrow-avro/src/reader/mod.rs` - add comprehensive tests for defaults and error cases (see below). # Are these changes tested? Yes, new integration tests cover both the **happy path** and **validation errors**: * `test_schema_resolution_defaults_all_supported_types`: verifies that defaults for boolean/int/long/float/double/bytes/string/date/time/timestamp/decimal/fixed/enum/duration/uuid/array/map/nested record and unions are materialized correctly for all rows. * `test_schema_resolution_default_enum_invalid_symbol_errors`: invalid enum default symbol is rejected. * `test_schema_resolution_default_fixed_size_mismatch_errors`: mismatched fixed/bytes default lengths are rejected. These tests assert the Avro‑spec behavior (i.e., union defaults must match the first branch; bytes/fixed defaults use JSON strings). # Are there any user-facing changes? N/A

…art 2) (#8349) # Which issue does this PR close? This work continues arrow-avro schema resolution support and aligns behavior with the Avro spec. - **Related to**: #4886 (“Add Avro Support”): ongoing work to round out the reader/decoder, including schema resolution and type promotion. - **Follow-ups/Context**: #8348 (Add arrow-avro Reader support for Dense Union and Union resolution (Part 1)), #8293 (Add projection with default values support to RecordDecoder), #8124 (schema resolution & type promotion for the decoder), #8223 (enum mapping for schema resolution). These previous efforts established the foundations that this PR extends to Union types and Union resolution. # Rationale for this change `arrow-avro` lacked end‑to‑end support for Avro unions and Arrow `Union` schemas. Many Avro datasets rely on unions (i.e.., `["null","string"]`, tagged unions of different records), and without schema‐level resolution and JSON encoding the crate could not interoperate cleanly. This PR complete the initial Decoder support for Union types and Union resolution. # What changes are included in this PR? * Decoder support for Dense Union decoding and Union resolution. # Are these changes tested? Yes, New detailed end to end integration tests have been added to `reader/mod.rs` and unit tests covering the new Union and Union resolution functionality are included in the `reader/record.rs` file. # Are there any user-facing changes? N/A --------- Co-authored-by: Ryan Johnson <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>

Add schema resolution and type promotion support to arrow-avro Decoder

ee049fd

github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Aug 12, 2025

alamb approved these changes Aug 13, 2025

View reviewed changes

alamb merged commit 08c0984 into apache:main Aug 14, 2025
23 checks passed

jecsand838 deleted the avro-schema-resolution-promotions-decoder branch August 17, 2025 19:20

This was referenced Sep 8, 2025

Add array/map/fixed schema resolution and default value support to arrow-avro codec #8292

Merged

Add projection with default values support to RecordDecoder #8293

Merged

jecsand838 mentioned this pull request Sep 15, 2025

Add arrow-avro Reader support for Dense Union and Union resolution (Part 2) #8349

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add schema resolution and type promotion support to arrow-avro Decoder #8124

Add schema resolution and type promotion support to arrow-avro Decoder #8124

Uh oh!

jecsand838 commented Aug 12, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Aug 13, 2025

Uh oh!

jecsand838 Aug 17, 2025

Uh oh!

Uh oh!

alamb commented Aug 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add schema resolution and type promotion support to arrow-avro Decoder #8124

Add schema resolution and type promotion support to arrow-avro Decoder #8124

Uh oh!

Conversation

jecsand838 commented Aug 12, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

jecsand838 Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Aug 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants