-
Notifications
You must be signed in to change notification settings - Fork 1k
Add schema resolution and type promotion support to arrow-avro Decoder #8124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add schema resolution and type promotion support to arrow-avro Decoder #8124
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jecsand838 -- this looks good and well tested to me
| true, | ||
| ), | ||
| ( | ||
| "date_string_col", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does a date_string_col come through as Binary by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb The underlying data in the alltypes_plain.avro file is typed and stored as bytes.
Here's the Avro schema info for the date_string_col field:
{
"name": "date_string_col",
"type": [
"bytes",
"null"
]
},I also noticed that the string_col is bytes as well:
{
"name": "string_col",
"type": [
"bytes",
"null"
]
},|
Thank you @jecsand838 |
…row-avro codec (#8292) # Which issue does this PR close? This work continues arrow-avro schema resolution support and aligns behavior with the Avro spec. - **Related to**: #4886 (“Add Avro Support”): ongoing work to round out the reader/decoder, including schema resolution and type promotion. - **Follow-ups/Context**: #8124 (schema resolution & type promotion for the decoder), #8223 (enum mapping for schema resolution). These previous efforts established the foundations that this PR extends to default values and additional resolvable types. # Rationale for this change Avro’s **schema resolution** requires readers to reconcile differences between the writer and reader schemas, including: - Using record-field **default values** when the writer lacks a field present in the reader; defaults must be type-correct (i.e., union defaults match the first union member; bytes/fixed defaults are JSON strings). - Recursively resolving **arrays** (by item schema) and **maps** (by value schema). - Resolving **fixed** types (size and unqualified name must match) and erroring when they do not. Prior to this change, arrow-avro’s resolution handled some cases but lacked full Codec support for **default values** and for resolving **array/map/fixed** shapes between writer and reader. This led to gaps when reading evolved data or datasets produced by heterogeneous systems. This PR implements these missing pieces so the Arrow reader behaves per the spec in common evolution scenarios. # What changes are included in this PR? This PR modifies **`arrow-avro/src/codec.rs`** to extend the schema-resolution path - **Default value handling** for record fields - Reads and applies default values when the reader expects a field absent from the writer, including **nested defaults**. - Validates defaults per the Avro spec (e.g., union defaults match the first schema; bytes/fixed defaults are JSON strings). - **Array / Map / Fixed schema resolution** - **Array**: recursively resolves item schemas (writer↔reader). - **Map**: recursively resolves value schemas. - **Fixed**: enforces matching size and (unqualified) name; otherwise signals an error, consistent with the spec. - **Codec updates** - Refactors internal codec logic to support the above during decoding, including resolution for **record fields** and **nested defaults**. (See commit message for the high-level summary.) # Are these changes tested? **Yes.** This PR includes new unit tests in `arrow-avro/src/codec.rs` covering: 1) **Default validation & persistence** - `Null`/union‑nullability rules; metadata persistence of defaults (`AVRO_FIELD_DEFAULT_METADATA_KEY`). 2) **`AvroLiteral` Parsing** - Range checks for `i32`/`f32`; correct literals for `i64`/`f64`; `Utf8`/`Utf8View`; `uuid` strings (RFC‑4122). - Byte‑range mapping for `bytes`/`fixed` defaults; `Fixed(n)` length enforcement; `decimal` on `fixed` vs `bytes`; `duration`/interval fixed **12**‑byte enforcement. 3) **Collections & records** - Array/map defaults shape; enum symbol validity; record defaults for missing fields, required‑field errors, and honoring field‑level defaults; skip‑fields retained for writer‑only fields. 4) **Resolution mechanics** - Element **promotion** (`int` to `long`) for arrays; **reader metadata precedence** for colliding attributes; `fixed` name/size match including **alias**. # Are there any user-facing changes? N/A --------- Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close? This work continues arrow-avro schema resolution support and aligns behavior with the Avro spec. - **Related to**: #4886 (“Add Avro Support”): ongoing work to round out the reader/decoder, including schema resolution and type promotion. - **Follow-ups/Context**: #8292 (Add array/map/fixed schema resolution and default value support to arrow-avro codec), #8124 (schema resolution & type promotion for the decoder), #8223 (enum mapping for schema resolution). These previous efforts established the foundations that this PR extends to default values and additional resolvable types. # Rationale for this change Avro’s specification requires readers to materialize default values when a field exists in the **reader** schema but not in the **writer** schema, and to validate defaults (i.e., union defaults must match the first branch; bytes/fixed defaults must be JSON strings; enums may specify a default symbol for unknown writer symbols). Implementing this behavior makes `arrow-avro` more standards‑compliant and improves interoperability with evolving schemas. # What changes are included in this PR? **High‑level summary** * **Refactor `RecordDecoder`** around a simpler **`Projector`**‑style abstraction that consumes `ResolvedRecord` to: (a) skip writer‑only fields, and (b) materialize reader‑only defaulted fields, reducing branching in the hot path. (See commit subject and record decoder changes.) **Touched files (2):** * `arrow-avro/src/reader/record.rs` - refactor decoder to use precomputed mappings and defaults. * `arrow-avro/src/reader/mod.rs` - add comprehensive tests for defaults and error cases (see below). # Are these changes tested? Yes, new integration tests cover both the **happy path** and **validation errors**: * `test_schema_resolution_defaults_all_supported_types`: verifies that defaults for boolean/int/long/float/double/bytes/string/date/time/timestamp/decimal/fixed/enum/duration/uuid/array/map/nested record and unions are materialized correctly for all rows. * `test_schema_resolution_default_enum_invalid_symbol_errors`: invalid enum default symbol is rejected. * `test_schema_resolution_default_fixed_size_mismatch_errors`: mismatched fixed/bytes default lengths are rejected. These tests assert the Avro‑spec behavior (i.e., union defaults must match the first branch; bytes/fixed defaults use JSON strings). # Are there any user-facing changes? N/A
…art 2) (#8349) # Which issue does this PR close? This work continues arrow-avro schema resolution support and aligns behavior with the Avro spec. - **Related to**: #4886 (“Add Avro Support”): ongoing work to round out the reader/decoder, including schema resolution and type promotion. - **Follow-ups/Context**: #8348 (Add arrow-avro Reader support for Dense Union and Union resolution (Part 1)), #8293 (Add projection with default values support to RecordDecoder), #8124 (schema resolution & type promotion for the decoder), #8223 (enum mapping for schema resolution). These previous efforts established the foundations that this PR extends to Union types and Union resolution. # Rationale for this change `arrow-avro` lacked end‑to‑end support for Avro unions and Arrow `Union` schemas. Many Avro datasets rely on unions (i.e.., `["null","string"]`, tagged unions of different records), and without schema‐level resolution and JSON encoding the crate could not interoperate cleanly. This PR complete the initial Decoder support for Union types and Union resolution. # What changes are included in this PR? * Decoder support for Dense Union decoding and Union resolution. # Are these changes tested? Yes, New detailed end to end integration tests have been added to `reader/mod.rs` and unit tests covering the new Union and Union resolution functionality are included in the `reader/record.rs` file. # Are there any user-facing changes? N/A --------- Co-authored-by: Ryan Johnson <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Rationale for this change
Avro allows safe widening between numeric primitives and interoperability between
bytesand UTF‑8stringduring schema resolution.Implementing promotion-aware decoding lets us:
What changes are included in this PR?
Core decoding (
arrow-avro/src/reader/record.rs):Int32ToInt64,Int32ToFloat32,Int32ToFloat64Int64ToFloat32,Int64ToFloat64Float32ToFloat64BytesToString,StringToBytesDecoder::try_newto inspectResolutionInfo::Promotionand select the appropriate variant, so conversion happens as we decode, not after.decode,append_null, andflushto handle the new variants and materialize the correct Arrow arrays (Int64Array,Float32Array,Float64Array,StringArray,BinaryArray).Utf8Viewfor non-promoted strings; promotions tostringmaterialize aStringArray(notStringViewArray) for correctness and simplicity. (StringView remains available for native UTF‑8 paths.)Integration tests & helpers (
arrow-avro/src/reader/mod.rs):make_reader_schema_with_promotions).alltypes_plain(no compression, snappy, zstd, bzip2, xz) that validate:float/doubleandint to long.bytes to stringandstring to bytes.boolean to double) produce a descriptive error.Are these changes tested?
Yes.
record.rs) for each promotion path:int to long,int to float,int to doublelong to float,long to doublefloat to doublebytes to string(including non‑ASCII UTF‑8) andstring to bytesmod.rs) reading realalltypes_plainAvro files across multiple compression codecs, asserting exact Arrow outputs for promoted fields.Are there any user-facing changes?
N/A