-
Notifications
You must be signed in to change notification settings - Fork 1k
Added arrow-avro enum mapping support for schema resolution #8223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added arrow-avro enum mapping support for schema resolution #8223
Conversation
f1e5c68
to
6617891
Compare
300113c
to
264b71d
Compare
@alamb Let me know if you'd be able to get to this PR. I'm sure you're pretty busy catching up from being out! There should be just one more PR after this one related to schema resolution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jecsand838 -- I went through this PR at a fairly high level and it looks good to me
} | ||
} | ||
|
||
fn resolve_enums( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some more context about Avro enums here for future readers / maintainers?
I think more or less copying the contents of this PR's description in "Rationale for this change" is probably good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb I just pushed up a detailed comment on enums with links and examples. Let me know what you think!
7914444
to
17689ac
Compare
17689ac
to
0f1ae72
Compare
Thanks @jecsand838 -- I took another look and it looked great to me |
…row-avro codec (#8292) # Which issue does this PR close? This work continues arrow-avro schema resolution support and aligns behavior with the Avro spec. - **Related to**: #4886 (“Add Avro Support”): ongoing work to round out the reader/decoder, including schema resolution and type promotion. - **Follow-ups/Context**: #8124 (schema resolution & type promotion for the decoder), #8223 (enum mapping for schema resolution). These previous efforts established the foundations that this PR extends to default values and additional resolvable types. # Rationale for this change Avro’s **schema resolution** requires readers to reconcile differences between the writer and reader schemas, including: - Using record-field **default values** when the writer lacks a field present in the reader; defaults must be type-correct (i.e., union defaults match the first union member; bytes/fixed defaults are JSON strings). - Recursively resolving **arrays** (by item schema) and **maps** (by value schema). - Resolving **fixed** types (size and unqualified name must match) and erroring when they do not. Prior to this change, arrow-avro’s resolution handled some cases but lacked full Codec support for **default values** and for resolving **array/map/fixed** shapes between writer and reader. This led to gaps when reading evolved data or datasets produced by heterogeneous systems. This PR implements these missing pieces so the Arrow reader behaves per the spec in common evolution scenarios. # What changes are included in this PR? This PR modifies **`arrow-avro/src/codec.rs`** to extend the schema-resolution path - **Default value handling** for record fields - Reads and applies default values when the reader expects a field absent from the writer, including **nested defaults**. - Validates defaults per the Avro spec (e.g., union defaults match the first schema; bytes/fixed defaults are JSON strings). - **Array / Map / Fixed schema resolution** - **Array**: recursively resolves item schemas (writer↔reader). - **Map**: recursively resolves value schemas. - **Fixed**: enforces matching size and (unqualified) name; otherwise signals an error, consistent with the spec. - **Codec updates** - Refactors internal codec logic to support the above during decoding, including resolution for **record fields** and **nested defaults**. (See commit message for the high-level summary.) # Are these changes tested? **Yes.** This PR includes new unit tests in `arrow-avro/src/codec.rs` covering: 1) **Default validation & persistence** - `Null`/union‑nullability rules; metadata persistence of defaults (`AVRO_FIELD_DEFAULT_METADATA_KEY`). 2) **`AvroLiteral` Parsing** - Range checks for `i32`/`f32`; correct literals for `i64`/`f64`; `Utf8`/`Utf8View`; `uuid` strings (RFC‑4122). - Byte‑range mapping for `bytes`/`fixed` defaults; `Fixed(n)` length enforcement; `decimal` on `fixed` vs `bytes`; `duration`/interval fixed **12**‑byte enforcement. 3) **Collections & records** - Array/map defaults shape; enum symbol validity; record defaults for missing fields, required‑field errors, and honoring field‑level defaults; skip‑fields retained for writer‑only fields. 4) **Resolution mechanics** - Element **promotion** (`int` to `long`) for arrays; **reader metadata precedence** for colliding attributes; `fixed` name/size match including **alias**. # Are there any user-facing changes? N/A --------- Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close? This work continues arrow-avro schema resolution support and aligns behavior with the Avro spec. - **Related to**: #4886 (“Add Avro Support”): ongoing work to round out the reader/decoder, including schema resolution and type promotion. - **Follow-ups/Context**: #8292 (Add array/map/fixed schema resolution and default value support to arrow-avro codec), #8124 (schema resolution & type promotion for the decoder), #8223 (enum mapping for schema resolution). These previous efforts established the foundations that this PR extends to default values and additional resolvable types. # Rationale for this change Avro’s specification requires readers to materialize default values when a field exists in the **reader** schema but not in the **writer** schema, and to validate defaults (i.e., union defaults must match the first branch; bytes/fixed defaults must be JSON strings; enums may specify a default symbol for unknown writer symbols). Implementing this behavior makes `arrow-avro` more standards‑compliant and improves interoperability with evolving schemas. # What changes are included in this PR? **High‑level summary** * **Refactor `RecordDecoder`** around a simpler **`Projector`**‑style abstraction that consumes `ResolvedRecord` to: (a) skip writer‑only fields, and (b) materialize reader‑only defaulted fields, reducing branching in the hot path. (See commit subject and record decoder changes.) **Touched files (2):** * `arrow-avro/src/reader/record.rs` - refactor decoder to use precomputed mappings and defaults. * `arrow-avro/src/reader/mod.rs` - add comprehensive tests for defaults and error cases (see below). # Are these changes tested? Yes, new integration tests cover both the **happy path** and **validation errors**: * `test_schema_resolution_defaults_all_supported_types`: verifies that defaults for boolean/int/long/float/double/bytes/string/date/time/timestamp/decimal/fixed/enum/duration/uuid/array/map/nested record and unions are materialized correctly for all rows. * `test_schema_resolution_default_enum_invalid_symbol_errors`: invalid enum default symbol is rejected. * `test_schema_resolution_default_fixed_size_mismatch_errors`: mismatched fixed/bytes default lengths are rejected. These tests assert the Avro‑spec behavior (i.e., union defaults must match the first branch; bytes/fixed defaults use JSON strings). # Are there any user-facing changes? N/A
…art 2) (#8349) # Which issue does this PR close? This work continues arrow-avro schema resolution support and aligns behavior with the Avro spec. - **Related to**: #4886 (“Add Avro Support”): ongoing work to round out the reader/decoder, including schema resolution and type promotion. - **Follow-ups/Context**: #8348 (Add arrow-avro Reader support for Dense Union and Union resolution (Part 1)), #8293 (Add projection with default values support to RecordDecoder), #8124 (schema resolution & type promotion for the decoder), #8223 (enum mapping for schema resolution). These previous efforts established the foundations that this PR extends to Union types and Union resolution. # Rationale for this change `arrow-avro` lacked end‑to‑end support for Avro unions and Arrow `Union` schemas. Many Avro datasets rely on unions (i.e.., `["null","string"]`, tagged unions of different records), and without schema‐level resolution and JSON encoding the crate could not interoperate cleanly. This PR complete the initial Decoder support for Union types and Union resolution. # What changes are included in this PR? * Decoder support for Dense Union decoding and Union resolution. # Are these changes tested? Yes, New detailed end to end integration tests have been added to `reader/mod.rs` and unit tests covering the new Union and Union resolution functionality are included in the `reader/record.rs` file. # Are there any user-facing changes? N/A --------- Co-authored-by: Ryan Johnson <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Rationale for this change
Avro
enum
values are encoded by index but are semantically identified by symbol name. During schema evolution it is legal for the writer and reader to use different enum symbol orders so long as the symbol set is compatible. The Avro specification requires that, when resolving a writer enum against a reader enum, the value be mapped by symbol name, not by the writer’s numeric index. If the writer’s symbol is not present in the reader’s enum and the reader defines a default, the default is used; otherwise it is an error.What changes are included in this PR?
Core changes
Are these changes tested?
Yes. This PR adds comprehensive unit tests for enum mapping in
reader/record.rs
and a real‑file integration test inreader/mod.rs
usingavro/simple_enum.avro
.Are there any user-facing changes?
N/A due to
arrow-avro
not being public yet.