Skip to content

Conversation

jecsand838
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Avro enum values are encoded by index but are semantically identified by symbol name. During schema evolution it is legal for the writer and reader to use different enum symbol orders so long as the symbol set is compatible. The Avro specification requires that, when resolving a writer enum against a reader enum, the value be mapped by symbol name, not by the writer’s numeric index. If the writer’s symbol is not present in the reader’s enum and the reader defines a default, the default is used; otherwise it is an error.

What changes are included in this PR?

Core changes

  • Implement writer to reader enum symbol remapping:
    • Build a fast lookup table at schema resolution time from writer enum index to reader enum index using symbol names.
    • Apply this mapping during decode so the produced Arrow dictionary keys always reference the reader’s symbol order.
    • If a writer symbol is not found in the reader enum, surface a clear error.

Are these changes tested?

Yes. This PR adds comprehensive unit tests for enum mapping in reader/record.rs and a real‑file integration test in reader/mod.rs using avro/simple_enum.avro.

Are there any user-facing changes?

N/A due to arrow-avro not being public yet.

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Aug 25, 2025
@jecsand838 jecsand838 force-pushed the avro-schema-resolution-enum-mappings branch from f1e5c68 to 6617891 Compare August 25, 2025 22:49
@jecsand838 jecsand838 force-pushed the avro-schema-resolution-enum-mappings branch from 300113c to 264b71d Compare September 4, 2025 23:11
@jecsand838
Copy link
Contributor Author

@alamb Let me know if you'd be able to get to this PR. I'm sure you're pretty busy catching up from being out!

There should be just one more PR after this one related to schema resolution.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jecsand838 -- I went through this PR at a fairly high level and it looks good to me

}
}

fn resolve_enums(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some more context about Avro enums here for future readers / maintainers?

I think more or less copying the contents of this PR's description in "Rationale for this change" is probably good enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I just pushed up a detailed comment on enums with links and examples. Let me know what you think!

@jecsand838 jecsand838 force-pushed the avro-schema-resolution-enum-mappings branch from 7914444 to 17689ac Compare September 5, 2025 22:18
@jecsand838 jecsand838 force-pushed the avro-schema-resolution-enum-mappings branch from 17689ac to 0f1ae72 Compare September 5, 2025 22:18
@alamb alamb merged commit 8c80fe1 into apache:main Sep 6, 2025
23 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 6, 2025

Thanks @jecsand838 -- I took another look and it looked great to me

@jecsand838 jecsand838 deleted the avro-schema-resolution-enum-mappings branch September 8, 2025 02:51
alamb added a commit that referenced this pull request Sep 11, 2025
…row-avro codec (#8292)

# Which issue does this PR close?

This work continues arrow-avro schema resolution support and aligns
behavior with the Avro spec.

- **Related to**: #4886 (“Add Avro Support”): ongoing work to round out
the reader/decoder, including schema resolution and type promotion.
- **Follow-ups/Context**: #8124 (schema resolution & type promotion for
the decoder), #8223 (enum mapping for schema resolution). These previous
efforts established the foundations that this PR extends to default
values and additional resolvable types.

# Rationale for this change

Avro’s **schema resolution** requires readers to reconcile differences
between the writer and reader schemas, including:
- Using record-field **default values** when the writer lacks a field
present in the reader; defaults must be type-correct (i.e., union
defaults match the first union member; bytes/fixed defaults are JSON
strings).
- Recursively resolving **arrays** (by item schema) and **maps** (by
value schema).
- Resolving **fixed** types (size and unqualified name must match) and
erroring when they do not.

Prior to this change, arrow-avro’s resolution handled some cases but
lacked full Codec support for **default values** and for resolving
**array/map/fixed** shapes between writer and reader. This led to gaps
when reading evolved data or datasets produced by heterogeneous systems.
This PR implements these missing pieces so the Arrow reader behaves per
the spec in common evolution scenarios.

# What changes are included in this PR?

This PR modifies **`arrow-avro/src/codec.rs`** to extend the
schema-resolution path

- **Default value handling** for record fields  
- Reads and applies default values when the reader expects a field
absent from the writer, including **nested defaults**.
- Validates defaults per the Avro spec (e.g., union defaults match the
first schema; bytes/fixed defaults are JSON strings).

- **Array / Map / Fixed schema resolution**  
  - **Array**: recursively resolves item schemas (writer↔reader).
  - **Map**: recursively resolves value schemas.
- **Fixed**: enforces matching size and (unqualified) name; otherwise
signals an error, consistent with the spec.

- **Codec updates**  
- Refactors internal codec logic to support the above during decoding,
including resolution for **record fields** and **nested defaults**. (See
commit message for the high-level summary.)

# Are these changes tested?

**Yes.** This PR includes new unit tests in `arrow-avro/src/codec.rs`
covering:

1) **Default validation & persistence**
- `Null`/union‑nullability rules; metadata persistence of defaults
(`AVRO_FIELD_DEFAULT_METADATA_KEY`).
2) **`AvroLiteral` Parsing**
- Range checks for `i32`/`f32`; correct literals for `i64`/`f64`;
`Utf8`/`Utf8View`; `uuid` strings (RFC‑4122).
- Byte‑range mapping for `bytes`/`fixed` defaults; `Fixed(n)` length
enforcement; `decimal` on `fixed` vs `bytes`; `duration`/interval fixed
**12**‑byte enforcement.
3) **Collections & records**
- Array/map defaults shape; enum symbol validity; record defaults for
missing fields, required‑field errors, and honoring field‑level
defaults; skip‑fields retained for writer‑only fields.
4) **Resolution mechanics**
- Element **promotion** (`int` to `long`) for arrays; **reader metadata
precedence** for colliding attributes; `fixed` name/size match including
**alias**.

# Are there any user-facing changes?

N/A

---------

Co-authored-by: Andrew Lamb <[email protected]>
mbrobbel pushed a commit that referenced this pull request Sep 16, 2025
# Which issue does this PR close?

This work continues arrow-avro schema resolution support and aligns
behavior with the Avro spec.

- **Related to**: #4886 (“Add Avro Support”): ongoing work to round out
the reader/decoder, including schema resolution and type promotion.
- **Follow-ups/Context**: #8292 (Add array/map/fixed schema resolution
and default value support to arrow-avro codec), #8124 (schema resolution
& type promotion for the decoder), #8223 (enum mapping for schema
resolution). These previous efforts established the foundations that
this PR extends to default values and additional resolvable types.

# Rationale for this change

Avro’s specification requires readers to materialize default values when
a field exists in the **reader** schema but not in the **writer**
schema, and to validate defaults (i.e., union defaults must match the
first branch; bytes/fixed defaults must be JSON strings; enums may
specify a default symbol for unknown writer symbols). Implementing this
behavior makes `arrow-avro` more standards‑compliant and improves
interoperability with evolving schemas.

# What changes are included in this PR?

**High‑level summary**

* **Refactor `RecordDecoder`** around a simpler **`Projector`**‑style
abstraction that consumes `ResolvedRecord` to: (a) skip writer‑only
fields, and (b) materialize reader‑only defaulted fields, reducing
branching in the hot path. (See commit subject and record decoder
changes.)
**Touched files (2):**

* `arrow-avro/src/reader/record.rs` - refactor decoder to use
precomputed mappings and defaults.
* `arrow-avro/src/reader/mod.rs` - add comprehensive tests for defaults
and error cases (see below).

# Are these changes tested?

Yes, new integration tests cover both the **happy path** and
**validation errors**:
* `test_schema_resolution_defaults_all_supported_types`: verifies that
defaults for
boolean/int/long/float/double/bytes/string/date/time/timestamp/decimal/fixed/enum/duration/uuid/array/map/nested
record and unions are materialized correctly for all rows.
* `test_schema_resolution_default_enum_invalid_symbol_errors`: invalid
enum default symbol is rejected.
* `test_schema_resolution_default_fixed_size_mismatch_errors`:
mismatched fixed/bytes default lengths are rejected.

These tests assert the Avro‑spec behavior (i.e., union defaults must
match the first branch; bytes/fixed defaults use JSON strings).

# Are there any user-facing changes?

N/A
alamb added a commit that referenced this pull request Sep 24, 2025
…art 2) (#8349)

# Which issue does this PR close?

This work continues arrow-avro schema resolution support and aligns
behavior with the Avro spec.

- **Related to**: #4886 (“Add Avro Support”): ongoing work to round out
the reader/decoder, including schema resolution and type promotion.
- **Follow-ups/Context**: #8348 (Add arrow-avro Reader support for Dense
Union and Union resolution (Part 1)), #8293 (Add projection with default
values support to RecordDecoder), #8124 (schema resolution & type
promotion for the decoder), #8223 (enum mapping for schema resolution).
These previous efforts established the foundations that this PR extends
to Union types and Union resolution.

# Rationale for this change

`arrow-avro` lacked end‑to‑end support for Avro unions and Arrow `Union`
schemas. Many Avro datasets rely on unions (i.e.., `["null","string"]`,
tagged unions of different records), and without schema‐level resolution
and JSON encoding the crate could not interoperate cleanly. This PR
complete the initial Decoder support for Union types and Union
resolution.

# What changes are included in this PR?

* Decoder support for Dense Union decoding and Union resolution.

# Are these changes tested?

Yes,
New detailed end to end integration tests have been added to
`reader/mod.rs` and unit tests covering the new Union and Union
resolution functionality are included in the `reader/record.rs` file.

# Are there any user-facing changes?

N/A

---------

Co-authored-by: Ryan Johnson <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate arrow-avro arrow-avro crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants