-
Notifications
You must be signed in to change notification settings - Fork 1k
Add projection with default values support to RecordDecoder
#8293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add projection with default values support to RecordDecoder
#8293
Conversation
RecordDecoder to simplify s…RecordDecoder
da5c105 to
b4f2a0c
Compare
…chema resolution with `Projector` abstraction. Enables efficient skipping of writer-only fields and improves handling of default field values, enums, and resolved mappings.
b4f2a0c to
e948975
Compare
RecordDecoderRecordDecoder
5c00977 to
4250668
Compare
90fed9f to
1e778ed
Compare
1e778ed to
2c586b8
Compare
eccfff0 to
e5551ce
Compare
|
@alamb Would you be able to review this PR when you have a chance? About half of the code is tests, so ~650 lines of added production code. Also after this PR, there's two Reader PRs left related to Union support before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jecsand838
86219f7 to
12d7a47
Compare
@mbrobbel Thank you so much for the review and those catches. I went ahead and pushed up changes addressing your comments. |
|
Thanks @mbrobbel and @jecsand838 |
…art 2) (#8349) # Which issue does this PR close? This work continues arrow-avro schema resolution support and aligns behavior with the Avro spec. - **Related to**: #4886 (“Add Avro Support”): ongoing work to round out the reader/decoder, including schema resolution and type promotion. - **Follow-ups/Context**: #8348 (Add arrow-avro Reader support for Dense Union and Union resolution (Part 1)), #8293 (Add projection with default values support to RecordDecoder), #8124 (schema resolution & type promotion for the decoder), #8223 (enum mapping for schema resolution). These previous efforts established the foundations that this PR extends to Union types and Union resolution. # Rationale for this change `arrow-avro` lacked end‑to‑end support for Avro unions and Arrow `Union` schemas. Many Avro datasets rely on unions (i.e.., `["null","string"]`, tagged unions of different records), and without schema‐level resolution and JSON encoding the crate could not interoperate cleanly. This PR complete the initial Decoder support for Union types and Union resolution. # What changes are included in this PR? * Decoder support for Dense Union decoding and Union resolution. # Are these changes tested? Yes, New detailed end to end integration tests have been added to `reader/mod.rs` and unit tests covering the new Union and Union resolution functionality are included in the `reader/record.rs` file. # Are there any user-facing changes? N/A --------- Co-authored-by: Ryan Johnson <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
This work continues arrow-avro schema resolution support and aligns behavior with the Avro spec.
Rationale for this change
Avro’s specification requires readers to materialize default values when a field exists in the reader schema but not in the writer schema, and to validate defaults (i.e., union defaults must match the first branch; bytes/fixed defaults must be JSON strings; enums may specify a default symbol for unknown writer symbols). Implementing this behavior makes
arrow-avromore standards‑compliant and improves interoperability with evolving schemas.What changes are included in this PR?
High‑level summary
Refactor
RecordDecoderaround a simplerProjector‑style abstraction that consumesResolvedRecordto: (a) skip writer‑only fields, and (b) materialize reader‑only defaulted fields, reducing branching in the hot path. (See commit subject and record decoder changes.)Touched files (2):
arrow-avro/src/reader/record.rs- refactor decoder to use precomputed mappings and defaults.arrow-avro/src/reader/mod.rs- add comprehensive tests for defaults and error cases (see below).Are these changes tested?
Yes, new integration tests cover both the happy path and validation errors:
test_schema_resolution_defaults_all_supported_types: verifies that defaults for boolean/int/long/float/double/bytes/string/date/time/timestamp/decimal/fixed/enum/duration/uuid/array/map/nested record and unions are materialized correctly for all rows.test_schema_resolution_default_enum_invalid_symbol_errors: invalid enum default symbol is rejected.test_schema_resolution_default_fixed_size_mismatch_errors: mismatched fixed/bytes default lengths are rejected.These tests assert the Avro‑spec behavior (i.e., union defaults must match the first branch; bytes/fixed defaults use JSON strings).
Are there any user-facing changes?
N/A