Skip to content

Conversation

@veronica-m-ef
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This change introduces support for Avro files generated by systems like Impala, which have a specific convention for representing nullable fields. In Avro, nullability is typically represented by a union of a type and a type. This PR updates the Avro reader to correctly interpret these schemas, ensuring proper handling of nullable data and improving interoperability with Impala-generated data. null

What changes are included in this PR?

This pull request introduces several changes to support Impala-style nullability in the Avro reader:

  • The Avro schema parser has been updated to recognize unions where is the second type (e.g., ['type', 'null']) as a nullable field. null
  • Logic has been added to handle this nullability convention during Avro decoding.
  • New tests are included to verify that Avro files using this nullability format are read correctly while ensuring that strict mode properly identifies them.

Are these changes tested?

Yes, I added new test cases covering these changes to the tests named: test_nonnullable_impala, test_nonnullable_impala_strict, test_nullable_impala and test_nullable_impala_strict.

Are there any user-facing changes?

N/A

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 17, 2025
Comment on lines 304 to 316
let union_order = match data_type.nullability() {
None => None,
Some(Nullability::NullFirst) => Some(Nullability::NullFirst),
Some(Nullability::NullSecond) => {
if strict_mode {
return Err(ArrowError::ParseError(
"Found Avro union of the form ['T','null'], which is disallowed in strict_mode"
.to_string(),
));
}
Some(Nullability::NullSecond)
}
};
Copy link
Contributor

@jecsand838 jecsand838 Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@veronica-m-ef After looking through this again, I'm beginning to think it may make more sense to move this check to the Schema::Union(f) branch inside the make_data_type method in codec.rs. That way we can determine whether the Avro Schema is valid when strict_mode is true before attempting to construct the RecordDecoder.

By doing this we'd get an error earlier on line 224 in mod.rs, i.e. let root_field = AvroField::try_from(schema)?;

You'd need to find a way to pass down strict_mode into AvroField though. Maybe we could implement an AvroFieldBuilder in codec.rs containing with_utf8view and with_strict_mode methods to achieve this?

Then in the make_data_type method we could do:

        Schema::Union(f) => {
            // Special case the common case of nullable primitives
            let null = f
                .iter()
                .position(|x| x == &Schema::TypeName(TypeName::Primitive(PrimitiveType::Null)));
            match (f.len() == 2, null) {
                (true, Some(0)) => {
                    let mut field = make_data_type(&f[1], namespace, resolver, use_utf8view, strict_mode)?;
                    field.nullability = Some(Nullability::NullFirst);
                    Ok(field)
                }
                (true, Some(1)) => {
                    if strict_mode {
                        return Err(ArrowError::SchemaError(
                            "Found Avro union of the form ['T','null'], which is disallowed in strict_mode"
                                .to_string(),
                        ));
                    }
                    let mut field = make_data_type(&f[0], namespace, resolver, use_utf8view, strict_mode)?;
                    field.nullability = Some(Nullability::NullSecond);
                    Ok(field)
                }
                _ => Err(ArrowError::NotYetImplemented(format!(
                    "Union of {f:?} not currently supported"
                ))),
            }
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jecsand838 thank you for your comment. You're absolutely right, I have implemented the necessary changes and removed the strict_mode check in the Decoder::try_new method. Additionally I removed the strict_mode option from the RecordDecoder as it wouldn't be needed anymore. Let me know what you think :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@veronica-m-ef Looking good! I just left three small follow-up suggestions.

@alamb @scovich Would either of you be able to give this a review when you get a chance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jecsand838 thank you! All done now :)

Co-authored-by: Connor Sanders <[email protected]>
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @veronica-m-ef and @jecsand838

I started the CI checks on this PR. Please let me know when you think it is ready to merge

self.utf8_view,
self.strict_mode,
)
let root_field = AvroFieldBuilder::new(schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this field builder API looks very nice to me -- thank you @veronica-m-ef

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb no errors from clippy locally, we're good to go now :)

@alamb alamb merged commit 291e6e5 into apache:main Jul 21, 2025
23 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 21, 2025

Thanks again @veronica-m-ef

@jecsand838 jecsand838 deleted the avro-code-nulls branch July 22, 2025 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants