-
Couldn't load subscription status.
- Fork 1k
Add arrow-avro support for Impala Nullability #7954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fb790fb to
848de3b
Compare
arrow-avro/src/reader/record.rs
Outdated
| let union_order = match data_type.nullability() { | ||
| None => None, | ||
| Some(Nullability::NullFirst) => Some(Nullability::NullFirst), | ||
| Some(Nullability::NullSecond) => { | ||
| if strict_mode { | ||
| return Err(ArrowError::ParseError( | ||
| "Found Avro union of the form ['T','null'], which is disallowed in strict_mode" | ||
| .to_string(), | ||
| )); | ||
| } | ||
| Some(Nullability::NullSecond) | ||
| } | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@veronica-m-ef After looking through this again, I'm beginning to think it may make more sense to move this check to the Schema::Union(f) branch inside the make_data_type method in codec.rs. That way we can determine whether the Avro Schema is valid when strict_mode is true before attempting to construct the RecordDecoder.
By doing this we'd get an error earlier on line 224 in mod.rs, i.e. let root_field = AvroField::try_from(schema)?;
You'd need to find a way to pass down strict_mode into AvroField though. Maybe we could implement an AvroFieldBuilder in codec.rs containing with_utf8view and with_strict_mode methods to achieve this?
Then in the make_data_type method we could do:
Schema::Union(f) => {
// Special case the common case of nullable primitives
let null = f
.iter()
.position(|x| x == &Schema::TypeName(TypeName::Primitive(PrimitiveType::Null)));
match (f.len() == 2, null) {
(true, Some(0)) => {
let mut field = make_data_type(&f[1], namespace, resolver, use_utf8view, strict_mode)?;
field.nullability = Some(Nullability::NullFirst);
Ok(field)
}
(true, Some(1)) => {
if strict_mode {
return Err(ArrowError::SchemaError(
"Found Avro union of the form ['T','null'], which is disallowed in strict_mode"
.to_string(),
));
}
let mut field = make_data_type(&f[0], namespace, resolver, use_utf8view, strict_mode)?;
field.nullability = Some(Nullability::NullSecond);
Ok(field)
}
_ => Err(ArrowError::NotYetImplemented(format!(
"Union of {f:?} not currently supported"
))),
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jecsand838 thank you for your comment. You're absolutely right, I have implemented the necessary changes and removed the strict_mode check in the Decoder::try_new method. Additionally I removed the strict_mode option from the RecordDecoder as it wouldn't be needed anymore. Let me know what you think :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@veronica-m-ef Looking good! I just left three small follow-up suggestions.
@alamb @scovich Would either of you be able to give this a review when you get a chance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jecsand838 thank you! All done now :)
Co-authored-by: Connor Sanders <[email protected]>
Co-authored-by: Connor Sanders <[email protected]>
Co-authored-by: Connor Sanders <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me -- thank you @veronica-m-ef and @jecsand838
I started the CI checks on this PR. Please let me know when you think it is ready to merge
| self.utf8_view, | ||
| self.strict_mode, | ||
| ) | ||
| let root_field = AvroFieldBuilder::new(schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this field builder API looks very nice to me -- thank you @veronica-m-ef
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb no errors from clippy locally, we're good to go now :)
|
Thanks again @veronica-m-ef |
Which issue does this PR close?
Rationale for this change
This change introduces support for Avro files generated by systems like Impala, which have a specific convention for representing nullable fields. In Avro, nullability is typically represented by a union of a type and a type. This PR updates the Avro reader to correctly interpret these schemas, ensuring proper handling of nullable data and improving interoperability with Impala-generated data.
nullWhat changes are included in this PR?
This pull request introduces several changes to support Impala-style nullability in the Avro reader:
['type', 'null']) as a nullable field.nullAre these changes tested?
Yes, I added new test cases covering these changes to the tests named:
test_nonnullable_impala,test_nonnullable_impala_strict,test_nullable_impalaandtest_nullable_impala_strict.Are there any user-facing changes?
N/A