-
Notifications
You must be signed in to change notification settings - Fork 1k
Add Fixed, Uuid support to arrow-avro #7557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| /// Represents Avro fixed type, maps to Arrow's FixedSizeBinary data type | ||
| /// The i32 parameter indicates the fixed binary size | ||
| Fixed(i32), | ||
| /// Represents Avro Uuid type, a FixedSizeBinary with a length of 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the rfc link, it says that A UUID is 128 bits long, and requires no central registration process, does the length 16 here suffice for that?
for the uuid crate, the length for the example 67e55044-10b1-426f-9247-bb680e5fe0c8 is 36
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should; 16 bytes == 128 bits. It looks to me like the UUID crate specifies 16 & 128 as well. Do you mean character count? That's independent of byte size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, thanks
| } | ||
|
|
||
| #[test] | ||
| fn test_fixed_decoding() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these tests. Do we need to add a test for the UUID type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems unnecessary to me, as it would be nearly identical to fixed but with a hard-coded length. I can add one though, if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure about this; we could involve a maintainer for a double check.
just noticed that we already have tests for other types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a test explicitly for reading uuids is important (maybe it is covered by end to end coverage somewhere else however)
I think it is important to ensure we don't accidentally break UUID handling in the future\
Perhaps we can add one in a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb We don't have an E2E test covering UUID. I'll be sure to create an Avro UUID test file (unless there's already one you'd want us to use) and include an E2E UUID test along with the other E2E tests in our second to last PR. Our final PR would then contain the completed arrow-avro implementations of the ReaderBuilder and Reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(unless there's already one you'd want us to use)
Not that i know of -- but I am not familar with apache avro -- if there is some sort of standardized test corpus it would be great to use that
I took a look around quickly but couldn't find anything like https://github.com/apache/parquet-testing for avro 🤔
Introduced handling for Avro `fixed` type in the decoder, enabling proper initialization, decoding, and array construction. Added corresponding tests to validate decoding functionality and edge cases like empty arrays. Implemented `get_fixed` method in `AvroCursor` for precise byte reading.
…d minimize diffs with earlier iteration
Introduce a new `Codec::Uuid` variant to handle Avro Uuid logical types, mapping it to a `FixedSizeBinary` of length 16 in Arrow. This includes updates to codec handling, relevant match branches, and a unit test to validate functionality.
6de7cc7 to
23cc74d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution, LGTM % whether to need to add a test for uuid type.
| /// Represents Avro fixed type, maps to Arrow's FixedSizeBinary data type | ||
| /// The i32 parameter indicates the fixed binary size | ||
| Fixed(i32), | ||
| /// Represents Avro Uuid type, a FixedSizeBinary with a length of 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, thanks
| } | ||
|
|
||
| #[test] | ||
| fn test_fixed_decoding() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure about this; we could involve a maintainer for a double check.
just noticed that we already have tests for other types.
|
@klion26 Thank you for your review; I went ahead and added the test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @nathaniel-d-ef and @klion26 -- this looks like an improvement to me
@klion26 once you are happy that a PR is looking good please feel free to @ mention me for a final stamp of approval / merge
| } | ||
|
|
||
| #[test] | ||
| fn test_fixed_decoding() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a test explicitly for reading uuids is important (maybe it is covered by end to end coverage somewhere else however)
I think it is important to ensure we don't accidentally break UUID handling in the future\
Perhaps we can add one in a follow on PR
| let mut decoder = Decoder::try_new(&avro_type).expect("Failed to create decoder"); | ||
|
|
||
| let data1 = [1u8, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]; | ||
| let mut cursor1 = AvroCursor::new(&data1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to extend this test, perhaps in a follow on PR, to ensure the UUID was read correctly
|
@nathaniel-d-ef can. you please merge @alamb @tustvold Any ideas when you'd be able to give this one a final review? The next set of PRs I was going to open depend on this one getting approved and merged in. @jecsand838 sorry for the delay |
@alamb No worries at all! Thank you so much for responding and getting back so quickly. |
|
Thanks again everyone |
@alamb got it, will ping you in the future |
Which issue does this PR close?
Part of 4886
Related to 6965
Rationale for this change
This change expands upon the Avro reader logic by adding full support for the Fixed and Uuid types (Uuid relies on Fixed). It builds out the
Fixedpath currently stubbed out.What changes are included in this PR?
Adds
FixedandUuidsupport to the arrow-avro crate with changes to the following:Uuidtypeget_fixedhelper method to read the specified bytes into a buffer.nyiCodec::Fixedin theDecoder.Are there any user-facing changes?
n/a