Skip to content

Conversation

@nathaniel-d-ef
Copy link
Contributor

@nathaniel-d-ef nathaniel-d-ef commented May 28, 2025

Which issue does this PR close?

Part of 4886

Related to 6965

Rationale for this change

This change expands upon the Avro reader logic by adding full support for the Fixed and Uuid types (Uuid relies on Fixed). It builds out the Fixed path currently stubbed out.

What changes are included in this PR?

Adds Fixed and Uuid support to the arrow-avro crate with changes to the following:

  1. arrow-avro/src/codec.rs
  • Adds support for Uuid type
  • Adds test
  1. arrow-avro/src/reader/cursor.rs:
  • Adds a get_fixed helper method to read the specified bytes into a buffer.
  1. arrow-avro/src/reader/record.rs:
  • Introduces the Fixed decoding path, building out the nyi Codec::Fixed in the Decoder.
  • Introduces the Uuid decoding path, building off of Fixed
  • Adds tests.

Are there any user-facing changes?

n/a

@github-actions github-actions bot added the arrow Changes to the arrow crate label May 28, 2025
@nathaniel-d-ef nathaniel-d-ef changed the title Add Fixed support to arrow-avro Add Fixed, Uuid support to arrow-avro May 28, 2025
/// Represents Avro fixed type, maps to Arrow's FixedSizeBinary data type
/// The i32 parameter indicates the fixed binary size
Fixed(i32),
/// Represents Avro Uuid type, a FixedSizeBinary with a length of 16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the rfc link, it says that A UUID is 128 bits long, and requires no central registration process, does the length 16 here suffice for that?

for the uuid crate, the length for the example 67e55044-10b1-426f-9247-bb680e5fe0c8 is 36

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should; 16 bytes == 128 bits. It looks to me like the UUID crate specifies 16 & 128 as well. Do you mean character count? That's independent of byte size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks

}

#[test]
fn test_fixed_decoding() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these tests. Do we need to add a test for the UUID type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems unnecessary to me, as it would be nearly identical to fixed but with a hard-coded length. I can add one though, if necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure about this; we could involve a maintainer for a double check.
just noticed that we already have tests for other types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a test explicitly for reading uuids is important (maybe it is covered by end to end coverage somewhere else however)

I think it is important to ensure we don't accidentally break UUID handling in the future\

Perhaps we can add one in a follow on PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb We don't have an E2E test covering UUID. I'll be sure to create an Avro UUID test file (unless there's already one you'd want us to use) and include an E2E UUID test along with the other E2E tests in our second to last PR. Our final PR would then contain the completed arrow-avro implementations of the ReaderBuilder and Reader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(unless there's already one you'd want us to use)

Not that i know of -- but I am not familar with apache avro -- if there is some sort of standardized test corpus it would be great to use that

I took a look around quickly but couldn't find anything like https://github.com/apache/parquet-testing for avro 🤔

Introduced handling for Avro `fixed` type in the decoder, enabling proper initialization, decoding, and array construction. Added corresponding tests to validate decoding functionality and edge cases like empty arrays. Implemented `get_fixed` method in `AvroCursor` for precise byte reading.
Introduce a new `Codec::Uuid` variant to handle Avro Uuid logical types, mapping it to a `FixedSizeBinary` of length 16 in Arrow. This includes updates to codec handling, relevant match branches, and a unit test to validate functionality.
Copy link
Member

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution, LGTM % whether to need to add a test for uuid type.

/// Represents Avro fixed type, maps to Arrow's FixedSizeBinary data type
/// The i32 parameter indicates the fixed binary size
Fixed(i32),
/// Represents Avro Uuid type, a FixedSizeBinary with a length of 16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks

}

#[test]
fn test_fixed_decoding() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure about this; we could involve a maintainer for a double check.
just noticed that we already have tests for other types.

@nathaniel-d-ef
Copy link
Contributor Author

@klion26 Thank you for your review; I went ahead and added the test.

@jecsand838
Copy link
Contributor

@alamb @tustvold Any ideas when you'd be able to give this one a final review? The next set of PRs I was going to open depend on this one getting approved and merged in.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @nathaniel-d-ef and @klion26 -- this looks like an improvement to me

@klion26 once you are happy that a PR is looking good please feel free to @ mention me for a final stamp of approval / merge

}

#[test]
fn test_fixed_decoding() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a test explicitly for reading uuids is important (maybe it is covered by end to end coverage somewhere else however)

I think it is important to ensure we don't accidentally break UUID handling in the future\

Perhaps we can add one in a follow on PR

let mut decoder = Decoder::try_new(&avro_type).expect("Failed to create decoder");

let data1 = [1u8, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16];
let mut cursor1 = AvroCursor::new(&data1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to extend this test, perhaps in a follow on PR, to ensure the UUID was read correctly

@alamb
Copy link
Contributor

alamb commented Jun 27, 2025

@nathaniel-d-ef can. you please merge main into this PR and run cargo fmt so we can get a clean CI run ? Then I'll merge it

@alamb @tustvold Any ideas when you'd be able to give this one a final review? The next set of PRs I was going to open depend on this one getting approved and merged in.

@jecsand838 sorry for the delay

@jecsand838
Copy link
Contributor

@jecsand838 sorry for the delay

@alamb No worries at all! Thank you so much for responding and getting back so quickly.

@alamb alamb merged commit 674dc17 into apache:main Jun 28, 2025
23 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 28, 2025

Thanks again everyone

@klion26
Copy link
Member

klion26 commented Jun 29, 2025

Thank you @nathaniel-d-ef and @klion26 -- this looks like an improvement to me

@klion26 once you are happy that a PR is looking good please feel free to @ mention me for a final stamp of approval / merge

@alamb got it, will ping you in the future

@jecsand838 jecsand838 deleted the avro-codec-fixed branch June 30, 2025 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants