Skip to content

Conversation

@nathaniel-d-ef
Copy link
Contributor

@nathaniel-d-ef nathaniel-d-ef commented Sep 8, 2025

Which issue does this PR close?

Rationale for this change

This PR extends on work introduced in #8274, adding additional complex type support to the Avro writer. This brings us closer to a complete round-trip capability and Avro spec support in the arrow-avro crate.

What changes are included in this PR?

New encoders:
Fixed
UUID
IntervalMonthDayNano
IntervalYearMonth
IntervalDayTime
Decimal32
Decimal64
Decimal128
Decimal256

Corresponding changes in support of these encoders in FieldEncoder and FieldPlan

Are these changes tested?

Yes, additional complex type unit tests have been added. Benchmark tests have also been written but are being omitted here to keep the diff manageable. All tests, new and existing, pass.

Are there any user-facing changes?

n/a, arrow-avro crate is not yet public

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Sep 8, 2025
@nathaniel-d-ef nathaniel-d-ef reopened this Sep 8, 2025
@jecsand838
Copy link
Contributor

jecsand838 commented Sep 8, 2025

@nathaniel-d-ef Thanks for getting this up!

I did a quick pre-review glance over and had two high-level items:

  1. We'll probably want to feature flag Decimal32 and Decimal64 using small_decimals.
  2. I'd bring over the round trip integration tests for these new types in arrow-avro/src/writer/mod.rs. I know that makes the PR bigger, but it also verifies the functionality. You could always split this into two independent PRs if it get's too large, i.e. (Fixed, Decimal, Uuid, etc.) and (Map and Enum, etc.).

@nathaniel-d-ef
Copy link
Contributor Author

@jecsand838 Both good points. I added the feature flag and pulled out the Enum and Map to make this more manageable. I'll follow up with a new PR including those and the round trip tests.

@nathaniel-d-ef nathaniel-d-ef marked this pull request as ready for review September 10, 2025 15:36
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contributon @nathaniel-d-ef -- the code looks good to me and I think this is close to being ready to merge. I think we need a few more tests, however. I left some comments - let me know what you think

assert_bytes_eq(&got, &expected);
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need coverage for the other types? For example, FixedSizeBinary, UUID, etc?

Here is the list from the PR description

Fixed
UUID
IntervalMonthDayNano
IntervalYearMonth
IntervalDayTime
Decimal32
Decimal64
Decimal128
Decimal256

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, I'll push additional unit test coverage.

assert_bytes_eq(&got, &expected);
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also please add some "end to end" round trip tests here?

Specifically

  1. Create an Arrow array of the relevant type
  2. Write the array an avro file
  3. Read the avro file back
  4. Assert that the array that comes back is the same as was written

maybe @jecsand838 knows of existing tests which we can pattern new tests on

Copy link
Contributor

@jecsand838 jecsand838 Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nathaniel-d-ef You could probably do something like this for the end to end tests in arrow-avro/src/writer/mod.rs:

    #[test]
    fn test_roundtrip_decimals_via_writer() -> Result<(), ArrowError> {
        let files: [(&str, bool); 8] = [
            ("avro/fixed_length_decimal.avro", true), // fixed-backed -> Decimal128(25,2)
            ("avro/fixed_length_decimal_legacy.avro", true), // legacy fixed[8] -> Decimal64(13,2)
            ("avro/int32_decimal.avro", true),        // bytes-backed -> Decimal32(4,2)
            ("avro/int64_decimal.avro", true),        // bytes-backed -> Decimal64(10,2)
            ("test/data/int256_decimal.avro", false), // bytes-backed -> Decimal256(76,2)
            ("test/data/fixed256_decimal.avro", false), // fixed[32]-backed -> Decimal256(76,10)
            ("test/data/fixed_length_decimal_legacy_32.avro", false), // legacy fixed[4] -> Decimal32(9,2)
            ("test/data/int128_decimal.avro", false), // bytes-backed -> Decimal128(38,2)
        ];
        for (rel, in_test_data_dir) in files {
            // Resolve path the same way as reader::test_decimal
            let path: String = if in_test_data_dir {
                arrow_test_data(rel)
            } else {
                PathBuf::from(env!("CARGO_MANIFEST_DIR"))
                    .join(rel)
                    .to_string_lossy()
                    .into_owned()
            };
            // Read original file into a single RecordBatch for comparison
            let f_in = File::open(&path).expect("open input avro");
            let mut rdr = ReaderBuilder::new().build(BufReader::new(f_in))?;
            let in_schema = rdr.schema();
            let in_batches = rdr.collect::<Result<Vec<_>, _>>()?;
            let original =
                arrow::compute::concat_batches(&in_schema, &in_batches).expect("concat input");
            // Write it out with the OCF writer (no special compression)
            let tmp = NamedTempFile::new().expect("create temp file");
            let out_path = tmp.into_temp_path();
            let out_file = File::create(&out_path).expect("create temp avro");
            let mut writer = AvroWriter::new(out_file, original.schema().as_ref().clone())?;
            writer.write(&original)?;
            writer.finish()?;
            // Read back the file we just wrote and compare equality (schema + data)
            let f_rt = File::open(&out_path).expect("open roundtrip avro");
            let mut rt_rdr = ReaderBuilder::new().build(BufReader::new(f_rt))?;
            let rt_schema = rt_rdr.schema();
            let rt_batches = rt_rdr.collect::<Result<Vec<_>, _>>()?;
            let roundtrip =
                arrow::compute::concat_batches(&rt_schema, &rt_batches).expect("concat rt");
            assert_eq!(roundtrip, original, "decimal round-trip mismatch for {rel}");
        }
        Ok(())
    }

EDIT: Just be sure to add the small_decimals feature flag into test_roundtrip_decimals_via_writer and to add a second test using the arrow-avro/test/data/duration_uuid.avro file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb Yes, good callout. I have additional round trip tests a fast-follow up draft PR, but I'll apply @jecsand838 's recommendation here too for decimals. I also plan to do a wrap-up PR to cover any gaps in the tests, docs, etc.

Copy link
Contributor

@jecsand838 jecsand838 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nathaniel-d-ef Just finished reviewing the code. Looks really good. I just had some additional suggestions for improving performance and maintainability.

…ng; consolidate interval encoding logic and add extensive unit tests.
@alamb
Copy link
Contributor

alamb commented Sep 12, 2025

@alamb Yes, good callout. I have additional round trip tests a fast-follow up draft PR, but I'll apply @jecsand838 's recommendation here too for decimals. I also plan to do a wrap-up PR to cover any gaps in the tests, docs, etc.

Round trip tests in a follow on PR sounds good to me -- thank you

Note I still think there is value is direct round trip tests where the source data was created directly in memory as Arrow rather than being read from an existing file. This would ensure there is a pattern to cover all potential data types and would be easy to add additional cases to (without having to create / check in a avro file from some other systems)

However, the round trip tests in #8308 look reasonable

Thanks for pushing this along. Please just give me a ping when it is ready for another look

}
DataType::Duration(_) => {
return Err(ArrowError::NotYetImplemented(
"Avro writer: Arrow Duration(TimeUnit) has no standard Avro mapping; cast to Interval(MonthDayNano) to use Avro 'duration'".into(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be completed in a follow up PR

@nathaniel-d-ef
Copy link
Contributor Author

@alamb Yes, good callout. I have additional round trip tests a fast-follow up draft PR, but I'll apply @jecsand838 's recommendation here too for decimals. I also plan to do a wrap-up PR to cover any gaps in the tests, docs, etc.

Round trip tests in a follow on PR sounds good to me -- thank you

Note I still think there is value is direct round trip tests where the source data was created directly in memory as Arrow rather than being read from an existing file. This would ensure there is a pattern to cover all potential data types and would be easy to add additional cases to (without having to create / check in a avro file from some other systems)

However, the round trip tests in #8308 look reasonable

Thanks for pushing this along. Please just give me a ping when it is ready for another look

@alamb That's a good point that I hadn't considered. I'll work on that pattern for the coming PRs. Thanks again for digging in here; it's good for another review.

Copy link
Contributor

@jecsand838 jecsand838 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nathaniel-d-ef -- this looks great.

Thanks as well to @jecsand838 for the help reviewing

@alamb alamb merged commit 70f9250 into apache:main Sep 13, 2025
23 checks passed
alamb added a commit that referenced this pull request Sep 17, 2025
# Which issue does this PR close?

- Part of #4886
- Related to #8274 and
#8298

# Rationale for this change

This PR adds Map and Enum encoders to the arrow-avro crate writer, along
with new benchmark tests for remaining types and round-trip tests.

# What changes are included in this PR?

New encoders:
**Map**
**Enum**

Corresponding changes in support of these encoders in FieldEncoder and
FieldPlan

## Additional round trip tests in `mod.rs`

New tests follow existing file read pattern

- simple_fixed
- duration_uuid
- nonnullable.impala.avro
- decimals
- enum

## Additional benchmark tests for data types

- Utf8
- List<Utf8>
- Struct
- FixedSizeBinary16
- UUID
- IntervalMonthDayNanoDuration
- Decimal32(bytes)
- Decimal64(bytes)
- Decimal128(bytes)
- Decimal128(fixed16)
- Decimal256(bytes)
- Map
- Enum

# Are these changes tested?

Yes, additional complex type unit tests have been added for Map and
Enum. The rest of the PR beyond the new types are tests themselves. All
tests, new and existing, pass.

# Are there any user-facing changes?

n/a, arrow-avro crate is not yet public

---------

Co-authored-by: Connor Sanders <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
@jecsand838 jecsand838 deleted the CAN-58 branch September 22, 2025 04:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants