-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Adds additional type support to arrow-avro writer #8298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… arrow-avro encoder
…pes including utf8, fixed-size binary, interval, struct, and decimals
…cimal, Map) in arrow-avro
|
@nathaniel-d-ef Thanks for getting this up! I did a quick pre-review glance over and had two high-level items:
|
…xed, Enum, Map, Decimal, and Interval types
|
@jecsand838 Both good points. I added the feature flag and pulled out the Enum and Map to make this more manageable. I'll follow up with a new PR including those and the round trip tests. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contributon @nathaniel-d-ef -- the code looks good to me and I think this is close to being ready to merge. I think we need a few more tests, however. I left some comments - let me know what you think
| assert_bytes_eq(&got, &expected); | ||
| } | ||
|
|
||
| #[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need coverage for the other types? For example, FixedSizeBinary, UUID, etc?
Here is the list from the PR description
Fixed
UUID
IntervalMonthDayNano
IntervalYearMonth
IntervalDayTime
Decimal32
Decimal64
Decimal128
Decimal256
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure, I'll push additional unit test coverage.
| assert_bytes_eq(&got, &expected); | ||
| } | ||
|
|
||
| #[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also please add some "end to end" round trip tests here?
Specifically
- Create an Arrow array of the relevant type
- Write the array an avro file
- Read the avro file back
- Assert that the array that comes back is the same as was written
maybe @jecsand838 knows of existing tests which we can pattern new tests on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nathaniel-d-ef You could probably do something like this for the end to end tests in arrow-avro/src/writer/mod.rs:
#[test]
fn test_roundtrip_decimals_via_writer() -> Result<(), ArrowError> {
let files: [(&str, bool); 8] = [
("avro/fixed_length_decimal.avro", true), // fixed-backed -> Decimal128(25,2)
("avro/fixed_length_decimal_legacy.avro", true), // legacy fixed[8] -> Decimal64(13,2)
("avro/int32_decimal.avro", true), // bytes-backed -> Decimal32(4,2)
("avro/int64_decimal.avro", true), // bytes-backed -> Decimal64(10,2)
("test/data/int256_decimal.avro", false), // bytes-backed -> Decimal256(76,2)
("test/data/fixed256_decimal.avro", false), // fixed[32]-backed -> Decimal256(76,10)
("test/data/fixed_length_decimal_legacy_32.avro", false), // legacy fixed[4] -> Decimal32(9,2)
("test/data/int128_decimal.avro", false), // bytes-backed -> Decimal128(38,2)
];
for (rel, in_test_data_dir) in files {
// Resolve path the same way as reader::test_decimal
let path: String = if in_test_data_dir {
arrow_test_data(rel)
} else {
PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join(rel)
.to_string_lossy()
.into_owned()
};
// Read original file into a single RecordBatch for comparison
let f_in = File::open(&path).expect("open input avro");
let mut rdr = ReaderBuilder::new().build(BufReader::new(f_in))?;
let in_schema = rdr.schema();
let in_batches = rdr.collect::<Result<Vec<_>, _>>()?;
let original =
arrow::compute::concat_batches(&in_schema, &in_batches).expect("concat input");
// Write it out with the OCF writer (no special compression)
let tmp = NamedTempFile::new().expect("create temp file");
let out_path = tmp.into_temp_path();
let out_file = File::create(&out_path).expect("create temp avro");
let mut writer = AvroWriter::new(out_file, original.schema().as_ref().clone())?;
writer.write(&original)?;
writer.finish()?;
// Read back the file we just wrote and compare equality (schema + data)
let f_rt = File::open(&out_path).expect("open roundtrip avro");
let mut rt_rdr = ReaderBuilder::new().build(BufReader::new(f_rt))?;
let rt_schema = rt_rdr.schema();
let rt_batches = rt_rdr.collect::<Result<Vec<_>, _>>()?;
let roundtrip =
arrow::compute::concat_batches(&rt_schema, &rt_batches).expect("concat rt");
assert_eq!(roundtrip, original, "decimal round-trip mismatch for {rel}");
}
Ok(())
}EDIT: Just be sure to add the small_decimals feature flag into test_roundtrip_decimals_via_writer and to add a second test using the arrow-avro/test/data/duration_uuid.avro file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb Yes, good callout. I have additional round trip tests a fast-follow up draft PR, but I'll apply @jecsand838 's recommendation here too for decimals. I also plan to do a wrap-up PR to cover any gaps in the tests, docs, etc.
jecsand838
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nathaniel-d-ef Just finished reviewing the code. Looks really good. I just had some additional suggestions for improving performance and maintainability.
…ng; consolidate interval encoding logic and add extensive unit tests.
Round trip tests in a follow on PR sounds good to me -- thank you Note I still think there is value is direct round trip tests where the source data was created directly in memory as Arrow rather than being read from an existing file. This would ensure there is a pattern to cover all potential data types and would be easy to add additional cases to (without having to create / check in a avro file from some other systems) However, the round trip tests in #8308 look reasonable Thanks for pushing this along. Please just give me a ping when it is ready for another look |
| } | ||
| DataType::Duration(_) => { | ||
| return Err(ArrowError::NotYetImplemented( | ||
| "Avro writer: Arrow Duration(TimeUnit) has no standard Avro mapping; cast to Interval(MonthDayNano) to use Avro 'duration'".into(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be completed in a follow up PR
@alamb That's a good point that I hadn't considered. I'll work on that pattern for the coming PRs. Thanks again for digging in here; it's good for another review. |
jecsand838
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nathaniel-d-ef LGTM!
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nathaniel-d-ef -- this looks great.
Thanks as well to @jecsand838 for the help reviewing
# Which issue does this PR close? - Part of #4886 - Related to #8274 and #8298 # Rationale for this change This PR adds Map and Enum encoders to the arrow-avro crate writer, along with new benchmark tests for remaining types and round-trip tests. # What changes are included in this PR? New encoders: **Map** **Enum** Corresponding changes in support of these encoders in FieldEncoder and FieldPlan ## Additional round trip tests in `mod.rs` New tests follow existing file read pattern - simple_fixed - duration_uuid - nonnullable.impala.avro - decimals - enum ## Additional benchmark tests for data types - Utf8 - List<Utf8> - Struct - FixedSizeBinary16 - UUID - IntervalMonthDayNanoDuration - Decimal32(bytes) - Decimal64(bytes) - Decimal128(bytes) - Decimal128(fixed16) - Decimal256(bytes) - Map - Enum # Are these changes tested? Yes, additional complex type unit tests have been added for Map and Enum. The rest of the PR beyond the new types are tests themselves. All tests, new and existing, pass. # Are there any user-facing changes? n/a, arrow-avro crate is not yet public --------- Co-authored-by: Connor Sanders <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Rationale for this change
This PR extends on work introduced in #8274, adding additional complex type support to the Avro writer. This brings us closer to a complete round-trip capability and Avro spec support in the arrow-avro crate.
What changes are included in this PR?
New encoders:
Fixed
UUID
IntervalMonthDayNano
IntervalYearMonth
IntervalDayTime
Decimal32
Decimal64
Decimal128
Decimal256
Corresponding changes in support of these encoders in FieldEncoder and FieldPlan
Are these changes tested?
Yes, additional complex type unit tests have been added. Benchmark tests have also been written but are being omitted here to keep the diff manageable. All tests, new and existing, pass.
Are there any user-facing changes?
n/a, arrow-avro crate is not yet public