-
Notifications
You must be signed in to change notification settings - Fork 1k
Add Decimal32 and Decimal64 support to arrow-avro Reader #8255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Expanded Decoder to handle Decimal32 and Decimal64 types alongside existing Decimal128 and Decimal256 support. - Added precision and fixed size validation for Decimal32 and Decimal64 to ensure Arrow compatibility. - Updated flush, decode, and append_null methods in the Decoder to process these new decimal types correctly. - Refined test suite with new test cases to validate Decimal32/Decimal64 decoding and integration logic. - Introduced thorough coverage across data-backed, fixed-sized, and precision-based decimal paths.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jecsand838
Can we please add a description of the new binary files added?
Also, one thing I worry about is that Decimal32 and Decimal64 are relatively new types (thanks to @CurtHagenlocher 🙏 ) but their support across the various arrow kernels now is not very large.
Thus if you have the avro reader return Decimal32/Decimal64 arrays, it might be hard to use them in the rest of the arrow-rs world until we complete the rest of the support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How were these files created? Maybe we could add a README explaining what is in them and how they were made?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can do that. I have an arrow-testing PR outstanding for these files and the script used to create them saved as a gist: apache/arrow-testing#112
https://gist.github.com/jecsand838/3890349bdb33082a3e8fdcae3257eef7
Also, one thing I worry about is that Decimal32 and Decimal64 are relatively new types (thanks to @CurtHagenlocher 🙏 ) but their support across the various arrow kernels now is not very large.
Ah, that makes sense. Do you think it's worth feature flagging the Decimal32 and Decimal64 support for the time being?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb I went ahead and got the README.md file and new small_decimals feature flag pushed up. Let me know what you think!
Also if we need to hold off until Decimal32 and Decimal64 support has matured, then that's totally fine with me as well.
94f473f to
e2a153e
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jecsand838
Which issue does this PR close?
Rationale for this change
Apache Avro’s
decimallogical type annotates eitherbytesorfixedand carriesprecisionandscale. Implementations should reject invalid combinations such asscale > precision, and the underlying bytes are the two’s‑complement big‑endian representation of the unscaled integer. On the Arrow side, Rust now exposes first‑classDecimal32,Decimal64,Decimal128, andDecimal256data types with documented maximum precisions (9, 18, 38, 76 respectively). Until now,arrow-avrodecoded all Avro decimals to 128/256‑bit Arrow decimals, even when a narrower type would suffice.What changes are included in this PR?
arrow-avro/src/codec.rsCodec::Decimal(precision, scale, _size)to Arrow’sDecimal32/64/128/256by precision, preferring the narrowest type (≤9→32, ≤18→64, ≤38→128, otherwise 256).scale > precision.precisionexceeds Arrow’s maximum (Decimal256).fixed, check that declaredprecisionfits the byte width (≤4→max 9, ≤8→18, ≤16→38, ≤32→76).Codec::Decimalto mentionDecimal32/64.arrow-avro/src/reader/record.rsAdd
Decoder::Decimal32andDecoder::Decimal64variants with corresponding builders (Decimal32Builder,Decimal64Builder).Builder selection:
Implement decode paths that sign‑extend Avro’s two’s‑complement payload to 4/8 bytes and append values to the new builders; update
append_null/flushfor 32/64‑bit decimals.arrow-avro/src/reader/mod.rs(tests)Expand
test_decimalto assert that:Decimal32; precision 10 map toDecimal64;Decimal64;Decimal128.Add a nulls path test for bytes‑backed
Decimal32.Are these changes tested?
Yes. Unit tests under
arrow-avro/src/reader/mod.rsconstruct expectedDecimal32Array/Decimal64Array/Decimal128Arraywithwith_precision_and_scale, and compare against batches decoded from Avro files (including legacy fixed and bytes‑backed cases). The tests also exercise small batch sizes to cover buffering paths; a new Avro data file is added for higher‑width decimals.New Avro test file details:
These new Avro test files were created using this script: https://gist.github.com/jecsand838/3890349bdb33082a3e8fdcae3257eef7
There is also an arrow-testing PR for these new files: apache/arrow-testing#112
Are there any user-facing changes?
N/A due to
arrow-avronot being public.