-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Part of #6163
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
@efredine recently added support for extracting statistics from parquet files as arrays in #6046 using StatisticsConverter
During development we have also added support for StringViewArray and BinaryViewArray in #5374
Currently there is no way to read StringViewArray and BinaryViewArray statistics and it actually panics if you try to read data page level statistics as I found on apache/datafusion#11723
not implemented
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
External error: query failed: DataFusion error: Join Error
caused by
Describe the solution you'd like
- Implement the ability to extract parquet statistics as
StringViewandBinaryView - Remove the panic caused by
unimplemented!at_ => unimplemented!()
The code is in https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_reader/statistics.rs
Describe alternatives you've considered
You can avoid the panic by following the model of this:
arrow-rs/parquet/src/arrow/arrow_reader/statistics.rs
Lines 465 to 467 in 2905ce6
| let len = $iterator.count(); | |
| // don't know how to extract statistics, so return a null array | |
| Ok(new_null_array($data_type, len)) |
Then, you can probably write a test followig the model of utf8 and binary
arrow-rs/parquet/src/arrow/arrow_reader/statistics.rs
Lines 1897 to 1917 in 2905ce6
| fn roundtrip_utf8() { | |
| Test { | |
| input: utf8_array([ | |
| // row group 1 | |
| Some("A"), | |
| None, | |
| Some("Q"), | |
| // row group 2 | |
| Some("ZZ"), | |
| Some("AA"), | |
| None, | |
| // row group 3 | |
| None, | |
| None, | |
| None, | |
| ]), | |
| expected_min: utf8_array([Some("A"), Some("AA"), None]), | |
| expected_max: utf8_array([Some("Q"), Some("ZZ"), None]), | |
| } | |
| .run() | |
| } |
arrow-rs/parquet/src/arrow/arrow_reader/statistics.rs
Lines 1956 to 1984 in 2905ce6
| fn roundtrip_binary() { | |
| Test { | |
| input: Arc::new(BinaryArray::from_opt_vec(vec![ | |
| // row group 1 | |
| Some(b"A"), | |
| None, | |
| Some(b"Q"), | |
| // row group 2 | |
| Some(b"ZZ"), | |
| Some(b"AA"), | |
| None, | |
| // row group 3 | |
| None, | |
| None, | |
| None, | |
| ])), | |
| expected_min: Arc::new(BinaryArray::from_opt_vec(vec![ | |
| Some(b"A"), | |
| Some(b"AA"), | |
| None, | |
| ])), | |
| expected_max: Arc::new(BinaryArray::from_opt_vec(vec![ | |
| Some(b"Q"), | |
| Some(b"ZZ"), | |
| None, | |
| ])), | |
| } | |
| .run() | |
| } |
And then implement the missing pieces of code (use StringViewBuilder / BinaryViewBuilder instead of StringBuilder / BinaryBuilder)
I have a hacky version in apache/datafusion#11753 that looks something like
DataType::Utf8View => {
let iterator = [<$stat_type_prefix ByteArrayStatsIterator>]::new($iterator);
let mut builder = StringViewBuilder::new();
for x in iterator {
let Some(x) = x else {
builder.append_null(); // no statistics value
continue;
};
let Ok(x) = std::str::from_utf8(x) else {
log::debug!("Utf8 statistics is a non-UTF8 value, ignoring it.");
builder.append_null();
continue;
};
builder.append_value(x);
}
Ok(Arc::new(builder.finish()))
},Additional context