Skip to content

Add support for StringView and BinaryView statistics in StatisticsConverter #6164

@alamb

Description

@alamb

Part of #6163

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
@efredine recently added support for extracting statistics from parquet files as arrays in #6046 using StatisticsConverter

During development we have also added support for StringViewArray and BinaryViewArray in #5374

Currently there is no way to read StringViewArray and BinaryViewArray statistics and it actually panics if you try to read data page level statistics as I found on apache/datafusion#11723

not implemented
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
External error: query failed: DataFusion error: Join Error
caused by

Describe the solution you'd like

  1. Implement the ability to extract parquet statistics as StringView and BinaryView
  2. Remove the panic caused by unimplemented! at

The code is in https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_reader/statistics.rs

Describe alternatives you've considered

You can avoid the panic by following the model of this:

let len = $iterator.count();
// don't know how to extract statistics, so return a null array
Ok(new_null_array($data_type, len))

Then, you can probably write a test followig the model of utf8 and binary

fn roundtrip_utf8() {
Test {
input: utf8_array([
// row group 1
Some("A"),
None,
Some("Q"),
// row group 2
Some("ZZ"),
Some("AA"),
None,
// row group 3
None,
None,
None,
]),
expected_min: utf8_array([Some("A"), Some("AA"), None]),
expected_max: utf8_array([Some("Q"), Some("ZZ"), None]),
}
.run()
}

fn roundtrip_binary() {
Test {
input: Arc::new(BinaryArray::from_opt_vec(vec![
// row group 1
Some(b"A"),
None,
Some(b"Q"),
// row group 2
Some(b"ZZ"),
Some(b"AA"),
None,
// row group 3
None,
None,
None,
])),
expected_min: Arc::new(BinaryArray::from_opt_vec(vec![
Some(b"A"),
Some(b"AA"),
None,
])),
expected_max: Arc::new(BinaryArray::from_opt_vec(vec![
Some(b"Q"),
Some(b"ZZ"),
None,
])),
}
.run()
}

And then implement the missing pieces of code (use StringViewBuilder / BinaryViewBuilder instead of StringBuilder / BinaryBuilder)

I have a hacky version in apache/datafusion#11753 that looks something like

            DataType::Utf8View => {
                let iterator = [<$stat_type_prefix ByteArrayStatsIterator>]::new($iterator);
                let mut builder = StringViewBuilder::new();
                for x in iterator {
                    let Some(x) = x else {
                        builder.append_null(); // no statistics value
                        continue;
                    };

                    let Ok(x) = std::str::from_utf8(x) else {
                        log::debug!("Utf8 statistics is a non-UTF8 value, ignoring it.");
                        builder.append_null();
                        continue;
                    };

                    builder.append_value(x);
                }
                Ok(Arc::new(builder.finish()))
            },

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changeloggood first issueGood for newcomersparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions