Skip to content

ParquetExec::statistics() does not read statistics for many column types (like timstamps, strings, etc) #8295

@alamb

Description

@alamb

Describe the bug

While working on #8229 I found another bug that is non obvious, but that can be clearly seen now thanks to #8110 and #8111 from @NGA-TRAN

To Reproduce

❯ copy (values ('foo'), ('bar'), ('baz')) to '/tmp/strings.parquet';
+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.023 seconds.

And then look at the explain verbose up can see there are no min/max statisics shown:

❯ explain verbose select * from '/tmp/strings.parquet';

|                                                            |                                                                                                                                                                |
| physical_plan_with_stats                                   | ParquetExec: file_groups={1 group: [[private/tmp/strings.parquet]]}, projection=[column1], statistics=[Rows=Exact(3), Bytes=Absent, [(Col[0]: Null=Exact(0))]] |
|                                                            |                                                                                                                                                                |
+------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
80 rows in set. Query took 0.002 seconds.

Expected behavior

I expect there to be min/max values extracted in the statistics for the strings, as there are for integers ((Col[0]: Min=Exact(Int64(1)) Max=Exact(Int64(3)))

❯ copy (values (1), (2), (3)) to '/tmp/ints.parquet';
+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.023 seconds.
❯ explain verbose select * from '/tmp/ints.parquet';
...
                                                                                                               |
| physical_plan                                              | ParquetExec: file_groups={1 group: [[private/tmp/ints.parquet]]}, projection=[column1]                                                                                                              |
|                                                            |                                                                                                                                                                                                     |
| physical_plan_with_stats                                   | ParquetExec: file_groups={1 group: [[private/tmp/ints.parquet]]}, projection=[column1], statistics=[Rows=Exact(3), Bytes=Absent, [(Col[0]: Min=Exact(Int64(1)) Max=Exact(Int64(3)) Null=Exact(0))]] |
|                                                            |                                                                                                                                                                                                     |
+------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions