Skip to content

1000x slowdown opening parquet file due to partitions #16676

@asayers

Description

@asayers

Describe the bug

parquet metadata load time was 15 seconds with default configuration. Setting target_partitions to 1 brought it down to 15 milliseconds.

With the default config I was seeing 20 file groups in the DataSourceExec (since my machine has 20 cores) and loading the metadata was taking forever. Forcing target_partitions to 1 fixed it: I now see just 1 file group in the DataSourceExec, metadata load time is down. (Added bonus: the plan no longer requires a MergeExec, which was the slowest part of the whole query).

To Reproduce

The query I was running is very basic: just filter + sort + select. The parquet file is 130.00 MiB
with 1.08 MiB of metadata (6.42 MiB when expanded in memory).

Expected behavior

I wouldn't have thought going from 1 file group to 20 would slow down metadata parsing by 1000x.

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions