-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Describe the bug
parquet metadata load time was 15 seconds with default configuration. Setting target_partitions
to 1 brought it down to 15 milliseconds.
With the default config I was seeing 20 file groups in the DataSourceExec
(since my machine has 20 cores) and loading the metadata was taking forever. Forcing target_partitions
to 1 fixed it: I now see just 1 file group in the DataSourceExec
, metadata load time is down. (Added bonus: the plan no longer requires a MergeExec
, which was the slowest part of the whole query).
To Reproduce
The query I was running is very basic: just filter + sort + select. The parquet file is 130.00 MiB
with 1.08 MiB of metadata (6.42 MiB when expanded in memory).
Expected behavior
I wouldn't have thought going from 1 file group to 20 would slow down metadata parsing by 1000x.
Additional context
No response