Skip to content

Datafusion v19.rc1 scan parquet 20x slower than DuckDB v0.6.1 on 15GB ClickBench data #5404

@jychen7

Description

@jychen7

Describe the problem
This is NOT a bug, but an potential improvement goal

Datafusion v19.rc1 by default turn on repartition_file_scans at #5295

with my local Macbook Pro (2.6 GHz 6-Core Intel Core i7, 32 GB 2667 MHz DDR4), for following query on clickbench 14GB hits.parquet:

  • v19.rc1 took 12.343 seconds (yeah, 8x faster than v18, was 83.863 seconds)
  • DuckDB v0.6.1 took real 0.566 user 1.876031 sys 0.357483
    • clock time 566ms
    • cpu time 1.87s
    • I think clock time is smaller than cpu time, because of it uses multiple CPU cores in parallel.

To Reproduce
Download data file

wget --continue https://datasets.clickhouse.com/hits_compatible/hits.parquet

Prepare SQL
create a file called create.sql

CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'hits.parquet';

create a file called q23_no_order_limit_1.sql

SELECT * FROM hits WHERE "URL" LIKE '%google%' limit 1;

Datafusion

git clone https://github.com/apache/arrow-datafusion.git
git checkout 19.0.0-rc1
cd datafusion-cli
cargo build --release

target/release/datafusion-cli -f create.sql q23_no_order_limit_1.sql
// output: 1 row in set. Query took 12.343 seconds

DuckDB

brew install duckdb
duckdb
> .timer on
> SELECT * FROM read_parquet('hits.parquet') WHERE URL LIKE '%google%' LIMIT 1;
// output: Run Time (s): real 0.566 user 1.876031 sys 0.357483

Expected behavior

  1. with single core, datafusion-cli tooks 2s (like cpu time of DuckDB)
  2. with multi cores, datafusion-cli tooks 0.6s (like real time of DuckDB)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions