- 
                Notifications
    
You must be signed in to change notification settings  - Fork 1.7k
 
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the problem
This is NOT a bug, but an potential improvement goal
Datafusion v19.rc1 by default turn on  repartition_file_scans at #5295
with my local Macbook Pro (2.6 GHz 6-Core Intel Core i7, 32 GB 2667 MHz DDR4), for following query on clickbench 14GB hits.parquet:
- v19.rc1 took 12.343 seconds (yeah, 8x faster than v18, was 83.863 seconds)
 - DuckDB v0.6.1 took 
real 0.566 user 1.876031 sys 0.357483- clock time 566ms
 - cpu time 1.87s
 - I think clock time is smaller than cpu time, because of it uses multiple CPU cores in parallel.
 
 
To Reproduce
Download data file
wget --continue https://datasets.clickhouse.com/hits_compatible/hits.parquet
Prepare SQL
create a file called create.sql
CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'hits.parquet';
create a file called q23_no_order_limit_1.sql
SELECT * FROM hits WHERE "URL" LIKE '%google%' limit 1;
Datafusion
git clone https://github.com/apache/arrow-datafusion.git
git checkout 19.0.0-rc1
cd datafusion-cli
cargo build --release
target/release/datafusion-cli -f create.sql q23_no_order_limit_1.sql
// output: 1 row in set. Query took 12.343 seconds
DuckDB
brew install duckdb
duckdb
> .timer on
> SELECT * FROM read_parquet('hits.parquet') WHERE URL LIKE '%google%' LIMIT 1;
// output: Run Time (s): real 0.566 user 1.876031 sys 0.357483
Expected behavior
- with single core, datafusion-cli tooks 2s (like cpu time of DuckDB)
 - with multi cores, datafusion-cli tooks 0.6s (like real time of DuckDB)
 
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working