Original SQLStorm implementation: https://github.com/SQL-Storm/SQLStorm
Tracking issue in DataFusion: apache/datafusion#17698
Generated datasets live in the data/
directory at the project root. The location is configurable when running the setup utility.
<project-root>/data
---- lineitem.parquet
---- orders.parquet
...
- Install
tpchgen-cli
https://github.com/clflushopt/tpchgen-rs/blob/main/tpchgen-cli/README.md - Run
tpchgen-cli -s 0.1 --format=parquet
Valid TPCH queries in SQLStorm is used https://github.com/SQL-Storm/SQLStorm/tree/master/v1.0/tpch/queries
- valid means they're runnable in PostgreSQL
# Run benchmarks (defaults use <project-root>/data and SQLStorm query set)
uv run datafusion-sqlstorm --output benchmark_results.csv
# Re-run a single query for debugging ("error" or "rows")
uv run python -m datafusion_sqlstorm.reproduce rows 5 --timeout 5
--data-dir <path>
: Parquet source directory (default<project-root>/data
).--queries-dir <path>
: SQL files directory (default<project-root>/../SQLStorm/v1.0/tpch/queries/
).--output <path>
: Benchmark CSV destination (defaultbenchmark_results.csv
).--timeout <seconds>
: Per-engine timeout in seconds (default3.0
).--start-query-id <int>
: First numeric query ID to execute (inclusive, optional).--end-query-id <int>
: Final numeric query ID to execute (exclusive, optional).
query_id
: Query file stem executed (e.g.5
for5.sql
).df_time
/duckdb_time
: Wall-clock seconds per engine (TIMEOUT
orFAIL
if the run exceeded the limit or errored).df_mem
/duckdb_mem
: Peak resident set size in bytes (TIMEOUT
/FAIL
mirror the time columns).unexpected_err
:true
when DataFusion failed with an error outside the whitelist.row_count_match
:true
when both engines succeeded and returned matching row counts, if both engine succeed but return a different number of rows, thenfalse
, otherwiseNA
.time_ratio
:df_time / duckdb_time
rounded to 6 decimals;INF
when DataFusion timed out while DuckDB finished;NA
when the ratio cannot be computed.memory_ratio
:df_mem / duckdb_mem
rounded to 6 decimals;NA
when either engine failed or memory usage was zero.
cat benchmark_results.csv
query_id,df_time,df_mem,duckdb_time,duckdb_mem,unexpected_err,row_count_match,time_ratio,memory_ratio
5,FAIL,FAIL,FAIL,FAIL,false,NA,NA,NA
6,0.040621,110575616,0.033876,93175808,false,true,1.199119,1.186742
12,0.038163,269795328,0.031675,206192640,false,true,1.204827,1.308462
13,FAIL,FAIL,FAIL,FAIL,true,NA,NA,NA
...
The original bug-triggering query is quite complex, it hard to tell how is the bug triggered from the raw reproducer. So it's necessary to reduce the bug to its minimal form.
It's done by prompting 'please reduce this bug to the minimal form' + failing query, in agentic AI coding tools like Codex/Cursor/Claude Code
Example of reporting reduced bug: apache/datafusion#17699
Each dot is for a single query's speed ratio from df_time / duckdb_time
. The top-right dots means DataFusion's execution time is 100x slower than DuckDB's execution time on the same query.
Note its timeout is 10 second, there might exist queries with higher time ratio if they're timed out.
Around 2500/17000 queries are failed on DataFusion. They should be able to run in PostgreSQL. It's likely that they're mostly duplicated bugs, or unsupported features like certain subqueries.
- Support TPCH dataset
- Support other datasets (TPC-DS, JOB)
- Triage bugs
- Investigate queries that are significantly slower than DuckDB (most of them should be due to bad join order, or subqueries that DF are not able to unnest now, how can we filter them out automatically?)