datafusion-sqlstorm

Original SQLStorm implementation: https://github.com/SQL-Storm/SQLStorm

Tracking issue in DataFusion: apache/datafusion#17698

Dataset setup

Generated datasets live in the data/ directory at the project root. The location is configurable when running the setup utility.

<project-root>/data
---- lineitem.parquet
---- orders.parquet
...

Generating tpch dataset

Install tpchgen-cli https://github.com/clflushopt/tpchgen-rs/blob/main/tpchgen-cli/README.md
Run tpchgen-cli -s 0.1 --format=parquet

Queries setup

Valid TPCH queries in SQLStorm is used https://github.com/SQL-Storm/SQLStorm/tree/master/v1.0/tpch/queries

valid means they're runnable in PostgreSQL

Running command

# Run benchmarks (defaults use <project-root>/data and SQLStorm query set)
uv run datafusion-sqlstorm --output benchmark_results.csv

# Re-run a single query for debugging ("error" or "rows")
uv run python -m datafusion_sqlstorm.reproduce rows 5 --timeout 5

Benchmark options:

--data-dir <path>: Parquet source directory (default <project-root>/data).
--queries-dir <path>: SQL files directory (default <project-root>/../SQLStorm/v1.0/tpch/queries/).
--output <path>: Benchmark CSV destination (default benchmark_results.csv).
--timeout <seconds>: Per-engine timeout in seconds (default 3.0).
--start-query-id <int>: First numeric query ID to execute (inclusive, optional).
--end-query-id <int>: Final numeric query ID to execute (exclusive, optional).

`benchmark_results.csv` columns:

query_id: Query file stem executed (e.g. 5 for 5.sql).
df_time / duckdb_time: Wall-clock seconds per engine (TIMEOUT or FAIL if the run exceeded the limit or errored).
df_mem / duckdb_mem: Peak resident set size in bytes (TIMEOUT/FAIL mirror the time columns).
unexpected_err: true when DataFusion failed with an error outside the whitelist.
row_count_match: true when both engines succeeded and returned matching row counts, if both engine succeed but return a different number of rows, then false, otherwise NA.
time_ratio: df_time / duckdb_time rounded to 6 decimals; INF when DataFusion timed out while DuckDB finished; NA when the ratio cannot be computed.
memory_ratio: df_mem / duckdb_mem rounded to 6 decimals; NA when either engine failed or memory usage was zero.

Example

cat benchmark_results.csv

query_id,df_time,df_mem,duckdb_time,duckdb_mem,unexpected_err,row_count_match,time_ratio,memory_ratio
5,FAIL,FAIL,FAIL,FAIL,false,NA,NA,NA
6,0.040621,110575616,0.033876,93175808,false,true,1.199119,1.186742
12,0.038163,269795328,0.031675,206192640,false,true,1.204827,1.308462
13,FAIL,FAIL,FAIL,FAIL,true,NA,NA,NA
...

Reporting bugs

The original bug-triggering query is quite complex, it hard to tell how is the bug triggered from the raw reproducer. So it's necessary to reduce the bug to its minimal form.

It's done by prompting 'please reduce this bug to the minimal form' + failing query, in agentic AI coding tools like Codex/Cursor/Claude Code

Example of reporting reduced bug: apache/datafusion#17699

Initial running result

Speed comparison

Each dot is for a single query's speed ratio from df_time / duckdb_time. The top-right dots means DataFusion's execution time is 100x slower than DuckDB's execution time on the same query. Note its timeout is 10 second, there might exist queries with higher time ratio if they're timed out.

Failed queries

Around 2500/17000 queries are failed on DataFusion. They should be able to run in PostgreSQL. It's likely that they're mostly duplicated bugs, or unsupported features like certain subqueries.

TODO

Support TPCH dataset
Support other datasets (TPC-DS, JOB)
Triage bugs
Investigate queries that are significantly slower than DuckDB (most of them should be due to bad join order, or subqueries that DF are not able to unnest now, how can we filter them out automatically?)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src/datafusion_sqlstorm		src/datafusion_sqlstorm
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
benchmark_plot_log.png		benchmark_plot_log.png
benchmark_results.csv		benchmark_results.csv
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

datafusion-sqlstorm

Dataset setup

Generating tpch dataset

Queries setup

Running command

Benchmark options:

`benchmark_results.csv` columns:

Example

Reporting bugs

Initial running result

Speed comparison

Failed queries

TODO

About

Uh oh!

Releases

Packages

Languages

2010YOUY01/datafusion-sqlstorm

Folders and files

Latest commit

History

Repository files navigation

datafusion-sqlstorm

Dataset setup

Generating tpch dataset

Queries setup

Running command

Benchmark options:

benchmark_results.csv columns:

Example

Reporting bugs

Initial running result

Speed comparison

Failed queries

TODO

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`benchmark_results.csv` columns:

Packages