Skip to content

2010YOUY01/datafusion-sqlstorm

Repository files navigation

datafusion-sqlstorm

Original SQLStorm implementation: https://github.com/SQL-Storm/SQLStorm

Tracking issue in DataFusion: apache/datafusion#17698

Dataset setup

Generated datasets live in the data/ directory at the project root. The location is configurable when running the setup utility.

<project-root>/data
---- lineitem.parquet
---- orders.parquet
...

Generating tpch dataset

  1. Install tpchgen-cli https://github.com/clflushopt/tpchgen-rs/blob/main/tpchgen-cli/README.md
  2. Run tpchgen-cli -s 0.1 --format=parquet

Queries setup

Valid TPCH queries in SQLStorm is used https://github.com/SQL-Storm/SQLStorm/tree/master/v1.0/tpch/queries

  • valid means they're runnable in PostgreSQL

Running command

# Run benchmarks (defaults use <project-root>/data and SQLStorm query set)
uv run datafusion-sqlstorm --output benchmark_results.csv

# Re-run a single query for debugging ("error" or "rows")
uv run python -m datafusion_sqlstorm.reproduce rows 5 --timeout 5

Benchmark options:

  • --data-dir <path>: Parquet source directory (default <project-root>/data).
  • --queries-dir <path>: SQL files directory (default <project-root>/../SQLStorm/v1.0/tpch/queries/).
  • --output <path>: Benchmark CSV destination (default benchmark_results.csv).
  • --timeout <seconds>: Per-engine timeout in seconds (default 3.0).
  • --start-query-id <int>: First numeric query ID to execute (inclusive, optional).
  • --end-query-id <int>: Final numeric query ID to execute (exclusive, optional).

benchmark_results.csv columns:

  • query_id: Query file stem executed (e.g. 5 for 5.sql).
  • df_time / duckdb_time: Wall-clock seconds per engine (TIMEOUT or FAIL if the run exceeded the limit or errored).
  • df_mem / duckdb_mem: Peak resident set size in bytes (TIMEOUT/FAIL mirror the time columns).
  • unexpected_err: true when DataFusion failed with an error outside the whitelist.
  • row_count_match: true when both engines succeeded and returned matching row counts, if both engine succeed but return a different number of rows, then false, otherwise NA.
  • time_ratio: df_time / duckdb_time rounded to 6 decimals; INF when DataFusion timed out while DuckDB finished; NA when the ratio cannot be computed.
  • memory_ratio: df_mem / duckdb_mem rounded to 6 decimals; NA when either engine failed or memory usage was zero.

Example

cat benchmark_results.csv

query_id,df_time,df_mem,duckdb_time,duckdb_mem,unexpected_err,row_count_match,time_ratio,memory_ratio
5,FAIL,FAIL,FAIL,FAIL,false,NA,NA,NA
6,0.040621,110575616,0.033876,93175808,false,true,1.199119,1.186742
12,0.038163,269795328,0.031675,206192640,false,true,1.204827,1.308462
13,FAIL,FAIL,FAIL,FAIL,true,NA,NA,NA
...

Reporting bugs

The original bug-triggering query is quite complex, it hard to tell how is the bug triggered from the raw reproducer. So it's necessary to reduce the bug to its minimal form.

It's done by prompting 'please reduce this bug to the minimal form' + failing query, in agentic AI coding tools like Codex/Cursor/Claude Code

Example of reporting reduced bug: apache/datafusion#17699

Initial running result

Speed comparison

Each dot is for a single query's speed ratio from df_time / duckdb_time. The top-right dots means DataFusion's execution time is 100x slower than DuckDB's execution time on the same query. Note its timeout is 10 second, there might exist queries with higher time ratio if they're timed out. My local image

Failed queries

Around 2500/17000 queries are failed on DataFusion. They should be able to run in PostgreSQL. It's likely that they're mostly duplicated bugs, or unsupported features like certain subqueries.

TODO

  • Support TPCH dataset
  • Support other datasets (TPC-DS, JOB)
  • Triage bugs
  • Investigate queries that are significantly slower than DuckDB (most of them should be due to bad join order, or subqueries that DF are not able to unnest now, how can we filter them out automatically?)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages