Skip to content

Add IMDB queries (a.k.a. JOB - Join Order Benchmark) to DataFusion benchmark suite #12311

@doupache

Description

@doupache

Is your feature request related to a problem or challenge?

JOB (Join Order Benchmark) was proposed by a research team from TUM in the paper "How Good Are Query Optimizers, Really?".

It is also used in HyPer, DuckDB, and CedarDB. It is a good benchmark for testing join ordering and join operators. It is also part of DuckDB's regression test suite.

I think if we add this test suite, it will also help with improvements like those discussed in #7955.

Describe the solution you'd like

JOB utilize the IMDB datasets. These datasets are provided in csv.gz format and represent real-world data, making them ideal for testing datafusion.

task

  • Convert the dataset from csv.gz format to Parquet.
  • Add the IMDB license to the LICENSE.
  • add benchmark queries.
  • Integrate the benchmark suite into dfbench.

Once everything is set up, we will be able to easily run benchmarks using the following command:

cargo run  --bin dfbench --imdb --query=5

I would like to work on this!
Can someone help me understand the usual process for adding a third-party license in a Apache project ?

cc @jayzhan211 @alamb

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions