Skip to content

TopK Fuzz Tests 🐝  #7749

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

After #7721 a SortExec with a limit will use a special TopK . We have basic unit tests, but I think the coverage could be improved, specifically with Fuzz testing

Describe the solution you'd like

What I would like is a new fuzz test to be added to the the existing fuzz cases: https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/tests/fuzz_cases

The structure of SortTest in https://github.com/apache/arrow-datafusion/blob/e95a24b5a260e0e2f603d52682d36cce192676f8/datafusion/core/tests/fuzz_cases/sort_fuzz.rs#L111 might be a good one to follow

The basic outline would be:

  1. Create an input with several columns (integers, strings, floats)
  2. Reorder the input randomly
  3. Divide the input up multiple batches using make_staggered_batches
  4. Run a query like SELECT * FROM t ORDER BY <col(s)> LIMIT <N> and collect the output
  5. Compute the expected result programmatically (e.g. by sort the data, prior to creating RecordBatches)
  6. Ensure the output matches the expected result

Input size: 1000 rows

Parameters to vary

  1. sort cols: (int), (string), (float), (int, string), (string, int), etc.
  2. N: 1, 10, 100, 300 (aka how many are kept)

Bonus points
make it easy to add new columns / types (e.g. like string dictionary)

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions