Add PRIMARY KEY Aggregate support to dataframe API #8356

mustafasrepo · 2023-11-29T12:07:30Z

Which issue does this PR close?

Closes #.

Rationale for this change

Currently we can rewrite group by expressions according to primary key.
Consider query below

SELECT ts, sn, SUM(amount) as sum1
  FROM sales_global_with_pk
  GROUP BY sn

normally, after aggregation only sn and SUM(amount) as sum1 would be available for output. However, given sn is PRIMARY_KEY, we know that each sn value maps to a fixed ts value (e.g They are functionally depandant). Hence, by treating query above as

SELECT ts, sn, SUM(amount) as sum1
  FROM sales_global_with_pk
  GROUP BY ts, sn

under the hood, we can output ts also at the end. However, this feature was only available with SQL API. See discussion.

This PR brings this support to the dataframe API also. With this PR following dataframe query can be executed given col_id is PRIMARY_KEY.

        // group by contains id column
        let group_expr = vec![col_id.clone()];
        let aggr_expr = vec![];
        let df = df.aggregate(group_expr, aggr_expr)?;

        // expr list contains id, name (e.g projection)
        let expr_list = vec![col_id, col_name];
        let df = df.select(expr_list)?;

What changes are included in this PR?

Are these changes tested?

Yes, most of the changes comes from either plan tests, or .slt test

Are there any user-facing changes?

# Conflicts: # datafusion/common/src/functional_dependencies.rs

alamb · 2023-11-30T22:57:18Z

I plan to review this PR tomorrow if no one beats me to it

# Conflicts: # datafusion/optimizer/src/optimize_projections.rs # datafusion/optimizer/src/optimizer.rs

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

# Conflicts: # datafusion/optimizer/src/optimize_projections.rs

ozankabak

I reviewed this very carefully (over 6 sittings as you can see in the commit history!) and it is almost ready to merge. It pays a longstanding tech debt where functional dependencies worked in SQL but not in dataframe API.

I asked @mustafasrepo to make one little optional fix and then it will be ready to go.

# Conflicts: # datafusion/optimizer/src/optimize_projections.rs

ozankabak · 2023-12-08T13:40:44Z

Since this has been around for a week, it doesn't make any potentially disruptive change, pays an old technical debt, and that I have already reviewed it very carefully; I will go ahead and merge this.

In case anything breaks, let us know and we will promptly fix it.

* Aggregate rewrite for dataframe API. * Simplifications * Minor changes * Minor changes * Add new test * Add new tests * Minor changes * Add rule, for aggregate simplification * Simplifications * Simplifications * Simplifications * Minor changes * Simplifications * Add new test condition * Tmp * Push requirement below aggregate * Add join and subqeury alias * Add cross join support * Minor changes * Add logical plan repartition support * Add union support * Add table scan * Add limit * Minor changes, buggy * Add new tests, fix existing bugs * change concat type array_concat * Resolve some of the bugs * Comment out a rule * All tests pass, when single distinct is closed * Fix aggregate bug * Change analyze and explain implementations * All tests pass * Resolve linter errors * Simplifications, remove unnecessary codes * Comment out tests * Remove pushdown projection * Pushdown empty projections * Fix failing tests * Simplifications * Update comments, simplifications * Remove eliminate projection rule, Add method for group expr len aggregate * Simplifications, subquery support * Update comments, add unnest support, simplifications * Remove eliminate projection pass * Change name * Minor changes * Minor changes * Add comments * Fix failing test * Minor simplifications * update * Minor * Remove ordering * Minor changes * add merge projections * Add comments, resolve linter errors * Minor changes * Minor changes * Minor changes * Minor changes * Minor changes * Minor changes * Minor changes * Minor changes * Review Part 1 * Review Part 2 * Fix quadratic search, Change trim_expr impl * Review Part 3 * Address reviews * Minor changes * Review Part 4 * Add case expr support * Review Part 5 * Review Part 6 * Finishing touch: Improve comments --------- Co-authored-by: berkaysynnada <[email protected]> Co-authored-by: Mehmet Ozan Kabak <[email protected]>

mustafasrepo added 30 commits October 9, 2023 16:02

Aggregate rewrite for dataframe API.

c7bae26

Merge branch 'apache_main' into enhance/aggregate_pk

66704c0

# Conflicts: # datafusion/common/src/functional_dependencies.rs

Simplifications

c7374d1

Minor changes

f669ef0

Minor changes

5d9dffb

Add new test

b353c7d

Add new tests

d2a902e

Minor changes

2c4fcd9

Add rule, for aggregate simplification

bfb8a78

Simplifications

16b55b3

Simplifications

209a262

Simplifications

9f8e5ab

Minor changes

aca4d9d

Simplifications

1c3d8a0

Add new test condition

d57be29

Tmp

3221c77

Push requirement below aggregate

1e45b13

Add join and subqeury alias

95476f2

Add cross join support

baa07b2

Minor changes

9dc7cfc

Add logical plan repartition support

4be0b04

Add union support

273890f

Add table scan

67338d2

Add limit

58aa3bc

Minor changes, buggy

d887f27

Add new tests, fix existing bugs

4ec506b

change concat type array_concat

0270546

Resolve some of the bugs

9e69390

Comment out a rule

c6f2fe4

All tests pass, when single distinct is closed

547d13f

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Nov 29, 2023

alamb mentioned this pull request Nov 30, 2023

DataFusion weekly project plan (Andrew Lamb) - Nov 27, 2023 #8329

Closed

8 tasks

ozankabak and others added 13 commits December 4, 2023 17:40

Review Part 1

4346ac7

Merge branch 'apache_main' into enhance/aggregate_pk

ff9447a

# Conflicts: # datafusion/optimizer/src/optimize_projections.rs # datafusion/optimizer/src/optimizer.rs

Review Part 2

7fc6626

Fix quadratic search, Change trim_expr impl

05fe152

Merge branch 'apache_main' into enhance/aggregate_pk

c2ef739

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

Review Part 3

77035b5

Address reviews

a8c314b

Minor changes

af172f8

Review Part 4

d4bf02a

Add case expr support

7502651

Review Part 5

ec39bab

Merge branch 'apache_main' into enhance/aggregate_pk

e6f9333

# Conflicts: # datafusion/optimizer/src/optimize_projections.rs

Review Part 6

0170942

ozankabak approved these changes Dec 8, 2023

View reviewed changes

mustafasrepo and others added 2 commits December 8, 2023 14:34

Merge branch 'apache_main' into enhance/aggregate_pk

38a3df6

# Conflicts: # datafusion/optimizer/src/optimize_projections.rs

Finishing touch: Improve comments

2b9de85

ozankabak merged commit 8f9d6e3 into apache:main Dec 8, 2023

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

anlinc mentioned this pull request Jan 29, 2025

[substrait] Synthetically added grouping expressions in Aggregates can cause mismatched output columns #14348

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add PRIMARY KEY Aggregate support to dataframe API #8356

Add PRIMARY KEY Aggregate support to dataframe API #8356

mustafasrepo commented Nov 29, 2023

Uh oh!

alamb commented Nov 30, 2023 •

edited

Loading

Uh oh!

ozankabak left a comment

Uh oh!

ozankabak commented Dec 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Add PRIMARY KEY Aggregate support to dataframe API #8356

Add PRIMARY KEY Aggregate support to dataframe API #8356

Conversation

mustafasrepo commented Nov 29, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Nov 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ozankabak left a comment

Choose a reason for hiding this comment

Uh oh!

ozankabak commented Dec 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alamb commented Nov 30, 2023 •

edited

Loading