-
Couldn't load subscription status.
- Fork 1.7k
Add PRIMARY KEY Aggregate support to dataframe API #8356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Conflicts: # datafusion/common/src/functional_dependencies.rs
|
I plan to review this PR tomorrow if no one beats me to it |
# Conflicts: # datafusion/optimizer/src/optimize_projections.rs # datafusion/optimizer/src/optimizer.rs
# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.
# Conflicts: # datafusion/optimizer/src/optimize_projections.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed this very carefully (over 6 sittings as you can see in the commit history!) and it is almost ready to merge. It pays a longstanding tech debt where functional dependencies worked in SQL but not in dataframe API.
I asked @mustafasrepo to make one little optional fix and then it will be ready to go.
# Conflicts: # datafusion/optimizer/src/optimize_projections.rs
|
Since this has been around for a week, it doesn't make any potentially disruptive change, pays an old technical debt, and that I have already reviewed it very carefully; I will go ahead and merge this. In case anything breaks, let us know and we will promptly fix it. |
* Aggregate rewrite for dataframe API. * Simplifications * Minor changes * Minor changes * Add new test * Add new tests * Minor changes * Add rule, for aggregate simplification * Simplifications * Simplifications * Simplifications * Minor changes * Simplifications * Add new test condition * Tmp * Push requirement below aggregate * Add join and subqeury alias * Add cross join support * Minor changes * Add logical plan repartition support * Add union support * Add table scan * Add limit * Minor changes, buggy * Add new tests, fix existing bugs * change concat type array_concat * Resolve some of the bugs * Comment out a rule * All tests pass, when single distinct is closed * Fix aggregate bug * Change analyze and explain implementations * All tests pass * Resolve linter errors * Simplifications, remove unnecessary codes * Comment out tests * Remove pushdown projection * Pushdown empty projections * Fix failing tests * Simplifications * Update comments, simplifications * Remove eliminate projection rule, Add method for group expr len aggregate * Simplifications, subquery support * Update comments, add unnest support, simplifications * Remove eliminate projection pass * Change name * Minor changes * Minor changes * Add comments * Fix failing test * Minor simplifications * update * Minor * Remove ordering * Minor changes * add merge projections * Add comments, resolve linter errors * Minor changes * Minor changes * Minor changes * Minor changes * Minor changes * Minor changes * Minor changes * Minor changes * Review Part 1 * Review Part 2 * Fix quadratic search, Change trim_expr impl * Review Part 3 * Address reviews * Minor changes * Review Part 4 * Add case expr support * Review Part 5 * Review Part 6 * Finishing touch: Improve comments --------- Co-authored-by: berkaysynnada <[email protected]> Co-authored-by: Mehmet Ozan Kabak <[email protected]>
Which issue does this PR close?
Closes #.
Rationale for this change
Currently we can rewrite group by expressions according to primary key.
Consider query below
normally, after aggregation only
snandSUM(amount) as sum1would be available for output. However, givensnisPRIMARY_KEY, we know that eachsnvalue maps to a fixedtsvalue (e.g They are functionally depandant). Hence, by treating query above asunder the hood, we can output
tsalso at the end. However, this feature was only available withSQLAPI. See discussion.This PR brings this support to the dataframe API also. With this PR following dataframe query can be executed given
col_idisPRIMARY_KEY.What changes are included in this PR?
Are these changes tested?
Yes, most of the changes comes from either plan tests, or
.slttestAre there any user-facing changes?