- 
                Notifications
    You must be signed in to change notification settings 
- Fork 28.9k
[SPARK-27604][SQL] Add filter reduction #24495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| I filed https://issues.apache.org/jira/browse/SPARK-27604 to cover some improvements regarding constraints. This PR adds some improvement although there is room for more. cc. @gengliangwang @cloud-fan @mgaido91 @viirya @hvanhovell @gatorsmile @dongjoon-hyun as I saw you worked on similar topics before. | 
Change-Id: I8031a37f380d5fe5fff37488460b790dd65c0161
Change-Id: I79ec5286b4762d98abaea638f8a9e5e9f7f753d0
| * SELECT * FROM table WHERE i = 5 AND j = i + 3 | ||
| * ==> SELECT * FROM table WHERE i = 5 AND j = 8 | ||
| * SELECT * FROM table WHERE i = 5 AND j = i + 3 => SELECT * FROM table WHERE i = 5 AND j = 8 | ||
| * SELECT * FROM table WHERE i <= 5 AND i = 5 => SELECT * FROM table WHERE i = 5 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, @peter-toth . I don't think this aims the equal goal. Could you make a separate rule instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filter reduction should be a separate rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you all for the feedback. The 2 rules seemed similar and easy to combine to me, but I will not mix them then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I separated the 2 rules. I added some improvement to the ConstantPropagation and created a new FilterReduction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved all new filter reduction related logic to FilterReduction and extracted constant propagation enhancements to a separate PR.
        
          
                ...talyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ConstantPropagationSuite.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | * SELECT * FROM table WHERE i = 5 AND j = i + 3 | ||
| * ==> SELECT * FROM table WHERE i = 5 AND j = 8 | ||
| * SELECT * FROM table WHERE i = 5 AND j = i + 3 => SELECT * FROM table WHERE i = 5 AND j = 8 | ||
| * SELECT * FROM table WHERE i <= 5 AND i = 5 => SELECT * FROM table WHERE i = 5 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Change-Id: Ia68242ba5067e84143e35a1126a49017c5a203f6
        
          
                sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...talyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | val newCondition = | ||
| splitConjunctivePredicates(condition) | ||
| .map(rewriteEqual) | ||
| .sortBy(p => scala.util.hashing.MurmurHash3.seqHash(Seq(p.getClass, p))) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the class of expression to the sorting order here and in rewriteEqual() as hashCode() of TreeNode doesn't take the class of the expression into account and
comparePlans(testRelation.where('a < 2 && 'a === 2).analyze, testRelation.where('a === 2 && 'a < 2).analyze) would fail.
Change-Id: Ice97dbafb67bf6cf84107b9064f42be991cbd9f8
| Hi, @peter-toth . Thank you for splitting the rule. | 
| Thanks @dongjoon-hyun, I extracted the improved  | 
| * - Using this mapping, replace occurrence of the expressions with the corresponding constant | ||
| * values in the AND node. | ||
| */ | ||
| object ConstantPropagation extends Rule[LogicalPlan] with PredicateHelper { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since new PR is created, we had better remove this change from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, reverted.
Please note that some of the new UTs will require the enhanced constant propagation to work as expected.
        
          
                sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
          
            Show resolved
            Hide resolved
        
      Change-Id: I8ba31ae5043184d53662a4cedfe011a36d63759f
        
          
                sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterReductionSuite.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | Batch("AnalysisNodes", Once, | ||
| EliminateSubqueryAliases) :: | ||
| Batch("FilterReduction", FixedPoint(10), | ||
| ConstantPropagation, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which test case requires this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like a <= 2 AND 2 <= a where I would expect a = 2 reduced result. FilterReduction can transforms it into a <= 2 AND 2 = a. Then ConstantPropagation (only the enhanced in #24553) can transform it into 2 <= 2 AND 2 = a ...
Change-Id: Ifef383125dd8827fdeea90e8101cb432de9e704e
        
          
                sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | case c EqualNullSafe d if planEqual(a, d) => EqualTo(c, d) | ||
| } | ||
| case a EqualTo b => expression transformUp { | ||
| case c LessThan d if planEqual(b, d) && planEqual(a, c) => Literal.FalseLiteral | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually these too are wrong with nulls, as they should return null instead of false and the difference is significant (consider the case of a Not in front of it..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think you are right and this is an issue at many other places too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the solution could be checking if a, b, c and d not nullable and then we can use transform. Otherwise limiting a top-down way replacement until we have any of And, Or, CaseWhen or If nodes and can't go deeper if we hit something else. This is what ReplaceNullWithFalseInPredicate rule does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the issue and added some new UTs.
| ok to test | 
| Test build #105735 has finished for PR 24495 at commit  
 | 
| Thanks @HyukjinKwon for enabling the tests. Some UT failures are expected until #24553 is not merged. | 
| can we enrich the PR description with how it is implemented? e.g. the overall design, the algorithm that is used. | 
| 
 @cloud-fan, I'm happy to take this up again, but there is a prerequisite PR to this one which I would like to polish first: #24553 | 
| We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. | 
What changes were proposed in this pull request?
This PR adds a new
FilterReductionrule that can reduceBinaryComparisonfilters like these:How was this patch tested?
Existing and new UTs.