[SPARK-17868][SQL] Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS #15484

jiangxb1987 · 2016-10-14T13:52:07Z

What changes were proposed in this pull request?

We generate bitmasks for grouping sets during the parsing process, and use these during analysis. These bitmasks are difficult to work with in practice and have lead to numerous bugs. This PR removes these and use actual sets instead, however we still need to generate these offsets for the grouping_id.

This PR does the following works:

Replace bitmasks by actual grouping sets durning Parsing/Analysis stage of CUBE/ROLLUP/GROUPING SETS;
Add new testsuite ResolveGroupingAnalyticsSuite to test the Analyzer.ResolveGroupingAnalytics rule directly;
Fix a minor bug in ResolveGroupingAnalytics.

How was this patch tested?

By existing test cases, and add new testsuite ResolveGroupingAnalyticsSuite to test directly.

jiangxb1987 · 2016-10-14T13:54:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

We don't check whether expression is in the GROUP BY list here, moved this to Analysis stage.

That is fine.

jiangxb1987 · 2016-10-14T13:57:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

expression in condition could be unresolved, we should check this and respect the operator until Filter.condition is resolved. A testcase in ResolveGroupingAnalyticsSuite would fail before this PR.

jiangxb1987 · 2016-10-14T13:59:16Z

...st/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupingAnalyticsSuite.scala

Will add negative cases for this.

SparkQA · 2016-10-14T16:08:45Z

Test build #66960 has finished for PR 15484 at commit 776d32e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-14T18:27:19Z

Test build #66967 has finished for PR 15484 at commit 4906cf4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2016-10-15T01:25:46Z

@hvanhovell Could you look at this please? Thank you!

hvanhovell · 2016-10-19T05:59:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

This is too complex. It is hard to grasp what is going on here. Lets make this imperative:

val buffer = mutable.Buffer.empty[Seq[Expression]] var current = exprs while (current.nonEmpty) { buffer += current current = current.init } buffer

hvanhovell · 2016-10-19T06:19:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

We could make this one more concise:

def cubeExprs(exprs: Seq[String]): Seq[Seq[String]] = exprs match { case x :: xs => val initial = cubeExprs(xs) initial.map(x +: _) ++ initial case Nil => Seq(Seq.empty) }

hvanhovell · 2016-10-19T06:24:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

For future reference: exprs.drop(1) ->exprs.tail

hvanhovell · 2016-10-19T06:25:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

For future reference: exprs.take(1) ->exprs.head

hvanhovell · 2016-10-19T06:44:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

A more generic comment here would be to put the common code used by Cube, Rollup & Grouping Sets in a single method. It seems a bit wasteful to rewrite this into a GroupingSets plan, to pick that up later down the line (especially since determining things like nullabilty is trivial for Cube and Rollup).

hvanhovell · 2016-10-19T06:50:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

Just use a map here: _.expression().asScala.map(e => expression(e))

hvanhovell · 2016-10-19T07:06:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

That is fine.

hvanhovell · 2016-10-19T07:42:30Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

You only need fold left when you want to traverse the collection in a certain order. Folds are typically not the easiest to understand, so use them sparingly and prefer to use more imperative constructs.

Could you rewrite this using a more imperative approach?

Just for kicks & giggles:

groupingSetAttrs.map(attrMap).map(index => ~(1 << (numAttributes - 1 - index))).reduce(_ && _)

jiangxb1987 · 2016-10-20T09:11:06Z

@hvanhovell Thank you for your comments - They are awesome! I've made some changes according to your advice, I hope the code looks better now. Thanks a lot!

SparkQA · 2016-10-20T10:33:33Z

Test build #67253 has finished for PR 15484 at commit 9502196.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-21T08:24:40Z

Test build #67325 has finished for PR 15484 at commit 925a5ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-10-22T13:43:52Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

Could you add a little bit of documentation on the mask? It is non-trivial to understand.

It might also be a good idea to split this into two separate statements. One to calculate the the attribute masks and one to reduce them.

rxin · 2016-10-22T19:35:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

idx is not used?

+1. Looking at it more, I feel zipWithIndex is not needed at all and the map would suffice.

rxin · 2016-10-22T19:35:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

don't need the "case" here.

rxin · 2016-10-22T19:36:02Z

cc @davies too

tejasapatil · 2016-10-22T21:19:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Avoid using ArrayBuffer as insertions would lead to expansion of underlying array and copying of data to the new one. Since you know the size upfront, you could create an Array of required size.

The use of ArrayBuffer will make this piece of code more concise, since the sequence of exprs is not usually very long, maybe performance is not the major concern here, I'd prefer to keep this one, is it OK? @hvanhovell

Is this just exprs.inits ?

to be honest this is the first time I've seen the use of init/inits on a trait.

exprs.inits is much more concise.

tejasapatil · 2016-10-22T21:35:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+1. Looking at it more, I feel zipWithIndex is not needed at all and the map would suffice.

tejasapatil · 2016-10-22T21:40:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

can you also display the GROUP BY list in the message ?

jiangxb1987 · 2016-10-23T04:04:49Z

@tejasapatil @rxin I've addressed most of your comments, thanks for reviewing this!

SparkQA · 2016-10-23T06:06:33Z

Test build #67405 has finished for PR 15484 at commit a47cc68.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2016-10-25T05:06:42Z

@davies Would you please have a look at this PR? Thank you!

davies · 2016-10-25T05:18:04Z

@jiangxb1987 Could you say a little bit more about the "minor bug"? that help us to decide this patch should be backported or not.

hvanhovell · 2016-10-25T06:36:59Z

It is not really a bug. Bitmap manipulation has bitten us quite a few time in the past, so I would rather use expressions.

davies · 2016-10-25T06:56:05Z

I see, this PR just improve the readability, we don't need to backport it, looks good to me overall.

jiangxb1987 · 2016-10-27T09:33:12Z

What else should I update on this PR? Please don't be hesitate to require any change, thanks!

jiangxb1987 · 2016-11-01T11:57:51Z

ping @hvanhovell

rxin · 2016-11-02T05:28:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

I'd also write unit tests specifically for cubeExprs and rollupExprs

Also I think you can just use subsets? e.g.

scala> Seq(1, 2, 3).toSet.subsets.foreach(println) Set() Set(1) Set(2) Set(3) Set(1, 2) Set(1, 3) Set(2, 3) Set(1, 2, 3)

I'm afraid we can't just map the exprs to a set because we want to keep the original order.

SparkQA · 2016-11-06T18:53:10Z

Test build #68245 has finished for PR 15484 at commit f8c6a04.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-11-06T19:20:13Z

Test build #68246 has finished for PR 15484 at commit ef3a733.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2016-11-07T03:13:30Z

Does this version looks good now?

jiangxb1987 · 2016-11-08T10:39:51Z

ping @hvanhovell

hvanhovell · 2016-11-08T14:10:33Z

LGTM - merging to master. Thanks!

… CUBE/ROLLUP/GROUPING SETS ## What changes were proposed in this pull request? We generate bitmasks for grouping sets during the parsing process, and use these during analysis. These bitmasks are difficult to work with in practice and have lead to numerous bugs. This PR removes these and use actual sets instead, however we still need to generate these offsets for the grouping_id. This PR does the following works: 1. Replace bitmasks by actual grouping sets durning Parsing/Analysis stage of CUBE/ROLLUP/GROUPING SETS; 2. Add new testsuite `ResolveGroupingAnalyticsSuite` to test the `Analyzer.ResolveGroupingAnalytics` rule directly; 3. Fix a minor bug in `ResolveGroupingAnalytics`. ## How was this patch tested? By existing test cases, and add new testsuite `ResolveGroupingAnalyticsSuite` to test directly. Author: jiangxingbo <[email protected]> Closes apache#15484 from jiangxb1987/group-set.

## What changes were proposed in this pull request? Spark with Scala 2.10 fails with a group by cube: ``` spark.range(1).select($"id" as "a", $"id" as "b").write.partitionBy("a").mode("overwrite").saveAsTable("rollup_bug") spark.sql("select 1 from rollup_bug group by rollup ()").show ``` It can be traced back to #15484 , which made `Expand.projections` a lazy `Stream` for group by cube. In scala 2.10 `Stream` captures a lot of stuff, and in this case it captures the entire query plan which has some un-serializable parts. This change is also good for master branch, to reduce the serialized size of `Expand.projections`. ## How was this patch tested? manually verified with Spark with Scala 2.10. Author: Wenchen Fan <[email protected]> Closes #19289 from cloud-fan/bug. (cherry picked from commit ce6a71e) Signed-off-by: gatorsmile <[email protected]>

## What changes were proposed in this pull request? Spark with Scala 2.10 fails with a group by cube: ``` spark.range(1).select($"id" as "a", $"id" as "b").write.partitionBy("a").mode("overwrite").saveAsTable("rollup_bug") spark.sql("select 1 from rollup_bug group by rollup ()").show ``` It can be traced back to apache#15484 , which made `Expand.projections` a lazy `Stream` for group by cube. In scala 2.10 `Stream` captures a lot of stuff, and in this case it captures the entire query plan which has some un-serializable parts. This change is also good for master branch, to reduce the serialized size of `Expand.projections`. ## How was this patch tested? manually verified with Spark with Scala 2.10. Author: Wenchen Fan <[email protected]> Closes apache#19289 from cloud-fan/bug.

## What changes were proposed in this pull request? Spark with Scala 2.10 fails with a group by cube: ``` spark.range(1).select($"id" as "a", $"id" as "b").write.partitionBy("a").mode("overwrite").saveAsTable("rollup_bug") spark.sql("select 1 from rollup_bug group by rollup ()").show ``` It can be traced back to apache#15484 , which made `Expand.projections` a lazy `Stream` for group by cube. In scala 2.10 `Stream` captures a lot of stuff, and in this case it captures the entire query plan which has some un-serializable parts. This change is also good for master branch, to reduce the serialized size of `Expand.projections`. ## How was this patch tested? manually verified with Spark with Scala 2.10. Author: Wenchen Fan <[email protected]> Closes apache#19289 from cloud-fan/bug. (cherry picked from commit ce6a71e) Signed-off-by: gatorsmile <[email protected]>

jiangxb1987 commented Oct 14, 2016

View reviewed changes

hvanhovell requested changes Oct 19, 2016

View reviewed changes

hvanhovell reviewed Oct 22, 2016

View reviewed changes

rxin reviewed Oct 22, 2016

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated

Copy link

Contributor

rxin Oct 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need the "case" here.

tejasapatil reviewed Oct 22, 2016

View reviewed changes

rxin reviewed Nov 2, 2016

View reviewed changes

jiangxb1987 added 8 commits November 7, 2016 00:47

remove bitmasks for grouping sets and use actual sets instead.

5456a95

update testcases.

66efa00

update comments.

bf5d419

add test cases.

16bd22b

improve code quality.

406060a

refactor ResolveGroupingAnalytics

481a6ac

bugfix

5587343

add comment; simplify some case conditions.

4df9a42

add test cases for cubeExprs and rollupExprs

ef3a733

jiangxb1987 force-pushed the group-set branch from f8c6a04 to ef3a733 Compare November 6, 2016 16:56

asfgit closed this in 344dcad Nov 8, 2016

jiangxb1987 deleted the group-set branch November 9, 2016 02:00

stanzhai mentioned this pull request Feb 9, 2017

[SPARK-19509][SQL]Fix a NPE problem in grouping sets when using an empty column #16874

Closed

cloud-fan mentioned this pull request Sep 20, 2017

[SPARK-22076][SQL] Expand.projections should not be a Stream #19289

Closed

jackwener mentioned this pull request Apr 16, 2023

Add ResolveGroupingAnalytics analyzer rule apache/datafusion#5749

Closed

[SPARK-17868][SQL] Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS #15484

[SPARK-17868][SQL] Do not use bitmasks during parsing and analysis of CUBE/ROLLUP/GROUPING SETS #15484

Uh oh!

Conversation

jiangxb1987 commented Oct 14, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

jiangxb1987 Oct 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 14, 2016

Uh oh!

SparkQA commented Oct 14, 2016

Uh oh!

jiangxb1987 commented Oct 15, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented Oct 20, 2016

Uh oh!

SparkQA commented Oct 20, 2016

Uh oh!

SparkQA commented Oct 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Oct 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented Oct 23, 2016

Uh oh!

SparkQA commented Oct 23, 2016

Uh oh!

jiangxb1987 commented Oct 25, 2016

Uh oh!

davies commented Oct 25, 2016

Uh oh!

hvanhovell commented Oct 25, 2016

Uh oh!

davies commented Oct 25, 2016

Uh oh!

jiangxb1987 Oct 14, 2016 •

edited

Loading