[SPARK-16046][DOCS] Aggregations in the Spark SQL programming guide #16329

aokolnychyi · 2016-12-18T11:30:17Z

What changes were proposed in this pull request?

A separate subsection for Aggregations under “Getting Started” in the Spark SQL programming guide. It mentions which aggregate functions are predefined and how users can create their own.
Examples of using the UserDefinedAggregateFunction abstract class for untyped aggregations in Java and Scala.
Examples of using the Aggregator abstract class for type-safe aggregations in Java and Scala.
Python is not covered.
The PR might not resolve the ticket since I do not know what exactly was planned by the author.

In total, there are four new standalone examples that can be executed via spark-submit or run-example. The updated Spark SQL programming guide references to these examples and does not contain hard-coded snippets.

How was this patch tested?

The patch was tested locally by building the docs. The examples were run as well.

SparkQA · 2016-12-18T11:56:40Z

Test build #70319 has finished for PR 16329 at commit 8c18b2a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class JavaUserDefinedTypedAggregation
public static class Salary implements Serializable
public static class Average implements Serializable
public static class MyAverage extends Aggregator<Salary, Average, Double>
public class JavaUserDefinedUntypedAggregation
public static class MyAverage extends UserDefinedAggregateFunction
case class Salary(person: String, salary: Long)
case class Average(sum: Long, count: Long)

assafmendelson · 2016-12-19T08:54:58Z

examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedUntypedAggregation.scala

Maybe add a little explanation here. For example, when I first saw this I tried to figure out where "salary" appears in the code as in practice it is being accessed by index only (input.getLong(0)).

@assafmendelson Yes, your point is definitely reasonable. Now I am thinking whether I should keep "salary" here. As an option, I can replace "salary" with "inputColumn" or something like this to make MyAverage more generic. No reason to bound it to salary. What's your opinion?

I would go with inputColumn.
What I think should be more strongly explained is that this is basically the schema of the input for the aggregate function and not for the source dataframe. Basically someone might think that their original dataframe might need to have this name for the column.

assafmendelson · 2016-12-19T09:00:32Z

examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedUntypedAggregation.scala

I believe an explanation on what MutableAggregationBuffer is should be added.
Basically explain the fact that it is a row, how to access it, what it means for it to be mutable (including probably explaining that arrays and map types are immutable even if the buffer itself is mutable) etc.

Agree, I will try to add a small but meaningful explanation here.

SparkQA · 2016-12-19T20:35:23Z

Test build #70374 has finished for PR 16329 at commit d059d5d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-12-19T21:03:07Z

examples/src/main/java/org/apache/spark/examples/sql/JavaUserDefinedTypedAggregation.java

Its a little confusing to have the comment here for this optimization, but then not implement it.

marmbrus · 2016-12-19T21:03:38Z

examples/src/main/java/org/apache/spark/examples/sql/JavaUserDefinedTypedAggregation.java

I might be a little clearer if this was a Person with a name and salary.

marmbrus · 2016-12-19T21:08:05Z

examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedTypedAggregation.scala

Same comment here with object reuse.

marmbrus · 2016-12-19T21:16:22Z

examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedTypedAggregation.scala

Maybe comment what name is doing here. I actually had to look it up.

marmbrus · 2016-12-19T21:19:49Z

This is great! Thanks for taking the time to write up such complete examples. I think this was a big gap in the existing docs. One other ask. The screen-shot is great, but I'd like to see which parts actually make it into the code snipits in the doc. Ideally you could post a link to compiled doc. If thats hard I can also try to build locally though.

aokolnychyi · 2016-12-20T14:00:23Z

@marmbrus I have updated the pull request. The compiled docs can be found here.

I did not manage to build the Java API docs. I believe the problem is in my local installation. Therefore, I checked each url manually, they should work once the API docs are compiled. I will verify everything one more time in the nightly build.

SparkQA · 2016-12-20T14:08:53Z

Test build #70405 has finished for PR 16329 at commit 2c1f182.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public static class Employee implements Serializable
public static class MyAverage extends Aggregator<Employee, Average, Double>
case class Employee(name: String, salary: Long)
case class Average(var sum: Long, var count: Long)

jnh5y · 2016-12-20T15:36:07Z

docs/sql-programming-guide.md

As a suggestion, I'd change this to read:
"The built-in DataFrames functions provide common aggregations such as count(), countDistinct(), avg(), max(), and min()."

Yes, that will be easier to read. Thanks

jnh5y · 2016-12-20T15:40:55Z

docs/sql-programming-guide.md

I think it'd be worth showing an Spark SQL example using the included/pre-defined functions. Since your example implements 'avg', maybe use 'min' / 'max'?

Alternatively, the example could be added to the SQL statements in the main driver for the UserDefinedAggregateFunction implementations.

I also thought about this. In my view, it will be appropriate to have a separate subsection before Aggregations to show how to apply predefined SQL functions, including writing your own UDFs. That's will be worth another pull request. Alternatively, I can also try to extend this one to add an example of max() or min(). @marmbrus what's your opinion?

jnh5y · 2016-12-20T15:43:51Z

examples/src/main/java/org/apache/spark/examples/sql/JavaUserDefinedUntypedAggregation.java

Should all UDAFs written in Java be static classes? Similarly, should Scala implementations be Scala objects?

I tried to group all relevant entities for each example into one single class. Static classes are used here just to avoid the cumbersome code for creating the entities inside the main() method. Scala objects are also used due to a similar reason. I can mention this but it will add more comments to the already long examples.

jnh5y

My comments are generally small suggestions; the PR looks great to me.

I wonder if a discussion of the types and functions in a UDAF would be worthwhile.

Also, the code looks similar to the Hive test code here: https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java.

michalsenkyr · 2016-12-21T20:17:39Z

examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedTypedAggregation.scala

Why are you constructing a new object instead of modifying the vars in one of the parameters? Is it required in the merge method and not in the reduce method?

@michalsenkyr It is not required to create a new object in the merge method. One can modify the vars and return the existing object just like in the reduce method. However, it is less critical here since this method will be called on pre-aggregated data and not for every element. On the one hand, I can apply here the same approach as in the reduce method to make the example consistent. On the other hand, the current code shows that it is not mandatory to modify vars. Probably, a comment might help. I am not sure which approach is better. Therefore, I am open to suggestions.

Personally, I prefer consistency. When I saw this, I immediately wondered whether there is a specific reason you did it this way.
I'd rather see both methods use the same paradigm. In this case probably the immutable one as the option of mutability is already mentioned in the comment above.
Or you can mention it again in the comment on this method if you want to provide examples of both. This way it just seems a little confusing.

michalsenkyr · 2016-12-21T20:31:22Z

If you are having trouble building Javadoc, try switching to Java 7 temporarily. Java 8 introduced stricter Javadoc rules that may fail the docs build. Unfortunately Jenkins doesn't, so new errors get introduced over time.

SparkQA · 2016-12-23T22:38:26Z

Test build #70558 has finished for PR 16329 at commit 0c55b94.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…More detailed comments

…Review updates

…Improved consistency

SparkQA · 2017-01-23T17:55:23Z

Test build #71863 has finished for PR 16329 at commit 0b17e13.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Except for one small question, the text/code looks OK. Others who know the content seem to approve, and you have gotten tests to pass, so seems reasonable.

srowen · 2017-01-24T15:57:34Z

examples/src/main/java/org/apache/spark/examples/sql/JavaUserDefinedTypedAggregation.java

+
+  public static class MyAverage extends Aggregator<Employee, Average, Double> {
+    // A zero value for this aggregation. Should satisfy the property that any b + zero = b
+    public Average zero() {


Is this meant to be MyAverage?

@srowen Average is a Java bean that holds current sum and count. It is defined earlier. Here it represents a zero value. MyAverage, in turn, is the actual aggregator that accepts instances of the Employee class, stores intermediate results using an instance ofAverage, and produces Double as a result.

I can rename MyAverage to MyAverageAggregator if this makes things clearer.

My bad, I read this incorrectly while skimming.

marmbrus · 2017-01-24T21:04:44Z

Sorry for the delay. This LGTM, but I'm currently away from my Apache SSH keys. Other committers should feel free to merge if you get there before I do.

gatorsmile · 2017-01-25T05:27:37Z

Sure, let me quickly go over the changes. Will merge it after that.

gatorsmile

LGTM

## What changes were proposed in this pull request? - A separate subsection for Aggregations under “Getting Started” in the Spark SQL programming guide. It mentions which aggregate functions are predefined and how users can create their own. - Examples of using the `UserDefinedAggregateFunction` abstract class for untyped aggregations in Java and Scala. - Examples of using the `Aggregator` abstract class for type-safe aggregations in Java and Scala. - Python is not covered. - The PR might not resolve the ticket since I do not know what exactly was planned by the author. In total, there are four new standalone examples that can be executed via `spark-submit` or `run-example`. The updated Spark SQL programming guide references to these examples and does not contain hard-coded snippets. ## How was this patch tested? The patch was tested locally by building the docs. The examples were run as well. ![image](https://cloud.githubusercontent.com/assets/6235869/21292915/04d9d084-c515-11e6-811a-999d598dffba.png) Author: aokolnychyi <[email protected]> Closes #16329 from aokolnychyi/SPARK-16046. (cherry picked from commit 3fdce81) Signed-off-by: gatorsmile <[email protected]>

gatorsmile · 2017-01-25T06:14:40Z

Thanks! Merging to master/2.1

## What changes were proposed in this pull request? - A separate subsection for Aggregations under “Getting Started” in the Spark SQL programming guide. It mentions which aggregate functions are predefined and how users can create their own. - Examples of using the `UserDefinedAggregateFunction` abstract class for untyped aggregations in Java and Scala. - Examples of using the `Aggregator` abstract class for type-safe aggregations in Java and Scala. - Python is not covered. - The PR might not resolve the ticket since I do not know what exactly was planned by the author. In total, there are four new standalone examples that can be executed via `spark-submit` or `run-example`. The updated Spark SQL programming guide references to these examples and does not contain hard-coded snippets. ## How was this patch tested? The patch was tested locally by building the docs. The examples were run as well. ![image](https://cloud.githubusercontent.com/assets/6235869/21292915/04d9d084-c515-11e6-811a-999d598dffba.png) Author: aokolnychyi <[email protected]> Closes apache#16329 from aokolnychyi/SPARK-16046.

assafmendelson reviewed Dec 19, 2016

View reviewed changes

marmbrus reviewed Dec 19, 2016

View reviewed changes

jnh5y reviewed Dec 20, 2016

View reviewed changes

jnh5y approved these changes Dec 20, 2016

View reviewed changes

michalsenkyr reviewed Dec 21, 2016

View reviewed changes

aokolnychyi added 4 commits January 21, 2017 21:19

[SPARK-16046][DOCS] Aggregations in the Spark SQL programming guide

b2c08d5

[SPARK-16046][DOCS] Aggregations in the Spark SQL programming guide. …

0ee7c80

…More detailed comments

[SPARK-16046][DOCS] Aggregations in the Spark SQL programming guide. …

87a68bd

…Review updates

[SPARK-16046][DOCS] Aggregations in the Spark SQL programming guide. …

0b17e13

…Improved consistency

aokolnychyi force-pushed the SPARK-16046 branch from 0c55b94 to 0b17e13 Compare January 23, 2017 17:30

srowen reviewed Jan 24, 2017

View reviewed changes

gatorsmile approved these changes Jan 25, 2017

View reviewed changes

asfgit closed this in 3fdce81 Jan 25, 2017

[SPARK-16046][DOCS] Aggregations in the Spark SQL programming guide #16329

[SPARK-16046][DOCS] Aggregations in the Spark SQL programming guide #16329

Uh oh!

Conversation

aokolnychyi commented Dec 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 18, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Dec 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

assafmendelson Dec 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Dec 19, 2016

Uh oh!

aokolnychyi commented Dec 20, 2016

Uh oh!

SparkQA commented Dec 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Dec 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnh5y left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michalsenkyr commented Dec 21, 2016

Uh oh!

SparkQA commented Dec 23, 2016

Uh oh!

SparkQA commented Jan 23, 2017

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Jan 24, 2017

aokolnychyi commented Dec 18, 2016 •

edited

Loading

aokolnychyi Dec 19, 2016 •

edited

Loading

assafmendelson Dec 19, 2016 •

edited

Loading

aokolnychyi Dec 21, 2016 •

edited

Loading