[SPARK-27207][SQL] : Ensure aggregate buffers are initialized again for So… #24149

pgandhi999 · 2019-03-19T21:33:49Z

…rtBasedAggregate

Normally, the aggregate operations that are invoked for an aggregation buffer for User Defined Aggregate Functions(UDAF) follow the order like initialize(), update(), eval() OR initialize(), merge(), eval(). However, after a certain threshold configurable by spark.sql.objectHashAggregate.sortBased.fallbackThreshold is reached, ObjectHashAggregate falls back to SortBasedAggregator which invokes the merge or update operation without calling initialize() on the aggregate buffer.

What changes were proposed in this pull request?

The fix here is to initialize aggregate buffers again when fallback to SortBasedAggregate operator happens.

How was this patch tested?

The patch was tested as part of SPARK-24935 as documented in PR #23778.

…rtBasedAggregate Normally, the aggregate operations that are invoked for an aggregation buffer for User Defined Aggregate Functions(UDAF) follow the order like initialize(), update(), eval() OR initialize(), merge(), eval(). However, after a certain threshold configurable by spark.sql.objectHashAggregate.sortBased.fallbackThreshold is reached, ObjectHashAggregate falls back to SortBasedAggregator which invokes the merge or update operation without calling initialize() on the aggregate buffer. The fix here is to initialize aggregate buffers again when fallback to SortBasedAggregate operator happens.

pgandhi999 · 2019-03-19T21:39:53Z

cc @cloud-fan This is a bug with SortBasedAggregate that was exposed while testing PR #23778 . Have filed a separate JIRA alongwith the PR here. Request you to review it. Thank you.

pgandhi999 · 2019-03-19T21:41:40Z

ok to test

SparkQA · 2019-03-20T02:09:46Z

Test build #103691 has finished for PR 24149 at commit 400db3d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-03-20T08:55:19Z

...core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationIterator.scala


  // Hacking the aggregation mode to call AggregateFunction.merge to merge two aggregation buffers
-  private val mergeAggregationBuffers: (InternalRow, InternalRow) => Unit = {
+  var (sortBasedAggExpressions, sortBasedAggFunctions): (


why it's var instead of val?

Changed it to val

cloud-fan · 2019-03-20T08:58:33Z

...core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationIterator.scala

-  private def createNewAggregationBuffer(): SpecificInternalRow = {
-    val bufferFieldTypes = aggregateFunctions.flatMap(_.aggBufferAttributes.map(_.dataType))
+  private def createNewAggregationBuffer(
+    functions: Array[AggregateFunction]): SpecificInternalRow = {


4 space indentation here.

cloud-fan · 2019-03-20T08:59:07Z

...core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationIterator.scala

  }

-  private def initAggregationBuffer(buffer: SpecificInternalRow): Unit = {
+  private def initAggregationBuffer(


it's only called once, let's inline it

cloud-fan · 2019-03-20T09:00:01Z

good catch! Can we add a UT? we can create a TypedImperativeAggregate implementation which fails if initialize is not called.

Adding unit test and refactoring code

pgandhi999 · 2019-03-21T16:27:30Z

@cloud-fan Thank you for reviewing. Have added a unit test which was failing before the code change.

SparkQA · 2019-03-21T16:27:44Z

Test build #103773 has started for PR 24149 at commit ea050f7.

SparkQA · 2019-03-21T16:37:11Z

Test build #103775 has started for PR 24149 at commit 0714876.

cloud-fan · 2019-03-21T18:40:43Z

sql/core/src/test/scala/org/apache/spark/sql/TypedImperativeAggregateSuite.scala

+    var maxValueBuffer: MaxValue = null
+    override def createAggregationBuffer(): MaxValue = {
+      // Returns Int.MinValue if all inputs are null
+      maxValueBuffer = new MaxValue(Int.MinValue)


why do we need to save it to a member variable? I think the bug can be exposed even if we just return the buffer here.

@cloud-fan I am still looking more into it, but for some reason, calling merge() without invoking initialize() does not cause any visible exception on normal UDAF functions, but it fails with a Null Pointer Exception when I test it with the test case described in SPARK-24935(PR #24144 ). My guess is that for the above test case, since, two different aggregation buffer instances are created(SketchState and UnionState), the exception shows up. Will investigate more on it and get back to you soon. Thank you.

shaneknapp · 2019-03-21T20:00:57Z

test this please

SparkQA · 2019-03-22T00:16:37Z

Test build #103782 has finished for PR 24149 at commit 0714876.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…RK-27207 [SPARK-27207] : Upmerging with master branch

pgandhi999 · 2019-03-25T21:33:49Z

@cloud-fan Regarding our discussion in PR #24144 , I just found out a case where Spark initializes a UDAF, runs update and then runs merge. It happens in SortBasedAggregator. So, the code blows up in this case. The code in ObjectAggregationIterator.scala is pasted below:

// Two-way merges initialAggBufferIterator and inputIterator
      private def findNextSortedGroup(): Boolean = {
        if (hasNextInput || hasNextAggBuffer) {
          // Find smaller key of the initialAggBufferIterator and initialAggBufferIterator
          groupingKey = findGroupingKey()
          result = new AggregationBufferEntry(groupingKey, makeEmptyAggregationBuffer)

          // Firstly, update the aggregation buffer with input rows.
          while (hasNextInput &&
            groupingKeyOrdering.compare(inputIterator.getKey, groupingKey) == 0) {
            processRow(result.aggregationBuffer, inputIterator.getValue)
            hasNextInput = inputIterator.next()
          }

          // Secondly, merge the aggregation buffer with existing aggregation buffers.
          // NOTE: the ordering of these two while-block matter, mergeAggregationBuffer() should
          // be called after calling processRow.
          while (hasNextAggBuffer &&
            groupingKeyOrdering.compare(initialAggBufferIterator.getKey, groupingKey) == 0) {
            mergeAggregationBuffers(result.aggregationBuffer, initialAggBufferIterator.getValue)
            hasNextAggBuffer = initialAggBufferIterator.next()
          }

          true
        } else {
          false
        }
      }

It calls update first and then calls merge on the same buffer. I found out the issue while testing this PR today.

cloud-fan · 2019-03-25T22:42:44Z

But we call update and merge for different copies of the aggregate functions, don't we?

pgandhi999 · 2019-03-25T23:07:10Z

Turns out that is not true in the case of SortBasedAggregator. In the code that I have pasted above, processRow() is called on result.aggregationBuffer which performs update and later, mergeAggregationBuffers is called on the same result.aggregationBuffer which performs merge. This is what I could infer from the code as well as debug logs that I added, correct me if I am wrong.

cloud-fan · 2019-03-26T00:47:45Z

Yea it's the same buffer instance, but not same aggregate function instance, IIUC.

pgandhi999 · 2019-03-27T15:42:02Z

@cloud-fan So after going through the code, I see that we are calling update and merge for different copies of aggregatefunctions but are using the buffer created for one copy of aggregatefunction. I am really not an expert with the aggregation framework so was wondering if you could guide me here by elaborating more about how Spark aggregate functions use the aggregation buffer? Thank you once again for your continued guidance and support in this matter.

cloud-fan · 2019-03-27T16:48:32Z

Each aggregate function will create its own buffer. When we feed a buffer to a agg func, we are not asking the agg func to replace its own buffer with the new one, but we ask it to merge the new buffer to its own buffer.

…te functions and write unit test Fix SortBasedAggregator to ensure that update and merge are performed with two different sets of aggregate functions, one for update and one for merge respectively.

pgandhi999 · 2019-03-29T18:40:36Z

@cloud-fan So I realized that the bug was caused as I was creating the aggregate buffer for sortBasedMergeAggFunctions, calling processRow on update operation with aggregateFunctions and then calling merge once again with sortBasedMergeAggFunctions. Fixed the bug by having a separate update buffer initialized with aggregateFunctions, perform the update on that buffer, merge it's results to a new aggregate buffer initialized for sortBasedMergeAggFunctions, finally on which merge is called and the final result is returned. It may not be the best solution so your valuable guidance in this matter is really appreciated. Thank you.

SparkQA · 2019-03-29T20:47:27Z

Test build #104090 has finished for PR 24149 at commit 088cbc6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…r different aggregate functions

SparkQA · 2019-03-30T02:02:45Z

Test build #104099 has finished for PR 24149 at commit 6a5ed71.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…izing for different aggregate functions" This reverts commit 6a5ed71. Reverting to previous commit

Since, apache#24459 fixes the init-update-merge issue, the fix here is reverted.

…RK-27207 [SPARK-27207] : Upmerging with master branch

SparkQA · 2019-04-30T18:33:21Z

Test build #105032 has finished for PR 24149 at commit db46cf7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

pgandhi999 · 2019-04-30T18:35:21Z

@cloud-fan Have updated the PR. Thank you.

pgandhi999 · 2019-05-06T14:38:15Z

Hello @cloud-fan , WDYT about the updated PR? Thank you.

cloud-fan · 2019-05-06T16:20:47Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDAFSuite.scala

    }
  }

+  test("SPARK-27207: customized Hive UDAF with two aggregation buffers for Sort" +


does this test fail without your patch? I think we can write a test with a custom UDAF which fails without initialization, but Hive UDAF does not fail without initialization.

So it used to fail earlier without my patch, but your latest patch seems to have fixed it. Will come up with another test case. Thank you.

SparkQA · 2019-05-06T21:08:22Z

Test build #105167 has finished for PR 24149 at commit df330fa.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-07T00:28:04Z

Test build #105168 has finished for PR 24149 at commit 5bd474c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-05-07T05:46:34Z

sql/core/src/test/scala/org/apache/spark/sql/TypedImperativeAggregateSuite.scala

+   * in aggregation buffer.
+   */
+  private case class TypedMax2(
+    child: Expression,


4 space indentation.

cloud-fan · 2019-05-07T05:49:46Z

sql/core/src/test/scala/org/apache/spark/sql/TypedImperativeAggregateSuite.scala

+   * Calculate the max value with object aggregation buffer. This stores class MaxValue
+   * in aggregation buffer.
+   */
+  private case class TypedMax2(


can we simplify it? I think we just need to do some initialization work in createAggregationBuffer.

case class MyUDAF ... { var initialized = false override def createAggregationBuffer(): MyBuffer = { initialized = true null } override def update(buffer: MaxValue, input: InternalRow): MyBuffer = { assert(initialized) null } ... }

cloud-fan · 2019-05-07T16:42:51Z

sql/core/src/test/scala/org/apache/spark/sql/TypedImperativeAggregateSuite.scala

+    withSQLConf("spark.sql.objectHashAggregate.sortBased.fallbackThreshold" -> "5") {
+      val df = data.toDF("value", "key").coalesce(2)
+      val query = df.groupBy($"key").agg(typedMax2($"value"), count($"value"), typedMax2($"value"))
+      query.show(10, false)


nit: we should not use show in the test. Let's use checkAnswer.

cloud-fan · 2019-05-07T16:50:34Z

...core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationIterator.scala

+    (newExpressions, initializeAggregateFunctions(newExpressions, 0))
+  }
+
+  // Hacking the aggregation mode to call AggregateFunction.merge to merge two aggregation buffers


I think this comment should be put before

val newExpressions = aggregateExpressions.map { case agg @ AggregateExpression(_, Partial, _, _) => agg.copy(mode = PartialMerge) case agg @ AggregateExpression(_, Complete, _, _) => agg.copy(mode = Final) case other => other }

cloud-fan · 2019-05-07T17:02:25Z

After some more thoughts, I have some different ideas now.

I checked the TypedImperativeAggregate implementations, the initialize method is used to initialize the aggregate buffer, not to initialize the TypedImperativeAggregate instance. That said, TypedImperativeAggregate implementations should be stateless.

Come back to this bug, it can only be exposed if the UDAF needs to initialize itself, which should not be allowed. I think we can just add some doc to TypedImperativeAggregate, saying that it must be stateless. Sorry for the back and forth!

pgandhi999 · 2019-05-07T17:54:57Z

@cloud-fan I do see your point but don't you still think, if fallback to SortBasedAggregate occurs, the initialize method for the new aggregate function should be invoked as I recall you said earlier:

Each aggregate function will create its own buffer. When we feed a buffer to a agg func, we are not asking the agg func to replace its own buffer with the new one, but we ask it to merge the new buffer to its own buffer.

By the above logic, we still need to ensure that the respective aggregate buffer is initialized for the new set of aggregate functions. Or we should just use the same set of aggregate functions again for SortBasedAggregate. Correct me if I am wrong here.

cloud-fan · 2019-05-07T18:16:29Z

Each aggregate function will create its own buffer, but the aggregate function doesn't hold the buffer, the buffer is managed by Spark. Aggregate function should be stateless.

Let's say we have an aggregate expression expr. expr creates an aggregate function f1 to do the work before sort fallback happens. f1 creates a buffer and starts to accumulate the buffer. When sort fallback happens, expr creates a new aggregate function f2. We still ask f1 to create the buffer, and then f2 starts working and accumulate the buffer.

Since f1 and f2 are the same functions, this should be fine.

SparkQA · 2019-05-07T19:22:57Z

Test build #105221 has finished for PR 24149 at commit 006616e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

pgandhi999 · 2019-05-07T21:10:18Z

I see. Sure I can go ahead and close this PR. Thank you.

cloud-fan · 2019-05-08T05:59:48Z

@pgandhi999 it could be great if you can update the classdoc of AggregateFunction and say that it should be stateless. thanks!

…RK-27207 [SPARK-27207] : Upmerging with master branch

pgandhi999 · 2019-05-08T14:39:28Z

@cloud-fan I have updated the doc. If it also needs to be updated someplace else, do let me know. Thank you.

SparkQA · 2019-05-08T17:07:07Z

Test build #105257 has finished for PR 24149 at commit 28ea0f9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

pgandhi999 · 2019-05-08T17:54:17Z

test this please.

SparkQA · 2019-05-08T21:09:16Z

Test build #105265 has finished for PR 24149 at commit 28ea0f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-05-09T03:13:37Z

thanks, merging to master!

cloud-fan reviewed Mar 20, 2019

View reviewed changes

[SPARK-27207] : Adding Unit Test and addressing reviews

ea050f7

Adding unit test and refactoring code

[SPARK-27207] : Fixing Scalastyle tests

0714876

cloud-fan reviewed Mar 21, 2019

View reviewed changes

Merge branch 'master' of https://github.com/pgandhi999/spark into SPA…

b4eaf31

…RK-27207 [SPARK-27207] : Upmerging with master branch

pgandhi added 2 commits March 29, 2019 13:28

[SPARK-27207] : Fix SortBasedAggregator to run with different aggrega…

4dc1007

…te functions and write unit test Fix SortBasedAggregator to ensure that update and merge are performed with two different sets of aggregate functions, one for update and one for merge respectively.

[SPARK-27207] : Fix new line for TypedImperativeAggregateSuite

088cbc6

[SPARK-27207] : Changing design to use one buffer but initializing fo…

6a5ed71

…r different aggregate functions

Revert "[SPARK-27207] : Changing design to use one buffer but initial…

fb9fea8

…izing for different aggregate functions" This reverts commit 6a5ed71. Reverting to previous commit

pgandhi added 2 commits April 30, 2019 09:41

[SPARK-27207] : Reverting the two buffer logic and simplifying the code

8f5c6b0

Since, apache#24459 fixes the init-update-merge issue, the fix here is reverted.

Merge branch 'master' of https://github.com/pgandhi999/spark into SPA…

db46cf7

…RK-27207 [SPARK-27207] : Upmerging with master branch

pgandhi999 changed the title ~~[SPARK-27207] : Ensure aggregate buffers are initialized again for So…~~ [SPARK-27207][SQL] : Ensure aggregate buffers are initialized again for So… Apr 30, 2019

cloud-fan reviewed May 6, 2019

View reviewed changes

[SPARK-27207] : Coming up with a unit test for custom UDAF

df330fa

[SPARK-27207] : Fixing Scalastyle Tests

5bd474c

cloud-fan reviewed May 7, 2019

View reviewed changes

[SPARK-27207] : Simplifying unit test and indentation

006616e

cloud-fan reviewed May 7, 2019

View reviewed changes

pgandhi added 2 commits May 8, 2019 09:35

[SPARK-27207] : Reverting changes and updating doc

c8959f4

Merge branch 'master' of https://github.com/pgandhi999/spark into SPA…

28ea0f9

…RK-27207 [SPARK-27207] : Upmerging with master branch

cloud-fan closed this in 0969d7a May 9, 2019

[SPARK-27207][SQL] : Ensure aggregate buffers are initialized again for So… #24149

[SPARK-27207][SQL] : Ensure aggregate buffers are initialized again for So… #24149

Uh oh!

Conversation

pgandhi999 commented Mar 19, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

pgandhi999 commented Mar 19, 2019

Uh oh!

pgandhi999 commented Mar 19, 2019

Uh oh!

SparkQA commented Mar 20, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Mar 20, 2019

Uh oh!

pgandhi999 commented Mar 21, 2019

Uh oh!

SparkQA commented Mar 21, 2019

Uh oh!

SparkQA commented Mar 21, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shaneknapp commented Mar 21, 2019

Uh oh!

SparkQA commented Mar 22, 2019

Uh oh!

pgandhi999 commented Mar 25, 2019

Uh oh!

cloud-fan commented Mar 25, 2019

Uh oh!

pgandhi999 commented Mar 25, 2019

Uh oh!

cloud-fan commented Mar 26, 2019

Uh oh!

pgandhi999 commented Mar 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Mar 27, 2019

Uh oh!

pgandhi999 commented Mar 29, 2019

Uh oh!

SparkQA commented Mar 29, 2019

Uh oh!

SparkQA commented Mar 30, 2019

Uh oh!

SparkQA commented Apr 30, 2019

Uh oh!

pgandhi999 commented Apr 30, 2019

Uh oh!

pgandhi999 commented May 6, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 6, 2019

Uh oh!

SparkQA commented May 7, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

pgandhi999 commented Mar 27, 2019 •

edited

Loading

cloud-fan commented May 7, 2019 •

edited

Loading