[SPARK-22600][SQL] Fix 64kb limit for deeply nested expressions under wholestage codegen #19813

viirya · 2017-11-24T09:59:43Z

What changes were proposed in this pull request?

SPARK-22543 fixes the 64kb compile error for deeply nested expression for non-wholestage codegen. This PR extends it to support wholestage codegen.

This patch brings some util methods in to extract necessary parameters for an expression if it is split to a function.

The util methods are put in object ExpressionCodegen under codegen. The main entry is getExpressionInputParams which returns all necessary parameters to evaluate the given expression in a split function.

This util methods can be used to split expressions too. This is a TODO item later.

How was this patch tested?

Added test.

viirya · 2017-11-24T10:00:44Z

This patch needs #19800 to fix another issue in codegen.

SparkQA · 2017-11-24T10:20:41Z

Test build #84158 has finished for PR 19813 at commit 1cf6a48.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-11-24T10:54:58Z

sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala

whole stage codegen may fallback to normal code path if code size is too large, so we need to make sure the query is whole stage codegened.

Ok. Add assert to make sure this.

just to double check, this test fails on the master branch, right?

cloud-fan · 2017-11-24T11:04:36Z

We need to clearly define what is the current input according to the codegen context. For normal code path, it's always ctx.INPUT_ROW, which means when we split codes to methods, we just need to pass InternalRow ctx.INPUT_ROW to those methods.

However for whole stage codegen path, it's way more complex:

some of ctx.currentVars are just variables, their codes have already been generated before. But some are not. For those whose codes are not generated, they are not valid inputs.
ctx.currentVars is not null but has null slots, and ctx.INPUT_ROW is not null. Then both ctx.currentVars and ctx.INPUT_ROW are valid inputs.

…4k limit.

…for-wholestage

viirya · 2017-11-25T08:16:35Z

However for whole stage codegen path, it's way more complex:

some of ctx.currentVars are just variables, their codes have already been generated before. But some are not. For those whose codes are not generated, they are not valid inputs.

ctx.currentVars is not null but has null slots, and ctx.INPUT_ROW is not null. Then both ctx.currentVars and ctx.INPUT_ROW are valid inputs.

Yes, this is correct.

So, for 1, only the variables not evaluate yet, we don't include them as parameters.
For 2, null slots in ctx.currentVars won't be included as parameters too. ctx.INPUT_ROW will be included only if it is not null.

SparkQA · 2017-11-25T08:50:34Z

Test build #84182 has finished for PR 19813 at commit 65d07d5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-11-27T03:04:31Z

If we have a clear rule, I think it makes more sense to do this in CodegenContext, i.e. having a def splitExpressions(expressions: Seq[String]): String, which automatically extract the current inputs and put them into the parameter list.

… into parameter list.

SparkQA · 2017-11-27T08:05:01Z

Test build #84207 has finished for PR 19813 at commit 9f848be.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ExprCode(
case class ExprInputVar(val expr: Expression, val exprCode: ExprCode)

viirya · 2017-11-27T08:06:09Z

retest this please.

viirya · 2017-11-27T08:52:55Z

@cloud-fan @kiszk ctx.currentVars and ctx.INPUT_ROW are not the only sources for expression evaluation under wholestage codegen. There are also eliminated subexpressions and the input rows and variables referred by deferred expressions.

In an API like def splitExpressions(expressions: Seq[String]): String, seems to me those sources are not easily to access.

SparkQA · 2017-11-27T10:13:18Z

Test build #84208 has finished for PR 19813 at commit 9f848be.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ExprCode(
case class ExprInputVar(val expr: Expression, val exprCode: ExprCode)

SparkQA · 2017-11-27T11:00:09Z

Test build #84211 has finished for PR 19813 at commit 57b1add.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-11-27T12:09:31Z

splitExpressions is the most common way we use in the codegen framework to deal with large code. If we can't make it work with whole stage codegen, we are not making many values.

viirya · 2017-11-28T00:23:40Z

A more advanced version of splitExpressions may work. We can provide necessary function parameters to it.

cloud-fan · 2017-11-28T01:54:53Z

BTW splitExpressions doesn't work with subexpressions since the beinning, it's another topic to integrate them.

…an's rowIdx.

…utput.

SparkQA · 2017-11-28T04:57:33Z

Test build #84239 has finished for PR 19813 at commit d051f9e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-28T05:33:04Z

Test build #84241 has finished for PR 19813 at commit 6368702.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-28T17:43:51Z

Test build #84260 has finished for PR 19813 at commit 8c7f749.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-11-28T22:32:17Z

retest this please.

cloud-fan · 2017-12-18T08:21:43Z

replacing them via string is too dangerous, logically we wanna replace some nodes in AST, which needs an AST based codegen framework, or we need to refactor the current framework a little bit to do it safely.

viirya · 2017-12-18T08:25:22Z

AST based codegen framework sounds a too far step from current status. I think we either follow the new contract or refactor the current framework a little bit.

kiszk · 2017-12-20T18:42:26Z

I agree that string replacement is too dangerous (e.g. a + 1 = a + 10 with a + 1).
How about a contract with adding assertions?

kiszk · 2017-12-20T18:44:28Z

I found a few problems that this PR can ideally solve. If this is not available soon, I will use workaround in upcoming PRs.

viirya · 2017-12-20T23:23:33Z

@kiszk Yea, I have now checked through the codegen. I didn't find the places that can cause that issue (a + 1 as the codegen output value) yet. I may submit another PR to let us easily identify such codegen output so we can easily do adding assertions for it.

As we don't use such statement as codegen output, I think the easiest approach is adding assertions. @cloud-fan WDYT?

kiszk · 2017-12-22T05:05:37Z

I think that this PR is necessary to fix SPARK-22868 and SPARK-22869

cloud-fan · 2017-12-22T07:49:33Z

To me whole stage codegen compilation fix is less important as we can fallback to non whole stage codegen, so we don't need to rush.

As we don't use such statement as codegen output, I think the easiest approach is adding assertions

What about future? Will we need to output statement for some reason? like reducing the usage of local variables?

viirya · 2017-12-22T08:18:35Z

What about future? Will we need to output statement for some reason? like reducing the usage of local variables?

I think that we won't have strong motivation to use output statement generally. The reason is, although it helps to reduce the usage of local variables (is it any beneficial like reducing global variables?), it also means the chained evaluation of expressions needs to be run at every occurrence.

cloud-fan · 2017-12-22T14:27:38Z

it also means the chained evaluation of expressions needs to be run at every occurrence.

We can introduce some mechanism to save statement to local variables if it's going to be re-computed. A possible benefit is reducing code size. Anyway I think this is a valid possibility to improve the codegen framework and we should not totally give it up.

viirya · 2017-12-23T10:04:43Z

Not all of expressions can use statement as output. Seems to me there are only very few scenarios (e.g., a chain of expressions that all can use statement output by coincidence) we can really save local variables. If we also consider possible optimization of JIT, the benefit might be only marginal. That is why I am not sure if it is necessary to consider statement output generally.

kiszk · 2017-12-25T03:10:45Z

IMHO, in general, the output ev.value would be declared as local variable by parent as

s"""${ctx.javaType(dataType)} ${ev.value} = ${ctx.defaultValue(dataType)};"""

Such as cases cannot have an expression in ev.value.
As @viirya pointed out, I imagine there are a few scenarios. Would it be possible to show an example and place in source code where an expression is used as output in order to correctly understand the issue?

cloud-fan · 2017-12-26T03:35:17Z

I did a search but can't find one in the current codebase, but I do think this is a valid idea, e.g. a simple example would be a + b + .... z, if expressions can output statement, then we just generate code like

int result = a + b ... + z;
boolean isNull = false;

instead of

int result 1 = a + b;
boolean isNull1 = false;
int result2 = result1 + c;
boolean isNull2 = false;
...

This can apply to both whole stage codegen and normal codegen, and reduce the code size dramatically, and make whole stage codegen less likely to hit 64kb compile error.

Another thing I'm working on is: do not create global variables if ctx.spiltExpression doesn't spit. This optimization should be much more useful if expressions can output statement.

viirya · 2017-12-26T04:05:09Z

This is only valid when by coincidence the all expressions involved can use statement as output. As I looked at the codebase, I think only few expressions can output statement. This may not apply generally to reduce code size.

cloud-fan · 2017-12-26T04:56:52Z

I think all arithmetic, predicate and bitwise expressions can benefit from it, and they are very common expressions in SQL. More importantly, allowing expressions to output statement may have other benefits that we haven't discovered yet, I don't think we should sacrifice it just for supporting splitting code in whole stage codegen, which is only for performance not stability.

For now I think we can fix the 64kb compile error caused by the whole stage codegen framework not expressions. I remember @maropu has a PR to fix that and I prefer to take priority to review that PR.

viirya · 2017-12-26T07:43:10Z

Arithmetic, predicate and bitwise expressions are very common expressions in SQL, but it doesn't mean we commonly see a long chain of arithmetic/predicate/bitwise expressions.

Which one is more common? A chain of arithmetic expressions long enough to cause some issues we need to get rid of intermediate variables? Or a deeply nested expression? I don't see strong evidence that supports statement output from the discussion. The only one possibility for now is to reducing code size. This is also for performance, not stability. On the contrary, isn't using local variable more stable? Don't forget we need to introduce other mechanism to fix the problem of statement output like re-evaluation I pointed out above.

I'm not saying it is not good to support statement output. But for now, the reason to support it is very vague.

cloud-fan · 2017-12-27T01:46:01Z

Do we have to sacrifice one of them? If we do then I agree deeply nested expression is more common than a long chain of arithmetic expressions and we should get this patch. I think we should explore more about how to split methods in whole stage codegen before making this decision, at least now I'm not convinced that we have to forbid expressions to output statement.

viirya · 2017-12-27T03:28:29Z

I agree that the best is we can have both of them.

I have a proposal to replace statement output in split methods. Maybe you can check if it sounds good.

By #20043, we have a StatementValue wrapping statement output. Instead of immediately embedding the statement in codes, we use a special replacement like %STATEMENT_1% for it. Normally we replace this with actual statement. If we need split methods, we replace this with a generated variable name. As it is special replacement, I think it should be safer.

This is the idea to more safely replace statement with generate variable name under the string based framework.

E.g., if we have a statement output a + 1, we represent it as a replacement, the code looks like:

...
int d = %STATEMENT_1% + b;
if (%STATEMENT_1% > 10 && c) {
  ...
}

If we split a method, the method body like:

void splitMethod1(int %STATEMENT_1%, int b, boolean c) {
 ...
 int d = %STATEMENT_1% + b;
 if (%STATEMENT_1% > 10 && c) {
    ...
 }
}
...
splitMethod1(%STATEMENT_1%, c);

After replacing statement with variable:

void splitMethod1(int varAPlusOne, int b, boolean c) {
 ...
 int d = varAPlusOne + b;
 if (varAPlusOne > 10 && c) {
    ...
 }
}

...
int varAPlusOne = a + 1;
splitMethod1(varAPlusOne, b, c);

cloud-fan · 2017-12-27T04:01:01Z

This is a pretty cool idea that can work with the current string based codegen framework, LGTM!

kiszk · 2017-12-27T08:28:53Z

Cool. LGTM, too.

maropu · 2017-12-28T02:46:13Z

LGTM, great work!

viirya · 2018-04-10T04:08:19Z

@cloud-fan Since #20043 was merged now, I will go to polish this and implement the above idea. Will submit a PR for this when it is ready.

## What changes were proposed in this pull request? This patch tries to implement this [proposal](#19813 (comment)) to add an API for handling expression code generation. It should allow us to manipulate how to generate codes for expressions. In details, this adds an new abstraction `CodeBlock` to `JavaCode`. `CodeBlock` holds the code snippet and inputs for generating actual java code. For example, in following java code: ```java int ${variable} = 1; boolean ${isNull} = ${CodeGenerator.defaultValue(BooleanType)}; ``` `variable`, `isNull` are two `VariableValue` and `CodeGenerator.defaultValue(BooleanType)` is a string. They are all inputs to this code block and held by `CodeBlock` representing this code. For codegen, we provide a specified string interpolator `code`, so you can define a code like this: ```scala val codeBlock = code""" |int ${variable} = 1; |boolean ${isNull} = ${CodeGenerator.defaultValue(BooleanType)}; """.stripMargin // Generates actual java code. codeBlock.toString ``` Because those inputs are held separately in `CodeBlock` before generating code, we can safely manipulate them, e.g., replacing statements to aliased variables, etc.. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <[email protected]> Closes #21193 from viirya/SPARK-24121.

cloud-fan reviewed Nov 24, 2017

View reviewed changes

mgaido91 mentioned this pull request Nov 24, 2017

[SPARK-22520][SQL] Support code generation for large CaseWhen #19752

Closed

viirya added 2 commits November 25, 2017 07:27

Support wholestage codegen for reducing expression codes to prevent 6…

34abc22

…4k limit.

Merge remote-tracking branch 'upstream/master' into reduce-expr-code-…

e0d111e

…for-wholestage

viirya force-pushed the reduce-expr-code-for-wholestage branch from 1cf6a48 to 65d07d5 Compare November 25, 2017 08:09

Assert the added test is under wholestage codegen.

65d07d5

viirya changed the title ~~[SPARK-22600][SQL] Fix 64kb limit for deeply nested expressions under wholestage codegen~~ [WIP][SPARK-22600][SQL] Fix 64kb limit for deeply nested expressions under wholestage codegen Nov 25, 2017

Put input rows and evaluated columns referred by deferred expressions…

9f848be

… into parameter list.

Revert unnecessary changes.

57b1add

viirya added 2 commits November 28, 2017 02:35

Fix subexpression isNull for non nullable case. Fix columnar batch sc…

d051f9e

…an's rowIdx.

Let rowidx as global variable instead of early evaluation of column o…

6368702

…utput.

Fix the problematic case.

8c7f749

mgaido91 mentioned this pull request Apr 11, 2018

[SPARK-23951][SQL] Use actual java class instead of string representation. #21026

Closed

This was referenced Apr 24, 2018

[SPARK-22600][SQL][WIP] Fix 64kb limit for deeply nested expressions under wholestage codegen #21140

Closed

[SPARK-24121][SQL] Add API for handling expression code generation #21193

Closed

viirya mentioned this pull request May 23, 2018

[SPARK-24361][SQL] Polish code block manipulation API #21405

Closed

mgaido91 mentioned this pull request Aug 15, 2018

[SPARK-24505][SQL] Convert strings in codegen to blocks: Cast and BoundAttribute #21537

Closed

viirya deleted the reduce-expr-code-for-wholestage branch December 27, 2023 18:35

[SPARK-22600][SQL] Fix 64kb limit for deeply nested expressions under wholestage codegen #19813

[SPARK-22600][SQL] Fix 64kb limit for deeply nested expressions under wholestage codegen #19813

Uh oh!

Conversation

viirya commented Nov 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

viirya commented Nov 24, 2017

Uh oh!

SparkQA commented Nov 24, 2017

Uh oh!

cloud-fan Nov 24, 2017

Choose a reason for hiding this comment

Uh oh!

viirya Nov 25, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 5, 2017

Choose a reason for hiding this comment

Uh oh!

viirya Dec 5, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 24, 2017

Uh oh!

viirya commented Nov 25, 2017

Uh oh!

SparkQA commented Nov 25, 2017

Uh oh!

cloud-fan commented Nov 27, 2017

Uh oh!

SparkQA commented Nov 27, 2017

Uh oh!

viirya commented Nov 27, 2017

Uh oh!

viirya commented Nov 27, 2017

Uh oh!

SparkQA commented Nov 27, 2017

Uh oh!

SparkQA commented Nov 27, 2017

Uh oh!

cloud-fan commented Nov 27, 2017

Uh oh!

viirya commented Nov 28, 2017

Uh oh!

cloud-fan commented Nov 28, 2017

Uh oh!

SparkQA commented Nov 28, 2017

Uh oh!

SparkQA commented Nov 28, 2017

Uh oh!

SparkQA commented Nov 28, 2017

Uh oh!

viirya commented Nov 28, 2017

Uh oh!

cloud-fan commented Dec 18, 2017

Uh oh!

viirya commented Dec 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kiszk commented Dec 20, 2017

Uh oh!

kiszk commented Dec 20, 2017

Uh oh!

viirya commented Dec 20, 2017

Uh oh!

kiszk commented Dec 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Dec 22, 2017

Uh oh!

viirya commented Dec 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Dec 22, 2017

Uh oh!

viirya commented Dec 23, 2017

Uh oh!

viirya commented Nov 24, 2017 •

edited

Loading

viirya commented Dec 18, 2017 •

edited

Loading

kiszk commented Dec 22, 2017 •

edited

Loading

viirya commented Dec 22, 2017 •

edited

Loading

kiszk commented Dec 25, 2017 •

edited

Loading

cloud-fan commented Dec 26, 2017 •

edited

Loading

viirya commented Dec 26, 2017 •

edited

Loading

viirya commented Dec 27, 2017 •

edited

Loading