Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Nov 24, 2017

What changes were proposed in this pull request?

SPARK-22543 fixes the 64kb compile error for deeply nested expression for non-wholestage codegen. This PR extends it to support wholestage codegen.

This patch brings some util methods in to extract necessary parameters for an expression if it is split to a function.

The util methods are put in object ExpressionCodegen under codegen. The main entry is getExpressionInputParams which returns all necessary parameters to evaluate the given expression in a split function.

This util methods can be used to split expressions too. This is a TODO item later.

How was this patch tested?

Added test.

@viirya
Copy link
Member Author

viirya commented Nov 24, 2017

This patch needs #19800 to fix another issue in codegen.

@SparkQA
Copy link

SparkQA commented Nov 24, 2017

Test build #84158 has finished for PR 19813 at commit 1cf6a48.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whole stage codegen may fallback to normal code path if code size is too large, so we need to make sure the query is whole stage codegened.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Add assert to make sure this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to double check, this test fails on the master branch, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@cloud-fan
Copy link
Contributor

We need to clearly define what is the current input according to the codegen context. For normal code path, it's always ctx.INPUT_ROW, which means when we split codes to methods, we just need to pass InternalRow ctx.INPUT_ROW to those methods.

However for whole stage codegen path, it's way more complex:

  1. some of ctx.currentVars are just variables, their codes have already been generated before. But some are not. For those whose codes are not generated, they are not valid inputs.
  2. ctx.currentVars is not null but has null slots, and ctx.INPUT_ROW is not null. Then both ctx.currentVars and ctx.INPUT_ROW are valid inputs.

@viirya viirya force-pushed the reduce-expr-code-for-wholestage branch from 1cf6a48 to 65d07d5 Compare November 25, 2017 08:09
@viirya
Copy link
Member Author

viirya commented Nov 25, 2017

However for whole stage codegen path, it's way more complex:

  1. some of ctx.currentVars are just variables, their codes have already been generated before. But some are not. For those whose codes are not generated, they are not valid inputs.
  2. ctx.currentVars is not null but has null slots, and ctx.INPUT_ROW is not null. Then both ctx.currentVars and ctx.INPUT_ROW are valid inputs.

Yes, this is correct.

So, for 1, only the variables not evaluate yet, we don't include them as parameters.
For 2, null slots in ctx.currentVars won't be included as parameters too. ctx.INPUT_ROW will be included only if it is not null.

@SparkQA
Copy link

SparkQA commented Nov 25, 2017

Test build #84182 has finished for PR 19813 at commit 65d07d5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya viirya changed the title [SPARK-22600][SQL] Fix 64kb limit for deeply nested expressions under wholestage codegen [WIP][SPARK-22600][SQL] Fix 64kb limit for deeply nested expressions under wholestage codegen Nov 25, 2017
@cloud-fan
Copy link
Contributor

If we have a clear rule, I think it makes more sense to do this in CodegenContext, i.e. having a def splitExpressions(expressions: Seq[String]): String, which automatically extract the current inputs and put them into the parameter list.

@SparkQA
Copy link

SparkQA commented Nov 27, 2017

Test build #84207 has finished for PR 19813 at commit 9f848be.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ExprCode(
  • case class ExprInputVar(val expr: Expression, val exprCode: ExprCode)

@viirya
Copy link
Member Author

viirya commented Nov 27, 2017

retest this please.

@viirya
Copy link
Member Author

viirya commented Nov 27, 2017

@cloud-fan @kiszk ctx.currentVars and ctx.INPUT_ROW are not the only sources for expression evaluation under wholestage codegen. There are also eliminated subexpressions and the input rows and variables referred by deferred expressions.

In an API like def splitExpressions(expressions: Seq[String]): String, seems to me those sources are not easily to access.

@SparkQA
Copy link

SparkQA commented Nov 27, 2017

Test build #84208 has finished for PR 19813 at commit 9f848be.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ExprCode(
  • case class ExprInputVar(val expr: Expression, val exprCode: ExprCode)

@SparkQA
Copy link

SparkQA commented Nov 27, 2017

Test build #84211 has finished for PR 19813 at commit 57b1add.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

splitExpressions is the most common way we use in the codegen framework to deal with large code. If we can't make it work with whole stage codegen, we are not making many values.

@viirya
Copy link
Member Author

viirya commented Nov 28, 2017

A more advanced version of splitExpressions may work. We can provide necessary function parameters to it.

@cloud-fan
Copy link
Contributor

BTW splitExpressions doesn't work with subexpressions since the beinning, it's another topic to integrate them.

@SparkQA
Copy link

SparkQA commented Nov 28, 2017

Test build #84239 has finished for PR 19813 at commit d051f9e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 28, 2017

Test build #84241 has finished for PR 19813 at commit 6368702.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 28, 2017

Test build #84260 has finished for PR 19813 at commit 8c7f749.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Nov 28, 2017

retest this please.

@cloud-fan
Copy link
Contributor

replacing them via string is too dangerous, logically we wanna replace some nodes in AST, which needs an AST based codegen framework, or we need to refactor the current framework a little bit to do it safely.

@viirya
Copy link
Member Author

viirya commented Dec 18, 2017

AST based codegen framework sounds a too far step from current status. I think we either follow the new contract or refactor the current framework a little bit.

@kiszk
Copy link
Member

kiszk commented Dec 20, 2017

I agree that string replacement is too dangerous (e.g. a + 1 = a + 10 with a + 1).
How about a contract with adding assertions?

@kiszk
Copy link
Member

kiszk commented Dec 20, 2017

I found a few problems that this PR can ideally solve. If this is not available soon, I will use workaround in upcoming PRs.

@viirya
Copy link
Member Author

viirya commented Dec 20, 2017

@kiszk Yea, I have now checked through the codegen. I didn't find the places that can cause that issue (a + 1 as the codegen output value) yet. I may submit another PR to let us easily identify such codegen output so we can easily do adding assertions for it.

As we don't use such statement as codegen output, I think the easiest approach is adding assertions. @cloud-fan WDYT?

@kiszk
Copy link
Member

kiszk commented Dec 22, 2017

I think that this PR is necessary to fix SPARK-22868 and SPARK-22869

@cloud-fan
Copy link
Contributor

To me whole stage codegen compilation fix is less important as we can fallback to non whole stage codegen, so we don't need to rush.

As we don't use such statement as codegen output, I think the easiest approach is adding assertions

What about future? Will we need to output statement for some reason? like reducing the usage of local variables?

@viirya
Copy link
Member Author

viirya commented Dec 22, 2017

What about future? Will we need to output statement for some reason? like reducing the usage of local variables?

I think that we won't have strong motivation to use output statement generally. The reason is, although it helps to reduce the usage of local variables (is it any beneficial like reducing global variables?), it also means the chained evaluation of expressions needs to be run at every occurrence.

@cloud-fan
Copy link
Contributor

it also means the chained evaluation of expressions needs to be run at every occurrence.

We can introduce some mechanism to save statement to local variables if it's going to be re-computed. A possible benefit is reducing code size. Anyway I think this is a valid possibility to improve the codegen framework and we should not totally give it up.

@viirya
Copy link
Member Author

viirya commented Dec 23, 2017

Not all of expressions can use statement as output. Seems to me there are only very few scenarios (e.g., a chain of expressions that all can use statement output by coincidence) we can really save local variables. If we also consider possible optimization of JIT, the benefit might be only marginal. That is why I am not sure if it is necessary to consider statement output generally.

@kiszk
Copy link
Member

kiszk commented Dec 25, 2017

IMHO, in general, the output ev.value would be declared as local variable by parent as

s"""${ctx.javaType(dataType)} ${ev.value} = ${ctx.defaultValue(dataType)};"""

Such as cases cannot have an expression in ev.value.
As @viirya pointed out, I imagine there are a few scenarios. Would it be possible to show an example and place in source code where an expression is used as output in order to correctly understand the issue?

@cloud-fan
Copy link
Contributor

cloud-fan commented Dec 26, 2017

I did a search but can't find one in the current codebase, but I do think this is a valid idea, e.g. a simple example would be a + b + .... z, if expressions can output statement, then we just generate code like

int result = a + b ... + z;
boolean isNull = false;

instead of

int result 1 = a + b;
boolean isNull1 = false;
int result2 = result1 + c;
boolean isNull2 = false;
...

This can apply to both whole stage codegen and normal codegen, and reduce the code size dramatically, and make whole stage codegen less likely to hit 64kb compile error.

Another thing I'm working on is: do not create global variables if ctx.spiltExpression doesn't spit. This optimization should be much more useful if expressions can output statement.

@viirya
Copy link
Member Author

viirya commented Dec 26, 2017

This is only valid when by coincidence the all expressions involved can use statement as output. As I looked at the codebase, I think only few expressions can output statement. This may not apply generally to reduce code size.

@cloud-fan
Copy link
Contributor

I think all arithmetic, predicate and bitwise expressions can benefit from it, and they are very common expressions in SQL. More importantly, allowing expressions to output statement may have other benefits that we haven't discovered yet, I don't think we should sacrifice it just for supporting splitting code in whole stage codegen, which is only for performance not stability.

For now I think we can fix the 64kb compile error caused by the whole stage codegen framework not expressions. I remember @maropu has a PR to fix that and I prefer to take priority to review that PR.

@viirya
Copy link
Member Author

viirya commented Dec 26, 2017

Arithmetic, predicate and bitwise expressions are very common expressions in SQL, but it doesn't mean we commonly see a long chain of arithmetic/predicate/bitwise expressions.

Which one is more common? A chain of arithmetic expressions long enough to cause some issues we need to get rid of intermediate variables? Or a deeply nested expression? I don't see strong evidence that supports statement output from the discussion. The only one possibility for now is to reducing code size. This is also for performance, not stability. On the contrary, isn't using local variable more stable? Don't forget we need to introduce other mechanism to fix the problem of statement output like re-evaluation I pointed out above.

I'm not saying it is not good to support statement output. But for now, the reason to support it is very vague.

@cloud-fan
Copy link
Contributor

Do we have to sacrifice one of them? If we do then I agree deeply nested expression is more common than a long chain of arithmetic expressions and we should get this patch. I think we should explore more about how to split methods in whole stage codegen before making this decision, at least now I'm not convinced that we have to forbid expressions to output statement.

@viirya
Copy link
Member Author

viirya commented Dec 27, 2017

I agree that the best is we can have both of them.

I have a proposal to replace statement output in split methods. Maybe you can check if it sounds good.

By #20043, we have a StatementValue wrapping statement output. Instead of immediately embedding the statement in codes, we use a special replacement like %STATEMENT_1% for it. Normally we replace this with actual statement. If we need split methods, we replace this with a generated variable name. As it is special replacement, I think it should be safer.

This is the idea to more safely replace statement with generate variable name under the string based framework.

E.g., if we have a statement output a + 1, we represent it as a replacement, the code looks like:

...
int d = %STATEMENT_1% + b;
if (%STATEMENT_1% > 10 && c) {
  ...
}

If we split a method, the method body like:

void splitMethod1(int %STATEMENT_1%, int b, boolean c) {
 ...
 int d = %STATEMENT_1% + b;
 if (%STATEMENT_1% > 10 && c) {
    ...
 }
}
...
splitMethod1(%STATEMENT_1%, c);

After replacing statement with variable:

void splitMethod1(int varAPlusOne, int b, boolean c) {
 ...
 int d = varAPlusOne + b;
 if (varAPlusOne > 10 && c) {
    ...
 }
}

...
int varAPlusOne = a + 1;
splitMethod1(varAPlusOne, b, c);

@cloud-fan
Copy link
Contributor

This is a pretty cool idea that can work with the current string based codegen framework, LGTM!

@kiszk
Copy link
Member

kiszk commented Dec 27, 2017

Cool. LGTM, too.

@maropu
Copy link
Member

maropu commented Dec 28, 2017

LGTM, great work!

@viirya
Copy link
Member Author

viirya commented Apr 10, 2018

@cloud-fan Since #20043 was merged now, I will go to polish this and implement the above idea. Will submit a PR for this when it is ready.

asfgit pushed a commit that referenced this pull request May 22, 2018
## What changes were proposed in this pull request?

This patch tries to implement this [proposal](#19813 (comment)) to add an API for handling expression code generation. It should allow us to manipulate how to generate codes for expressions.

In details, this adds an new abstraction `CodeBlock` to `JavaCode`. `CodeBlock` holds the code snippet and inputs for generating actual java code.

For example, in following java code:

```java
  int ${variable} = 1;
  boolean ${isNull} = ${CodeGenerator.defaultValue(BooleanType)};
```

`variable`, `isNull` are two `VariableValue` and `CodeGenerator.defaultValue(BooleanType)` is a string. They are all inputs to this code block and held by `CodeBlock` representing this code.

For codegen, we provide a specified string interpolator `code`, so you can define a code like this:
```scala
  val codeBlock =
    code"""
         |int ${variable} = 1;
         |boolean ${isNull} = ${CodeGenerator.defaultValue(BooleanType)};
        """.stripMargin
  // Generates actual java code.
  codeBlock.toString
```

Because those inputs are held separately in `CodeBlock` before generating code, we can safely manipulate them, e.g., replacing statements to aliased variables, etc..

## How was this patch tested?

Added tests.

Author: Liang-Chi Hsieh <[email protected]>

Closes #21193 from viirya/SPARK-24121.
@viirya viirya deleted the reduce-expr-code-for-wholestage branch December 27, 2023 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants