-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-22600][SQL] Fix 64kb limit for deeply nested expressions under wholestage codegen #19813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This patch needs #19800 to fix another issue in codegen. |
Test build #84158 has finished for PR 19813 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whole stage codegen may fallback to normal code path if code size is too large, so we need to make sure the query is whole stage codegened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Add assert to make sure this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to double check, this test fails on the master branch, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
We need to clearly define what is the current input according to the codegen context. For normal code path, it's always However for whole stage codegen path, it's way more complex:
|
1cf6a48
to
65d07d5
Compare
Yes, this is correct. So, for 1, only the variables not evaluate yet, we don't include them as parameters. |
Test build #84182 has finished for PR 19813 at commit
|
If we have a clear rule, I think it makes more sense to do this in |
… into parameter list.
Test build #84207 has finished for PR 19813 at commit
|
retest this please. |
@cloud-fan @kiszk In an API like |
Test build #84208 has finished for PR 19813 at commit
|
Test build #84211 has finished for PR 19813 at commit
|
|
A more advanced version of |
BTW |
Test build #84239 has finished for PR 19813 at commit
|
Test build #84241 has finished for PR 19813 at commit
|
Test build #84260 has finished for PR 19813 at commit
|
retest this please. |
replacing them via string is too dangerous, logically we wanna replace some nodes in AST, which needs an AST based codegen framework, or we need to refactor the current framework a little bit to do it safely. |
AST based codegen framework sounds a too far step from current status. I think we either follow the new contract or refactor the current framework a little bit. |
I agree that string replacement is too dangerous (e.g. |
I found a few problems that this PR can ideally solve. If this is not available soon, I will use workaround in upcoming PRs. |
@kiszk Yea, I have now checked through the codegen. I didn't find the places that can cause that issue ( As we don't use such statement as codegen output, I think the easiest approach is adding assertions. @cloud-fan WDYT? |
I think that this PR is necessary to fix SPARK-22868 and SPARK-22869 |
To me whole stage codegen compilation fix is less important as we can fallback to non whole stage codegen, so we don't need to rush.
What about future? Will we need to output statement for some reason? like reducing the usage of local variables? |
I think that we won't have strong motivation to use output statement generally. The reason is, although it helps to reduce the usage of local variables (is it any beneficial like reducing global variables?), it also means the chained evaluation of expressions needs to be run at every occurrence. |
We can introduce some mechanism to save statement to local variables if it's going to be re-computed. A possible benefit is reducing code size. Anyway I think this is a valid possibility to improve the codegen framework and we should not totally give it up. |
Not all of expressions can use statement as output. Seems to me there are only very few scenarios (e.g., a chain of expressions that all can use statement output by coincidence) we can really save local variables. If we also consider possible optimization of JIT, the benefit might be only marginal. That is why I am not sure if it is necessary to consider statement output generally. |
IMHO, in general, the output
Such as cases cannot have an expression in |
I did a search but can't find one in the current codebase, but I do think this is a valid idea, e.g. a simple example would be
instead of
This can apply to both whole stage codegen and normal codegen, and reduce the code size dramatically, and make whole stage codegen less likely to hit 64kb compile error. Another thing I'm working on is: do not create global variables if |
This is only valid when by coincidence the all expressions involved can use statement as output. As I looked at the codebase, I think only few expressions can output statement. This may not apply generally to reduce code size. |
I think all arithmetic, predicate and bitwise expressions can benefit from it, and they are very common expressions in SQL. More importantly, allowing expressions to output statement may have other benefits that we haven't discovered yet, I don't think we should sacrifice it just for supporting splitting code in whole stage codegen, which is only for performance not stability. For now I think we can fix the 64kb compile error caused by the whole stage codegen framework not expressions. I remember @maropu has a PR to fix that and I prefer to take priority to review that PR. |
Arithmetic, predicate and bitwise expressions are very common expressions in SQL, but it doesn't mean we commonly see a long chain of arithmetic/predicate/bitwise expressions. Which one is more common? A chain of arithmetic expressions long enough to cause some issues we need to get rid of intermediate variables? Or a deeply nested expression? I don't see strong evidence that supports statement output from the discussion. The only one possibility for now is to reducing code size. This is also for performance, not stability. On the contrary, isn't using local variable more stable? Don't forget we need to introduce other mechanism to fix the problem of statement output like re-evaluation I pointed out above. I'm not saying it is not good to support statement output. But for now, the reason to support it is very vague. |
Do we have to sacrifice one of them? If we do then I agree deeply nested expression is more common than a long chain of arithmetic expressions and we should get this patch. I think we should explore more about how to split methods in whole stage codegen before making this decision, at least now I'm not convinced that we have to forbid expressions to output statement. |
I agree that the best is we can have both of them. I have a proposal to replace statement output in split methods. Maybe you can check if it sounds good. By #20043, we have a This is the idea to more safely replace statement with generate variable name under the string based framework. E.g., if we have a statement output ...
int d = %STATEMENT_1% + b;
if (%STATEMENT_1% > 10 && c) {
...
} If we split a method, the method body like: void splitMethod1(int %STATEMENT_1%, int b, boolean c) {
...
int d = %STATEMENT_1% + b;
if (%STATEMENT_1% > 10 && c) {
...
}
}
...
splitMethod1(%STATEMENT_1%, c); After replacing statement with variable: void splitMethod1(int varAPlusOne, int b, boolean c) {
...
int d = varAPlusOne + b;
if (varAPlusOne > 10 && c) {
...
}
}
...
int varAPlusOne = a + 1;
splitMethod1(varAPlusOne, b, c); |
This is a pretty cool idea that can work with the current string based codegen framework, LGTM! |
Cool. LGTM, too. |
LGTM, great work! |
@cloud-fan Since #20043 was merged now, I will go to polish this and implement the above idea. Will submit a PR for this when it is ready. |
## What changes were proposed in this pull request? This patch tries to implement this [proposal](#19813 (comment)) to add an API for handling expression code generation. It should allow us to manipulate how to generate codes for expressions. In details, this adds an new abstraction `CodeBlock` to `JavaCode`. `CodeBlock` holds the code snippet and inputs for generating actual java code. For example, in following java code: ```java int ${variable} = 1; boolean ${isNull} = ${CodeGenerator.defaultValue(BooleanType)}; ``` `variable`, `isNull` are two `VariableValue` and `CodeGenerator.defaultValue(BooleanType)` is a string. They are all inputs to this code block and held by `CodeBlock` representing this code. For codegen, we provide a specified string interpolator `code`, so you can define a code like this: ```scala val codeBlock = code""" |int ${variable} = 1; |boolean ${isNull} = ${CodeGenerator.defaultValue(BooleanType)}; """.stripMargin // Generates actual java code. codeBlock.toString ``` Because those inputs are held separately in `CodeBlock` before generating code, we can safely manipulate them, e.g., replacing statements to aliased variables, etc.. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <[email protected]> Closes #21193 from viirya/SPARK-24121.
What changes were proposed in this pull request?
SPARK-22543 fixes the 64kb compile error for deeply nested expression for non-wholestage codegen. This PR extends it to support wholestage codegen.
This patch brings some util methods in to extract necessary parameters for an expression if it is split to a function.
The util methods are put in object
ExpressionCodegen
undercodegen
. The main entry isgetExpressionInputParams
which returns all necessary parameters to evaluate the given expression in a split function.This util methods can be used to split expressions too. This is a TODO item later.
How was this patch tested?
Added test.