-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-21413][SQL] Fix 64KB JVM bytecode limit problem in multiple projections with CASE WHEN #18641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #79632 has finished for PR 18641 at commit
|
|
Test build #79636 has finished for PR 18641 at commit
|
|
@viirya @cloud-fan could you please take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move this condition above and reuse it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we check INPUT_ROW too for these three functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVM. I saw there's the check already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it enough to only consider the generated code of individual condition? If it is less than 512, e.g, 500, but the combination of all conditions can still be large to cause the same issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Now, I updated it to use the same logic as if.
|
Test build #79660 has finished for PR 18641 at commit
|
|
Jenkins, retest this please |
|
Test build #79666 has finished for PR 18641 at commit
|
|
@viirya Are there any additional comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually my previous comment meant the sum of generated codes of all branches.
With latest commit, it is possible that the code size of each branch is in the range, but the sum of all branches still breaks 64KB limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it. You mean that we have to split super deeply-nested if-then-else statements into multiple methods, too.
I will work for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya I worked for possible three cases that may break 64KB limit.
|
Test build #79747 has finished for PR 18641 at commit
|
|
Test build #79750 has finished for PR 18641 at commit
|
|
Test build #79762 has finished for PR 18641 at commit
|
|
ping @cloud-fan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is put at wrong place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. done.
|
Test build #80045 has finished for PR 18641 at commit
|
|
ping @cloud-fan |
|
Jenkins, retest this please |
|
Test build #82677 has finished for PR 18641 at commit
|
|
Jenkins, retest this please |
|
Test build #83165 has finished for PR 18641 at commit
|
|
@cloud-fan would it be possible to review if you have time? |
|
can you fix the conflict? |
|
Test build #83803 has finished for PR 18641 at commit
|
| val schema = StructType(StructField("a", IntegerType) :: Nil) | ||
| val df = spark.createDataFrame(sparkContext.parallelize(Seq(Row(1))), schema) | ||
| val df1 = | ||
| df.withColumn("a", when($"a" === 0, null).otherwise($"a")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use a loop here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, the following code can throw an exception
...
var df1 = df
for (i <- 1 to 10) {
df1 = df1.withColumn("a", when($"a" === 0, null).otherwise($"a"))
}
...
| val isNull = ctx.freshName("caseWhenIsNull") | ||
| val value = ctx.freshName("caseWhenValue") | ||
|
|
||
| val cases = branches.map { case (condExpr, valueExpr) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we follow what we did for And and Or, and just check the code length at the beginning? TBH I don't understand your change after several minutes reading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For CaseWhen, the code bloat occurs in one case class. The CaseWhenCodegen.doGenCode can generate deeply-nested if-then-else statements as above in the comment. Each element in cases has only a if-then. Thus, it is not possible to insert code check here. Since And and Or generates deeply nested if-then-else by calling doGenCode many times, to check code size here works well.
This line generates the nested if-then-else. Thus, after this line, code size check is performed.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only codegen CASE WHEN if the case branches are less than 20, I think check code size here is good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Do you want to add such a code?
val cases = ...
val genCode = if (cases.map(s => s.length).sum <= 1024) {
cases.mkString("\nelse {\n")
} else {
// current code
var isGlobalVariable = false
...
generatedCode
}
|
Test build #83842 has finished for PR 18641 at commit
|
|
Test build #83854 has finished for PR 18641 at commit
|
|
Test build #83880 has finished for PR 18641 at commit
|
|
Test build #83882 has finished for PR 18641 at commit
|
|
retest this please |
|
Test build #83885 has finished for PR 18641 at commit
|
…essions ## What changes were proposed in this pull request? A frequently reported issue of Spark is the Java 64kb compile error. This is because Spark generates a very big method and it's usually caused by 3 reasons: 1. a deep expression tree, e.g. a very complex filter condition 2. many individual expressions, e.g. expressions can have many children, operators can have many expressions. 3. a deep query plan tree (with whole stage codegen) This PR focuses on 1. There are already several patches(apache#15620 apache#18972 apache#18641) trying to fix this issue and some of them are already merged. However this is an endless job as every non-leaf expression has this issue. This PR proposes to fix this issue in `Expression.genCode`, to make sure the code for a single expression won't grow too big. According to maropu 's benchmark, no regression is found with TPCDS (thanks maropu !): https://docs.google.com/spreadsheets/d/1K3_7lX05-ZgxDXi9X_GleNnDjcnJIfoSlSCDZcL4gdg/edit?usp=sharing ## How was this patch tested? existing test Author: Wenchen Fan <[email protected]> Author: Wenchen Fan <[email protected]> Closes apache#19767 from cloud-fan/codegen.
|
#19752 will cover this solution. |
What changes were proposed in this pull request?
This PR changes Case When code generation to place condition and then expressions' generated code into separated methods if these size could be large. When the method is newly generated, variables for
isNullandvalueare declared as an instance variable to pass these values (e.g.isNull1409andvalue1409) to the callers of the generated method.This PR resolved three cases:
Case 1 before this PR
Case 2 after this PR
How was this patch tested?
Added new test suites into
CodeGenerationSuiteandDataFrameSuite