- 
                Notifications
    You must be signed in to change notification settings 
- Fork 28.9k
[SPARK-18091] [SQL] Deep if expressions cause Generated SpecificUnsafeProjection code to exceed JVM code size limit #15620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @rxin @liancheng @sameeragarwal Please review | 
| ok to test. | 
| Test build #68451 has finished for PR 15620 at commit  
 | 
62f7b4e    to
    6059572      
    Compare
  
    | Test build #68464 has finished for PR 15620 at commit  
 | 
| I ran the tests on my machine and they passed. @cloud-fan Can you please trigger a re-run of test build? | 
| retest this please | 
| Test build #68832 has started for PR 15620 at commit  | 
3213f2e    to
    19622e2      
    Compare
  
    | Test build #68838 has finished for PR 15620 at commit  
 | 
| @cloud-fan the same tests which failed earlier are failing again. Could you issue retest request one more time? | 
| retest this please | 
| Test build #68851 has finished for PR 15620 at commit  
 | 
| Looks like it is not flaky test. The test failure is due to compilation error of janino: A local variable can not be accessed in the generated evaluation function. | 
| I did not want to run tests using ./dev/run-tests on my machine because it'll take hours. I ran the mllib module tests using "mvn test" which worked fine on my machine. What could be the difference between such a run on my machine and the Jenkins Test Build? Please guide me on further steps. | 
| can you try sbt? The jenkins run tests using sbt instead of maven | 
| I've tested one of the failed tests locally with sbt. It failed. | 
| Let me try running the unit tests with sbt | 
| I have been able to reproduce the failures using sbt. Investigating the cause. | 
| @kapilsingh5050 any update? | 
| I'm not sure whether I understand the problem completely. It seems WholeStage CodeGen wants the complete stage code in a single function. The fix I made violates this because it creates separate methods for codegen of If expression's condition, if block and then block (of course, when the total code is beyond a threshold). Now, if my assumption about WholeStage CodeGen is correct then how does the split expressions logic work? Or maybe the problem is specific to initialisation of input attributes of a stage. | 
19622e2    to
    6a7e9ac      
    Compare
  
    | The unit tests passed after incorporating related changes from SPARK-15327 and SPARK-17115 | 
| Test build #69324 has finished for PR 15620 at commit  
 | 
| @viirya @cloud-fan @rxin Please review and merge | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The limitation is 64k, we don't need to be so conservative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the same limit as in following change:
https://github.com/apache/spark/pull/14692/files#diff-8bcc5aea39c73d4bf38aef6f6951d42cL595
which is based on some benchmarks. In addition, for JIT to do its optimisations, I think the limit is 8k.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please follow the existing code style in Spark, i.e.
def xxx(
  param1: xxx,
  param2: xxx): xxx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like we don't need to pass in the expr, but only its data type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'll fix it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if condition and true/false expressions are bound to currentVars, we still exceed JVM code size limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about how other similar patches handle whole stage codegen, any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will pass needed local variables to the separated functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we want to do it here. It will add a bit complexity. If for simplicity, I think we can not to support it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the conversation on the change I mentioned also went like there will be some refactoring required in whole stage code generation to support that
| } | ||
|  | ||
| val expressions = Seq(strExpr) | ||
| val plan = GenerateUnsafeProjection.generate(expressions, true) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: other similar tests use GenerateMutableProjection to test the codegen, any specific reason we have to use GenerateUnsafeProjection here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue I had encountered involved GenerateUnsafeProjection and I was trying to reproduce the same in the unit test so I went with it. I can't think of any other reason. Any specific reason you guys use GenerateMutableProjection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no specific reason too.
| assert(actual(0) == cases) | ||
| } | ||
|  | ||
| test("SPARK-18091: split large if expressions into blocks due to JVM code size limit") { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test looks better now.
| This looks good now. | 
| Test build #69569 has finished for PR 15620 at commit  
 | 
| @viirya @cloud-fan Are we good to merge? Please let me know if there is any other action on my end. | 
| LGTM and cc @cloud-fan for another check. | 
…Projection code to exceed JVM code size limit ## What changes were proposed in this pull request? Fix for SPARK-18091 which is a bug related to large if expressions causing generated SpecificUnsafeProjection code to exceed JVM code size limit. This PR changes if expression's code generation to place its predicate, true value and false value expressions' generated code in separate methods in context so as to never generate too long combined code. ## How was this patch tested? Added a unit test and also tested manually with the application (having transformations similar to the unit test) which caused the issue to be identified in the first place. Author: Kapil Singh <[email protected]> Closes #15620 from kapilsingh5050/SPARK-18091-IfCodegenFix. (cherry picked from commit e463678) Signed-off-by: Wenchen Fan <[email protected]>
| thanks, merging to master/2.1! | 
| @cloud-fan Can we merge this change to 1.6 and 2.0 too? | 
…Projection code to exceed JVM code size limit ## What changes were proposed in this pull request? Fix for SPARK-18091 which is a bug related to large if expressions causing generated SpecificUnsafeProjection code to exceed JVM code size limit. This PR changes if expression's code generation to place its predicate, true value and false value expressions' generated code in separate methods in context so as to never generate too long combined code. ## How was this patch tested? Added a unit test and also tested manually with the application (having transformations similar to the unit test) which caused the issue to be identified in the first place. Author: Kapil Singh <[email protected]> Closes #15620 from kapilsingh5050/SPARK-18091-IfCodegenFix. (cherry picked from commit e463678) Signed-off-by: Wenchen Fan <[email protected]>
| merged to 2.0. Can you send a new PR to backport this to 1.6? There are a lot of code changes beween 1.6 and 2.1, it's safer to open a PR and run all tests. | 
| @cloud-fan created #16146 for backporting to 1.6 | 
| @cloud-fan @kapilsingh5050 it's weird, but it looks like after this change, all of the Maven-based 2.0 Jenkins jobs now time out. It looks consistent: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ They all end like this:  | 
|  | ||
| var strExpr: Expression = inputStrAttr | ||
| for (_ <- 1 to 13) { | ||
| strExpr = If(EqualTo(Decode(Encode(strExpr, "utf-8"), "utf-8"), inputStrAttr), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @srowen I think this is the root cause. This test is an overkill, although this PR fixed the code size limitation problem, this test may still hit constants pool size limitation, which is a known limiation and has not been fixed yet. It seems that maven and sbt have different JVM settings when run test, so the problem only exists at maven side.
I'm going to submit a PR to simplify this test a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interested in this, we know with 5 iterations instead of a 13 we don't get the problem (which makes sense as we'd only be generating so much code, not hitting the 64k constant pool size limit). I'm testing with IBM's SDK for Java so curious if it manifests itself in a different way for us or if we have a problem to fix on our end.
I have log files exceeding 2 GB from the test printing the generated code on failure.
If we add prints for the strExpr we see something like
The problem is
CodeGenerationSuite:
- multithreaded eval
- metrics are recorded on compile
- SPARK-8443: split wide projections into blocks due to JVM code size limit
- SPARK-13242: case-when expression with large number of branches (or cases)
- SPARK-18091: split large if expressions into blocks due to JVM code size limit *** FAILED ***
  java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection has grown past JVM limit of 0xFFFF
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificUnsafeProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private boolean isNull42;
/* 009 */   private boolean value42;
/* 010 */   private boolean isNull43;
/* 011 */   private UTF8String value43;
/* 012 */   private boolean isNull44;
/* 013 */   private UTF8String value44;
/* 014 */   private boolean isNull58;
With my prints:
debug, row: [StringForTesting]
input string attr: input[0, string, true]
in the loop
strExpr is: if ((decode(encode(input[0, string, true], utf-8), utf-8) = input[0, string, true])) input[0, string, true] else input[0, string, true]
in the loop
strExpr is: if ((decode(encode(if ((decode(encode(input[0, string, true], utf-8), utf-8) = input[0, string, true])) input[0, string, true] else input[0, string, true], utf-8), utf-8) = input[0, string, true])) if ((decode(encode(input[0, string, true], utf-8), utf-8) = input[0, string, true])) input[0, string, true] else input[0, string, true] else if ((decode(encode(input[0, string, true], utf-8), utf-8) = input[0, string, true])) input[0, string, true] else input[0, string, true]
in the loop
strExpr is: if ((decode(encode(if ((decode(encode(if ((decode(encode(input[0, string, true], utf-8), utf-8) = input[0, string, true])) input[0, string, true] else input[0, string, true], utf-8), utf-8) = input[0, string, true])) if ((decode(encode(input[0, string, true], utf-8), utf-8) = input[0, string, true])) input[0, string, true] else input[0, string, true] else if ((decode(encode(input[0, string, true], utf-8), utf-8) = input[0, string, true])) input[0, string, true] else input[0, string, true], utf-8), utf-8) = input[0, string, true])) if ((decode(encode(if ((decode(encode(input[0, string, true], utf-8), utf-8) = input[0, string, true])) input[0, string, true] else input[0, string, true], utf-8), utf-8) = input[0, string, true])) if ((decode(encode(input[0, string, true], utf-8), utf-8) = input[0, string, true])) input[0, string, true] else input[0, string, true] else if ((decode(encode(input[0, string, true], utf-8), utf-8) = input[0, string, true])) input[0, string, true] else input[0, string, true] else if ((decode(encode(if ((decode(encode(input[0, string, true], utf-8), utf-8) = input[0, string, true])) input[0, string, true] else input[0, string, true], utf-8), utf-8) = input[0, string, true])) if ((decode(encode(input[0, string, true], utf-8), utf-8) = input[0, string, true])) input[0, string, true] else input[0, string, true] else if ((decode(encode(input[0, string, true], utf-8), utf-8) = input[0, string, true])) input[0, string, true] else input[0, string, true]
etc
Is this the same issue we're referring to? I've also seen timeouts and our Jenkins farm has 5h limits, I see the problem against branch-2.0, branch-2.1, master, but didn't see it for Spark 2.1.0 RC1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a-roberts I think this is merged into branch-2.1 after RC1 coming out.
## What changes were proposed in this pull request? After #15620 , all of the Maven-based 2.0 Jenkins jobs time out consistently. As I pointed out in #15620 (comment) , it seems that the regression test is an overkill and may hit constants pool size limitation, which is a known issue and hasn't been fixed yet. Since #15620 only fix the code size limitation problem, we can simplify the test to avoid hitting constants pool size limitation. ## How was this patch tested? test only change Author: Wenchen Fan <[email protected]> Closes #16244 from cloud-fan/minor. (cherry picked from commit 9abd05b) Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request? After #15620 , all of the Maven-based 2.0 Jenkins jobs time out consistently. As I pointed out in #15620 (comment) , it seems that the regression test is an overkill and may hit constants pool size limitation, which is a known issue and hasn't been fixed yet. Since #15620 only fix the code size limitation problem, we can simplify the test to avoid hitting constants pool size limitation. ## How was this patch tested? test only change Author: Wenchen Fan <[email protected]> Closes #16244 from cloud-fan/minor.
## What changes were proposed in this pull request? After #15620 , all of the Maven-based 2.0 Jenkins jobs time out consistently. As I pointed out in #15620 (comment) , it seems that the regression test is an overkill and may hit constants pool size limitation, which is a known issue and hasn't been fixed yet. Since #15620 only fix the code size limitation problem, we can simplify the test to avoid hitting constants pool size limitation. ## How was this patch tested? test only change Author: Wenchen Fan <[email protected]> Closes #16244 from cloud-fan/minor. (cherry picked from commit 9abd05b) Signed-off-by: Sean Owen <[email protected]>
## What changes were proposed in this pull request? After apache#15620 , all of the Maven-based 2.0 Jenkins jobs time out consistently. As I pointed out in apache#15620 (comment) , it seems that the regression test is an overkill and may hit constants pool size limitation, which is a known issue and hasn't been fixed yet. Since apache#15620 only fix the code size limitation problem, we can simplify the test to avoid hitting constants pool size limitation. ## How was this patch tested? test only change Author: Wenchen Fan <[email protected]> Closes apache#16244 from cloud-fan/minor.
…Projection code to exceed JVM code size limit ## What changes were proposed in this pull request? Fix for SPARK-18091 which is a bug related to large if expressions causing generated SpecificUnsafeProjection code to exceed JVM code size limit. This PR changes if expression's code generation to place its predicate, true value and false value expressions' generated code in separate methods in context so as to never generate too long combined code. ## How was this patch tested? Added a unit test and also tested manually with the application (having transformations similar to the unit test) which caused the issue to be identified in the first place. Author: Kapil Singh <[email protected]> Closes apache#15620 from kapilsingh5050/SPARK-18091-IfCodegenFix.
## What changes were proposed in this pull request? After apache#15620 , all of the Maven-based 2.0 Jenkins jobs time out consistently. As I pointed out in apache#15620 (comment) , it seems that the regression test is an overkill and may hit constants pool size limitation, which is a known issue and hasn't been fixed yet. Since apache#15620 only fix the code size limitation problem, we can simplify the test to avoid hitting constants pool size limitation. ## How was this patch tested? test only change Author: Wenchen Fan <[email protected]> Closes apache#16244 from cloud-fan/minor.
…Projection code to exceed JVM code size limit ## What changes were proposed in this pull request? Fix for SPARK-18091 which is a bug related to large if expressions causing generated SpecificUnsafeProjection code to exceed JVM code size limit. This PR changes if expression's code generation to place its predicate, true value and false value expressions' generated code in separate methods in context so as to never generate too long combined code. ## How was this patch tested? Added a unit test and also tested manually with the application (having transformations similar to the unit test) which caused the issue to be identified in the first place. Author: Kapil Singh <[email protected]> Closes apache#15620 from kapilsingh5050/SPARK-18091-IfCodegenFix.
## What changes were proposed in this pull request? After apache#15620 , all of the Maven-based 2.0 Jenkins jobs time out consistently. As I pointed out in apache#15620 (comment) , it seems that the regression test is an overkill and may hit constants pool size limitation, which is a known issue and hasn't been fixed yet. Since apache#15620 only fix the code size limitation problem, we can simplify the test to avoid hitting constants pool size limitation. ## How was this patch tested? test only change Author: Wenchen Fan <[email protected]> Closes apache#16244 from cloud-fan/minor.
…essions ## What changes were proposed in this pull request? A frequently reported issue of Spark is the Java 64kb compile error. This is because Spark generates a very big method and it's usually caused by 3 reasons: 1. a deep expression tree, e.g. a very complex filter condition 2. many individual expressions, e.g. expressions can have many children, operators can have many expressions. 3. a deep query plan tree (with whole stage codegen) This PR focuses on 1. There are already several patches(apache#15620 apache#18972 apache#18641) trying to fix this issue and some of them are already merged. However this is an endless job as every non-leaf expression has this issue. This PR proposes to fix this issue in `Expression.genCode`, to make sure the code for a single expression won't grow too big. According to maropu 's benchmark, no regression is found with TPCDS (thanks maropu !): https://docs.google.com/spreadsheets/d/1K3_7lX05-ZgxDXi9X_GleNnDjcnJIfoSlSCDZcL4gdg/edit?usp=sharing ## How was this patch tested? existing test Author: Wenchen Fan <[email protected]> Author: Wenchen Fan <[email protected]> Closes apache#19767 from cloud-fan/codegen.
What changes were proposed in this pull request?
Fix for SPARK-18091 which is a bug related to large if expressions causing generated SpecificUnsafeProjection code to exceed JVM code size limit.
This PR changes if expression's code generation to place its predicate, true value and false value expressions' generated code in separate methods in context so as to never generate too long combined code.
How was this patch tested?
Added a unit test and also tested manually with the application (having transformations similar to the unit test) which caused the issue to be identified in the first place.