-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-21751][SQL] CodeGeneraor.splitExpressions counts code size more precisely #18966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add an internal SQLConf for it and make it adjustable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
Test build #80749 has finished for PR 18966 at commit
|
|
Test build #80776 has finished for PR 18966 at commit
|
|
ping @gatorsmile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my understanding, this is not the number of characters? This is the length of source codes, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the length of source code does not mean the number of lines of the source code, you are right. This is because we check the sum of String.length.
If it is correct, I will update them based on the length of source codes.
Here are two examples.
abc -> 3
ab
c
-> 4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, using the number of characters looks pretty fragile. It is even worse than using the number of lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR, do we change this parameter to use number of lines instead of number of characters, too? It is possible technically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @gatorsmile
|
Test build #81100 has finished for PR 18966 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this threshold impact this test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile Sorry, I updated the result. This threshold impacts this test case.
|
Test build #81109 has finished for PR 18966 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flag does not take an effect at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? Could you elaborate this review comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you set the conf to a new value at runtime, can you get the value from SQLConf.get.maxCodegenLinesPerFunction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I had another interpretation that "this value may not change performance".
Let me check this while I did it like other flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Depends on calling context, it may take an effect or not. Should we pass I am thinking whether we can pass SQLConf to this method?SQLConf to a constructor or not.
Is there any other thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this option effective by executing SparkEnv.get.conf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile Since to make it configurable takes long time, can we do it using hard-coded parameter?
Even in this case, this PR makes better since the estimation does not include characters for comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kiszk You know, I am just afraid new regression could be introduced due to this change. Sorry for the delay. I really do not have a better solution. I kind of agree on your original solution. Just exclude the characters for comment. At least, it becomes better and take a less risk to hit a regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile I understand your concern about the possibility of new performance regression. I will use the original threshold (max characters) as hard-coded value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
splited -> split
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to emphasize this is not for whole-stage codegen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is not correct. Explain it for expressions only? rename it to
spark.sql.codegen.expressions.maxCodegenLinesPerFunction
|
Test build #81154 has finished for PR 18966 at commit
|
|
retest this please |
|
Test build #81160 has finished for PR 18966 at commit
|
|
ping @gatorsmile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not following what we are doing for the other SQLConf. I am also thinking if we should just put it into StaticSQLConf. Let me check it with others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile Is there any progress?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are multiple related PRs. Maybe we can wait until the reviews are finished there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @gatorsmile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use bytecode size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we do not know precise bytecode size. Will we use estimated bytecode size based on characters per line? Or, other ideas to get precise bytecode size before compiling a method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gentle ping @gatorsmile
|
Test build #82455 has finished for PR 18966 at commit
|
|
Test build #82518 has finished for PR 18966 at commit
|
| """([ |\t]*?\/\/[\s\S]*?\n)""").r // strip //comment | ||
| val codeWithoutComment = commentReg.replaceAllIn(input, "") | ||
| codeWithoutComment.replaceAll("""\n\s*\n""", "\n") // strip ExtraNewLines | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add back the test case of this function?
|
LGTM pending Jenkins |
|
Test build #82596 has finished for PR 18966 at commit
|
|
Test build #82598 has finished for PR 18966 at commit
|
|
Thanks! Merged to master. |
What changes were proposed in this pull request?
Current
CodeGeneraor.splitExpressionssplits statements into methods if the total length of statements is more than 1024 characters. The length may include comments or empty line.This PR excludes comment or empty line from the length to reduce the number of generated methods in a class, by using
CodeFormatter.stripExtraNewLinesAndComments()method.How was this patch tested?
Existing tests