-
Couldn't load subscription status.
- Fork 28.9k
[SPARK-22498][SQL] Fix 64KB JVM bytecode limit problem with concat #19728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #83745 has finished for PR 19728 at commit
|
|
Test build #83746 has finished for PR 19728 at commit
|
|
Test build #83747 has finished for PR 19728 at commit
|
| ctx.addMutableState("UTF8String[]", args, "") | ||
|
|
||
| val inputs = evals.zipWithIndex.map { case (eval, index) => | ||
| if (eval.isNull != "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If eval.isNull is not a pre-evaluated constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If eval.isNull is not a pre-evaluated constant, I expect that the following code at lines 73-74 will be generated. Only when we ensure it is true, we can avoid assigning a value (use default null value).
| ev.copy(evals.map(_.code).mkString("\n") + s""" | ||
| UTF8String ${ev.value} = UTF8String.concatWs($inputs); | ||
| val inputs = evals.tail.zipWithIndex.map { case (eval, index) => | ||
| if (eval.isNull != "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| }.unzip | ||
|
|
||
| ev.copy(evals.map(_.code).mkString("\n") + | ||
| // ev.copy(evals.map(_.code).mkString("\n") + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forget to remove commented code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks :)
| if (eval.isNull != "true") { | ||
| s""" | ||
| ${eval.code} | ||
| $args[$index] = ${eval.value}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If eval.isNull is evaluated to null at runtime, eval.value is useless. We should assign null in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
| if (eval.isNull != "true") { | ||
| s""" | ||
| ${eval.code} | ||
| $args[$index] = ${eval.value}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (${ev.value} == null) { | ||
| ${ev.isNull} = true; | ||
| val argNums = evals.length | ||
| val args = ctx.freshName("argLen") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
argLen? I think it's not length of args?
| ${eval.code} | ||
| if (!${eval.isNull}) { | ||
| $args[$index] = ${eval.value}; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are better to assign null too if eval.isNull == null, because args is global variable and we need to override all values for previous row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, while args is global, UTF8String[] is allocated before executing these assignments. Thus, we can ensure all of elements in args are null.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is guaranteed before the first row processed. After we process rows, args are updated for each row. E.g., args[0] can be updated and assigned with a string for row 0. In next row, if eval.isNull is evaluated to true, we don't override back to null, so arg[0] is still the string value for row 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the next row is processed, $args = new UTF8String[$argNums] is executed again in apply() method. In other words, I think that current implementation does not reuse UTF8String[] between different rows.
Do you mean UTF8String[] is reused between different rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. I didn't notice that args is not globally initialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for taking your time for my code.
It is not easy to imagine the generated code. I may have to put the generated code.
|
Test build #83770 has finished for PR 19728 at commit
|
|
Test build #83779 has finished for PR 19728 at commit
|
| ${ev.isNull} = true; | ||
| val argNums = evals.length | ||
| val args = ctx.freshName("args") | ||
| ctx.addMutableState("UTF8String[]", args, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can avoid defining this as a global variable, what do you think?
| UTF8String ${ev.value} = UTF8String.concatWs($inputs); | ||
| val argNums = evals.length | ||
| val args = ctx.freshName("argLen") | ||
| ctx.addMutableState("UTF8String[]", args, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also can be kept method local, IMHO
| val inputs = evals.tail.zipWithIndex.map { case (eval, index) => | ||
| if (eval.isNull != "true") { | ||
| s""" | ||
| ${eval.code} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: in this and the next 4 lines an indentation space is missing if I am not wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
|
|
||
| ev.copy(evals.map(_.code).mkString("\n") + s""" | ||
| UTF8String ${ev.value} = UTF8String.concatWs($inputs); | ||
| val argNums = evals.length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be evals.length -1, or even better, I'd suggest to declare two variables: sep and strings (or the name you prefer) to hold head and tail. This would help readability too IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry that I cannot understand your suggestion.
argNums is referred as $argNums to allocate an array. Are you suggesting to allocate an array as new UTF8String[$argNums + 1]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, I am saying the opposite. Let's consider an example, maybe I better can explain what I mean in this way. Let's assume that we are running concat_ws(',', 'a', 'b'). Then, evals would contain 3 elements. So here your argNums would be 3. But here you would be using only $args[0] and $args[1], because the first element (',', the separator) is handled differently.
Thus, I would suggest to have something like:
val sep = evals.head
val strings = evals.tail
val argNums = strings.length // note that this is evals.length -1
...
I think that in this way the code would be much clearer (other than fixing this little bug).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your kindly explanation. I totally agree with you.
|
Test build #83787 has finished for PR 19728 at commit
|
|
Test build #83791 has finished for PR 19728 at commit
|
|
Test build #83795 has finished for PR 19728 at commit
|
| UTF8String ${ev.value} = UTF8String.concat($inputs); | ||
| if (${ev.value} == null) { | ||
| ${ev.isNull} = true; | ||
| val argNums = evals.length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: numArgs.
| val args = ctx.freshName("args") | ||
|
|
||
| val inputs = evals.zipWithIndex.map { case (eval, index) => | ||
| if (eval.isNull != "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do not mix in optimizations with bug fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I will remove this.
|
Test build #83944 has finished for PR 19728 at commit
|
| if (${ev.value} == null) { | ||
| ${ev.isNull} = true; | ||
| } | ||
| val numArgs = evals.length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can inline it
| UTF8String ${ev.value} = UTF8String.concatWs($inputs); | ||
| val separator = evals.head | ||
| val strings = evals.tail | ||
| val argNums = strings.length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: numArgs
| ev.copy(s""" | ||
| UTF8String[] $args = new UTF8String[$argNums]; | ||
| $codes | ||
| UTF8String ${ev.value} = UTF8String.concatWs(${separator.value}, $args); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
val code = if (ctx.INPUT_ROW != null && ctx.currentVars == null)
...
ev.copy(code = s"""
UTF8String[] $args = new UTF8String[$argNums];
${separator.code}
...
""")
|
can you split it into 3 PRs? The approaches for these 3 expressions are quite different. |
|
Test build #83974 has finished for PR 19728 at commit
|
|
Test build #83983 has finished for PR 19728 at commit
|
|
thanks, merging to master/2.2! |
## What changes were proposed in this pull request? This PR changes `concat` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved the case of `concat` with a lot of argument ## How was this patch tested? Added new test cases into `StringExpressionsSuite` Author: Kazuaki Ishizaki <[email protected]> Closes #19728 from kiszk/SPARK-22498. (cherry picked from commit d54bfec) Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? This PR changes `concat` code generation to place generated code for expression for arguments into separated methods if these size could be large. This PR resolved the case of `concat` with a lot of argument ## How was this patch tested? Added new test cases into `StringExpressionsSuite` Author: Kazuaki Ishizaki <[email protected]> Closes apache#19728 from kiszk/SPARK-22498. (cherry picked from commit d54bfec) Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This PR changes
concatcode generation to place generated code for expression for arguments into separated methods if these size could be large.This PR resolved the case of
concatwith a lot of argumentHow was this patch tested?
Added new test cases into
StringExpressionsSuite