-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24121][SQL] Add API for handling expression code generation #21193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@viirya I just glanced over this. We currently use a lot of mutable code for in code generation with - TBH - way too complex side effects. Your proposal adds another layer of side effects. Why don't we add a bit more structure to the JavaCode abstraction, and figure out that we need splitting from that? It will probably be more involved than this because you probably cannot fix the names upfront. |
|
retest this please. |
|
@hvanhovell Thanks for your comment! I tried to add new JavaCode abstraction |
|
retest this please. |
|
Test build #90004 has finished for PR 21193 at commit
|
|
Test build #90009 has finished for PR 21193 at commit
|
|
Test build #90059 has finished for PR 21193 at commit
|
|
retest this please. |
|
Test build #90083 has finished for PR 21193 at commit
|
|
retest this please. |
|
Test build #90092 has finished for PR 21193 at commit
|
|
retest this please. |
|
Test build #90104 has finished for PR 21193 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work @viirya. I do like this approach.
|
|
||
| object ExprValue { | ||
| implicit def exprValueToString(exprValue: ExprValue): String = exprValue.toString | ||
| implicit def exprValueToString(exprValue: ExprValue): String = exprValue.code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we use toString?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because toString will call exprValueToString. We should have only one place to convert an ExprValue to string.
| EmptyBlock | ||
| } else { | ||
| args.foreach { | ||
| case _: ExprValue => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why true? Can't we just do nothing after the =>?
| implicit def blocksToBlock(blocks: Seq[Block]): Block = Blocks(blocks) | ||
|
|
||
| implicit class BlockHelper(val sc: StringContext) extends AnyVal { | ||
| def code(args: Any*): Block = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chat about adding checkLengths(args)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
| val expressions = exprValues.iterator | ||
| var buf = new StringBuffer(strings.next) | ||
| while (strings.hasNext) { | ||
| if (expressions.hasNext) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not needed (especially if we add the checkLengths), because if it is not true we are in a bad situation
|
Test build #90120 has finished for PR 21193 at commit
|
|
retest this please. |
|
Test build #90135 has finished for PR 21193 at commit
|
|
Test build #90139 has finished for PR 21193 at commit
|
|
I think this can be reviewed for now. Thanks. cc @cloud-fan @hvanhovell @kiszk @maropu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except some minor comments
| // as a function before. In that case, we just re-use it. | ||
| ExprCode(ctx.registerComment(this.toString), subExprState.isNull, subExprState.value) | ||
| ExprCode(ctx.registerComment(this.toString), subExprState.isNull, | ||
| subExprState.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary change
| buf append inputs.next | ||
| buf append StringContext.treatEscapes(strings.next) | ||
| } | ||
| buf.toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we add an assert that assert(!inputs.hasNext)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not needed because we have checkLengths at line 159. This is the same approach which is used by the StringContext in the scala library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see
| ev.copy(code = s"final ${CodeGenerator.javaType(dataType)} ${ev.value} = " + | ||
| s"$className.getInputFilePath();", isNull = FalseLiteral) | ||
| ev.copy(code = code"final ${CodeGenerator.javaType(dataType)} ${ev.value} = " + | ||
| code"$className.getInputFilePath();", isNull = FalseLiteral) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the Blocks.code is defined as override def code: String = blocks.map(_.toString).mkString("\n"), is it OK here?
maybe we can do
val typeDef = s"final ${CodeGenerator.javaType(dataType)}"
ev.code(code = code"$typeDef ${ev.value} = $className.getInputFilePath();", ...)
| val className = InputFileBlockHolder.getClass.getName.stripSuffix("$") | ||
| ev.copy(code = s"final ${CodeGenerator.javaType(dataType)} ${ev.value} = " + | ||
| s"$className.getStartOffset();", isNull = FalseLiteral) | ||
| ev.copy(code = code"final ${CodeGenerator.javaType(dataType)} ${ev.value} = " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| val className = InputFileBlockHolder.getClass.getName.stripSuffix("$") | ||
| ev.copy(code = s"final ${CodeGenerator.javaType(dataType)} ${ev.value} = " + | ||
| s"$className.getLength();", isNull = FalseLiteral) | ||
| ev.copy(code = code"final ${CodeGenerator.javaType(dataType)} ${ev.value} = " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| } | ||
|
|
||
| test("Throws exception when interpolating unexcepted object in code block") { | ||
| val obj = TestClass(100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can simply use a Tuple2 here, like 1 -> 1
| case _: SimpleExprValue => aliasedParam | ||
| case other => other | ||
| } | ||
| val aliasedCode = CodeBlock(code.asInstanceOf[CodeBlock].codeParts, aliasedInputs).stripMargin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the follow up, we may need a better API to do this. The current way is a little hacky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 with @cloud-fan 's comment above. The current demonstrated way of manipulating code is less than ideal. But at least this PR does provide a good carrier for us to build the future code manipulation API so it's good enough for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agreed. In the follow up, we definitely should have a better API for manipulating code. I think here it is used to show what we can do for CodeBlock right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I just think about this and have initial idea. I'd prototype it locally and make follow up for your review after this.
| val str = s"columnVector[$columnVar, $ordinal, ${dataType.simpleString}]" | ||
| val code = s"${ctx.registerComment(str)}\n" + (if (nullable) { | ||
| s""" | ||
| val code = code"${ctx.registerComment(str)}\n" + (if (nullable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the \n is not needed now, since we always add \n between code blocks.
|
|
||
| case class Blocks(blocks: Seq[Block]) extends Block { | ||
| override lazy val exprValues: Set[ExprValue] = blocks.flatMap(_.exprValues).toSet | ||
| override def code: String = blocks.map(_.toString).mkString("\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we also make it lazy val?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one of the goals of this is letting us changing the string representation of an expression, so making it a lazy val may be fine for now, but would impede us from doing that. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it depends on how we would implement code rewriting later. If we wanna keep JavaCode immutable, then lazy val is OK here, because once we rewrite the code, we will have a new Blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think so. Ideally I think JavaCode should be immutable. Once we change code in it, we should have a new object instead of chancing current one.
| protected def evaluateVariables(variables: Seq[ExprCode]): String = { | ||
| val evaluate = variables.filter(_.code != "").map(_.code.trim).mkString("\n") | ||
| variables.foreach(_.code = "") | ||
| val evaluate = variables.filter(_.code.toString != "").map(_.code.toString).mkString("\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: _.code.toString.nonEmpty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love having more structure in the codegen framework. Thanks for tackling this, @viirya !
Left some nitpick comments inline. Overall I'd say this does a good job in laying down the foundations, but there's still a lot to be done for building up a nice API for the actual code manipulation.
@viirya do you have concrete follow-up ideas / tasks that would use this API? If so, it'd be nice to see how you're planning to use the API. I imagine the original intent of #19813 would be the main use case?
| case other => throw new IllegalArgumentException( | ||
| s"Can not interpolate ${other.getClass.getName} into code block.") | ||
| } | ||
| CodeBlock(sc.parts, args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering: does it make sense to eagerly fold the literal args into the parts, instead of carrying them over into CodeBlock? That's similar to what you're already doing with CodeBlock.exprValues, but just done a bit more eagerly.
Basically, the only things that are safe to manipulate are the structured ones like ExprValue and Block; none of the others are safe to change because we don't know what's in there. So no need to even carry them as separate argument into CodeBlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea. We should not change literal args. Keeping them in block inputs seems meaningless.
| } | ||
|
|
||
| test("replace expr values in code block") { | ||
| val statement = JavaCode.expression("1 + 1", IntegerType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we rename this statement variable to something like expr or someExpr or something? It's an expression not a statement.
| case _: SimpleExprValue => aliasedParam | ||
| case other => other | ||
| } | ||
| val aliasedCode = CodeBlock(code.asInstanceOf[CodeBlock].codeParts, aliasedInputs).stripMargin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 with @cloud-fan 's comment above. The current demonstrated way of manipulating code is less than ideal. But at least this PR does provide a good carrier for us to build the future code manipulation API so it's good enough for now.
|
@rednaxelafx Thanks for your comments! Yes, #19813 is should be the first use case. However, I'd like to first have follow up for some points we discussed so far, e.g., preventing string interpolation, manipulating code API, etc.. So this API can be more easily to be used. |
|
Test build #90833 has finished for PR 21193 at commit
|
|
retest this please. |
|
Test build #90835 has finished for PR 21193 at commit
|
|
retest this please. |
|
Test build #90858 has finished for PR 21193 at commit
|
|
@cloud-fan @rednaxelafx Your last comments are addressed. Please check if you have other comments. Thanks for review. |
| // Folds eagerly the literal args into the code parts. | ||
| private def foldLiteralArgs(parts: Seq[String], args: Seq[Any]): (Seq[String], Seq[Any]) = { | ||
| val codeParts = ArrayBuffer.empty[String] | ||
| val blockInputs = ArrayBuffer.empty[Any] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we make the type JavaCode instead of Any?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. JavaCode is better.
| val inputs = args.iterator | ||
| val buf = new StringBuilder(Block.CODE_BLOCK_BUFFER_LENGTH) | ||
|
|
||
| buf append strings.next |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use java style here? buf.append(strings.next).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments! LGTM now, no more comments from my side.
Just anxiously waiting to play with it more and also the polished code manipulation API ;-)
| assert(code.toString == "boolean expr1_isNull = false;") | ||
| } | ||
|
|
||
| test("Literals are folded into string code parts instead of block inputs") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, I like it!
| val buf = new StringBuilder(Block.CODE_BLOCK_BUFFER_LENGTH) | ||
|
|
||
| buf append strings.next | ||
| while (strings.hasNext) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
|
Test build #90931 has finished for PR 21193 at commit
|
|
retest this please. |
|
Test build #90941 has finished for PR 21193 at commit
|
|
LGTM, can we fix the conflict? |
|
@cloud-fan Thanks. Fixed. |
|
Test build #90964 has finished for PR 21193 at commit
|
|
thanks, merging to master! |
|
Thanks for your review @hvanhovell @cloud-fan @kiszk @maropu @mgaido91 @ueshin @rednaxelafx |
What changes were proposed in this pull request?
This patch tries to implement this proposal to add an API for handling expression code generation. It should allow us to manipulate how to generate codes for expressions.
In details, this adds an new abstraction
CodeBlocktoJavaCode.CodeBlockholds the code snippet and inputs for generating actual java code.For example, in following java code:
variable,isNullare twoVariableValueandCodeGenerator.defaultValue(BooleanType)is a string. They are all inputs to this code block and held byCodeBlockrepresenting this code.For codegen, we provide a specified string interpolator
code, so you can define a code like this:Because those inputs are held separately in
CodeBlockbefore generating code, we can safely manipulate them, e.g., replacing statements to aliased variables, etc..How was this patch tested?
Added tests.