[SPARK-21751][SQL] CodeGeneraor.splitExpressions counts code size more precisely #18966

kiszk · 2017-08-16T19:37:51Z

What changes were proposed in this pull request?

Current CodeGeneraor.splitExpressions splits statements into methods if the total length of statements is more than 1024 characters. The length may include comments or empty line.

This PR excludes comment or empty line from the length to reduce the number of generated methods in a class, by using CodeFormatter.stripExtraNewLinesAndComments() method.

How was this patch tested?

Existing tests

gatorsmile · 2017-08-16T21:10:31Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

Could you add an internal SQLConf for it and make it adjustable?

SparkQA · 2017-08-16T22:23:44Z

Test build #80749 has finished for PR 18966 at commit 5ef1f53.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-17T10:12:57Z

Test build #80776 has finished for PR 18966 at commit 51b6253.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-08-20T08:23:17Z

ping @gatorsmile

gatorsmile · 2017-08-20T22:27:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Based on my understanding, this is not the number of characters? This is the length of source codes, right?

If the length of source code does not mean the number of lines of the source code, you are right. This is because we check the sum of String.length.
If it is correct, I will update them based on the length of source codes.

Here are two examples.

abc -> 3

ab c

-> 4

To be honest, using the number of characters looks pretty fragile. It is even worse than using the number of lines.

In this PR, do we change this parameter to use number of lines instead of number of characters, too? It is possible technically.

cc @gatorsmile

SparkQA · 2017-08-24T21:31:11Z

Test build #81100 has finished for PR 18966 at commit e59168c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-08-24T23:52:25Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BenchmarkWideTable.scala

Does this threshold impact this test case?

@gatorsmile Sorry, I updated the result. This threshold impacts this test case.

SparkQA · 2017-08-25T03:42:10Z

Test build #81109 has finished for PR 18966 at commit ea9fea4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-08-25T18:29:08Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

This flag does not take an effect at runtime.

What do you mean? Could you elaborate this review comment?

If you set the conf to a new value at runtime, can you get the value from SQLConf.get.maxCodegenLinesPerFunction?

I see. I had another interpretation that "this value may not change performance".
Let me check this while I did it like other flags.

Got it. Depends on calling context, it may take an effect or not. ~~Should we pass SQLConf to this method?~~ I am thinking whether we can pass SQLConf to a constructor or not.
Is there any other thoughts?

I made this option effective by executing SparkEnv.get.conf

@gatorsmile Since to make it configurable takes long time, can we do it using hard-coded parameter?
Even in this case, this PR makes better since the estimation does not include characters for comment.

@kiszk You know, I am just afraid new regression could be introduced due to this change. Sorry for the delay. I really do not have a better solution. I kind of agree on your original solution. Just exclude the characters for comment. At least, it becomes better and take a less risk to hit a regression.

@gatorsmile I understand your concern about the possibility of new performance regression. I will use the original threshold (max characters) as hard-coded value.

gatorsmile · 2017-08-25T18:36:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

splited -> split

gatorsmile · 2017-08-25T18:41:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

We might need to emphasize this is not for whole-stage codegen.

My understanding is not correct. Explain it for expressions only? rename it to

spark.sql.codegen.expressions.maxCodegenLinesPerFunction

SparkQA · 2017-08-26T19:28:52Z

Test build #81154 has finished for PR 18966 at commit 77f01e4.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-08-26T23:32:34Z

retest this please

SparkQA · 2017-08-27T02:14:37Z

Test build #81160 has finished for PR 18966 at commit 77f01e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-08-27T08:20:40Z

ping @gatorsmile

gatorsmile · 2017-08-27T16:34:31Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

This is not following what we are doing for the other SQLConf. I am also thinking if we should just put it into StaticSQLConf. Let me check it with others.

@gatorsmile Is there any progress?

There are multiple related PRs. Maybe we can wait until the reviews are finished there?

ping @gatorsmile

Can we use bytecode size?

Here, we do not know precise bytecode size. Will we use estimated bytecode size based on characters per line? Or, other ideas to get precise bytecode size before compiling a method?

gentle ping @gatorsmile

SparkQA · 2017-10-04T21:18:20Z

Test build #82455 has finished for PR 18966 at commit a489938.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-06T19:08:46Z

Test build #82518 has finished for PR 18966 at commit b04c09c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-10-10T18:07:12Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeFormatter.scala

+       """([ |\t]*?\/\/[\s\S]*?\n)""").r               // strip //comment
+    val codeWithoutComment = commentReg.replaceAllIn(input, "")
+    codeWithoutComment.replaceAll("""\n\s*\n""", "\n") // strip ExtraNewLines
+  }


Could you also add back the test case of this function?

gatorsmile · 2017-10-10T18:39:40Z

LGTM pending Jenkins

SparkQA · 2017-10-10T20:49:01Z

Test build #82596 has finished for PR 18966 at commit 4c47802.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-10T21:22:53Z

Test build #82598 has finished for PR 18966 at commit 516a72a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-10-11T03:28:53Z

Thanks! Merged to master.

gatorsmile reviewed Aug 16, 2017

View reviewed changes

gatorsmile reviewed Aug 20, 2017

View reviewed changes

gatorsmile reviewed Aug 24, 2017

View reviewed changes

gatorsmile reviewed Aug 25, 2017

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated

Copy link

Member

gatorsmile Aug 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splited -> split

gatorsmile reviewed Aug 25, 2017

View reviewed changes

gatorsmile reviewed Aug 27, 2017

View reviewed changes

kiszk added 6 commits October 5, 2017 02:38

initial commit

4ac090a

make threshold configurable

d96f8e5

use lines per method as split threshold instead of chars per method

87578db

update benchmark results

073e9e5

make a new option effective at runtime

63377a6

rebase with master

a489938

kiszk force-pushed the SPARK-21751 branch from 77f01e4 to a489938 Compare October 4, 2017 18:34

avoid to use SparkEnv.get

b04c09c

use the original threshold against Java code excluding comments

4c47802

gatorsmile reviewed Oct 10, 2017

View reviewed changes

revert test case

516a72a

asfgit closed this in 76fb173 Oct 11, 2017

kiszk mentioned this pull request Dec 13, 2017

[SPARK-18016][SQL] Code Generation: Constant Pool Limit - reduce entries for mutable state #19811

Closed

[SPARK-21751][SQL] CodeGeneraor.splitExpressions counts code size more precisely #18966

[SPARK-21751][SQL] CodeGeneraor.splitExpressions counts code size more precisely #18966

Uh oh!

Conversation

kiszk commented Aug 16, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 16, 2017

Uh oh!

SparkQA commented Aug 17, 2017

Uh oh!

kiszk commented Aug 20, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk Aug 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk Aug 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 24, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 25, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Aug 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk Aug 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 26, 2017

Uh oh!

kiszk commented Aug 26, 2017

Uh oh!

SparkQA commented Aug 27, 2017

Uh oh!

kiszk commented Aug 27, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk Aug 21, 2017 •

edited

Loading

kiszk Aug 22, 2017 •

edited

Loading

gatorsmile Aug 26, 2017 •

edited

Loading

kiszk Aug 26, 2017 •

edited

Loading

kiszk Oct 3, 2017 •

edited

Loading