-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24206][SQL] Improve DataSource read benchmark code #21266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'll add benchmark results just after #21070 merged. |
|
Also, I'll make a follow-up pr for pushdown benchmarks; |
|
Test build #90359 has finished for PR 21266 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu . Since you are merging ParquetReadBenchmark and OrcReadBenchmark benchmarks, let's remove OrcReadBenchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we still need to OrcReadBenchmark to compare native orc with Hive built-in orc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Never mind. I deleted the previous comment. The scope is only testing native orc here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks!
|
Test build #90392 has finished for PR 21266 at commit
|
8aedbf0 to
171e89a
Compare
|
Test build #90437 has finished for PR 21266 at commit
|
171e89a to
1d93d99
Compare
|
Test build #90438 has finished for PR 21266 at commit
|
1d93d99 to
fc96adb
Compare
|
Test build #90572 has finished for PR 21266 at commit
|
|
retest this please |
|
Test build #90579 has finished for PR 21266 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this type of comments should not be included by scala file in this directory.
If you want to put this comment, this scala file should be put into sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/ where files are not translated to doc.
[error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/core/target/java/org/apache/spark/sql/DataSourceReadBenchmark.java:5: error: unknown tag: this
[error] * spark-submit --class <this class> <spark sql test jar>
[error] ^
[error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/core/target/java/org/apache/spark/sql/DataSourceReadBenchmark.java:5: error: unknown tag: spark
[error] * spark-submit --class <this class> <spark sql test jar>
[error]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
3b6f541 to
fad31b7
Compare
|
Test build #90875 has finished for PR 21266 at commit
|
fad31b7 to
d8c308f
Compare
|
Test build #90877 has finished for PR 21266 at commit
|
|
Test build #90879 has finished for PR 21266 at commit
|
|
retest this please |
|
Test build #90884 has finished for PR 21266 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us explicitly set these confs? Here, we are expecting the perf number when ORC_COPY_BATCH_TO_SPARK is set to false. Please also double check the other benchmarks and add the related confs too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked that ORC_COPY_BATCH_TO_SPARK=false in other tests (I didn't find performance differences after explicitly setting false in line 50.
https://github.com/apache/spark/pull/21266/files#diff-ae11b49db05c9e6829cad071b112a742R50
|
@maropu Great work! Thanks for helping this! |
274aa8a to
5eab1a5
Compare
|
I'll update the benchmark results soon. |
|
Test build #91012 has finished for PR 21266 at commit
|
|
retest this please |
|
Test build #91017 has finished for PR 21266 at commit
|
|
retest this please |
|
Test build #91036 has finished for PR 21266 at commit
|
|
retest this please |
|
Test build #91049 has finished for PR 21266 at commit
|
|
LGTM Thanks! Merged to master. |
What changes were proposed in this pull request?
This pr added benchmark code
DataSourceReadBenchmarkfororc,paruqet,csv, andjsonbased on the existingParquetReadBenchmarkandOrcReadBenchmark.How was this patch tested?
N/A