-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-32564][SQL][TEST] Inject data statistics to simulate plan generation on actual TPCDS data #29384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // These data statistics are extracted from generated TPCDS data with SF=1 | ||
|
|
||
| // scalastyle:off line.size.limit | ||
| val sf1TableStats = Map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will replace this with larger data case, e.g., SF=100.
|
Test build #127197 has finished for PR 29384 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great, @maropu !
+1, LGTM. Merged to master!
|
@maropu . Can we have this nice patch in |
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm too
|
late LGTM. Shall we run the test twice with 100T stats and empty tables? |
|
Test build #127205 has finished for PR 29384 at commit
|
|
Thanks, all! @dongjoon-hyun Sure! |
… generation on actual TPCDS data ### What changes were proposed in this pull request? `TPCDSQuerySuite` currently computes plans with empty TPCDS tables, then checks if plans can be generated correctly. But, the generated plans can be different from actual ones because the input tables are empty (e.g., the plans always use broadcast-hash joins, but actual ones use sort-merge joins for larger tables). To mitigate the issue, this PR defines data statistics constants extracted from generated TPCDS data in `TPCDSTableStats`, then injects the statistics via `spark.sessionState.catalog.alterTableStats` when defining TPCDS tables in `TPCDSQuerySuite`. Please see a link below about how to extract the table statistics: - https://gist.github.com/maropu/f553d32c323ee803d39e2f7fa0b5a8c3 For example, the generated plans of TPCDS `q2` are different with/without this fix: ``` ==== w/ this fix: q2 ==== == Physical Plan == * Sort (43) +- Exchange (42) +- * Project (41) +- * SortMergeJoin Inner (40) :- * Sort (28) : +- Exchange (27) : +- * Project (26) : +- * BroadcastHashJoin Inner BuildRight (25) : :- * HashAggregate (19) : : +- Exchange (18) : : +- * HashAggregate (17) : : +- * Project (16) : : +- * BroadcastHashJoin Inner BuildRight (15) : : :- Union (9) : : : :- * Project (4) : : : : +- * Filter (3) : : : : +- * ColumnarToRow (2) : : : : +- Scan parquet default.web_sales (1) : : : +- * Project (8) : : : +- * Filter (7) : : : +- * ColumnarToRow (6) : : : +- Scan parquet default.catalog_sales (5) : : +- BroadcastExchange (14) : : +- * Project (13) : : +- * Filter (12) : : +- * ColumnarToRow (11) : : +- Scan parquet default.date_dim (10) : +- BroadcastExchange (24) : +- * Project (23) : +- * Filter (22) : +- * ColumnarToRow (21) : +- Scan parquet default.date_dim (20) +- * Sort (39) +- Exchange (38) +- * Project (37) +- * BroadcastHashJoin Inner BuildRight (36) :- * HashAggregate (30) : +- ReusedExchange (29) +- BroadcastExchange (35) +- * Project (34) +- * Filter (33) +- * ColumnarToRow (32) +- Scan parquet default.date_dim (31) ==== w/o this fix: q2 ==== == Physical Plan == * Sort (40) +- Exchange (39) +- * Project (38) +- * BroadcastHashJoin Inner BuildRight (37) :- * Project (26) : +- * BroadcastHashJoin Inner BuildRight (25) : :- * HashAggregate (19) : : +- Exchange (18) : : +- * HashAggregate (17) : : +- * Project (16) : : +- * BroadcastHashJoin Inner BuildRight (15) : : :- Union (9) : : : :- * Project (4) : : : : +- * Filter (3) : : : : +- * ColumnarToRow (2) : : : : +- Scan parquet default.web_sales (1) : : : +- * Project (8) : : : +- * Filter (7) : : : +- * ColumnarToRow (6) : : : +- Scan parquet default.catalog_sales (5) : : +- BroadcastExchange (14) : : +- * Project (13) : : +- * Filter (12) : : +- * ColumnarToRow (11) : : +- Scan parquet default.date_dim (10) : +- BroadcastExchange (24) : +- * Project (23) : +- * Filter (22) : +- * ColumnarToRow (21) : +- Scan parquet default.date_dim (20) +- BroadcastExchange (36) +- * Project (35) +- * BroadcastHashJoin Inner BuildRight (34) :- * HashAggregate (28) : +- ReusedExchange (27) +- BroadcastExchange (33) +- * Project (32) +- * Filter (31) +- * ColumnarToRow (30) +- Scan parquet default.date_dim (29) ``` This comes from the cloud-fan comment: #29270 (comment) This is the backport of #29384. ### Why are the changes needed? For better test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29390 from maropu/SPARK-32564-BRANCH3.0. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…ty tables ### What changes were proposed in this pull request? This is the follow-up PR of #29384 to address the cloud-fan comment: #29384 (comment) This PR re-enables `TPCDSQuerySuite` with empty tables for better test coverages. ### Why are the changes needed? For better test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29391 from maropu/SPARK-32564-FOLLOWUP. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…ty tables ### What changes were proposed in this pull request? This is the follow-up PR of #29384 to address the cloud-fan comment: #29384 (comment) This PR re-enables `TPCDSQuerySuite` with empty tables for better test coverages. ### Why are the changes needed? For better test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29391 from maropu/SPARK-32564-FOLLOWUP. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 1df855b) Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
TPCDSQuerySuitecurrently computes plans with empty TPCDS tables, then checks if plans can be generated correctly. But, the generated plans can be different from actual ones because the input tables are empty (e.g., the plans always use broadcast-hash joins, but actual ones use sort-merge joins for larger tables). To mitigate the issue, this PR defines data statistics constants extracted from generated TPCDS data inTPCDSTableStats, then injects the statistics viaspark.sessionState.catalog.alterTableStatswhen defining TPCDS tables inTPCDSQuerySuite.Please see a link below about how to extract the table statistics:
For example, the generated plans of TPCDS
q2are different with/without this fix:This comes from the @cloud-fan comment: #29270 (comment)
Why are the changes needed?
For better test coverage.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests.