You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-37392][SQL] Fix the performance bug when inferring constraints for Generate
### What changes were proposed in this pull request?
This is a performance regression since Spark 3.1, caused by https://issues.apache.org/jira/browse/SPARK-32295
If you run the query in the JIRA ticket
```
Seq(
(1, "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x", "x")
).toDF()
.checkpoint() // or save and reload to truncate lineage
.createOrReplaceTempView("sub")
session.sql("""
SELECT
*
FROM
(
SELECT
EXPLODE( ARRAY( * ) ) result
FROM
(
SELECT
_1 a, _2 b, _3 c, _4 d, _5 e, _6 f, _7 g, _8 h, _9 i, _10 j, _11 k, _12 l, _13 m, _14 n, _15 o, _16 p, _17 q, _18 r, _19 s, _20 t, _21 u
FROM
sub
)
)
WHERE
result != ''
""").show()
```
You will hit OOM. The reason is that:
1. We infer additional predicates with `Generate`. In this case, it's `size(array(cast(_1#21 as string), _2#22, _3#23, ...) > 0`
2. Because of the cast, the `ConstantFolding` rule can't optimize this `size(array(...))`.
3. We end up with a plan containing this part
```
+- Project [_1#21 AS a#106, _2#22 AS b#107, _3#23 AS c#108, _4#24 AS d#109, _5#25 AS e#110, _6#26 AS f#111, _7#27 AS g#112, _8#28 AS h#113, _9#29 AS i#114, _10#30 AS j#115, _11#31 AS k#116, _12#32 AS l#117, _13#33 AS m#118, _14#34 AS n#119, _15#35 AS o#120, _16#36 AS p#121, _17#37 AS q#122, _18#38 AS r#123, _19#39 AS s#124, _20#40 AS t#125, _21#41 AS u#126]
+- Filter (size(array(cast(_1#21 as string), _2#22, _3#23, _4#24, _5#25, _6#26, _7#27, _8#28, _9#29, _10#30, _11#31, _12#32, _13#33, _14#34, _15#35, _16#36, _17#37, _18#38, _19#39, _20#40, _21#41), true) > 0)
+- LogicalRDD [_1#21, _2#22, _3#23, _4#24, _5#25, _6#26, _7#27, _8#28, _9#29, _10#30, _11#31, _12#32, _13#33, _14#34, _15#35, _16#36, _17#37, _18#38, _19#39, _20#40, _21#41]
```
When calculating the constraints of the `Project`, we generate around 2^20 expressions, due to this code
```
var allConstraints = child.constraints
projectList.foreach {
case a Alias(l: Literal, _) =>
allConstraints += EqualNullSafe(a.toAttribute, l)
case a Alias(e, _) =>
// For every alias in `projectList`, replace the reference in constraints by its attribute.
allConstraints ++= allConstraints.map(_ transform {
case expr: Expression if expr.semanticEquals(e) =>
a.toAttribute
})
allConstraints += EqualNullSafe(e, a.toAttribute)
case _ => // Don't change.
}
```
There are 3 issues here:
1. We may infer complicated predicates from `Generate`
2. `ConstanFolding` rule is too conservative. At least `Cast` has no side effect with ANSI-off.
3. When calculating constraints, we should have a upper bound to avoid generating too many expressions.
This fixes the first 2 issues, and leaves the third one for the future.
### Why are the changes needed?
fix a performance issue
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
new tests, and run the query in JIRA ticket locally.
Closesapache#34823 from cloud-fan/perf.
Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 1fac7a9)
Signed-off-by: Wenchen Fan <[email protected]>
0 commit comments