[SPARK-28169][SQL] Fix Partition table partition PushDown failed by "OR" expression #24973

AngersZhuuuu · 2019-06-26T09:11:21Z

What changes were proposed in this pull request?

Spark can't push down filter condition of Or:

Such as if I have a table default.test, his partition col is "dt",

If we use query :

select * from default.test
where dt=20190625 or (dt = 20190626 and id in (1,2,3) )

In this case, Spark will resolve Or condition as one expression, and since this expr has reference of "id", then it can't been push down.

In my PR , for SQL like
select * from default.test
where dt = 20190626 or (dt = 20190626 and xxx="")

For this Or condition
or (dt = 20190626 or (dt = 20190626 and xxx="" )

All expression about partition keys will be extracted as an expression only contains partition expression
and retain the original logical relationship of And & Or like below :
dt = 20190626 or dt = 20190626

Then this condition will Passed to HiveTableScanExec. Such predicate expressions can be pushed down as expected .

For this PR, it will extract deep relation of OR expression and push down this condition

How was this patch tested?

Exist unit test

wangyum

Thank you @AngersZhuuuu Is this issue has been fixed by SPARK-27699?

AngersZhuuuu · 2019-06-27T01:24:46Z

@wangyum I looked about #24598 , we are not same , what I want to do is to fix the problem of hive partition table's partition push down. That pr is for ORC & Parquet filter condition push down.

By the way, we are in the Kyuubi wechat group

AngersZhuuuu · 2019-06-27T01:59:53Z

@dongjoon-hyun @cloud-fan @HyukjinKwon

HyukjinKwon · 2019-06-27T02:08:53Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+        }
+
+        val extractedPruningPredicates = extractPushDownPredicate(predicates, partitionKeyIds)
+          .filter(_ != null)


@AngersZhuuuu, just for clarification, this code path does support OR expression but you want to do a partial pushdown right? Considering it needs a lot of codes as @wangyum pointed out, I think we should better try to promote to use (or convert) Spark's Parquet or ORC. It looks like an overkill to me.

@HyukjinKwon What I do is to extract condition's about partition keys.For the old code :
val (pruningPredicates, otherPredicates) = predicates.partition { predicate => !predicate.references.isEmpty && predicate.references.subsetOf(partitionKeyIds) }

If in expression, there contains other key, it won't be a push to HiveTableScanExec, So what I to it to fix this situation, just extract all condition about partition keys, then push it to HiveTableScanExec, HiveTableScanExec will handle complex combine expressions.

@HyukjinKwon
Spark's Parquet or ORC is perfect, and it can push down filter condition, but it can't resolve the problem that when we read a Hive table, our first behavior is scan, What this pr want to do is to reduce the time of resolve file info and partition metadata, and the file we scan. Then the file num or partition num is big, it takes too long.

I think we convert Hive table reading operations into Spark's ones, via, for instance, spark.sql.hive.convertMetastoreParquet conf. If the diff is small, I might be fine but this does look like an overkill to me. I haven't taken a close look but it virtually looks like we need a fix like #24598

I won't object if some other committers are fine with that.

@HyukjinKwon I know that it's better to convert Hive table reading operations into Spark's , but it can't fix all situation. In our production env, we just change hive data's default storage type to orc. For partition table, if different partition's serde is not the same, Convert will failed, since during converting , it will check all partition's file by table level serde.

cloud-fan · 2019-06-27T08:10:28Z

shall we just add an optimizer rule to do CNF conversion?

AngersZhuuuu · 2019-06-27T08:44:47Z

shall we just add an optimizer rule to do CNF conversion?

@cloud-fan I think it's not about optimizer, since this is special for partition push down, and origin rule in HiveStrategies is too simple.

cloud-fan · 2019-06-27T09:13:33Z

CNF conversion can solve this problem, isn't it?

AngersZhuuuu · 2019-06-27T09:29:38Z

CNF conversion can solve this problem, isn't it?

Understand you, You mean In Optimizer level change condition it to CNF , only this can't solve this pr's problem. And it may seriously affect other rules.

In HiveStrategies to Change it to CNF can work well .

AngersZhuuuu · 2019-06-27T09:36:34Z

CNF conversion can solve this problem, isn't it?

PS: First, I wish to convert predicate to CNF, but since I found that HiveTableScanExec can resolve complex partitionkeys's expression combine of And & Or , so I just need to keep origin combine relation of partitionkey expression .

AngersZhuuuu · 2019-06-27T15:15:40Z

@srowen Could you review this pr. And give some advise

srowen · 2019-06-27T15:18:56Z

This part really isn't my area

cloud-fan · 2019-06-28T02:28:41Z

This place is definitely not the only place that extracts partition predicates (e.g. FileSourceStrategy), I'm -1 to add a hack just in this place. I still prefer to add a CNF conversion rule to solve this problem for all the places, or other general solutions.

Fix situation of A is partition key SELECT * FROM A WHERE A=1 OR B = 2 Int this case, we should ignore this condition

AngersZhuuuu · 2019-06-28T14:47:27Z

@cloud-fan
Checked a lot, if we make a rule for convert CNF, we still need to change some code in FileSourceStrategy and HiveStrategies or others.
How about make it as a general method, and call it in each section where need to do this?

cloud-fan · 2019-07-01T01:07:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

+ *   FROM DEFAULT.PARTITION_TABLE
+ *   WHERE DT = 20190601 OR (DT = 20190602 AND C = "TEST")
+ *
+ * Where condition "DT = 20190601 OR (DT = 20190602 AND C = "TEST")"


Let's be more clear about the approach here. We try to weaken a predicate so that it only refers to partition columns, by removing some branches in AND. Let's also mention the corner case when there is no partition column attribute in the predicate(we should return Nil in this case?).

@cloud-fan
In this place it will return (DT = 20190601 OR DT = 20190602).
But this whole condition will still return to filter.
What I want to do is purely to avoid read unnecessary partitions. When this case we only read partition(dt=20190601 + dt=20190602), If we don't push down this, we will read all data.

In condition " (DT = 20190602 AND C = "TEST") ", DT = 20190602 is C = "TEST"'s precondition.

If the whole condition is DT = 20190601 OR (DT = 20190602 OR C = "TEST"). We should return null, since DT = 20190602 is not C = "TEST"'s constraint.

@cloud-fan
In my code. coming predicate Set[Expression] has a potential AND logical.
For one Expression, it will be restricted by other same level Expression.
and :

if it is a combine of AND each side can be a constraint to others, so it one side is tenable, it can return a tenable condition.

if it is a combine of OR, if one side is out of control(such as have no condition about partition cols) this whole OR Expression should return NONE. Only when both side of OR 's child is reasonable, it can return a tenable combine of OR.

if it 's a multilayer nested Expression combined by BinaryOperator. It will visit the lowest level, if it found one level's OR Expression is untenable, it will break this Expression totally and return null.

cloud-fan · 2019-07-01T01:16:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

+ */
+object ExtractPartitionPredicates extends Logging {
+
+  private type ReturnType = Seq[Expression]


We don't need this type alias as Seq[Expression] is short.

Done, Thanks.

AngersZhuuuu · 2019-07-24T14:35:39Z

This place is definitely not the only place that extracts partition predicates (e.g. FileSourceStrategy), I'm -1 to add a hack just in this place. I still prefer to add a CNF conversion rule to solve this problem for all the places, or other general solutions.

Checked previous activity for the CNF normalization in below issue:
https://issues.apache.org/jira/browse/SPARK-6624

Seems we can't make it as an Optimizer Rule, it should only work for some special point such as partition predicate and filter push down。

AmplabJenkins · 2019-09-16T18:10:58Z

Can one of the admins verify this patch?

github-actions · 2019-12-28T00:07:29Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

[SPARK-28169] extract predicate expression deeply

b47dce7

AngersZhuuuu changed the title ~~[SPARK-28169] Fix PushDown failed by "OR" expression~~ [WIP][SPARK-28169] Fix PushDown failed by "OR" expression Jun 26, 2019

wangyum reviewed Jun 26, 2019

View reviewed changes

resolve and deeply

5bc19d4

dongjoon-hyun added the SQL label Jun 26, 2019

AngersZhuuuu changed the title ~~[WIP][SPARK-28169] Fix PushDown failed by "OR" expression~~ [WIP][SPARK-28169] Fix Partition table partition PushDown failed by "OR" expression Jun 27, 2019

AngersZhuuuu changed the title ~~[WIP][SPARK-28169] Fix Partition table partition PushDown failed by "OR" expression~~ [SPARK-28169] Fix Partition table partition PushDown failed by "OR" expression Jun 27, 2019

HyukjinKwon reviewed Jun 27, 2019

View reviewed changes

朱夷 and others added 2 commits June 28, 2019 14:14

fit scala style

4705fc2

Fix problem of eslaped Or condition

b5e00a5

Fix situation of A is partition key SELECT * FROM A WHERE A=1 OR B = 2 Int this case, we should ignore this condition

Angers and others added 7 commits June 29, 2019 16:07

Add ExtractPartitionPredicates

e8a9b28

Fix scala stype

55323ce

fix scalastyle

3e8085a

Fix scala style

e383f64

Change method style

c65143d

Add comment

d791135

add ExtractPartitionPredicates to DataSourceStrategy

6f81771

cloud-fan reviewed Jul 1, 2019

View reviewed changes

remove return type since it's short

01386d3

dongjoon-hyun changed the title ~~[SPARK-28169] Fix Partition table partition PushDown failed by "OR" expression~~ [SPARK-28169][SQL] Fix Partition table partition PushDown failed by "OR" expression Jul 7, 2019

AngersZhuuuu mentioned this pull request Jul 16, 2019

[SPARK-27280][SQL]infer more filter from join or condition #24228

Closed

HyukjinKwon mentioned this pull request Sep 5, 2019

[SPARK-28983][SQL]Support Or prunning hive partitions #25688

Closed

AngersZhuuuu mentioned this pull request Nov 25, 2019

hive内存泄漏的罪魁祸首分析 cjuexuan/mynote#62

Open

github-actions bot added the Stale label Dec 28, 2019

github-actions bot closed this Dec 29, 2019

AngersZhuuuu mentioned this pull request Jun 11, 2020

[SPARK-31705][SQL] Push more possible predicates through Join via CNF conversion #28733

Closed

[SPARK-28169][SQL] Fix Partition table partition PushDown failed by "OR" expression #24973

[SPARK-28169][SQL] Fix Partition table partition PushDown failed by "OR" expression #24973

Uh oh!

Conversation

AngersZhuuuu commented Jun 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

wangyum left a comment

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu commented Jun 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AngersZhuuuu commented Jun 27, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu Jun 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jun 27, 2019

Uh oh!

AngersZhuuuu commented Jun 27, 2019

Uh oh!

cloud-fan commented Jun 27, 2019

Uh oh!

AngersZhuuuu commented Jun 27, 2019

Uh oh!

AngersZhuuuu commented Jun 27, 2019

Uh oh!

AngersZhuuuu commented Jun 27, 2019

Uh oh!

srowen commented Jun 27, 2019

Uh oh!

cloud-fan commented Jun 28, 2019

Uh oh!

AngersZhuuuu commented Jun 28, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu commented Jul 24, 2019

Uh oh!

AmplabJenkins commented Sep 16, 2019

Uh oh!

github-actions bot commented Dec 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

AngersZhuuuu commented Jun 26, 2019 •

edited

Loading

AngersZhuuuu commented Jun 27, 2019 •

edited

Loading

AngersZhuuuu Jun 27, 2019 •

edited

Loading