-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26366][SQL] ReplaceExceptWithFilter should consider NULL as False #23315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #100106 has finished for PR 23315 at commit
|
|
Is this a correctness bug? |
|
Thanks @mgaido91 ! |
|
@rxin yes, it is. We are returning wrong results in this case, despite I'd not consider it as a regression from earlier releases, as the regression mentioned in the JIRA I think has been caused by other fixes which are not wrong by themselves but made this problem visible in that case. |
|
Test build #100109 has finished for PR 23315 at commit
|
| sparkContext.parallelize(Seq(Row("0", "a"), Row("1", null))), | ||
| StructType(Seq( | ||
| StructField("a", StringType, nullable = true), | ||
| StructField("b", StringType, nullable = true)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seq("0" -> "a", "1" -> null).toDF("a", "b")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with your suggestion, the test passes always, even without the patch, because it adds extra projects for renaming the fields, so I cannot do this...
| StructField("b", StringType, nullable = true)))) | ||
|
|
||
| val exceptDF = inputDF.filter( | ||
| col("a").isin(Seq("0"): _*) or col("b").isin()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isin(Seq("0"): _*) =>isin("0")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this bug only reproducible with In?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, also with other comparisons (> for instance...). I am using > now, is that ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a major correctness bug in this rule. We are unable to rely on the InferFiltersFromConstraints to infer the IsNotNull constrains. We should fix that, since InferFiltersFromConstraints is unable to infer many cases (including OR-connected ones)
|
@gatorsmile yes, thanks for your comment, you're right. I updated the PR accordingly. May I kindly ask you to have another pass at it? Thanks. |
|
Test build #100143 has finished for PR 23315 at commit
|
|
There is another bug in this rule. The condition of the right side must be deterministic; otherwise, the value appearance possibility will be changed in the final result of EXCEPT. |
|
thanks @gatorsmile , fixed |
|
LGTM. Can you also mention the other 2 fixes in PR description? |
|
Sure, done, thanks for the note @cloud-fan , I forgot that. |
|
Test build #100239 has finished for PR 23315 at commit
|
|
retest this please |
|
@mgaido91 Could you add an end-to-end test case with all the enumerated cases? Below is just an example. test("SPARK-26366: verify ReplaceExceptWithFilter") {
Seq(true, false).foreach { enabled =>
withSQLConf(SQLConf.REPLACE_EXCEPT_WITH_FILTER.key -> enabled.toString) {
withTable("tab1") {
// TODO: use DF APIs to make it shorter
spark.sql(
"""
|CREATE TABLE tab1 (col1 INT, col2 INT, col3 INT) using PARQUET
""".stripMargin)
spark.sql("INSERT INTO tab1 VALUES (0, 3, 5)")
spark.sql("INSERT INTO tab1 VALUES (0, 3, NULL)")
spark.sql("INSERT INTO tab1 VALUES (NULL, 3, 5)")
spark.sql("INSERT INTO tab1 VALUES (0, NULL, 5)")
spark.sql("INSERT INTO tab1 VALUES (0, NULL, NULL)")
spark.sql("INSERT INTO tab1 VALUES (NULL, NULL, 5)")
spark.sql("INSERT INTO tab1 VALUES (NULL, 3, NULL)")
spark.sql("INSERT INTO tab1 VALUES (NULL, NULL, NULL)")
val df = spark.read.table("tab1")
// TODO: add more conditions
val where =
"""
|(col1 IS NULL AND col2 >= 3)
|OR (col1 IS NOT NULL AND col2 >= 0)
""".stripMargin
val df1 = df.filter(where)
val df2 = df.except(df_a)
val df3 = df.except(df_b)
// TODO: compare df1 and df3 and check if they are the same
}
}
}
} |
|
Test build #100244 has finished for PR 23315 at commit
|
|
@gatorsmile I am adding it, but I am not sure how to do that for the non-deterministic case: I'll skip that since any test I can think of for it would be either useless or flaky. Thanks. |
|
Test build #100277 has finished for PR 23315 at commit
|
## What changes were proposed in this pull request? In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions. The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output. The PR fixes these problem by: - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it); - avoiding any transformation when the condition is non-deterministic. ## How was this patch tested? added UTs Closes #23315 from mgaido91/SPARK-26366. Authored-by: Marco Gaido <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit 834b860) Signed-off-by: gatorsmile <[email protected]>
|
LGTM Thanks! Merged to master/ 2.4 @mgaido91 Could you submit a PR to 2.3? I hit the code conflicts. |
…der NULL as False In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions. The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output. The PR fixes these problem by: - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it); - avoiding any transformation when the condition is non-deterministic. added UTs Closes apache#23315 from mgaido91/SPARK-26366. Authored-by: Marco Gaido <[email protected]> Signed-off-by: gatorsmile <[email protected]>
|
sure, thanks @gatorsmile |
…der NULL as False In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions. The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output. The PR fixes these problem by: - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it); - avoiding any transformation when the condition is non-deterministic. added UTs Closes apache#23315 from mgaido91/SPARK-26366. Authored-by: Marco Gaido <[email protected]> Signed-off-by: gatorsmile <[email protected]>
## What changes were proposed in this pull request? In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions. The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output. The PR fixes these problem by: - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it); - avoiding any transformation when the condition is non-deterministic. ## How was this patch tested? added UTs Closes apache#23315 from mgaido91/SPARK-26366. Authored-by: Marco Gaido <[email protected]> Signed-off-by: gatorsmile <[email protected]>
## What changes were proposed in this pull request? In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions. The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output. The PR fixes these problem by: - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it); - avoiding any transformation when the condition is non-deterministic. ## How was this patch tested? added UTs Closes apache#23315 from mgaido91/SPARK-26366. Authored-by: Marco Gaido <[email protected]> Signed-off-by: gatorsmile <[email protected]>
## What changes were proposed in this pull request? In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions. The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output. The PR fixes these problem by: - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it); - avoiding any transformation when the condition is non-deterministic. ## How was this patch tested? added UTs Closes apache#23315 from mgaido91/SPARK-26366. Authored-by: Marco Gaido <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit 834b860) Signed-off-by: gatorsmile <[email protected]>
## What changes were proposed in this pull request? In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions. The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output. The PR fixes these problem by: - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it); - avoiding any transformation when the condition is non-deterministic. ## How was this patch tested? added UTs Closes apache#23315 from mgaido91/SPARK-26366. Authored-by: Marco Gaido <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit 834b860) Signed-off-by: gatorsmile <[email protected]>
What changes were proposed in this pull request?
In
ReplaceExceptWithFilterwe do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred byInferFiltersFromConstraintsare not enough, as it happens withORconditions.The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output.
The PR fixes these problem by:
How was this patch tested?
added UTs