[SPARK-26366][SQL] ReplaceExceptWithFilter should consider NULL as False #23315

mgaido91 · 2018-12-13T18:28:55Z

What changes were proposed in this pull request?

In ReplaceExceptWithFilter we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by InferFiltersFromConstraints are not enough, as it happens with OR conditions.

The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output.

The PR fixes these problem by:

returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it);
avoiding any transformation when the condition is non-deterministic.

How was this patch tested?

added UTs

mgaido91 · 2018-12-13T18:29:07Z

cc @cloud-fan @gatorsmile

SparkQA · 2018-12-13T18:48:13Z

Test build #100106 has finished for PR 23315 at commit ab52007.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2018-12-13T18:56:51Z

Is this a correctness bug?

danosipov · 2018-12-13T19:42:28Z

Thanks @mgaido91 !

mgaido91 · 2018-12-13T19:59:25Z

@rxin yes, it is. We are returning wrong results in this case, despite I'd not consider it as a regression from earlier releases, as the regression mentioned in the JIRA I think has been caused by other fixes which are not wrong by themselves but made this problem visible in that case.

SparkQA · 2018-12-13T23:46:09Z

Test build #100109 has finished for PR 23315 at commit dbc3ca0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-12-14T02:53:21Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

+      sparkContext.parallelize(Seq(Row("0", "a"), Row("1", null))),
+      StructType(Seq(
+        StructField("a", StringType, nullable = true),
+        StructField("b", StringType, nullable = true))))


Seq("0" -> "a", "1" -> null).toDF("a", "b")

with your suggestion, the test passes always, even without the patch, because it adds extra projects for renaming the fields, so I cannot do this...

cloud-fan · 2018-12-14T02:54:46Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

+        StructField("b", StringType, nullable = true))))
+
+    val exceptDF = inputDF.filter(
+      col("a").isin(Seq("0"): _*) or col("b").isin())


isin(Seq("0"): _*) =>isin("0")

is this bug only reproducible with In?

no, also with other comparisons (> for instance...). I am using > now, is that ok?

gatorsmile

There is a major correctness bug in this rule. We are unable to rely on the InferFiltersFromConstraints to infer the IsNotNull constrains. We should fix that, since InferFiltersFromConstraints is unable to infer many cases (including OR-connected ones)

mgaido91 · 2018-12-14T09:24:56Z

@gatorsmile yes, thanks for your comment, you're right. I updated the PR accordingly.

May I kindly ask you to have another pass at it? Thanks.

SparkQA · 2018-12-14T13:08:33Z

Test build #100143 has finished for PR 23315 at commit 7e747a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-12-16T01:03:32Z

There is another bug in this rule. The condition of the right side must be deterministic; otherwise, the value appearance possibility will be changed in the final result of EXCEPT.

mgaido91 · 2018-12-17T11:00:09Z

thanks @gatorsmile , fixed

cloud-fan · 2018-12-17T13:19:17Z

LGTM. Can you also mention the other 2 fixes in PR description?

mgaido91 · 2018-12-17T13:20:57Z

Sure, done, thanks for the note @cloud-fan , I forgot that.

SparkQA · 2018-12-17T13:54:10Z

Test build #100239 has finished for PR 23315 at commit aedb572.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-12-17T13:55:45Z

retest this please

gatorsmile · 2018-12-17T18:09:20Z

@mgaido91 Could you add an end-to-end test case with all the enumerated cases?

Below is just an example.

  test("SPARK-26366: verify ReplaceExceptWithFilter") {
    Seq(true, false).foreach { enabled =>
      withSQLConf(SQLConf.REPLACE_EXCEPT_WITH_FILTER.key -> enabled.toString) {
        withTable("tab1") {
          // TODO: use DF APIs to make it shorter
          spark.sql(
            """
             |CREATE TABLE tab1 (col1 INT, col2 INT, col3 INT) using PARQUET
            """.stripMargin)
          spark.sql("INSERT INTO tab1 VALUES (0, 3, 5)")
          spark.sql("INSERT INTO tab1 VALUES (0, 3, NULL)")
          spark.sql("INSERT INTO tab1 VALUES (NULL, 3, 5)")
          spark.sql("INSERT INTO tab1 VALUES (0, NULL, 5)")
          spark.sql("INSERT INTO tab1 VALUES (0, NULL, NULL)")
          spark.sql("INSERT INTO tab1 VALUES (NULL, NULL, 5)")
          spark.sql("INSERT INTO tab1 VALUES (NULL, 3, NULL)")
          spark.sql("INSERT INTO tab1 VALUES (NULL, NULL, NULL)")

          val df = spark.read.table("tab1")

          // TODO: add more conditions 
          val where =
            """
              |(col1 IS NULL AND col2 >= 3)
              |OR (col1 IS NOT NULL AND col2 >= 0)
            """.stripMargin

          val df1 = df.filter(where)
          val df2 = df.except(df_a)
          val df3 = df.except(df_b)

          // TODO: compare df1 and df3 and check if they are the same
        }
      }
    }
  }

SparkQA · 2018-12-17T18:09:44Z

Test build #100244 has finished for PR 23315 at commit aedb572.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-12-18T08:26:20Z

@gatorsmile I am adding it, but I am not sure how to do that for the non-deterministic case: I'll skip that since any test I can think of for it would be either useless or flaky. Thanks.

SparkQA · 2018-12-18T12:50:21Z

Test build #100277 has finished for PR 23315 at commit ab74b1f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions. The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output. The PR fixes these problem by: - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it); - avoiding any transformation when the condition is non-deterministic. ## How was this patch tested? added UTs Closes #23315 from mgaido91/SPARK-26366. Authored-by: Marco Gaido <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit 834b860) Signed-off-by: gatorsmile <[email protected]>

gatorsmile · 2018-12-19T07:24:51Z

LGTM

Thanks! Merged to master/ 2.4

@mgaido91 Could you submit a PR to 2.3? I hit the code conflicts.

…der NULL as False In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions. The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output. The PR fixes these problem by: - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it); - avoiding any transformation when the condition is non-deterministic. added UTs Closes apache#23315 from mgaido91/SPARK-26366. Authored-by: Marco Gaido <[email protected]> Signed-off-by: gatorsmile <[email protected]>

mgaido91 · 2018-12-19T10:24:17Z

sure, thanks @gatorsmile

…der NULL as False In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions. The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output. The PR fixes these problem by: - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it); - avoiding any transformation when the condition is non-deterministic. added UTs Closes apache#23315 from mgaido91/SPARK-26366. Authored-by: Marco Gaido <[email protected]> Signed-off-by: gatorsmile <[email protected]>

## What changes were proposed in this pull request? In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions. The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output. The PR fixes these problem by: - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it); - avoiding any transformation when the condition is non-deterministic. ## How was this patch tested? added UTs Closes apache#23315 from mgaido91/SPARK-26366. Authored-by: Marco Gaido <[email protected]> Signed-off-by: gatorsmile <[email protected]>

## What changes were proposed in this pull request? In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions. The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output. The PR fixes these problem by: - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it); - avoiding any transformation when the condition is non-deterministic. ## How was this patch tested? added UTs Closes apache#23315 from mgaido91/SPARK-26366. Authored-by: Marco Gaido <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit 834b860) Signed-off-by: gatorsmile <[email protected]>

[SPARK-26366][SQL] ReplaceExceptWithFilter should consider NULL as False

ab52007

fix imports

dbc3ca0

cloud-fan reviewed Dec 14, 2018

View reviewed changes

gatorsmile reviewed Dec 14, 2018

View reviewed changes

address comments and fix uts

7e747a3

address comments

aedb572

address comment

ab74b1f

asfgit closed this in 834b860 Dec 19, 2018

[SPARK-26366][SQL] ReplaceExceptWithFilter should consider NULL as False #23315

[SPARK-26366][SQL] ReplaceExceptWithFilter should consider NULL as False #23315

Uh oh!

Conversation

mgaido91 commented Dec 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

mgaido91 commented Dec 13, 2018

Uh oh!

SparkQA commented Dec 13, 2018

Uh oh!

rxin commented Dec 13, 2018

Uh oh!

danosipov commented Dec 13, 2018

Uh oh!

mgaido91 commented Dec 13, 2018

Uh oh!

SparkQA commented Dec 13, 2018

Uh oh!

cloud-fan Dec 14, 2018

Choose a reason for hiding this comment

Uh oh!

mgaido91 Dec 14, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 14, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 14, 2018

Choose a reason for hiding this comment

Uh oh!

mgaido91 Dec 14, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile left a comment

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Dec 14, 2018

Uh oh!

SparkQA commented Dec 14, 2018

Uh oh!

gatorsmile commented Dec 16, 2018

Uh oh!

mgaido91 commented Dec 17, 2018

Uh oh!

cloud-fan commented Dec 17, 2018

Uh oh!

mgaido91 commented Dec 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Dec 17, 2018

Uh oh!

mgaido91 commented Dec 17, 2018

Uh oh!

gatorsmile commented Dec 17, 2018

Uh oh!

SparkQA commented Dec 17, 2018

Uh oh!

mgaido91 commented Dec 18, 2018

Uh oh!

SparkQA commented Dec 18, 2018

Uh oh!

gatorsmile commented Dec 19, 2018

Uh oh!

mgaido91 commented Dec 19, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mgaido91 commented Dec 13, 2018 •

edited

Loading

mgaido91 commented Dec 17, 2018 •

edited

Loading