Skip to content

Conversation

@mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Dec 13, 2018

What changes were proposed in this pull request?

In ReplaceExceptWithFilter we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by InferFiltersFromConstraints are not enough, as it happens with OR conditions.

The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output.

The PR fixes these problem by:

  • returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it);
  • avoiding any transformation when the condition is non-deterministic.

How was this patch tested?

added UTs

@mgaido91
Copy link
Contributor Author

cc @cloud-fan @gatorsmile

@SparkQA
Copy link

SparkQA commented Dec 13, 2018

Test build #100106 has finished for PR 23315 at commit ab52007.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Dec 13, 2018

Is this a correctness bug?

@danosipov
Copy link
Contributor

Thanks @mgaido91 !

@mgaido91
Copy link
Contributor Author

@rxin yes, it is. We are returning wrong results in this case, despite I'd not consider it as a regression from earlier releases, as the regression mentioned in the JIRA I think has been caused by other fixes which are not wrong by themselves but made this problem visible in that case.

@SparkQA
Copy link

SparkQA commented Dec 13, 2018

Test build #100109 has finished for PR 23315 at commit dbc3ca0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

sparkContext.parallelize(Seq(Row("0", "a"), Row("1", null))),
StructType(Seq(
StructField("a", StringType, nullable = true),
StructField("b", StringType, nullable = true))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seq("0" -> "a", "1" -> null).toDF("a", "b")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with your suggestion, the test passes always, even without the patch, because it adds extra projects for renaming the fields, so I cannot do this...

StructField("b", StringType, nullable = true))))

val exceptDF = inputDF.filter(
col("a").isin(Seq("0"): _*) or col("b").isin())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isin(Seq("0"): _*) =>isin("0")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this bug only reproducible with In?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, also with other comparisons (> for instance...). I am using > now, is that ok?

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a major correctness bug in this rule. We are unable to rely on the InferFiltersFromConstraints to infer the IsNotNull constrains. We should fix that, since InferFiltersFromConstraints is unable to infer many cases (including OR-connected ones)

@mgaido91
Copy link
Contributor Author

@gatorsmile yes, thanks for your comment, you're right. I updated the PR accordingly.

May I kindly ask you to have another pass at it? Thanks.

@SparkQA
Copy link

SparkQA commented Dec 14, 2018

Test build #100143 has finished for PR 23315 at commit 7e747a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

There is another bug in this rule. The condition of the right side must be deterministic; otherwise, the value appearance possibility will be changed in the final result of EXCEPT.

@mgaido91
Copy link
Contributor Author

thanks @gatorsmile , fixed

@cloud-fan
Copy link
Contributor

LGTM. Can you also mention the other 2 fixes in PR description?

@mgaido91
Copy link
Contributor Author

mgaido91 commented Dec 17, 2018

Sure, done, thanks for the note @cloud-fan , I forgot that.

@SparkQA
Copy link

SparkQA commented Dec 17, 2018

Test build #100239 has finished for PR 23315 at commit aedb572.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

retest this please

@gatorsmile
Copy link
Member

@mgaido91 Could you add an end-to-end test case with all the enumerated cases?

Below is just an example.

  test("SPARK-26366: verify ReplaceExceptWithFilter") {
    Seq(true, false).foreach { enabled =>
      withSQLConf(SQLConf.REPLACE_EXCEPT_WITH_FILTER.key -> enabled.toString) {
        withTable("tab1") {
          // TODO: use DF APIs to make it shorter
          spark.sql(
            """
             |CREATE TABLE tab1 (col1 INT, col2 INT, col3 INT) using PARQUET
            """.stripMargin)
          spark.sql("INSERT INTO tab1 VALUES (0, 3, 5)")
          spark.sql("INSERT INTO tab1 VALUES (0, 3, NULL)")
          spark.sql("INSERT INTO tab1 VALUES (NULL, 3, 5)")
          spark.sql("INSERT INTO tab1 VALUES (0, NULL, 5)")
          spark.sql("INSERT INTO tab1 VALUES (0, NULL, NULL)")
          spark.sql("INSERT INTO tab1 VALUES (NULL, NULL, 5)")
          spark.sql("INSERT INTO tab1 VALUES (NULL, 3, NULL)")
          spark.sql("INSERT INTO tab1 VALUES (NULL, NULL, NULL)")

          val df = spark.read.table("tab1")

          // TODO: add more conditions 
          val where =
            """
              |(col1 IS NULL AND col2 >= 3)
              |OR (col1 IS NOT NULL AND col2 >= 0)
            """.stripMargin

          val df1 = df.filter(where)
          val df2 = df.except(df_a)
          val df3 = df.except(df_b)

          // TODO: compare df1 and df3 and check if they are the same
        }
      }
    }
  }

@SparkQA
Copy link

SparkQA commented Dec 17, 2018

Test build #100244 has finished for PR 23315 at commit aedb572.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

@gatorsmile I am adding it, but I am not sure how to do that for the non-deterministic case: I'll skip that since any test I can think of for it would be either useless or flaky. Thanks.

@SparkQA
Copy link

SparkQA commented Dec 18, 2018

Test build #100277 has finished for PR 23315 at commit ab74b1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Dec 19, 2018
## What changes were proposed in this pull request?

In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions.

The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output.

The PR fixes these problem by:
 - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it);
 - avoiding any transformation when the condition is non-deterministic.

## How was this patch tested?

added UTs

Closes #23315 from mgaido91/SPARK-26366.

Authored-by: Marco Gaido <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
(cherry picked from commit 834b860)
Signed-off-by: gatorsmile <[email protected]>
@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master/ 2.4

@mgaido91 Could you submit a PR to 2.3? I hit the code conflicts.

@asfgit asfgit closed this in 834b860 Dec 19, 2018
mgaido91 added a commit to mgaido91/spark that referenced this pull request Dec 19, 2018
…der NULL as False

In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions.

The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output.

The PR fixes these problem by:
 - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it);
 - avoiding any transformation when the condition is non-deterministic.

added UTs

Closes apache#23315 from mgaido91/SPARK-26366.

Authored-by: Marco Gaido <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
@mgaido91
Copy link
Contributor Author

sure, thanks @gatorsmile

mgaido91 added a commit to mgaido91/spark that referenced this pull request Dec 22, 2018
…der NULL as False

In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions.

The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output.

The PR fixes these problem by:
 - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it);
 - avoiding any transformation when the condition is non-deterministic.

added UTs

Closes apache#23315 from mgaido91/SPARK-26366.

Authored-by: Marco Gaido <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
holdenk pushed a commit to holdenk/spark that referenced this pull request Jan 5, 2019
## What changes were proposed in this pull request?

In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions.

The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output.

The PR fixes these problem by:
 - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it);
 - avoiding any transformation when the condition is non-deterministic.

## How was this patch tested?

added UTs

Closes apache#23315 from mgaido91/SPARK-26366.

Authored-by: Marco Gaido <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions.

The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output.

The PR fixes these problem by:
 - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it);
 - avoiding any transformation when the condition is non-deterministic.

## How was this patch tested?

added UTs

Closes apache#23315 from mgaido91/SPARK-26366.

Authored-by: Marco Gaido <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 23, 2019
## What changes were proposed in this pull request?

In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions.

The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output.

The PR fixes these problem by:
 - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it);
 - avoiding any transformation when the condition is non-deterministic.

## How was this patch tested?

added UTs

Closes apache#23315 from mgaido91/SPARK-26366.

Authored-by: Marco Gaido <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
(cherry picked from commit 834b860)
Signed-off-by: gatorsmile <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
## What changes were proposed in this pull request?

In `ReplaceExceptWithFilter` we do not consider properly the case in which the condition returns NULL. Indeed, in that case, since negating NULL still returns NULL, so it is not true the assumption that negating the condition returns all the rows which didn't satisfy it, rows returning NULL may not be returned. This happens when constraints inferred by `InferFiltersFromConstraints` are not enough, as it happens with `OR` conditions.

The rule had also problems with non-deterministic conditions: in such a scenario, this rule would change the probability of the output.

The PR fixes these problem by:
 - returning False for the condition when it is Null (in this way we do return all the rows which didn't satisfy it);
 - avoiding any transformation when the condition is non-deterministic.

## How was this patch tested?

added UTs

Closes apache#23315 from mgaido91/SPARK-26366.

Authored-by: Marco Gaido <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
(cherry picked from commit 834b860)
Signed-off-by: gatorsmile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants