[SPARK-26677][SQL] Disable dictionary filtering by default at Parquet #23622

HyukjinKwon · 2019-01-23T05:47:07Z

What changes were proposed in this pull request?

Problem

This is a correctness issue and should be backported as well. If we use dictionary encoding as below, it hits a correctness issue as below:

// Repeat the values for dictionary encoding. PLAIN_DICTIONARY.
Seq(Some("A"), Some("A"), None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo")
spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show()

+-----+
|value|
+-----+
+-----+

Note that, if we don't use dictionary encoding it's fine. So it was difficult to find the issue.

// It becomes PLAIN encoding.
Seq(Some("A"), None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar")
spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show()

+-----+
|value|
+-----+
| null|
+-----+

How did it happen?

This is because Parquet side dictionary filter fails to handle null. The former case hits here:

      Set<T> dictSet = expandDictionary(meta);
      if (dictSet != null && dictSet.size() == 1 && dictSet.contains(value)) {
        return BLOCK_CANNOT_MATCH;
      }

See here for codes.

Given dictionary set does not contain null but only 'A'. value (given to the predicate) is 'A' here. So, it filters the row group out.

The latter case above works fine because it hits here:

    // if the chunk has non-dictionary pages, don't bother decoding the
    // dictionary because the row group can't be eliminated.
    if (hasNonDictionaryPages(meta)) {
      return BLOCK_MIGHT_MATCH;
    }

See here for codes.

So, it does not filter the row group out.

Parquet equality predicate handles null too (see also here). Up to my knowledge, Parquet predicate is null-safe quality comparison.

How does this PR fix?

This PR explicitly disables dictionary filtering if not set by default. However there's another problem:

We should disable parquet.filter.dictionary.enabled but Parquet 1.10.x has a mistake - the parquet.filter.stats.enabled and parquet.filter.dictionary.enabled were swapped mistakenly in Parquet side:

      useDictionaryFilter(conf.getBoolean(STATS_FILTERING_ENABLED, true));
      useStatsFilter(conf.getBoolean(DICTIONARY_FILTERING_ENABLED, true));

See here for codes.

This is fixed after 1.11. See PARQUET-1309.

Therefore, this PR explicitly disables parquet.filter.stats.enabled to disable dictionary filtering if that's not set by default.

User side workaround

They can explicitly disable parquet.filter.stats.enabled to disable dictionary filtering.

ETC

This is quite a conservative fix. This should be backported to Spark 2.4.
I only tested null-safe equality comparison but looks regular equality comparison having related issues as well.

How was this patch tested?

Unit tests were added.

HyukjinKwon · 2019-01-23T05:47:22Z

cc @cloud-fan and @gatorsmile - this is a correctness issue.

HyukjinKwon · 2019-01-23T05:48:09Z

adding @rdblue as well.

cloud-fan · 2019-01-23T06:09:31Z

is there a parquet JIRA to track this dictionary encoding bug?

HyukjinKwon · 2019-01-23T06:15:54Z

I haven't filed yet for this bug at Parquet yet.. I want to talk with @rdblue first anyway.

gatorsmile · 2019-01-23T06:33:08Z

I think this is pretty serious. We might need a release of 2.4.1 and send a note to dev list.

@rdblue Could you confirm the impact of this bug?

SparkQA · 2019-01-23T08:05:01Z

Test build #101573 has finished for PR 23622 at commit 9798dd4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-23T08:05:02Z

Test build #101574 has finished for PR 23622 at commit eb2eb2b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-23T08:05:47Z

retest this please

SparkQA · 2019-01-23T12:05:48Z

Test build #101576 has finished for PR 23622 at commit eb2eb2b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-01-23T17:41:23Z

I'll take a look at this today. Thanks for pointing it out.

gatorsmile · 2019-01-23T19:16:26Z

Thanks!

gatorsmile · 2019-01-24T00:23:13Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

+    //   the 'parquet.filter.stats.enabled' and 'parquet.filter.dictionary.enabled' were
+    //   swapped mistakenly in Parquet side. It should use 'parquet.filter.dictionary.enabled'
+    //   when Spark upgrades Parquet. See PARQUET-1309.
+    hadoopConf.setIfUnset(ParquetInputFormat.STATS_FILTERING_ENABLED, "false")


I think we should fix it in the Spark side. Do the swap in Spark. cc @cloud-fan @rxin @rdblue

I can do. but let me wait for other feedback before I do since it's a Parquet's property.

If we want to disable safely, shall we use set instead of setIfUnset?

Yea, I was thinking about that too. I wanted to consider other possibilities like .. users take that risk and enable them (partly also since this is going to backported into Spark 2.4.x).

Both configurations are swapped as of PARQUET-1309 but was wondering if there might be extreme corner cases that users know that issue and intentionally enable and disable it. I think both configurations are pretty much for advanced users, and regular users won't set this or meet this issue anyway.

WDYT on this @rdblue?

I'll make sure the fix for this is in Parquet 1.10.1.

As for fixing this problem, I think that Spark should avoid pushing down notEquals expressions or rewrite them to isNull(col) or notEquals(col, "A"). That's going to be much better for performance than disabling dictionary filtering.

Ah, so we're targeting the upgrade to Parquet 1.10.1? yea, sounds okay to me. Also, in that way users can also disable parquet.filter.dictionary.enabled explicitly I guess.

BTW, is it something we should enable by default at Parquet side, @rdblue? I see there can be the performance improvement but was wondering how much stable dictionary filtering it is.

Dictionary filtering has been available for more than 2 years and Netflix has been using it by default for almost 3 years without problems. The feature is stable.

Keep in mind what a narrow use case hits this bug. First, an entire row group in a file needs to have a just one value and nulls: either the column has just one non-null value for all rows, or a sort or write pattern has clustered null rows with one other value (enough for an entire row group). Next, a query must use null-safe equality and negate it. In my experience, most people don't know what null-safe equality is. Going back to the use cases that produce data like this, the first use case -- only one non-null value -- would probably result in filter like col IS NULL instead. The second write pattern is the problematic one, but how likely is the intersection of data with that write pattern and searching for all values except one with null-safe equality? I don't think it is likely and I think that's why we haven't found this until now.

Thanks for all details. +1 for going 1.10.1. If that's a plan here, I will close this PR in a couple of days.

The vote has started. Feel free to test the new binaries with this repository URI: https://repository.apache.org/content/repositories/orgapacheparquet-1022/

dongjoon-hyun · 2019-01-25T04:33:19Z

cc @dbtsai

dongjoon-hyun · 2019-01-25T04:40:53Z

Hi, @rdblue . Parquet 0.10 is released on April 2018. And, PARQUET-1309 seems to be resolved Jun 5, 2018. I'm wondering if there is a chance for us to get Parquet 0.11 before Spark 2.4.1 release. Is there a release plan in this year?

gatorsmile · 2019-01-25T07:27:16Z

We will not upgrade the major version of Parquet in 2.4.1. If Parquet release 0.10.1 with various bug fixes, we can do it after reviewing all the fixes and understand the impact.

rdblue · 2019-01-25T16:36:36Z

Parquet should do a 1.10.1. I agree with Xiao that Spark shouldn't update to 1.11.0 in a patch release.

Another option is to detect and avoid pushing down this filter. But I agree it would be better to fix it on the Parquet side.

rdblue · 2019-01-25T16:55:08Z

Opened PARQUET-1510.

dongjoon-hyun · 2019-01-25T17:10:51Z

Thanks, @rdblue and @gatorsmile . 0.10.1 sounds perfect!

HyukjinKwon · 2019-01-30T09:46:27Z

Closing this. This will be fixed by upgrading to 1.10.1

Disable dictionary filtering by default at Parquet

9798dd4

typo at comments

eb2eb2b

gatorsmile reviewed Jan 24, 2019

View reviewed changes

HyukjinKwon closed this Jan 30, 2019

dongjoon-hyun mentioned this pull request Jan 30, 2019

[SPARK-26677][BUILD] Update Parquet to 1.10.1 with notEq pushdown fix. #23704

Closed

HyukjinKwon mentioned this pull request Jan 31, 2019

Port the test of PR 23622 to PR 23704 rdblue/spark#2

Merged

HyukjinKwon deleted the disable-dictionary branch March 3, 2020 01:20

[SPARK-26677][SQL] Disable dictionary filtering by default at Parquet #23622

[SPARK-26677][SQL] Disable dictionary filtering by default at Parquet #23622

Uh oh!

Conversation

HyukjinKwon commented Jan 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Problem

How did it happen?

How does this PR fix?

User side workaround

ETC

How was this patch tested?

Uh oh!

HyukjinKwon commented Jan 23, 2019

Uh oh!

HyukjinKwon commented Jan 23, 2019

Uh oh!

cloud-fan commented Jan 23, 2019

Uh oh!

HyukjinKwon commented Jan 23, 2019

Uh oh!

gatorsmile commented Jan 23, 2019

Uh oh!

SparkQA commented Jan 23, 2019

Uh oh!

SparkQA commented Jan 23, 2019

Uh oh!

HyukjinKwon commented Jan 23, 2019

Uh oh!

SparkQA commented Jan 23, 2019

Uh oh!

rdblue commented Jan 23, 2019

Uh oh!

gatorsmile commented Jan 23, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Jan 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jan 25, 2019

Uh oh!

dongjoon-hyun commented Jan 25, 2019

Uh oh!

gatorsmile commented Jan 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Jan 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Jan 25, 2019

Uh oh!

dongjoon-hyun commented Jan 25, 2019

Uh oh!

HyukjinKwon commented Jan 30, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

HyukjinKwon commented Jan 23, 2019 •

edited

Loading

HyukjinKwon Jan 27, 2019 •

edited

Loading

rdblue Jan 27, 2019 •

edited

Loading

gatorsmile commented Jan 25, 2019 •

edited

Loading

rdblue commented Jan 25, 2019 •

edited

Loading