Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jan 23, 2019

What changes were proposed in this pull request?

Problem

This is a correctness issue and should be backported as well. If we use dictionary encoding as below, it hits a correctness issue as below:

// Repeat the values for dictionary encoding. PLAIN_DICTIONARY.
Seq(Some("A"), Some("A"), None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo")
spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show()
+-----+
|value|
+-----+
+-----+

Note that, if we don't use dictionary encoding it's fine. So it was difficult to find the issue.

// It becomes PLAIN encoding.
Seq(Some("A"), None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar")
spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show()
+-----+
|value|
+-----+
| null|
+-----+

How did it happen?

This is because Parquet side dictionary filter fails to handle null. The former case hits here:

      Set<T> dictSet = expandDictionary(meta);
      if (dictSet != null && dictSet.size() == 1 && dictSet.contains(value)) {
        return BLOCK_CANNOT_MATCH;
      }

See here for codes.

Given dictionary set does not contain null but only 'A'. value (given to the predicate) is 'A' here. So, it filters the row group out.

The latter case above works fine because it hits here:

    // if the chunk has non-dictionary pages, don't bother decoding the
    // dictionary because the row group can't be eliminated.
    if (hasNonDictionaryPages(meta)) {
      return BLOCK_MIGHT_MATCH;
    }

See here for codes.

So, it does not filter the row group out.

Parquet equality predicate handles null too (see also here). Up to my knowledge, Parquet predicate is null-safe quality comparison.

How does this PR fix?

This PR explicitly disables dictionary filtering if not set by default. However there's another problem:

We should disable parquet.filter.dictionary.enabled but Parquet 1.10.x has a mistake - the parquet.filter.stats.enabled and parquet.filter.dictionary.enabled were swapped mistakenly in Parquet side:

      useDictionaryFilter(conf.getBoolean(STATS_FILTERING_ENABLED, true));
      useStatsFilter(conf.getBoolean(DICTIONARY_FILTERING_ENABLED, true));

See here for codes.

This is fixed after 1.11. See PARQUET-1309.

Therefore, this PR explicitly disables parquet.filter.stats.enabled to disable dictionary filtering if that's not set by default.

User side workaround

They can explicitly disable parquet.filter.stats.enabled to disable dictionary filtering.

ETC

  • This is quite a conservative fix. This should be backported to Spark 2.4.
  • I only tested null-safe equality comparison but looks regular equality comparison having related issues as well.

How was this patch tested?

Unit tests were added.

@HyukjinKwon
Copy link
Member Author

cc @cloud-fan and @gatorsmile - this is a correctness issue.

@HyukjinKwon
Copy link
Member Author

adding @rdblue as well.

@cloud-fan
Copy link
Contributor

is there a parquet JIRA to track this dictionary encoding bug?

@HyukjinKwon
Copy link
Member Author

I haven't filed yet for this bug at Parquet yet.. I want to talk with @rdblue first anyway.

@gatorsmile
Copy link
Member

I think this is pretty serious. We might need a release of 2.4.1 and send a note to dev list.

@rdblue Could you confirm the impact of this bug?

@SparkQA
Copy link

SparkQA commented Jan 23, 2019

Test build #101573 has finished for PR 23622 at commit 9798dd4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 23, 2019

Test build #101574 has finished for PR 23622 at commit eb2eb2b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 23, 2019

Test build #101576 has finished for PR 23622 at commit eb2eb2b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented Jan 23, 2019

I'll take a look at this today. Thanks for pointing it out.

@gatorsmile
Copy link
Member

Thanks!

// the 'parquet.filter.stats.enabled' and 'parquet.filter.dictionary.enabled' were
// swapped mistakenly in Parquet side. It should use 'parquet.filter.dictionary.enabled'
// when Spark upgrades Parquet. See PARQUET-1309.
hadoopConf.setIfUnset(ParquetInputFormat.STATS_FILTERING_ENABLED, "false")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fix it in the Spark side. Do the swap in Spark. cc @cloud-fan @rxin @rdblue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do. but let me wait for other feedback before I do since it's a Parquet's property.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to disable safely, shall we use set instead of setIfUnset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I was thinking about that too. I wanted to consider other possibilities like .. users take that risk and enable them (partly also since this is going to backported into Spark 2.4.x).

Both configurations are swapped as of PARQUET-1309 but was wondering if there might be extreme corner cases that users know that issue and intentionally enable and disable it. I think both configurations are pretty much for advanced users, and regular users won't set this or meet this issue anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT on this @rdblue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make sure the fix for this is in Parquet 1.10.1.

As for fixing this problem, I think that Spark should avoid pushing down notEquals expressions or rewrite them to isNull(col) or notEquals(col, "A"). That's going to be much better for performance than disabling dictionary filtering.

Copy link
Member Author

@HyukjinKwon HyukjinKwon Jan 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so we're targeting the upgrade to Parquet 1.10.1? yea, sounds okay to me. Also, in that way users can also disable parquet.filter.dictionary.enabled explicitly I guess.

BTW, is it something we should enable by default at Parquet side, @rdblue? I see there can be the performance improvement but was wondering how much stable dictionary filtering it is.

Copy link
Contributor

@rdblue rdblue Jan 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dictionary filtering has been available for more than 2 years and Netflix has been using it by default for almost 3 years without problems. The feature is stable.

Keep in mind what a narrow use case hits this bug. First, an entire row group in a file needs to have a just one value and nulls: either the column has just one non-null value for all rows, or a sort or write pattern has clustered null rows with one other value (enough for an entire row group). Next, a query must use null-safe equality and negate it. In my experience, most people don't know what null-safe equality is. Going back to the use cases that produce data like this, the first use case -- only one non-null value -- would probably result in filter like col IS NULL instead. The second write pattern is the problematic one, but how likely is the intersection of data with that write pattern and searching for all values except one with null-safe equality? I don't think it is likely and I think that's why we haven't found this until now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all details. +1 for going 1.10.1. If that's a plan here, I will close this PR in a couple of days.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vote has started. Feel free to test the new binaries with this repository URI: https://repository.apache.org/content/repositories/orgapacheparquet-1022/

@dongjoon-hyun
Copy link
Member

cc @dbtsai

@dongjoon-hyun
Copy link
Member

Hi, @rdblue . Parquet 0.10 is released on April 2018. And, PARQUET-1309 seems to be resolved Jun 5, 2018. I'm wondering if there is a chance for us to get Parquet 0.11 before Spark 2.4.1 release. Is there a release plan in this year?

@gatorsmile
Copy link
Member

gatorsmile commented Jan 25, 2019

We will not upgrade the major version of Parquet in 2.4.1. If Parquet release 0.10.1 with various bug fixes, we can do it after reviewing all the fixes and understand the impact.

@rdblue
Copy link
Contributor

rdblue commented Jan 25, 2019

Parquet should do a 1.10.1. I agree with Xiao that Spark shouldn't update to 1.11.0 in a patch release.

Another option is to detect and avoid pushing down this filter. But I agree it would be better to fix it on the Parquet side.

@rdblue
Copy link
Contributor

rdblue commented Jan 25, 2019

Opened PARQUET-1510.

@dongjoon-hyun
Copy link
Member

Thanks, @rdblue and @gatorsmile . 0.10.1 sounds perfect!

@HyukjinKwon
Copy link
Member Author

Closing this. This will be fixed by upgrading to 1.10.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants