-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26677][SQL] Disable dictionary filtering by default at Parquet #23622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @cloud-fan and @gatorsmile - this is a correctness issue. |
|
adding @rdblue as well. |
|
is there a parquet JIRA to track this dictionary encoding bug? |
|
I haven't filed yet for this bug at Parquet yet.. I want to talk with @rdblue first anyway. |
|
I think this is pretty serious. We might need a release of 2.4.1 and send a note to dev list. @rdblue Could you confirm the impact of this bug? |
|
Test build #101573 has finished for PR 23622 at commit
|
|
Test build #101574 has finished for PR 23622 at commit
|
|
retest this please |
|
Test build #101576 has finished for PR 23622 at commit
|
|
I'll take a look at this today. Thanks for pointing it out. |
|
Thanks! |
| // the 'parquet.filter.stats.enabled' and 'parquet.filter.dictionary.enabled' were | ||
| // swapped mistakenly in Parquet side. It should use 'parquet.filter.dictionary.enabled' | ||
| // when Spark upgrades Parquet. See PARQUET-1309. | ||
| hadoopConf.setIfUnset(ParquetInputFormat.STATS_FILTERING_ENABLED, "false") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should fix it in the Spark side. Do the swap in Spark. cc @cloud-fan @rxin @rdblue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do. but let me wait for other feedback before I do since it's a Parquet's property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to disable safely, shall we use set instead of setIfUnset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I was thinking about that too. I wanted to consider other possibilities like .. users take that risk and enable them (partly also since this is going to backported into Spark 2.4.x).
Both configurations are swapped as of PARQUET-1309 but was wondering if there might be extreme corner cases that users know that issue and intentionally enable and disable it. I think both configurations are pretty much for advanced users, and regular users won't set this or meet this issue anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT on this @rdblue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make sure the fix for this is in Parquet 1.10.1.
As for fixing this problem, I think that Spark should avoid pushing down notEquals expressions or rewrite them to isNull(col) or notEquals(col, "A"). That's going to be much better for performance than disabling dictionary filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so we're targeting the upgrade to Parquet 1.10.1? yea, sounds okay to me. Also, in that way users can also disable parquet.filter.dictionary.enabled explicitly I guess.
BTW, is it something we should enable by default at Parquet side, @rdblue? I see there can be the performance improvement but was wondering how much stable dictionary filtering it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dictionary filtering has been available for more than 2 years and Netflix has been using it by default for almost 3 years without problems. The feature is stable.
Keep in mind what a narrow use case hits this bug. First, an entire row group in a file needs to have a just one value and nulls: either the column has just one non-null value for all rows, or a sort or write pattern has clustered null rows with one other value (enough for an entire row group). Next, a query must use null-safe equality and negate it. In my experience, most people don't know what null-safe equality is. Going back to the use cases that produce data like this, the first use case -- only one non-null value -- would probably result in filter like col IS NULL instead. The second write pattern is the problematic one, but how likely is the intersection of data with that write pattern and searching for all values except one with null-safe equality? I don't think it is likely and I think that's why we haven't found this until now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all details. +1 for going 1.10.1. If that's a plan here, I will close this PR in a couple of days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vote has started. Feel free to test the new binaries with this repository URI: https://repository.apache.org/content/repositories/orgapacheparquet-1022/
|
cc @dbtsai |
|
Hi, @rdblue . Parquet 0.10 is released on April 2018. And, PARQUET-1309 seems to be resolved Jun 5, 2018. I'm wondering if there is a chance for us to get Parquet 0.11 before Spark 2.4.1 release. Is there a release plan in this year? |
|
We will not upgrade the major version of Parquet in 2.4.1. If Parquet release 0.10.1 with various bug fixes, we can do it after reviewing all the fixes and understand the impact. |
|
Parquet should do a 1.10.1. I agree with Xiao that Spark shouldn't update to 1.11.0 in a patch release. Another option is to detect and avoid pushing down this filter. But I agree it would be better to fix it on the Parquet side. |
|
Opened PARQUET-1510. |
|
Thanks, @rdblue and @gatorsmile . 0.10.1 sounds perfect! |
|
Closing this. This will be fixed by upgrading to 1.10.1 |
What changes were proposed in this pull request?
Problem
This is a correctness issue and should be backported as well. If we use dictionary encoding as below, it hits a correctness issue as below:
Note that, if we don't use dictionary encoding it's fine. So it was difficult to find the issue.
How did it happen?
This is because Parquet side dictionary filter fails to handle
null. The former case hits here:See here for codes.
Given dictionary set does not contain
nullbut only'A'.value(given to the predicate) is'A'here. So, it filters the row group out.The latter case above works fine because it hits here:
See here for codes.
So, it does not filter the row group out.
Parquet equality predicate handles
nulltoo (see also here). Up to my knowledge, Parquet predicate is null-safe quality comparison.How does this PR fix?
This PR explicitly disables dictionary filtering if not set by default. However there's another problem:
We should disable
parquet.filter.dictionary.enabledbut Parquet 1.10.x has a mistake - theparquet.filter.stats.enabledandparquet.filter.dictionary.enabledwere swapped mistakenly in Parquet side:See here for codes.
This is fixed after 1.11. See PARQUET-1309.
Therefore, this PR explicitly disables
parquet.filter.stats.enabledto disable dictionary filtering if that's not set by default.User side workaround
They can explicitly disable
parquet.filter.stats.enabledto disable dictionary filtering.ETC
How was this patch tested?
Unit tests were added.