-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-9735][SQL]Respect the user specified schema than the infer partition schema for HadoopFsRelation #8026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @liancheng |
|
A summary of my offline discussion with @chenghao-intel: The real problem here is that the partition column types of the newly refreshed partition spec don't match those in the user specified spec. The current fix simply disables refreshing partition spec, which is not preferable. My suggestion is to factor out the partition values casting part in the |
f68d827 to
c80ff30
Compare
|
Test build #41211 has finished for PR 8026 at commit
|
|
retest this please |
1 similar comment
|
retest this please |
|
Test build #41202 has finished for PR 8026 at commit
|
|
Test build #41214 has finished for PR 8026 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, true, this is not valid any more.
|
@chenghao-intel, just to clarify: I noticed that your final approach involved pushing an expected data type down into the method named |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my other comment upthread, this is a bit confusing to me: this method is named infer, but has a mode where it won't perform inference (controlled by a boolean flag), and now has another new field which also bypasses inference and performs a cast. This is confusing to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the master branch, if typeInference == false, it means the data type of partition key will be StringType by default, otherwise, it's probably will be IntegerType, LongType etc. depends on the real value the partition key is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to pass the expect the data type down and then get the associated literal-based partition column value; and @liancheng's suggestion kind of like get the literal (maybe string based) first, and then do casting outside, however, this probably lose some data precision during the re-casting.
For example:
The path looks like, .../part1=1.000, and with the auto inference, we will get a Double for the partition column value, and it will be cast to string as 1.0 if what user expect is StringType;
However, this is totally different if we get it as StringType directly, which supposed to be 1.000.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that casting to a non string type and then converting back to a string may lose precision, but what about disabling inference when calling inferPartitionColumnValue if the user has provided a schema? In that case, it should end up just returning the string literals, which you can then cast without a loss of precision.
Sent from my phone
On Oct 20, 2015, at 8:00 AM, Cheng Hao [email protected] wrote:
In sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala:
*/private[sql] def inferPartitionColumnValue(
We need to pass the expect the data type down and then get the associated literal-based partition column value; and @liancheng's suggestion kind of like get the literal (maybe string based) first, and then do casting outside, however, this probably lose some data precision during the re-casting.expectedDT: Option[DataType],For example:
The path looks like, /part1=1.000, and with the auto inference, we will get a Double, and it will be cast to string as 1.0 if what user expect is StringType;However, this is totally different if we get it as StringType directly, which supposed to be 1.000.
—
Reply to this email directly or view it on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, I will update the code.
|
I just tried testing a build where I only re-enabled the ignored test and changed nothing else. In this case, the test still passed. This makes me wonder whether the "Partition column type casting" is an adequate regression test for this issue. Can you write a new test for this which fails without this patch? |
|
Thank you @JoshRosen , I will pick up this PR as some details I almost forgot. But definitely, the ignored test cases will fail without this PR previously, not sure if someone else fixed that in some other place. |
cda059f to
7f2da8c
Compare
|
@JoshRosen I've updated the unit test also by adding an The root reason that the previous unit test can even passed, should be solved #8035, as it will always get the latest schema from the user specified without calling the |
|
Test build #43984 has finished for PR 8026 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is AssertionError the right exception to be throwing here? I'd think that IllegalArgumentException might be more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll agree we need to take the partition path validation into a separate PR, since we definitely can do more checking and also more pretty error message.
7f2da8c to
2cc93da
Compare
|
Test build #44038 has finished for PR 8026 at commit
|
|
@JoshRosen I've updated the code, should be more straightforward and clean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super-minor nit: could you explicitly name the boolean parameter here at the call-site, e.g. inferSchema = false? This is one of IntelliJ's automatic style recommendations and I'm a fan of it because it makes the code a bit easier to read. I might also just change this myself on merge.
|
Test build #44040 has finished for PR 8026 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: .map instead of using infix notation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording of this error message might be slightly confusing to users since this branch is explicitly disabling inference. I think that it might be slightly clearer to say something like "Actual partitioning column names did not match user-specified partitioning schema; expected ... but got ...", since as far as I know the inference is really only done for the types of the columns, not their names.
|
@chenghao-intel, thanks a bunch for updating this; the current version of this patch is a lot easier to understand and I'm happy with how clean the code turned out. I left only minor style / clarity comments, which I don't mind addressing myself on merge if you're too busy. If you don't mind, though, one more round of quick updates to address my comments would be appreciated. Anyhow, the technical changes here LGTM. |
|
Thank you @JoshRosen so much for the detail review, but seems bug exists, I'd like to solve it myself soon. |
|
Test build #44066 has finished for PR 8026 at commit
|
|
@chenghao-intel, it looks like this most recent test failure is legitimate: |
|
Yes, true, actually SPARK-7749 provides an example of Hive metastore backend empty partition table, then we will not detect any partition column values. I simply removed the assertion in the code, as it's not valid in this case. |
|
Test build #44113 has finished for PR 8026 at commit
|
|
LGTM, so I'm going to merge this into master. Should this be backported to 1.5.x or any earlier releases? |
|
Is this being backported to 1.5.x? |
To enable the unit test of
hadoopFsRelationSuite.Partition column type casting. It previously threw exception like below, as we treat the auto infer partition schema with higher priority than the user specified one.