-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead of ZZ (FastDateFormat specific) in CSV/JSON timeformat options #17489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ic in CSV/JSON time related options
| formats follow the formats at ``java.text.SimpleDateFormat``. | ||
| This applies to timestamp type. If None is set, it uses the | ||
| default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSZZ``. | ||
| default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interestingly, the default format value in R omits the timezone. I think this is an issue that we need to fix (think we have seen some test failure reported by someone in India a while back)
https://github.com/apache/spark/blame/master/R/pkg/R/functions.R#L2844
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, they look apparently similar instances. However, it seems the doc states
... the timestamp of that moment in the current system time zone ...
... using the default timezone and the default locale, ...
This probably is fine up to my knowledge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will double check this path and related references and then open a JIRA if it looks needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This option is specific to CSV, not JDBC
|
Test build #75413 has finished for PR 17489 at commit
|
|
I see, so I think you need to check cases like "-1100" and "11". Is there any case that worked before but not after the change? |
Yes, it seems so given my tests and checking the codes.
Yes, it is.
Up to my knowledge, both cases should be scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-1100")
java.text.ParseException: Unparseable date: 2017-03-21T00:00:00.000-1100
at org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:371)
at org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:545)
... 48 elided
scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11")
java.text.ParseException: Unparseable date: 2017-03-21T00:00:00.000-11
at org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:371)
at org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:545)
... 48 elided
scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-1100")
java.text.ParseException: Unparseable date: 2017-03-21T00:00:00.000-1100
at org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:371)
at org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:545)
... 48 elided
scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11")
java.text.ParseException: Unparseable date: 2017-03-21T00:00:00.000-11
at org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:371)
at org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:545)
... 48 elided
scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXX").parse("2017-03-21T00:00:00.000-1100")
res18: java.util.Date = Tue Mar 21 20:00:00 KST 2017
scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSX").parse("2017-03-21T00:00:00.000-11")
res19: java.util.Date = Tue Mar 21 20:00:00 KST 2017scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXX").parse("2017-03-21T00:00:00.000-1100")
res20: java.util.Date = Tue Mar 21 20:00:00 KST 2017
scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSX").parse("2017-03-21T00:00:00.000-11")
res21: java.util.Date = Tue Mar 21 20:00:00 KST 2017
I understand this concern and that's why I checked the codes. They look using the same timezone pattern for both |
|
I think the question is: would this cause any input that currently works to not work? If it's a corner case we need to break in order to fix bigger problems, that's possible, but just trying to quantify that. That is, maybe this is also solvable with a doc change, to describe correctly what it currently does rather than change what it does? |
|
I tested "-1100" cases before with I guess this was rather a bug in commons-lang but fixed after we upgrade commons-lang SPARK-17985. I just downloaded Spark 2.0.0 and tested the codes below: scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-1100")
res2: java.util.Date = Tue Mar 21 20:00:00 KST 2017Spark 2.1.0: scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-1100")
java.text.ParseException: Unparseable date: 2017-03-21T00:00:00.000-1100
at org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:371)
at org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:545)
... 48 elidedI just tested what I tested before - #15147 (comment) val path = "/tmp/abcd"
val textDf = Seq(
"time",
"2015-07-20T15:09:23.736-0500",
"2015-07-20T15:10:51.687-0500",
"2015-11-21T23:15:01.499-0600").toDF()
textDf.coalesce(1).write.text(path)
val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(path)
df.printSchema()Spark 2.0.2: Spark 2.1.0: |
|
Let me just fix up only the documentation if you are worried of it. I am fine with it. |
|
Otherwise, let me ask if both |
|
OK, then "-1100" doesn't work right now, though it used to. ZZ is supposed to support "-1100" according to the Java docs, but then again, that also suggests some things don't work which do actually work because we're using FastDateFormat. ZZ actually means, to FastDateFormat, what XXX means to the Java implementation? but both interpret XXX the same way? It does seem simpler to standardize on XXX because it's clearer all around, if they both behave the same way and behave how ZZ already behaves. In that case, we do want to update the doc and implementation to use XXX, I think. The problem is that "-1100" should have worked according to docs but didn't? I think that it's OK for it to continue to not work then. More importantly, this is supposed to be the ISO 8601 default format right? and XXX means that, not ZZ. It's a complex situation but I think this is good change for 2.2. Of course, this only affects the default. One can override with any specific format. |
|
Yes. I asked a question to commons mailing list about this for clarity. Let me update you when I have an answer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK if this changes all instances of ZZ to XXX then that sounds good for 2.2.
|
I tried to grep this pattern. I think these are all if I haven't missed ones. Thanks for approving it. |
|
Merged to master |
|
Thank you for asking the details and merging it @srowen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get XXX format is invalid
|
No one knows what happened to you. Please file a JIRA with a full reproducer with full error messages with some analysis you can make. |
What changes were proposed in this pull request?
This PR proposes to use
XXXformat instead ofZZ.ZZseems aFastDateFormatspecific.ZZsupports "ISO 8601 extended format time zones" but it seemsFastDateFormatspecific option.I misunderstood this is compatible format with
SimpleDateFormatwhen this change is introduced.Please see SimpleDateFormat documentation and FastDateFormat documentation.
It seems we better replace
ZZtoXXXbecause they look using the same strategy - FastDateParser.java#L930, FastDateParser.java#L932-L951 and FastDateParser.java#L596-L601.I also checked the codes and manually debugged it for sure. It seems both cases use the same pattern
( Z|(?:[+-]\\d{2}(?::)\\d{2})).Note that this should be rather a fix about documentation and not the behaviour change because
ZZseems invalid date format inSimpleDateFormatas documented inDataFrameReaderand etc, and bothZZandXXXlook identically working withFastDateFormatCurrent documentation is as below:
How was this patch tested?
Existing tests should cover this. Also, manually tested as below (BTW, I don't think these are worth being added as tests within Spark):
Parse
Format