[SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead of ZZ (FastDateFormat specific) in CSV/JSON timeformat options #17489

HyukjinKwon · 2017-03-31T02:36:32Z

What changes were proposed in this pull request?

This PR proposes to use XXX format instead of ZZ. ZZ seems a FastDateFormat specific.

ZZ supports "ISO 8601 extended format time zones" but it seems FastDateFormat specific option.
I misunderstood this is compatible format with SimpleDateFormat when this change is introduced.
Please see SimpleDateFormat documentation and FastDateFormat documentation.

It seems we better replace ZZ to XXX because they look using the same strategy - FastDateParser.java#L930, FastDateParser.java#L932-L951 and FastDateParser.java#L596-L601.

I also checked the codes and manually debugged it for sure. It seems both cases use the same pattern ( Z|(?:[+-]\\d{2}(?::)\\d{2})).

Note that this should be rather a fix about documentation and not the behaviour change because ZZ seems invalid date format in SimpleDateFormat as documented in DataFrameReader and etc, and both ZZ and XXX look identically working with FastDateFormat

Current documentation is as below:

   * <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss.SSSZZ`): sets the string that
   * indicates a timestamp format. Custom date formats follow the formats at
   * `java.text.SimpleDateFormat`. This applies to timestamp type.</li>

How was this patch tested?

Existing tests should cover this. Also, manually tested as below (BTW, I don't think these are worth being added as tests within Spark):

Parse

scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
res4: java.util.Date = Tue Mar 21 20:00:00 KST 2017

scala>  new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
res10: java.util.Date = Tue Mar 21 09:00:00 KST 2017

scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000-11:00"
  at java.text.DateFormat.parse(DateFormat.java:366)
  ... 48 elided
scala>  new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000Z"
  at java.text.DateFormat.parse(DateFormat.java:366)
  ... 48 elided

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
res7: java.util.Date = Tue Mar 21 20:00:00 KST 2017

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
res1: java.util.Date = Tue Mar 21 09:00:00 KST 2017

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
res8: java.util.Date = Tue Mar 21 20:00:00 KST 2017

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
res2: java.util.Date = Tue Mar 21 09:00:00 KST 2017

Format

scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").format(new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00"))
res6: String = 2017-03-21T20:00:00.000+09:00

scala> val fd = org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ")
fd: org.apache.commons.lang3.time.FastDateFormat = FastDateFormat[yyyy-MM-dd'T'HH:mm:ss.SSSZZ,ko_KR,Asia/Seoul]

scala> fd.format(fd.parse("2017-03-21T00:00:00.000-11:00"))
res1: String = 2017-03-21T20:00:00.000+09:00

scala> val fd = org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX")
fd: org.apache.commons.lang3.time.FastDateFormat = FastDateFormat[yyyy-MM-dd'T'HH:mm:ss.SSSXXX,ko_KR,Asia/Seoul]

scala> fd.format(fd.parse("2017-03-21T00:00:00.000-11:00"))
res2: String = 2017-03-21T20:00:00.000+09:00

…ic in CSV/JSON time related options

HyukjinKwon · 2017-03-31T02:36:53Z

cc @srowen and @ueshin, could you take a look please?

felixcheung · 2017-03-31T03:51:44Z

python/pyspark/sql/readwriter.py

                                formats follow the formats at ``java.text.SimpleDateFormat``.
                                This applies to timestamp type. If None is set, it uses the
-                                default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSZZ``.
+                                default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``.


interestingly, the default format value in R omits the timezone. I think this is an issue that we need to fix (think we have seen some test failure reported by someone in India a while back)
https://github.com/apache/spark/blame/master/R/pkg/R/functions.R#L2844

Yea, they look apparently similar instances. However, it seems the doc states

... the timestamp of that moment in the current system time zone ...

... using the default timezone and the default locale, ...

This probably is fine up to my knowledge.

I will double check this path and related references and then open a JIRA if it looks needed.

@HyukjinKwon @felixcheung any clue whats wrong here ?... https://stackoverflow.com/questions/56020103/how-to-pass-date-timestamp-as-lowerbound-upperbound-in-spark-sql-2-4-1v

This option is specific to CSV, not JDBC

SparkQA · 2017-03-31T04:55:37Z

Test build #75413 has finished for PR 17489 at commit a0e332d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-03-31T11:11:11Z

I see, so ZZ means something different to FastDateFormat, and what it means actually matches what XXX means to SimpleDateFormat? But FastDateFormat also supports XXX and supports it the same way? then this makes sense.

I think you need to check cases like "-1100" and "11". Is there any case that worked before but not after the change?

HyukjinKwon · 2017-03-31T11:29:55Z

ZZ means something different to FastDateFormat, and what it means actually matches what XXX means to SimpleDateFormat?

Yes, it seems so given my tests and checking the codes.

But FastDateFormat also supports XXX and supports it the same way?

Yes, it is.

I think you need to check cases like "-1100" and "11".

Up to my knowledge, both cases should be XX and X for each. It seems throwing an exception as below if we use XXX or ZZ.

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-1100")
java.text.ParseException: Unparseable date: 2017-03-21T00:00:00.000-1100
  at org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:371)
  at org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:545)
  ... 48 elided

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11")
java.text.ParseException: Unparseable date: 2017-03-21T00:00:00.000-11
  at org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:371)
  at org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:545)
  ... 48 elided

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-1100")
java.text.ParseException: Unparseable date: 2017-03-21T00:00:00.000-1100
  at org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:371)
  at org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:545)
  ... 48 elided

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11")
java.text.ParseException: Unparseable date: 2017-03-21T00:00:00.000-11
  at org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:371)
  at org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:545)
  ... 48 elided

XX and X test for both as below:

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXX").parse("2017-03-21T00:00:00.000-1100")
res18: java.util.Date = Tue Mar 21 20:00:00 KST 2017

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSX").parse("2017-03-21T00:00:00.000-11")
res19: java.util.Date = Tue Mar 21 20:00:00 KST 2017

scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXX").parse("2017-03-21T00:00:00.000-1100")
res20: java.util.Date = Tue Mar 21 20:00:00 KST 2017

scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSX").parse("2017-03-21T00:00:00.000-11")
res21: java.util.Date = Tue Mar 21 20:00:00 KST 2017

Is there any case that worked before but not after the change?

I understand this concern and that's why I checked the codes. They look using the same timezone pattern for both ZZ and XXX - FastDateParser.java#L930, FastDateParser.java#L932-L951 and FastDateParser.java#L596-L601.

srowen · 2017-03-31T11:34:59Z

I think the question is: would this cause any input that currently works to not work?
It's not a deal-breaker but that's the question. Doesn't "-1100" work right now? it would stop working if the default is XXX. According to the docs, the current default of ZZ should allow that and it seems to work.

If it's a corner case we need to break in order to fix bigger problems, that's possible, but just trying to quantify that. That is, maybe this is also solvable with a doc change, to describe correctly what it currently does rather than change what it does?

HyukjinKwon · 2017-03-31T11:53:39Z

I tested "-1100" cases before with ZZ.

I guess this was rather a bug in commons-lang but fixed after we upgrade commons-lang SPARK-17985. I just downloaded Spark 2.0.0 and tested the codes below:

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-1100")
res2: java.util.Date = Tue Mar 21 20:00:00 KST 2017

Spark 2.1.0:

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-1100")
java.text.ParseException: Unparseable date: 2017-03-21T00:00:00.000-1100
  at org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:371)
  at org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:545)
  ... 48 elided

I just tested what I tested before - #15147 (comment)

val path = "/tmp/abcd"
val textDf = Seq(
  "time",
  "2015-07-20T15:09:23.736-0500",
  "2015-07-20T15:10:51.687-0500",
  "2015-11-21T23:15:01.499-0600").toDF()
textDf.coalesce(1).write.text(path)
val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(path)
df.printSchema()

Spark 2.0.2:

root
 |-- time: timestamp (nullable = true)

Spark 2.1.0:

root
 |-- time: string (nullable = true)

HyukjinKwon · 2017-03-31T11:58:14Z

Let me just fix up only the documentation if you are worried of it. I am fine with it.

HyukjinKwon · 2017-03-31T12:20:42Z

Otherwise, let me ask if both ZZ to XXX are really the same to Apache commons user mailing list.

srowen · 2017-03-31T12:25:08Z

OK, then "-1100" doesn't work right now, though it used to. ZZ is supposed to support "-1100" according to the Java docs, but then again, that also suggests some things don't work which do actually work because we're using FastDateFormat. ZZ actually means, to FastDateFormat, what XXX means to the Java implementation? but both interpret XXX the same way?

It does seem simpler to standardize on XXX because it's clearer all around, if they both behave the same way and behave how ZZ already behaves. In that case, we do want to update the doc and implementation to use XXX, I think.

The problem is that "-1100" should have worked according to docs but didn't? I think that it's OK for it to continue to not work then.

More importantly, this is supposed to be the ISO 8601 default format right? and XXX means that, not ZZ.

It's a complex situation but I think this is good change for 2.2. Of course, this only affects the default. One can override with any specific format.

HyukjinKwon · 2017-03-31T12:59:25Z

Yes. I asked a question to commons mailing list about this for clarity. Let me update you when I have an answer.

HyukjinKwon · 2017-04-01T06:29:59Z

@srowen, I had a great answer about ZZ - here. It seems okay to use XXX instead.

srowen

OK if this changes all instances of ZZ to XXX then that sounds good for 2.2.

HyukjinKwon · 2017-04-01T11:25:44Z

I tried to grep this pattern. I think these are all if I haven't missed ones. Thanks for approving it.

srowen · 2017-04-03T09:07:59Z

Merged to master

HyukjinKwon · 2017-04-03T09:16:57Z

Thank you for asking the details and merging it @srowen.

tooptoop4

I get XXX format is invalid

HyukjinKwon · 2019-09-26T08:00:38Z

No one knows what happened to you. Please file a JIRA with a full reproducer with full error messages with some analysis you can make.

Use XXX for ISO timezone instead of ZZ which is FastDateFormat specif…

a0e332d

…ic in CSV/JSON time related options

HyukjinKwon changed the title ~~[SPARK-20166][SQL] Use XXX for ISO timezone instead of ZZ (FastDateFormat specific) in CSV/JSON timeformat options~~ [SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead of ZZ (FastDateFormat specific) in CSV/JSON timeformat options Mar 31, 2017

felixcheung reviewed Mar 31, 2017

View reviewed changes

srowen approved these changes Apr 1, 2017

View reviewed changes

asfgit closed this in cff11fd Apr 3, 2017

HyukjinKwon deleted the SPARK-20166 branch January 2, 2018 03:38

tooptoop4 reviewed Sep 26, 2019

View reviewed changes

[SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead of ZZ (FastDateFormat specific) in CSV/JSON timeformat options #17489

[SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead of ZZ (FastDateFormat specific) in CSV/JSON timeformat options #17489

Uh oh!

Conversation

HyukjinKwon commented Mar 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Mar 31, 2017

Uh oh!

felixcheung Mar 31, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Apr 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Apr 2, 2017

Choose a reason for hiding this comment

Uh oh!

shatestest May 8, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 8, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 31, 2017

Uh oh!

srowen commented Mar 31, 2017

Uh oh!

HyukjinKwon commented Mar 31, 2017

Uh oh!

srowen commented Mar 31, 2017

Uh oh!

HyukjinKwon commented Mar 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Mar 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Mar 31, 2017

Uh oh!

srowen commented Mar 31, 2017

Uh oh!

HyukjinKwon commented Mar 31, 2017

Uh oh!

HyukjinKwon commented Apr 1, 2017

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Apr 1, 2017

Uh oh!

srowen commented Apr 3, 2017

Uh oh!

HyukjinKwon commented Apr 3, 2017

Uh oh!

tooptoop4 left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Sep 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

HyukjinKwon commented Mar 31, 2017 •

edited

Loading

HyukjinKwon Apr 2, 2017 •

edited

Loading

HyukjinKwon commented Mar 31, 2017 •

edited

Loading

HyukjinKwon commented Mar 31, 2017 •

edited

Loading