Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Mar 31, 2017

What changes were proposed in this pull request?

This PR proposes to use XXX format instead of ZZ. ZZ seems a FastDateFormat specific.

ZZ supports "ISO 8601 extended format time zones" but it seems FastDateFormat specific option.
I misunderstood this is compatible format with SimpleDateFormat when this change is introduced.
Please see SimpleDateFormat documentation and FastDateFormat documentation.

It seems we better replace ZZ to XXX because they look using the same strategy - FastDateParser.java#L930, FastDateParser.java#L932-L951 and FastDateParser.java#L596-L601.

I also checked the codes and manually debugged it for sure. It seems both cases use the same pattern ( Z|(?:[+-]\\d{2}(?::)\\d{2})).

Note that this should be rather a fix about documentation and not the behaviour change because ZZ seems invalid date format in SimpleDateFormat as documented in DataFrameReader and etc, and both ZZ and XXX look identically working with FastDateFormat

Current documentation is as below:

   * <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss.SSSZZ`): sets the string that
   * indicates a timestamp format. Custom date formats follow the formats at
   * `java.text.SimpleDateFormat`. This applies to timestamp type.</li>

How was this patch tested?

Existing tests should cover this. Also, manually tested as below (BTW, I don't think these are worth being added as tests within Spark):

Parse

scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
res4: java.util.Date = Tue Mar 21 20:00:00 KST 2017

scala>  new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
res10: java.util.Date = Tue Mar 21 09:00:00 KST 2017

scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000-11:00"
  at java.text.DateFormat.parse(DateFormat.java:366)
  ... 48 elided
scala>  new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000Z"
  at java.text.DateFormat.parse(DateFormat.java:366)
  ... 48 elided
scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
res7: java.util.Date = Tue Mar 21 20:00:00 KST 2017

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
res1: java.util.Date = Tue Mar 21 09:00:00 KST 2017

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
res8: java.util.Date = Tue Mar 21 20:00:00 KST 2017

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
res2: java.util.Date = Tue Mar 21 09:00:00 KST 2017

Format

scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").format(new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00"))
res6: String = 2017-03-21T20:00:00.000+09:00
scala> val fd = org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ")
fd: org.apache.commons.lang3.time.FastDateFormat = FastDateFormat[yyyy-MM-dd'T'HH:mm:ss.SSSZZ,ko_KR,Asia/Seoul]

scala> fd.format(fd.parse("2017-03-21T00:00:00.000-11:00"))
res1: String = 2017-03-21T20:00:00.000+09:00

scala> val fd = org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX")
fd: org.apache.commons.lang3.time.FastDateFormat = FastDateFormat[yyyy-MM-dd'T'HH:mm:ss.SSSXXX,ko_KR,Asia/Seoul]

scala> fd.format(fd.parse("2017-03-21T00:00:00.000-11:00"))
res2: String = 2017-03-21T20:00:00.000+09:00

@HyukjinKwon
Copy link
Member Author

cc @srowen and @ueshin, could you take a look please?

@HyukjinKwon HyukjinKwon changed the title [SPARK-20166][SQL] Use XXX for ISO timezone instead of ZZ (FastDateFormat specific) in CSV/JSON timeformat options [SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead of ZZ (FastDateFormat specific) in CSV/JSON timeformat options Mar 31, 2017
formats follow the formats at ``java.text.SimpleDateFormat``.
This applies to timestamp type. If None is set, it uses the
default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSZZ``.
default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interestingly, the default format value in R omits the timezone. I think this is an issue that we need to fix (think we have seen some test failure reported by someone in India a while back)
https://github.com/apache/spark/blame/master/R/pkg/R/functions.R#L2844

Copy link
Member Author

@HyukjinKwon HyukjinKwon Apr 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, they look apparently similar instances. However, it seems the doc states

... the timestamp of that moment in the current system time zone ...

... using the default timezone and the default locale, ...

This probably is fine up to my knowledge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will double check this path and related references and then open a JIRA if it looks needed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option is specific to CSV, not JDBC

@SparkQA
Copy link

SparkQA commented Mar 31, 2017

Test build #75413 has finished for PR 17489 at commit a0e332d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Mar 31, 2017

I see, so ZZ means something different to FastDateFormat, and what it means actually matches what XXX means to SimpleDateFormat? But FastDateFormat also supports XXX and supports it the same way? then this makes sense.

I think you need to check cases like "-1100" and "11". Is there any case that worked before but not after the change?

@HyukjinKwon
Copy link
Member Author

ZZ means something different to FastDateFormat, and what it means actually matches what XXX means to SimpleDateFormat?

Yes, it seems so given my tests and checking the codes.

But FastDateFormat also supports XXX and supports it the same way?

Yes, it is.

I think you need to check cases like "-1100" and "11".

Up to my knowledge, both cases should be XX and X for each. It seems throwing an exception as below if we use XXX or ZZ.

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-1100")
java.text.ParseException: Unparseable date: 2017-03-21T00:00:00.000-1100
  at org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:371)
  at org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:545)
  ... 48 elided

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11")
java.text.ParseException: Unparseable date: 2017-03-21T00:00:00.000-11
  at org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:371)
  at org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:545)
  ... 48 elided

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-1100")
java.text.ParseException: Unparseable date: 2017-03-21T00:00:00.000-1100
  at org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:371)
  at org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:545)
  ... 48 elided

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11")
java.text.ParseException: Unparseable date: 2017-03-21T00:00:00.000-11
  at org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:371)
  at org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:545)
  ... 48 elided

XX and X test for both as below:

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXX").parse("2017-03-21T00:00:00.000-1100")
res18: java.util.Date = Tue Mar 21 20:00:00 KST 2017

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSX").parse("2017-03-21T00:00:00.000-11")
res19: java.util.Date = Tue Mar 21 20:00:00 KST 2017
scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXX").parse("2017-03-21T00:00:00.000-1100")
res20: java.util.Date = Tue Mar 21 20:00:00 KST 2017

scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSX").parse("2017-03-21T00:00:00.000-11")
res21: java.util.Date = Tue Mar 21 20:00:00 KST 2017

Is there any case that worked before but not after the change?

I understand this concern and that's why I checked the codes. They look using the same timezone pattern for both ZZ and XXX - FastDateParser.java#L930, FastDateParser.java#L932-L951 and FastDateParser.java#L596-L601.

@srowen
Copy link
Member

srowen commented Mar 31, 2017

I think the question is: would this cause any input that currently works to not work?
It's not a deal-breaker but that's the question. Doesn't "-1100" work right now? it would stop working if the default is XXX. According to the docs, the current default of ZZ should allow that and it seems to work.

If it's a corner case we need to break in order to fix bigger problems, that's possible, but just trying to quantify that. That is, maybe this is also solvable with a doc change, to describe correctly what it currently does rather than change what it does?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Mar 31, 2017

I tested "-1100" cases before with ZZ.

I guess this was rather a bug in commons-lang but fixed after we upgrade commons-lang SPARK-17985. I just downloaded Spark 2.0.0 and tested the codes below:

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-1100")
res2: java.util.Date = Tue Mar 21 20:00:00 KST 2017

Spark 2.1.0:

scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-1100")
java.text.ParseException: Unparseable date: 2017-03-21T00:00:00.000-1100
  at org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:371)
  at org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:545)
  ... 48 elided

I just tested what I tested before - #15147 (comment)

val path = "/tmp/abcd"
val textDf = Seq(
  "time",
  "2015-07-20T15:09:23.736-0500",
  "2015-07-20T15:10:51.687-0500",
  "2015-11-21T23:15:01.499-0600").toDF()
textDf.coalesce(1).write.text(path)
val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(path)
df.printSchema()

Spark 2.0.2:

root
 |-- time: timestamp (nullable = true)

Spark 2.1.0:

root
 |-- time: string (nullable = true)

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Mar 31, 2017

Let me just fix up only the documentation if you are worried of it. I am fine with it.

@HyukjinKwon
Copy link
Member Author

Otherwise, let me ask if both ZZ to XXX are really the same to Apache commons user mailing list.

@srowen
Copy link
Member

srowen commented Mar 31, 2017

OK, then "-1100" doesn't work right now, though it used to. ZZ is supposed to support "-1100" according to the Java docs, but then again, that also suggests some things don't work which do actually work because we're using FastDateFormat. ZZ actually means, to FastDateFormat, what XXX means to the Java implementation? but both interpret XXX the same way?

It does seem simpler to standardize on XXX because it's clearer all around, if they both behave the same way and behave how ZZ already behaves. In that case, we do want to update the doc and implementation to use XXX, I think.

The problem is that "-1100" should have worked according to docs but didn't? I think that it's OK for it to continue to not work then.

More importantly, this is supposed to be the ISO 8601 default format right? and XXX means that, not ZZ.

It's a complex situation but I think this is good change for 2.2. Of course, this only affects the default. One can override with any specific format.

@HyukjinKwon
Copy link
Member Author

Yes. I asked a question to commons mailing list about this for clarity. Let me update you when I have an answer.

@HyukjinKwon
Copy link
Member Author

@srowen, I had a great answer about ZZ - here. It seems okay to use XXX instead.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK if this changes all instances of ZZ to XXX then that sounds good for 2.2.

@HyukjinKwon
Copy link
Member Author

I tried to grep this pattern. I think these are all if I haven't missed ones. Thanks for approving it.

@srowen
Copy link
Member

srowen commented Apr 3, 2017

Merged to master

@asfgit asfgit closed this in cff11fd Apr 3, 2017
@HyukjinKwon
Copy link
Member Author

Thank you for asking the details and merging it @srowen.

@HyukjinKwon HyukjinKwon deleted the SPARK-20166 branch January 2, 2018 03:38
Copy link
Contributor

@tooptoop4 tooptoop4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get XXX format is invalid

@HyukjinKwon
Copy link
Member Author

No one knows what happened to you. Please file a JIRA with a full reproducer with full error messages with some analysis you can make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants