-
Couldn't load subscription status.
- Fork 28.9k
[SPARK-15613] [SQL] Fix incorrect days to millis conversion due to Daylight Saving Time #13652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @JoshRosen |
|
|
||
| test("daysToMillis and millisToDays") { | ||
| (-1001 to 2000).foreach { d => | ||
| assert(millisToDays(daysToMillis(d)) === d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may need to test this in multiple timezones, since this wouldn't have necessarily failed before in US Pacific. Switching the default timezone within a test case is a bit tricky, though: I have code for this in one of my branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See that, will pull yours here.
|
Test build #60441 has finished for PR 13652 at commit
|
|
/cc @RussellSpitzer @ckadner @adrian-wang, since I think you've all either worked on this code or have commented on issues similar to this one. |
|
I think this is unfortunately the right thing to do. I wish we didn't have to use java.sql.Date :( |
|
@RussellSpitzer Unfortunately TimeZone does not provide enough API to access the transitions, we have to probe them as a blackbox. I had pulled in the unit tests from @JoshRosen 's branch, make sure all the time zoned are covered. |
|
Test build #60511 has finished for PR 13652 at commit
|
|
Test build #3110 has finished for PR 13652 at commit
|
|
I think a lot of my own confusion here stems from trying to reason through what should happen if I take When we convert a |
|
Test build #60607 has finished for PR 13652 at commit
|
| DateTimeTestUtils.withDefaultTimeZone(tz) { | ||
| (-2000 to 2000).foreach { d => | ||
| assert(millisToDays(daysToMillis(d)) === d) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the range (-2000 to 2000) should be extended to cover more recent dates, today is day 16967 since the epoch, i.e.:
for (tz <- DateTimeTestUtils.ALL_TIMEZONES) {
DateTimeTestUtils.withDefaultTimeZone(tz) {
(-20000 to 20000).foreach { d =>
assert(millisToDays(daysToMillis(d)) === d, s"toJavaDate($d) = ${toJavaDate(d)} in ${tz}")
}
}
...which will show more problem(s):
8633 did not equal 8632 toJavaDate(8632) = 1993-08-21 in sun.util.calendar.ZoneInfo[id="Kwajalein",offset=43200000,dstSavings=0,useDaylight=false,transitions=5,lastRule=null]
... list of the incorrectly converted days, with dates in time zones:
for (tz <- DateTimeTestUtils.ALL_TIMEZONES) {
DateTimeTestUtils.withDefaultTimeZone(tz) {
(-20000 to 20000).filterNot(d => d === millisToDays(daysToMillis(d))).foreach { d =>
println(s"toJavaDate($d) = ${toJavaDate(d)} in ${tz}")
}
}
}
toJavaDate(8632) = 1993-08-21 in time zone [id="Kwajalein",offset=43200000,dstSavings=0,useDaylight=false,transitions=5,lastRule=null]
toJavaDate(15338) = 2011-12-31 in time zone [id="Pacific/Apia",offset=46800000,dstSavings=3600000,useDaylight=true,transitions=59,lastRule=java.util.SimpleTimeZone[id=Pacific/Apia,offset=46800000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=2,startMonth=8,startDay=-1,startDayOfWeek=1,startTime=10800000,startTimeMode=0,endMode=3,endMonth=3,endDay=1,endDayOfWeek=1,endTime=14400000,endTimeMode=0]]
toJavaDate(9131) = 1995-01-02 in time zone [id="Pacific/Enderbury",offset=46800000,dstSavings=0,useDaylight=false,transitions=5,lastRule=null]
toJavaDate(15338) = 2011-12-31 in time zone [id="Pacific/Fakaofo",offset=46800000,dstSavings=0,useDaylight=false,transitions=4,lastRule=null]
toJavaDate(9131) = 1995-01-02 in time zone [id="Pacific/Kiritimati",offset=50400000,dstSavings=0,useDaylight=false,transitions=5,lastRule=null]
toJavaDate(8632) = 1993-08-21 in time zone [id="Pacific/Kwajalein",offset=43200000,dstSavings=0,useDaylight=false,transitions=5,lastRule=null]
toJavaDate(15338) = 2011-12-31 in time zone [id="MIT",offset=46800000,dstSavings=3600000,useDaylight=true,transitions=59,lastRule=java.util.SimpleTimeZone[id=MIT,offset=46800000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=2,startMonth=8,startDay=-1,startDayOfWeek=1,startTime=10800000,startTimeMode=0,endMode=3,endMonth=3,endDay=1,endDayOfWeek=1,endTime=14400000,endTimeMode=0]]
|
Question to check my conceptual understanding: if I don't use explicit sc.parallelize(Seq(1466036527, 1466036527))
.toDF("ts")
.selectExpr("cast(ts as timestamp)")
.filter("ts < '2016-06-15 17:22:07.0'")
.collect()If I run this with |
|
It looks like there's some precedence for having the notion of separate "session" and "server" timezones in MySQL and Postgres (as well as JDBC), so it may be okay to have parsing of timestamp literals be something that's driver-time-zone sensitive. This is kind of orthogonal to this PR, though, so maybe we should spin off that discussion thread into a new JIRA ticket to discuss session-specific timezone settings. |
|
I would love to be able to just specify days since epoch rather than using java.sql.Date |
| } | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since 1970-01-01 00:00:00 UTC or local timezone?
|
@JoshRosen If any human time representation is involved, for example, After extending the range of test cases, just realize another fact: In some timezone, a whole day in human time could be skipped, for example, 1993-08-21 in Kwajalein timezone, here is the code to show this: So I updated the tests to ignore these days. |
|
@RussellSpitzer Unfortunately, most of the database do not support that (cast a integer into date), same here. For parquet, we use the days since epoch rather than using java.sql.Date. |
|
Test build #60647 has finished for PR 13652 at commit
|
|
Test build #60648 has finished for PR 13652 at commit
|
|
Test build #3119 has finished for PR 13652 at commit
|
|
@JoshRosen How does this look now? |
|
I think this is a good "best effort" fix for the round-trip conversion It may also be worth testing the round-trip conversion test("fromJavaDate toJavaDate in all timezones") {
val debug = true
val years = 1900 to 2020
val dates : Seq[String] = for {
year <- years
month <- 1 to 12
day <- 1 to 31
} yield new Date(year-1900, month-1, day).toString
def testDateIsEqual(date: String): Boolean = {
val d1 = Date.valueOf(date)
val d2 = toJavaDate(fromJavaDate(d1))
d2 === d1
}
def testStringEquals(date: String): Boolean = {
val d1 = Date.valueOf(date)
val d2 = toJavaDate(fromJavaDate(d1))
d2.toString === d1.toString
}
// make sure all converted Dates have a matching String representation (ignores time component)
for (tz <- DateTimeTestUtils.ALL_TIMEZONES) {
DateTimeTestUtils.withDefaultTimeZone(tz) {
assert(dates.filterNot(testStringEquals _).isEmpty)
}
}
// find Dates that don't convert to an equal Date (non-matching time component)
val nonEqualDatesWithTZ = for {
tz <- DateTimeTestUtils.ALL_TIMEZONES
badDate <- DateTimeTestUtils.withDefaultTimeZone(tz) { dates.filterNot(testDateIsEqual _) }
} yield (badDate, tz)
// scalastyle:off println
if (debug) {
println(s"${nonEqualDatesWithTZ.size} incorrect conversions ${years.start}..${years.end}")
nonEqualDatesWithTZ.foreach { case (d, t) => println(s"${d} in ${t.getID}") }
}
// scalastyle:on println
}Output when The extra debug stuff should be removed of course. |
|
Test build #60773 has finished for PR 13652 at commit
|
|
@ckadner Had updated the PR to fallback to java.sql.Timestamp to do the reverse lookup, that should have better resolution (especially when converting a timestamp to another timezone). The test you posted still failed on 52 cases, because the equality of java.sql.Date is based on unix timestamp (in UTC), but it could be multiple timestamp mapping to same human time, so the equality of java.sql.Date could have false negative. For 52 cases here, I checked that the Date objects are actually point to the start of the day (create a Timestamp objects using the unix timestamp of the two Date objects). |
|
Since we have to cut a RC for 2.0, I will merge this into master and 2.0 after it pass tests. |
|
Test build #60775 has finished for PR 13652 at commit
|
…ylight Saving Time ## What changes were proposed in this pull request? Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not). This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice. ## How was this patch tested? Added regression tests. Author: Davies Liu <[email protected]> Closes #13652 from davies/fix_timezone. (cherry picked from commit 001a589) Signed-off-by: Davies Liu <[email protected]>
…ylight Saving Time ## What changes were proposed in this pull request? Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not). This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice. ## How was this patch tested? Added regression tests. Author: Davies Liu <[email protected]> Closes #13652 from davies/fix_timezone. (cherry picked from commit 001a589) Signed-off-by: Davies Liu <[email protected]>
|
Merged into 1.6 and 2.0 |
|
Awesome, thanks! |
|
hi @davies, two tests still fail after this patch when I build locally: My time zone: |
…ylight Saving Time ## What changes were proposed in this pull request? Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not). This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice. ## How was this patch tested? Added regression tests. Author: Davies Liu <[email protected]> Closes apache#13652 from davies/fix_timezone. (cherry picked from commit 001a589) Signed-off-by: Davies Liu <[email protected]> (cherry picked from commit 41efd20)
|
Also failing here in the UK:
|
|
@lw-lin @robbinspg I submit a PR to fix the failure: #13783 . Thanks for your information! |
|
@lw-lin @robbinspg Thanks for report this, sent #13784 to fix the bug. |
…ylight Saving Time Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not). This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice. Added regression tests. Author: Davies Liu <[email protected]> Closes #13652 from davies/fix_timezone.
|
I (also) just created a PR #13785 to fix the two test cases @davies Would it not suffice to set the (thread local) time zone in the two test cases? |
|
@ckadner That could fix the flaky test, but hide the actual bug. |
|
@davies ...just to confirm the desired behavior of these two test cases
test("from UTC timestamp") {
def test(utc: String, tz: String, expected: String): Unit = {
assert(toJavaTimestamp(fromUTCTime(fromJavaTimestamp(Timestamp.valueOf(utc)), tz)).toString
=== expected)
}
for (tz <- DateTimeTestUtils.ALL_TIMEZONES) {
DateTimeTestUtils.withDefaultTimeZone(tz) {
test("2011-12-25 09:00:00.123456", "UTC", "2011-12-25 09:00:00.123456")
test("2011-12-25 09:00:00.123456", "JST", "2011-12-25 18:00:00.123456")
test("2011-12-25 09:00:00.123456", "PST", "2011-12-25 01:00:00.123456")
test("2011-12-25 09:00:00.123456", "Asia/Shanghai", "2011-12-25 17:00:00.123456")
}
}
...(*) Before your fix (without the for-loop) the third time zone involved was determined by the user's JVM default Timezone. If this 3-timezone behavior is what we want to test, then should you replace the two occurrences of ... DateTimeTestUtils.withDefaultTimeZone(TimeZone.getTimeZone("PST")) {with ... for (tz <- DateTimeTestUtils.ALL_TIMEZONES) {
DateTimeTestUtils.withDefaultTimeZone(tz) {... or would/should DateTimeTestUtils.withDefaultTimeZone(TimeZone.getTimeZone("PST")) {
// Daylight Saving Time
test("2016-03-13 01:59:59", "PST", "2016-03-13 09:59:59.0")
// 2016-03-13 02:00:00 PST does not exists
test("2016-03-13 02:00:00", "PST", "2016-03-13 10:00:00.0")
test("2016-03-13 03:00:00", "PST", "2016-03-13 10:00:00.0")
test("2016-11-06 00:59:59", "PST", "2016-11-06 07:59:59.0")
// 2016-11-06 01:00:00 PST could be 2016-11-06 08:00:00 UTC or 2016-11-06 09:00:00 UTC
test("2016-11-06 01:00:00", "PST", "2016-11-06 09:00:00.0")
test("2016-11-06 01:59:59", "PST", "2016-11-06 09:59:59.0")
test("2016-11-06 02:00:00", "PST", "2016-11-06 10:00:00.0")
} |
|
@ckadner The test function here try to similar this SQL function: cast(from_unix_timestamp(cast("2016-03-13 01:59:59" as timestamp), "PST") as string) In the beginning, I want to run all these tests in different timezone, but unfortunately some of them could not pass the tests, because of |
…ylight Saving Time Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not). This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice. Added regression tests. Author: Davies Liu <[email protected]> Closes apache#13652 from davies/fix_timezone. (cherry picked from commit db86e7f)
What changes were proposed in this pull request?
Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not).
This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice.
How was this patch tested?
Added regression tests.