Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Jun 13, 2016

What changes were proposed in this pull request?

Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not).

This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice.

How was this patch tested?

Added regression tests.

@davies
Copy link
Contributor Author

davies commented Jun 13, 2016

cc @JoshRosen


test("daysToMillis and millisToDays") {
(-1001 to 2000).foreach { d =>
assert(millisToDays(daysToMillis(d)) === d)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may need to test this in multiple timezones, since this wouldn't have necessarily failed before in US Pacific. Switching the default timezone within a test case is a bit tricky, though: I have code for this in one of my branches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See that, will pull yours here.

@SparkQA
Copy link

SparkQA commented Jun 14, 2016

Test build #60441 has finished for PR 13652 at commit 20904c4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

/cc @RussellSpitzer @ckadner @adrian-wang, since I think you've all either worked on this code or have commented on issues similar to this one.

@RussellSpitzer
Copy link
Member

I think this is unfortunately the right thing to do. I wish we didn't have to use java.sql.Date :(

@davies
Copy link
Contributor Author

davies commented Jun 14, 2016

@RussellSpitzer Unfortunately TimeZone does not provide enough API to access the transitions, we have to probe them as a blackbox. I had pulled in the unit tests from @JoshRosen 's branch, make sure all the time zoned are covered.

@davies davies changed the title [SPARK-] Fix incorrect days to millis conversion [SPARK-15613] [SQL] Fix incorrect days to millis conversion Jun 14, 2016
@SparkQA
Copy link

SparkQA commented Jun 14, 2016

Test build #60511 has finished for PR 13652 at commit ac3d2b5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies davies changed the title [SPARK-15613] [SQL] Fix incorrect days to millis conversion [SPARK-15613] [SQL] Fix incorrect days to millis conversion due to Daylight Saving Time Jun 15, 2016
@SparkQA
Copy link

SparkQA commented Jun 15, 2016

Test build #3110 has finished for PR 13652 at commit ac3d2b5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

I think a lot of my own confusion here stems from trying to reason through what should happen if I take java.sql.Date.valueOf("2016-01-01") and convert it into Spark's internal date representation. Date.valueOf appears to be time-zone sensitive in the sense that the date's internal timestamp will be different depending on the JVM's local timezone, so the result of java.sql.Date.valueOf("2016-01-01").getTime() will vary from JVM to JVM.

When we convert a java.sql.Date into an Int to be stored in an UnsafeRow we call DateTimeUtils.fromJavaDate, which is implemented as millisToDays(date.getTime). The value passed to millisToDays is a UTC timestamp which represents the start of that day in the user's local timezone, so I'm a bit confused about why the first step of millisToDays adds threadLocalLocalTimeZone.get().getOffset(millisUtc) before truncating computing the day offset. Why aren't we subtracting the offset in order to normalize the time back to the start of the day in UTC?

@SparkQA
Copy link

SparkQA commented Jun 15, 2016

Test build #60607 has finished for PR 13652 at commit f23b545.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

DateTimeTestUtils.withDefaultTimeZone(tz) {
(-2000 to 2000).foreach { d =>
assert(millisToDays(daysToMillis(d)) === d)
}
Copy link
Member

@ckadner ckadner Jun 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the range (-2000 to 2000) should be extended to cover more recent dates, today is day 16967 since the epoch, i.e.:

     for (tz <- DateTimeTestUtils.ALL_TIMEZONES) {
      DateTimeTestUtils.withDefaultTimeZone(tz) {
        (-20000 to 20000).foreach { d =>
          assert(millisToDays(daysToMillis(d)) === d, s"toJavaDate($d) = ${toJavaDate(d)} in ${tz}")
        }
      }

...which will show more problem(s):

8633 did not equal 8632 toJavaDate(8632) = 1993-08-21 in sun.util.calendar.ZoneInfo[id="Kwajalein",offset=43200000,dstSavings=0,useDaylight=false,transitions=5,lastRule=null]

... list of the incorrectly converted days, with dates in time zones:

for (tz <- DateTimeTestUtils.ALL_TIMEZONES) {
  DateTimeTestUtils.withDefaultTimeZone(tz) {
    (-20000 to 20000).filterNot(d => d === millisToDays(daysToMillis(d))).foreach { d =>
      println(s"toJavaDate($d) = ${toJavaDate(d)} in ${tz}")
    }
  }
}

toJavaDate(8632)  = 1993-08-21 in time zone [id="Kwajalein",offset=43200000,dstSavings=0,useDaylight=false,transitions=5,lastRule=null]
toJavaDate(15338) = 2011-12-31 in time zone [id="Pacific/Apia",offset=46800000,dstSavings=3600000,useDaylight=true,transitions=59,lastRule=java.util.SimpleTimeZone[id=Pacific/Apia,offset=46800000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=2,startMonth=8,startDay=-1,startDayOfWeek=1,startTime=10800000,startTimeMode=0,endMode=3,endMonth=3,endDay=1,endDayOfWeek=1,endTime=14400000,endTimeMode=0]]
toJavaDate(9131)  = 1995-01-02 in time zone [id="Pacific/Enderbury",offset=46800000,dstSavings=0,useDaylight=false,transitions=5,lastRule=null]
toJavaDate(15338) = 2011-12-31 in time zone [id="Pacific/Fakaofo",offset=46800000,dstSavings=0,useDaylight=false,transitions=4,lastRule=null]
toJavaDate(9131)  = 1995-01-02 in time zone [id="Pacific/Kiritimati",offset=50400000,dstSavings=0,useDaylight=false,transitions=5,lastRule=null]
toJavaDate(8632)  = 1993-08-21 in time zone [id="Pacific/Kwajalein",offset=43200000,dstSavings=0,useDaylight=false,transitions=5,lastRule=null]
toJavaDate(15338) = 2011-12-31 in time zone [id="MIT",offset=46800000,dstSavings=3600000,useDaylight=true,transitions=59,lastRule=java.util.SimpleTimeZone[id=MIT,offset=46800000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=2,startMonth=8,startDay=-1,startDayOfWeek=1,startTime=10800000,startTimeMode=0,endMode=3,endMonth=3,endDay=1,endDayOfWeek=1,endTime=14400000,endTimeMode=0]]

@JoshRosen
Copy link
Contributor

JoshRosen commented Jun 16, 2016

Question to check my conceptual understanding: if I don't use explicit java.sql.Date functions or conversions and don't call any timezone functions, etc., then should any time/date-related pure SQL query return the same results irrespective of my machine's local timezone? I've found a confusing counterexample:

sc.parallelize(Seq(1466036527, 1466036527))
    .toDF("ts")
    .selectExpr("cast(ts as timestamp)")
    .filter("ts < '2016-06-15 17:22:07.0'")
    .collect()

If I run this with ./bin/spark-shell --driver-java-options '-Duser.timezone=America/Los_Angeles' and ./bin/spark-shell --driver-java-options '-Duser.timezone=Europe/Moscow' then I get different results, which I find somewhat perplexing.

@JoshRosen
Copy link
Contributor

It looks like there's some precedence for having the notion of separate "session" and "server" timezones in MySQL and Postgres (as well as JDBC), so it may be okay to have parsing of timestamp literals be something that's driver-time-zone sensitive. This is kind of orthogonal to this PR, though, so maybe we should spin off that discussion thread into a new JIRA ticket to discuss session-specific timezone settings.

@RussellSpitzer
Copy link
Member

I would love to be able to just specify days since epoch rather than using java.sql.Date

}
}

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since 1970-01-01 00:00:00 UTC or local timezone?

@davies
Copy link
Contributor Author

davies commented Jun 16, 2016

@JoshRosen If any human time representation is involved, for example, 2016-06-15 17:22:07.0, a timezone could be used. In your SQL query, you are comparing absolute timestamp again human time without timezone, the default timezone will be used, it will generated different result in different timezone. If you specify the timezone in the human time, for example, 2016-06-15 17:22:07.0 PST, then you will get same result in any timezone.

After extending the range of test cases, just realize another fact: In some timezone, a whole day in human time could be skipped, for example, 1993-08-21 in Kwajalein timezone, here is the code to show this:

scala> TimeZone.getDefault()
res5: java.util.TimeZone = sun.util.calendar.ZoneInfo[id="Kwajalein",offset=43200000,dstSavings=0,useDaylight=false,transitions=5,lastRule=null]

scala> val ts = java.sql.Date.valueOf("1993-08-19").getTime
ts: Long = 745761600000

scala> (0 to 12).foreach { h => println(new java.sql.Timestamp(ts + h * 3600L * 1000 * 4)) }
1993-08-19 00:00:00.0
1993-08-19 04:00:00.0
1993-08-19 08:00:00.0
1993-08-19 12:00:00.0
1993-08-19 16:00:00.0
1993-08-19 20:00:00.0
1993-08-21 00:00:00.0
1993-08-21 04:00:00.0
1993-08-21 08:00:00.0
1993-08-21 12:00:00.0
1993-08-21 16:00:00.0
1993-08-21 20:00:00.0
1993-08-22 00:00:00.0

So I updated the tests to ignore these days.

@davies
Copy link
Contributor Author

davies commented Jun 16, 2016

@RussellSpitzer Unfortunately, most of the database do not support that (cast a integer into date), same here. For parquet, we use the days since epoch rather than using java.sql.Date.

@SparkQA
Copy link

SparkQA commented Jun 16, 2016

Test build #60647 has finished for PR 13652 at commit 09c99d9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 16, 2016

Test build #60648 has finished for PR 13652 at commit 9df8c84.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 16, 2016

Test build #3119 has finished for PR 13652 at commit 9df8c84.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor Author

davies commented Jun 17, 2016

@JoshRosen How does this look now?

@ckadner
Copy link
Member

ckadner commented Jun 17, 2016

I think this is a good "best effort" fix for the round-trip conversion int -> java.sql.Date -> int.

It may also be worth testing the round-trip conversion java.sql.Date -> int -> java.sql.Date:

  test("fromJavaDate toJavaDate in all timezones") {

    val debug = true
    val years = 1900 to 2020

    val dates : Seq[String] = for {
      year <- years
      month <- 1 to 12
      day <- 1 to 31
    } yield new Date(year-1900, month-1, day).toString

    def testDateIsEqual(date: String): Boolean = {
      val d1 = Date.valueOf(date)
      val d2 = toJavaDate(fromJavaDate(d1))
      d2 === d1
    }

    def testStringEquals(date: String): Boolean = {
      val d1 = Date.valueOf(date)
      val d2 = toJavaDate(fromJavaDate(d1))
      d2.toString === d1.toString
    }

    // make sure all converted Dates have a matching String representation (ignores time component)
    for (tz <- DateTimeTestUtils.ALL_TIMEZONES) {
      DateTimeTestUtils.withDefaultTimeZone(tz) {
        assert(dates.filterNot(testStringEquals _).isEmpty)
      }
    }

    // find Dates that don't convert to an equal Date (non-matching time component)
    val nonEqualDatesWithTZ = for {
      tz <- DateTimeTestUtils.ALL_TIMEZONES
      badDate <- DateTimeTestUtils.withDefaultTimeZone(tz) { dates.filterNot(testDateIsEqual _) }
    } yield (badDate, tz)

    // scalastyle:off println
    if (debug) {
      println(s"${nonEqualDatesWithTZ.size} incorrect conversions ${years.start}..${years.end}")
      nonEqualDatesWithTZ.foreach { case (d, t) => println(s"${d} in ${t.getID}") }
    }
    // scalastyle:on println
  }

Output when debug == true

135 incorrect conversions 1900..2020:

1939-11-19 in Africa/Algiers
1900-01-01 in Africa/Ceuta
1943-03-21 in Africa/Windhoek
1922-01-01 in America/Bahia_Banderas
1922-01-01 in America/Cancun
1996-10-27 in America/Chihuahua
1997-10-26 in America/Chihuahua
1981-03-29 in America/Danmarkshavn
1982-03-28 in America/Danmarkshavn
1983-03-27 in America/Danmarkshavn
1984-03-25 in America/Danmarkshavn
1985-03-31 in America/Danmarkshavn
1986-03-30 in America/Danmarkshavn
1987-03-29 in America/Danmarkshavn
1988-03-27 in America/Danmarkshavn
1989-03-26 in America/Danmarkshavn
1990-03-25 in America/Danmarkshavn
1991-03-31 in America/Danmarkshavn
1992-03-29 in America/Danmarkshavn
1993-03-28 in America/Danmarkshavn
1994-03-27 in America/Danmarkshavn
1995-03-26 in America/Danmarkshavn
1965-10-31 in America/Dawson
1969-10-26 in America/Indiana/Tell_City
1970-10-25 in America/Indiana/Tell_City
1965-10-31 in America/Inuvik
1999-10-31 in America/Iqaluit
1945-09-30 in America/Juneau
1969-10-26 in America/Juneau
1970-10-25 in America/Juneau
1971-10-31 in America/Juneau
1972-10-29 in America/Juneau
1973-10-28 in America/Juneau
1974-10-27 in America/Juneau
1975-10-26 in America/Juneau
1976-10-31 in America/Juneau
1977-10-30 in America/Juneau
1978-10-29 in America/Juneau
1979-10-28 in America/Juneau
1981-10-25 in America/Juneau
1982-10-31 in America/Juneau
1922-01-01 in America/Mexico_City
1996-10-27 in America/Ojinaga
1997-10-26 in America/Ojinaga
1945-09-30 in America/Pangnirtung
1980-10-26 in America/Pangnirtung
1981-10-25 in America/Pangnirtung
1982-10-31 in America/Pangnirtung
1983-10-30 in America/Pangnirtung
1984-10-28 in America/Pangnirtung
1985-10-27 in America/Pangnirtung
1986-10-26 in America/Pangnirtung
1987-10-25 in America/Pangnirtung
1988-10-30 in America/Pangnirtung
1989-10-29 in America/Pangnirtung
1990-10-28 in America/Pangnirtung
1991-10-27 in America/Pangnirtung
1992-10-25 in America/Pangnirtung
1993-10-31 in America/Pangnirtung
1994-10-30 in America/Pangnirtung
1999-10-31 in America/Pangnirtung
1950-04-16 in America/Santarem
1945-09-30 in America/Sitka
1969-10-26 in America/Sitka
1970-10-25 in America/Sitka
1971-10-31 in America/Sitka
1972-10-29 in America/Sitka
1973-10-28 in America/Sitka
1974-10-27 in America/Sitka
1975-10-26 in America/Sitka
1976-10-31 in America/Sitka
1977-10-30 in America/Sitka
1978-10-29 in America/Sitka
1979-10-28 in America/Sitka
1980-10-26 in America/Sitka
1981-10-25 in America/Sitka
1982-10-31 in America/Sitka
1965-10-31 in America/Whitehorse
1900-01-01 in Antarctica/Casey
1900-01-01 in Antarctica/Davis
2009-10-18 in Antarctica/Davis
2011-10-28 in Antarctica/Davis
1900-01-01 in Antarctica/DumontDUrville
1900-01-01 in Antarctica/Mawson
1900-01-01 in Antarctica/Syowa
1900-01-01 in Antarctica/Vostok
2001-09-29 in Asia/Choibalsan
2002-09-28 in Asia/Choibalsan
2003-09-27 in Asia/Choibalsan
2004-09-25 in Asia/Choibalsan
2005-09-24 in Asia/Choibalsan
2006-09-30 in Asia/Choibalsan
1916-11-01 in Atlantic/Azores
1916-11-01 in Atlantic/Madeira
1917-10-21 in Atlantic/Reykjavik
1918-11-16 in Atlantic/Reykjavik
1919-11-16 in Atlantic/Reykjavik
1921-06-23 in Atlantic/Reykjavik
1965-10-31 in Canada/Yukon
1900-01-01 in Europe/Brussels
1900-01-01 in Europe/Gibraltar
1916-10-01 in Europe/Kaliningrad
1916-10-01 in Europe/Kaliningrad
1990-07-01 in Europe/Kiev
1990-07-01 in Europe/Kiev
1922-10-08 in Europe/Luxembourg
1924-10-05 in Europe/Luxembourg
1925-10-04 in Europe/Luxembourg
1926-10-03 in Europe/Luxembourg
1927-10-02 in Europe/Luxembourg
1928-10-07 in Europe/Luxembourg
1900-01-01 in Europe/Madrid
1992-09-27 in Europe/Minsk
1911-03-11 in Europe/Paris
1990-07-01 in Europe/Uzhgorod
1990-07-01 in Europe/Uzhgorod
1917-10-21 in Iceland
1918-11-16 in Iceland
1919-11-16 in Iceland
1921-06-23 in Iceland
1900-01-01 in Indian/Kerguelen
1922-01-01 in Mexico/General
1900-01-01 in Pacific/Apia
1950-01-02 in Pacific/Apia
2011-09-25 in Pacific/Apia
1900-01-01 in Pacific/Enderbury
1979-10-02 in Pacific/Enderbury
1900-01-01 in Pacific/Fakaofo
1901-01-02 in Pacific/Fakaofo
1900-01-01 in Pacific/Kiritimati
1979-10-02 in Pacific/Kiritimati
1911-03-11 in ECT
1900-01-01 in MIT
1950-01-02 in MIT
2011-09-25 in MIT

The extra debug stuff should be removed of course.

@SparkQA
Copy link

SparkQA commented Jun 18, 2016

Test build #60773 has finished for PR 13652 at commit fefc033.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor Author

davies commented Jun 18, 2016

@ckadner Had updated the PR to fallback to java.sql.Timestamp to do the reverse lookup, that should have better resolution (especially when converting a timestamp to another timezone).

The test you posted still failed on 52 cases, because the equality of java.sql.Date is based on unix timestamp (in UTC), but it could be multiple timestamp mapping to same human time, so the equality of java.sql.Date could have false negative.

52 incorrect conversions 1900..2020
1939-11-19 in Africa/Algiers
1900-01-01 in Africa/Ceuta
1922-01-01 in America/Cancun
1965-10-31 in America/Dawson
1965-10-31 in America/Inuvik
1999-10-31 in America/Iqaluit
1922-01-01 in America/Mexico_City
1999-10-31 in America/Pangnirtung
1950-04-16 in America/Santarem
1965-10-31 in America/Whitehorse
1900-01-01 in Antarctica/Casey
1900-01-01 in Antarctica/Davis
2009-10-18 in Antarctica/Davis
2011-10-28 in Antarctica/Davis
1900-01-01 in Antarctica/DumontDUrville
1900-01-01 in Antarctica/Mawson
1900-01-01 in Antarctica/Syowa
1900-01-01 in Antarctica/Vostok
1916-11-01 in Atlantic/Azores
1916-11-01 in Atlantic/Madeira
1917-10-21 in Atlantic/Reykjavik
1918-11-16 in Atlantic/Reykjavik
1919-11-16 in Atlantic/Reykjavik
1921-06-23 in Atlantic/Reykjavik
1965-10-31 in Canada/Yukon
1900-01-01 in Europe/Brussels
1900-01-01 in Europe/Gibraltar
1916-10-01 in Europe/Kaliningrad
1916-10-01 in Europe/Kaliningrad
1922-10-08 in Europe/Luxembourg
1924-10-05 in Europe/Luxembourg
1925-10-04 in Europe/Luxembourg
1926-10-03 in Europe/Luxembourg
1927-10-02 in Europe/Luxembourg
1928-10-07 in Europe/Luxembourg
1900-01-01 in Europe/Madrid
1992-09-27 in Europe/Minsk
1911-03-11 in Europe/Paris
1990-07-01 in Europe/Uzhgorod
1990-07-01 in Europe/Uzhgorod
1917-10-21 in Iceland
1918-11-16 in Iceland
1919-11-16 in Iceland
1921-06-23 in Iceland
1900-01-01 in Indian/Kerguelen
1922-01-01 in Mexico/General
1900-01-01 in Pacific/Apia
1900-01-01 in Pacific/Enderbury
1900-01-01 in Pacific/Fakaofo
1900-01-01 in Pacific/Kiritimati
1911-03-11 in ECT
1900-01-01 in MIT

For 52 cases here, I checked that the Date objects are actually point to the start of the day (create a Timestamp objects using the unix timestamp of the two Date objects).

(1939-11-19 00:00:00.0,1939-11-19 00:00:00.0)
(1900-01-01 00:00:00.0,1900-01-01 00:00:00.0)
(1922-01-01 00:00:00.0,1922-01-01 00:00:00.0)
(1965-10-31 00:00:00.0,1965-10-31 00:00:00.0)
(1965-10-31 00:00:00.0,1965-10-31 00:00:00.0)
(1999-10-31 00:00:00.0,1999-10-31 00:00:00.0)
(1922-01-01 00:00:00.0,1922-01-01 00:00:00.0)
(1999-10-31 00:00:00.0,1999-10-31 00:00:00.0)
(1950-04-16 00:00:00.0,1950-04-16 00:00:00.0)
(1965-10-31 00:00:00.0,1965-10-31 00:00:00.0)
(1900-01-01 00:00:00.0,1900-01-01 00:00:00.0)
(1900-01-01 00:00:00.0,1900-01-01 00:00:00.0)
(2009-10-18 00:00:00.0,2009-10-18 00:00:00.0)
(2011-10-28 00:00:00.0,2011-10-28 00:00:00.0)
(1900-01-01 00:00:00.0,1900-01-01 00:00:00.0)
(1900-01-01 00:00:00.0,1900-01-01 00:00:00.0)
(1900-01-01 00:00:00.0,1900-01-01 00:00:00.0)
(1900-01-01 00:00:00.0,1900-01-01 00:00:00.0)
(1916-11-01 00:00:00.0,1916-11-01 00:00:00.0)
(1916-11-01 00:00:00.0,1916-11-01 00:00:00.0)
(1917-10-21 00:00:00.0,1917-10-21 00:00:00.0)
(1918-11-16 00:00:00.0,1918-11-16 00:00:00.0)
(1919-11-16 00:00:00.0,1919-11-16 00:00:00.0)
(1921-06-23 00:00:00.0,1921-06-23 00:00:00.0)
(1965-10-31 00:00:00.0,1965-10-31 00:00:00.0)
(1900-01-01 00:00:00.0,1900-01-01 00:00:00.0)
(1900-01-01 00:00:00.0,1900-01-01 00:00:00.0)
(1916-10-01 00:00:00.0,1916-10-01 00:00:00.0)
(1916-10-01 00:00:00.0,1916-10-01 00:00:00.0)
(1922-10-08 00:00:00.0,1922-10-08 00:00:00.0)
(1924-10-05 00:00:00.0,1924-10-05 00:00:00.0)
(1925-10-04 00:00:00.0,1925-10-04 00:00:00.0)
(1926-10-03 00:00:00.0,1926-10-03 00:00:00.0)
(1927-10-02 00:00:00.0,1927-10-02 00:00:00.0)
(1928-10-07 00:00:00.0,1928-10-07 00:00:00.0)
(1900-01-01 00:00:00.0,1900-01-01 00:00:00.0)
(1992-09-27 00:00:00.0,1992-09-27 00:00:00.0)
(1911-03-11 00:00:00.0,1911-03-11 00:00:00.0)
(1990-07-01 00:00:00.0,1990-07-01 00:00:00.0)
(1990-07-01 00:00:00.0,1990-07-01 00:00:00.0)
(1917-10-21 00:00:00.0,1917-10-21 00:00:00.0)
(1918-11-16 00:00:00.0,1918-11-16 00:00:00.0)
(1919-11-16 00:00:00.0,1919-11-16 00:00:00.0)
(1921-06-23 00:00:00.0,1921-06-23 00:00:00.0)
(1900-01-01 00:00:00.0,1900-01-01 00:00:00.0)
(1922-01-01 00:00:00.0,1922-01-01 00:00:00.0)
(1900-01-01 00:00:00.0,1900-01-01 00:00:00.0)
(1900-01-01 00:00:00.0,1900-01-01 00:00:00.0)
(1900-01-01 00:00:00.0,1900-01-01 00:00:00.0)
(1900-01-01 00:00:00.0,1900-01-01 00:00:00.0)
(1911-03-11 00:00:00.0,1911-03-11 00:00:00.0)
(1900-01-01 00:00:00.0,1900-01-01 00:00:00.0)

@davies
Copy link
Contributor Author

davies commented Jun 18, 2016

Since we have to cut a RC for 2.0, I will merge this into master and 2.0 after it pass tests.

@SparkQA
Copy link

SparkQA commented Jun 18, 2016

Test build #60775 has finished for PR 13652 at commit 88df176.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jun 19, 2016
…ylight Saving Time

## What changes were proposed in this pull request?

Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not).

This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice.

## How was this patch tested?

Added regression tests.

Author: Davies Liu <[email protected]>

Closes #13652 from davies/fix_timezone.

(cherry picked from commit 001a589)
Signed-off-by: Davies Liu <[email protected]>
asfgit pushed a commit that referenced this pull request Jun 19, 2016
…ylight Saving Time

## What changes were proposed in this pull request?

Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not).

This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice.

## How was this patch tested?

Added regression tests.

Author: Davies Liu <[email protected]>

Closes #13652 from davies/fix_timezone.

(cherry picked from commit 001a589)
Signed-off-by: Davies Liu <[email protected]>
@davies
Copy link
Contributor Author

davies commented Jun 19, 2016

Merged into 1.6 and 2.0

@asfgit asfgit closed this in 001a589 Jun 19, 2016
@4e6
Copy link

4e6 commented Jun 19, 2016

Awesome, thanks!

@lw-lin
Copy link
Contributor

lw-lin commented Jun 19, 2016

hi @davies, two tests still fail after this patch when I build locally:

- from UTC timestamp *** FAILED ***
  "2016-03-13 0[2]:00:00.0" did not equal "2016-03-13 0[3]:00:00.0" (DateTimeUtilsSuite.scala:488)
- to UTC timestamp *** FAILED ***
  "2016-03-13 1[1]:00:00.0" did not equal "2016-03-13 1[0]:00:00.0" (DateTimeUtilsSuite.scala:506)

My time zone:

admin$ sudo systemsetup -gettimezone
Time Zone: Asia/Shanghai

zzcclp pushed a commit to zzcclp/spark that referenced this pull request Jun 20, 2016
…ylight Saving Time

## What changes were proposed in this pull request?

Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not).

This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice.

## How was this patch tested?

Added regression tests.

Author: Davies Liu <[email protected]>

Closes apache#13652 from davies/fix_timezone.

(cherry picked from commit 001a589)
Signed-off-by: Davies Liu <[email protected]>
(cherry picked from commit 41efd20)
@robbinspg
Copy link
Member

robbinspg commented Jun 20, 2016

Also failing here in the UK:

  • to UTC timestamp *** FAILED ***
    "2016-03-13 [02]:00:00.0" did not equal "2016-03-13 [10]:00:00.0" (DateTimeUtilsSuite.scala:506)

@adrian-wang
Copy link
Contributor

@lw-lin @robbinspg I submit a PR to fix the failure: #13783 . Thanks for your information!

@davies
Copy link
Contributor Author

davies commented Jun 20, 2016

@lw-lin @robbinspg Thanks for report this, sent #13784 to fix the bug.

asfgit pushed a commit that referenced this pull request Jun 20, 2016
…ylight Saving Time

Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not).

This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice.

Added regression tests.

Author: Davies Liu <[email protected]>

Closes #13652 from davies/fix_timezone.
@ckadner
Copy link
Member

ckadner commented Jun 20, 2016

I (also) just created a PR #13785 to fix the two test cases from UTC timestamp and to UTC timestap to respect the time zones that are supposed to be used in the conversions. I had not seen @davies latest PR in time.

@davies Would it not suffice to set the (thread local) time zone in the two test cases?

@davies
Copy link
Contributor Author

davies commented Jun 20, 2016

@ckadner That could fix the flaky test, but hide the actual bug.

@ckadner
Copy link
Member

ckadner commented Jun 21, 2016

@davies ...just to confirm the desired behavior of these two test cases from UTC timestamp and to UTC timestamp. We are contending with 3 time zones:

  • we are parsing Timestamps from Strings in any/all time zones in the Java world (DateTimeTestUtils.ALL_TIMEZONES) so that they represent the human readable time in any/each of those time zones which we are iterating over (*)
  • but in the inner test method we are treating these timestamps utc: Stringas if they were created in UTC
  • then we are converting these timestamps (of "mistaken identity") to yet another time zone tz: String (i.e. "UTC", "JST", "PST", "Asia/Shanghai") and generating the human readable time in the time zone of the for-loop (DateTimeTestUtils.withDefaultTimeZone(tz))
  test("from UTC timestamp") {
    def test(utc: String, tz: String, expected: String): Unit = {
       assert(toJavaTimestamp(fromUTCTime(fromJavaTimestamp(Timestamp.valueOf(utc)), tz)).toString
         === expected)
    }

    for (tz <- DateTimeTestUtils.ALL_TIMEZONES) {
      DateTimeTestUtils.withDefaultTimeZone(tz) {
        test("2011-12-25 09:00:00.123456", "UTC", "2011-12-25 09:00:00.123456")
        test("2011-12-25 09:00:00.123456", "JST", "2011-12-25 18:00:00.123456")
        test("2011-12-25 09:00:00.123456", "PST", "2011-12-25 01:00:00.123456")
        test("2011-12-25 09:00:00.123456", "Asia/Shanghai", "2011-12-25 17:00:00.123456")
      }
    }
    ...

(*) Before your fix (without the for-loop) the third time zone involved was determined by the user's JVM default Timezone.

If this 3-timezone behavior is what we want to test, then should you replace the two occurrences of ...

    DateTimeTestUtils.withDefaultTimeZone(TimeZone.getTimeZone("PST")) {

with ...

    for (tz <- DateTimeTestUtils.ALL_TIMEZONES) {
      DateTimeTestUtils.withDefaultTimeZone(tz) {

... or would/should 2016-03-13 02:00:00 PST and 2016-11-06 01:00:00 PST not work?

    DateTimeTestUtils.withDefaultTimeZone(TimeZone.getTimeZone("PST")) {
      // Daylight Saving Time
      test("2016-03-13 01:59:59", "PST", "2016-03-13 09:59:59.0")
      // 2016-03-13 02:00:00 PST does not exists
      test("2016-03-13 02:00:00", "PST", "2016-03-13 10:00:00.0")
      test("2016-03-13 03:00:00", "PST", "2016-03-13 10:00:00.0")
      test("2016-11-06 00:59:59", "PST", "2016-11-06 07:59:59.0")
      // 2016-11-06 01:00:00 PST could be 2016-11-06 08:00:00 UTC or 2016-11-06 09:00:00 UTC
      test("2016-11-06 01:00:00", "PST", "2016-11-06 09:00:00.0")
      test("2016-11-06 01:59:59", "PST", "2016-11-06 09:59:59.0")
      test("2016-11-06 02:00:00", "PST", "2016-11-06 10:00:00.0")
    }

@davies
Copy link
Contributor Author

davies commented Jun 21, 2016

@ckadner The test function here try to similar this SQL function:

cast(from_unix_timestamp(cast("2016-03-13 01:59:59" as timestamp), "PST") as string)

In the beginning, I want to run all these tests in different timezone, but unfortunately some of them could not pass the tests, because of

   * TODO: Because of DST, the conversion between UTC and human time is not exactly one-to-one
   * mapping, the conversion here may return wrong result, we should make the timestamp
   * timezone-aware.

zzcclp pushed a commit to zzcclp/spark that referenced this pull request Jun 21, 2016
…ylight Saving Time

Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not).

This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice.

Added regression tests.

Author: Davies Liu <[email protected]>

Closes apache#13652 from davies/fix_timezone.

(cherry picked from commit db86e7f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants