[SPARK-19561][SQL] add int case handling for TimestampType #17200

JasonMWhite · 2017-03-08T01:22:53Z

What changes were proposed in this pull request?

Add handling of input of type Int for dataType TimestampType to EvaluatePython.scala. Py4J serializes ints smaller than MIN_INT or larger than MAX_INT to Long, which are handled correctly already, but values between MIN_INT and MAX_INT are serialized to Int.

These range limits correspond to roughly half an hour on either side of the epoch. As a result, PySpark doesn't allow TimestampType values to be created in this range.

Alternatives attempted: patching the TimestampType.toInternal function to cast return values to long, so Py4J would always serialize them to Scala Long. Python3 does not have a long type, so this approach failed on Python3.

How was this patch tested?

Added a new PySpark-side test that fails without the change.

The contribution is my original work and I license the work to the project under the project’s open source license.

Resubmission of #16896. The original PR didn't go through Jenkins and broke the build. @davies @dongjoon-hyun

@cloud-fan Could you kick off a Jenkins run for me? It passed everything for me locally, but it's possible something has changed in the last few weeks.

cloud-fan · 2017-03-08T04:36:43Z

ok to test

SparkQA · 2017-03-08T04:58:10Z

Test build #74178 has finished for PR 17200 at commit 5b1dd67.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-03-08T05:19:03Z

python/pyspark/sql/tests.py

+    def test_datetime_at_epoch(self):
+        epoch = datetime.datetime.fromtimestamp(0)
+        df = self.spark.createDataFrame([Row(date=epoch)])
+        self.assertEqual(df.first()['date'], epoch)


This test is invalid. df.first()['date'] is None even in current master branch.

Yes, that's the bug this PR is fixing. It shouldn't be None.

JasonMWhite · 2017-03-08T05:49:13Z

Ah, test failed on Python 3.4 only. That makes some sense, I only tested locally on 2.6, and there are changes with how Python 3 handles ints vs longs. I'll dig in with Python 3.4 and see if I can see the cause for the test failure.

viirya · 2017-03-08T06:57:09Z

python/pyspark/sql/types.py


    def toInternal(self, dt):
        if dt is not None:
            seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo


hmm, for the value of epoch = datetime.datetime.fromtimestamp(0), seconds is 0. What is it different to use int or long?

The JIRA ticket has the details: https://issues.apache.org/jira/browse/SPARK-19561. But in a nutshell, that's the point: int(0) fails but long(0) succeeds.

https://github.com/bartdag/py4j/blob/master/py4j-python/src/py4j/protocol.py#L271-L275

Py4J automatically serializes any Python integer larger than 2 ^ 31 as LONG_TYPE, otherwise it's INTEGER_TYPE. Python longs are always serialized as LONG_TYPE.

I suspect my issue with Python 3 is that there is no more long, it's all just int. This may require a fix on the Scala side to accept either an int or a long to the appropriate constructor.

Thanks. Interesting.

I just tested it. In Python3, even toInternal returns Python's long, you still can a java.lang.Integer in JVM side.

However, in Python2, you can get java.lang.Long.

Because Python3 doesn't have long anymore, I think we can't solve this in python. We need to fix this in JVM side.

@JasonMWhite Are you going to submit another PR for it?

I will, yes. Trying to find where the appropriate location is in the Scala code.

…ong in Python

SparkQA · 2017-03-09T06:24:24Z

Test build #74239 has finished for PR 17200 at commit a1936af.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-03-09T06:25:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala

    case (c: Int, DateType) => c

    case (c: Long, TimestampType) => c
+    case (c: Int, TimestampType) => c.toLong


Can you add a comment for the reason why we recognize Int as TimestampType too here?

viirya · 2017-03-09T06:27:02Z

Btw, since it is a change to SQL code. Better to add [SQL] to the title.

viirya · 2017-03-09T06:29:10Z

LGTM except for a minor comment.

SparkQA · 2017-03-09T06:32:32Z

Test build #74240 has started for PR 17200 at commit bee635a.

viirya · 2017-03-09T06:42:31Z

oh, the PR description is not correct now. Can you update it too?

cloud-fan · 2017-03-09T06:58:46Z

LGTM

viirya · 2017-03-09T08:07:09Z

retest this please.

SparkQA · 2017-03-09T10:13:35Z

Test build #74252 has finished for PR 17200 at commit bee635a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-09T10:13:44Z

Test build #74254 has finished for PR 17200 at commit bee635a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? Add handling of input of type `Int` for dataType `TimestampType` to `EvaluatePython.scala`. Py4J serializes ints smaller than MIN_INT or larger than MAX_INT to Long, which are handled correctly already, but values between MIN_INT and MAX_INT are serialized to Int. These range limits correspond to roughly half an hour on either side of the epoch. As a result, PySpark doesn't allow TimestampType values to be created in this range. Alternatives attempted: patching the `TimestampType.toInternal` function to cast return values to `long`, so Py4J would always serialize them to Scala Long. Python3 does not have a `long` type, so this approach failed on Python3. ## How was this patch tested? Added a new PySpark-side test that fails without the change. The contribution is my original work and I license the work to the project under the project’s open source license. Resubmission of #16896. The original PR didn't go through Jenkins and broke the build. davies dongjoon-hyun cloud-fan Could you kick off a Jenkins run for me? It passed everything for me locally, but it's possible something has changed in the last few weeks. Author: Jason White <[email protected]> Closes #17200 from JasonMWhite/SPARK-19561. (cherry picked from commit 206030b) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2017-03-09T18:35:52Z

thanks, merging to master/2.1!

JasonMWhite · 2017-03-09T18:44:24Z

Thanks @cloud-fan !

JasonMWhite added 2 commits February 11, 2017 16:39

cast TimestampType.toInternal output to long

4238533

replace int function call with long

5b1dd67

viirya reviewed Mar 8, 2017

View reviewed changes

add TimestampType int handling in EvaluatePython, remove casting to l…

a1936af

…ong in Python

viirya reviewed Mar 9, 2017

View reviewed changes

linter

a198d49

JasonMWhite changed the title ~~[SPARK-19561][Python] cast TimestampType.toInternal output to long~~ [SPARK-19561][SQL] cast TimestampType.toInternal output to long Mar 9, 2017

add comment explaining adding Int case handling

bee635a

JasonMWhite changed the title ~~[SPARK-19561][SQL] cast TimestampType.toInternal output to long~~ [SPARK-19561][SQL] add int case handling for TimestampType Mar 9, 2017

asfgit closed this in 206030b Mar 9, 2017

JasonMWhite deleted the SPARK-19561 branch March 9, 2017 18:44

Uh oh!

[SPARK-19561][SQL] add int case handling for TimestampType #17200

[SPARK-19561][SQL] add int case handling for TimestampType #17200

Uh oh!

Conversation

JasonMWhite commented Mar 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Mar 8, 2017

Uh oh!

SparkQA commented Mar 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JasonMWhite commented Mar 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Mar 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 9, 2017

Uh oh!

viirya Mar 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Mar 9, 2017

Uh oh!

viirya commented Mar 9, 2017

Uh oh!

SparkQA commented Mar 9, 2017

Uh oh!

viirya commented Mar 9, 2017

Uh oh!

cloud-fan commented Mar 9, 2017

Uh oh!

viirya commented Mar 9, 2017

Uh oh!

SparkQA commented Mar 9, 2017

Uh oh!

SparkQA commented Mar 9, 2017

Uh oh!

cloud-fan commented Mar 9, 2017

Uh oh!

JasonMWhite commented Mar 9, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JasonMWhite commented Mar 8, 2017 •

edited

Loading

viirya Mar 8, 2017 •

edited

Loading

viirya Mar 9, 2017 •

edited

Loading