Skip to content

Conversation

@JasonMWhite
Copy link
Contributor

@JasonMWhite JasonMWhite commented Mar 8, 2017

What changes were proposed in this pull request?

Add handling of input of type Int for dataType TimestampType to EvaluatePython.scala. Py4J serializes ints smaller than MIN_INT or larger than MAX_INT to Long, which are handled correctly already, but values between MIN_INT and MAX_INT are serialized to Int.

These range limits correspond to roughly half an hour on either side of the epoch. As a result, PySpark doesn't allow TimestampType values to be created in this range.

Alternatives attempted: patching the TimestampType.toInternal function to cast return values to long, so Py4J would always serialize them to Scala Long. Python3 does not have a long type, so this approach failed on Python3.

How was this patch tested?

Added a new PySpark-side test that fails without the change.

The contribution is my original work and I license the work to the project under the project’s open source license.

Resubmission of #16896. The original PR didn't go through Jenkins and broke the build. @davies @dongjoon-hyun

@cloud-fan Could you kick off a Jenkins run for me? It passed everything for me locally, but it's possible something has changed in the last few weeks.

@cloud-fan
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Mar 8, 2017

Test build #74178 has finished for PR 17200 at commit 5b1dd67.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def test_datetime_at_epoch(self):
epoch = datetime.datetime.fromtimestamp(0)
df = self.spark.createDataFrame([Row(date=epoch)])
self.assertEqual(df.first()['date'], epoch)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is invalid. df.first()['date'] is None even in current master branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the bug this PR is fixing. It shouldn't be None.

@JasonMWhite
Copy link
Contributor Author

Ah, test failed on Python 3.4 only. That makes some sense, I only tested locally on 2.6, and there are changes with how Python 3 handles ints vs longs. I'll dig in with Python 3.4 and see if I can see the cause for the test failure.


def toInternal(self, dt):
if dt is not None:
seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, for the value of epoch = datetime.datetime.fromtimestamp(0), seconds is 0. What is it different to use int or long?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JIRA ticket has the details: https://issues.apache.org/jira/browse/SPARK-19561. But in a nutshell, that's the point: int(0) fails but long(0) succeeds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/bartdag/py4j/blob/master/py4j-python/src/py4j/protocol.py#L271-L275

Py4J automatically serializes any Python integer larger than 2 ^ 31 as LONG_TYPE, otherwise it's INTEGER_TYPE. Python longs are always serialized as LONG_TYPE.

I suspect my issue with Python 3 is that there is no more long, it's all just int. This may require a fix on the Scala side to accept either an int or a long to the appropriate constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Interesting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested it. In Python3, even toInternal returns Python's long, you still can a java.lang.Integer in JVM side.

However, in Python2, you can get java.lang.Long.

Copy link
Member

@viirya viirya Mar 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Python3 doesn't have long anymore, I think we can't solve this in python. We need to fix this in JVM side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JasonMWhite Are you going to submit another PR for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will, yes. Trying to find where the appropriate location is in the Scala code.

@SparkQA
Copy link

SparkQA commented Mar 9, 2017

Test build #74239 has finished for PR 17200 at commit a1936af.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case (c: Int, DateType) => c

case (c: Long, TimestampType) => c
case (c: Int, TimestampType) => c.toLong
Copy link
Member

@viirya viirya Mar 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment for the reason why we recognize Int as TimestampType too here?

@viirya
Copy link
Member

viirya commented Mar 9, 2017

Btw, since it is a change to SQL code. Better to add [SQL] to the title.

@JasonMWhite JasonMWhite changed the title [SPARK-19561][Python] cast TimestampType.toInternal output to long [SPARK-19561][SQL] cast TimestampType.toInternal output to long Mar 9, 2017
@viirya
Copy link
Member

viirya commented Mar 9, 2017

LGTM except for a minor comment.

@SparkQA
Copy link

SparkQA commented Mar 9, 2017

Test build #74240 has started for PR 17200 at commit bee635a.

@JasonMWhite JasonMWhite changed the title [SPARK-19561][SQL] cast TimestampType.toInternal output to long [SPARK-19561][SQL] add int case handling for TimestampType Mar 9, 2017
@viirya
Copy link
Member

viirya commented Mar 9, 2017

oh, the PR description is not correct now. Can you update it too?

@cloud-fan
Copy link
Contributor

LGTM

@viirya
Copy link
Member

viirya commented Mar 9, 2017

retest this please.

@SparkQA
Copy link

SparkQA commented Mar 9, 2017

Test build #74252 has finished for PR 17200 at commit bee635a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 9, 2017

Test build #74254 has finished for PR 17200 at commit bee635a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Mar 9, 2017
## What changes were proposed in this pull request?

Add handling of input of type `Int` for dataType `TimestampType` to `EvaluatePython.scala`. Py4J serializes ints smaller than MIN_INT or larger than MAX_INT to Long, which are handled correctly already, but values between MIN_INT and MAX_INT are serialized to Int.

These range limits correspond to roughly half an hour on either side of the epoch. As a result, PySpark doesn't allow TimestampType values to be created in this range.

Alternatives attempted: patching the `TimestampType.toInternal` function to cast return values to `long`, so Py4J would always serialize them to Scala Long. Python3 does not have a `long` type, so this approach failed on Python3.

## How was this patch tested?

Added a new PySpark-side test that fails without the change.

The contribution is my original work and I license the work to the project under the project’s open source license.

Resubmission of #16896. The original PR didn't go through Jenkins and broke the build. davies dongjoon-hyun

cloud-fan Could you kick off a Jenkins run for me? It passed everything for me locally, but it's possible something has changed in the last few weeks.

Author: Jason White <[email protected]>

Closes #17200 from JasonMWhite/SPARK-19561.

(cherry picked from commit 206030b)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.1!

@asfgit asfgit closed this in 206030b Mar 9, 2017
@JasonMWhite JasonMWhite deleted the SPARK-19561 branch March 9, 2017 18:44
@JasonMWhite
Copy link
Contributor Author

Thanks @cloud-fan !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants