-
Couldn't load subscription status.
- Fork 28.9k
[SPARK-19561][SQL] add int case handling for TimestampType #17200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
|
Test build #74178 has finished for PR 17200 at commit
|
python/pyspark/sql/tests.py
Outdated
| def test_datetime_at_epoch(self): | ||
| epoch = datetime.datetime.fromtimestamp(0) | ||
| df = self.spark.createDataFrame([Row(date=epoch)]) | ||
| self.assertEqual(df.first()['date'], epoch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is invalid. df.first()['date'] is None even in current master branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the bug this PR is fixing. It shouldn't be None.
|
Ah, test failed on Python 3.4 only. That makes some sense, I only tested locally on 2.6, and there are changes with how Python 3 handles ints vs longs. I'll dig in with Python 3.4 and see if I can see the cause for the test failure. |
|
|
||
| def toInternal(self, dt): | ||
| if dt is not None: | ||
| seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, for the value of epoch = datetime.datetime.fromtimestamp(0), seconds is 0. What is it different to use int or long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JIRA ticket has the details: https://issues.apache.org/jira/browse/SPARK-19561. But in a nutshell, that's the point: int(0) fails but long(0) succeeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/bartdag/py4j/blob/master/py4j-python/src/py4j/protocol.py#L271-L275
Py4J automatically serializes any Python integer larger than 2 ^ 31 as LONG_TYPE, otherwise it's INTEGER_TYPE. Python longs are always serialized as LONG_TYPE.
I suspect my issue with Python 3 is that there is no more long, it's all just int. This may require a fix on the Scala side to accept either an int or a long to the appropriate constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tested it. In Python3, even toInternal returns Python's long, you still can a java.lang.Integer in JVM side.
However, in Python2, you can get java.lang.Long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because Python3 doesn't have long anymore, I think we can't solve this in python. We need to fix this in JVM side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JasonMWhite Are you going to submit another PR for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will, yes. Trying to find where the appropriate location is in the Scala code.
|
Test build #74239 has finished for PR 17200 at commit
|
| case (c: Int, DateType) => c | ||
|
|
||
| case (c: Long, TimestampType) => c | ||
| case (c: Int, TimestampType) => c.toLong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment for the reason why we recognize Int as TimestampType too here?
|
Btw, since it is a change to SQL code. Better to add |
|
LGTM except for a minor comment. |
|
Test build #74240 has started for PR 17200 at commit |
|
oh, the PR description is not correct now. Can you update it too? |
|
LGTM |
|
retest this please. |
|
Test build #74252 has finished for PR 17200 at commit
|
|
Test build #74254 has finished for PR 17200 at commit
|
## What changes were proposed in this pull request? Add handling of input of type `Int` for dataType `TimestampType` to `EvaluatePython.scala`. Py4J serializes ints smaller than MIN_INT or larger than MAX_INT to Long, which are handled correctly already, but values between MIN_INT and MAX_INT are serialized to Int. These range limits correspond to roughly half an hour on either side of the epoch. As a result, PySpark doesn't allow TimestampType values to be created in this range. Alternatives attempted: patching the `TimestampType.toInternal` function to cast return values to `long`, so Py4J would always serialize them to Scala Long. Python3 does not have a `long` type, so this approach failed on Python3. ## How was this patch tested? Added a new PySpark-side test that fails without the change. The contribution is my original work and I license the work to the project under the project’s open source license. Resubmission of #16896. The original PR didn't go through Jenkins and broke the build. davies dongjoon-hyun cloud-fan Could you kick off a Jenkins run for me? It passed everything for me locally, but it's possible something has changed in the last few weeks. Author: Jason White <[email protected]> Closes #17200 from JasonMWhite/SPARK-19561. (cherry picked from commit 206030b) Signed-off-by: Wenchen Fan <[email protected]>
|
thanks, merging to master/2.1! |
|
Thanks @cloud-fan ! |
What changes were proposed in this pull request?
Add handling of input of type
Intfor dataTypeTimestampTypetoEvaluatePython.scala. Py4J serializes ints smaller than MIN_INT or larger than MAX_INT to Long, which are handled correctly already, but values between MIN_INT and MAX_INT are serialized to Int.These range limits correspond to roughly half an hour on either side of the epoch. As a result, PySpark doesn't allow TimestampType values to be created in this range.
Alternatives attempted: patching the
TimestampType.toInternalfunction to cast return values tolong, so Py4J would always serialize them to Scala Long. Python3 does not have alongtype, so this approach failed on Python3.How was this patch tested?
Added a new PySpark-side test that fails without the change.
The contribution is my original work and I license the work to the project under the project’s open source license.
Resubmission of #16896. The original PR didn't go through Jenkins and broke the build. @davies @dongjoon-hyun
@cloud-fan Could you kick off a Jenkins run for me? It passed everything for me locally, but it's possible something has changed in the last few weeks.