-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24976][PYTHON] Allow None for Decimal type conversion (specific to PyArrow 0.9.0) #21928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@HyukjinKwon, thanks! I am a bot who has found some folks who might be able to help with the review:@gatorsmile, @JoshRosen and @mateiz |
|
cc @ueshin, @icexelloss and @BryanCutler |
|
Test build #93821 has finished for PR 21928 at commit
|
|
I wonder if we could tune the bot suggestions to more recent contributions/contributors? |
|
Yea.. this also triggered me to send an email to the mailing list - http://apache-spark-developers-list.1001551.n3.nabble.com/Review-notification-bot-tc24133.html |
|
yea, it doesn't seem very useful to ping matei on every single PR ;) |
| LooseVersion("0.9.0") <= LooseVersion(pa.__version__) < LooseVersion("0.10.0"): | ||
| # TODO: see ARROW-2432. Remove when the minimum PyArrow version becomes 0.10.0. | ||
| return pa.Array.from_pandas(s.apply( | ||
| lambda v: decimal.Decimal('NaN') if v is None else v), mask=mask, type=t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
existing test should test this test_vectorized_udf_null_decimal. This is failed without the current change when PyArrow 0.9.0 is used.
| return pa.Array.from_pandas(s.apply( | ||
| lambda v: v.decode("utf-8") if isinstance(v, str) else v), mask=mask, type=t) | ||
| elif t is not None and pa.types.is_decimal(t) and \ | ||
| LooseVersion("0.9.0") <= LooseVersion(pa.__version__) < LooseVersion("0.10.0"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider a single place to check pyarrow versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, but not sure if I am aware of other issues specific to PyArrow versions. Will make a single place if I happen to fix things specific to PyArrow versions for sure.
|
@HyukjinKwon arrow 0.10.0 release is around the corner. I think Spark 2.4 will very likely to ship with 0.10.0 (where I believe this issue has been fixed, @BryanCutler can you confirm?) I am not sure if it's necessary to has this patch just for pyarrow 0.9.0 ... |
|
@icexelloss you mean we should change minimum PyArrow version as well? |
|
I think we shouldn't change minimum PyArrow version in 2.4.0 and the upgrade doesn't require to change the minimum as far as I remember. We should backport this anyway even if we will change the minimum version for 2.4.0. let's go ahead. |
|
I see. Yeah sounds good to me.
…On Tue, Jul 31, 2018 at 12:30 PM Hyukjin Kwon ***@***.***> wrote:
I think we shouldn't change minimum PyArrow version in 2.4.0 and the
upgrade doesn't require to change it as far as I remember.
We should backport this anyway even if we will upgrade it for 2.4.0. let's
go ahead.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21928 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAwbrI0nm_g9FHFrQFonGQtzvh7HeQQ6ks5uMIYJgaJpZM4VnjD_>
.
|
BryanCutler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, from what I recall this was the only issue with pyarrow 0.9.0.
…c to PyArrow 0.9.0) ## What changes were proposed in this pull request? See [ARROW-2432](https://jira.apache.org/jira/browse/ARROW-2432). Seems using `from_pandas` to convert decimals fails if encounters a value of `None`: ```python import pyarrow as pa import pandas as pd from decimal import Decimal pa.Array.from_pandas(pd.Series([Decimal('3.14'), None]), type=pa.decimal128(3, 2)) ``` **Arrow 0.8.0** ``` <pyarrow.lib.Decimal128Array object at 0x10a572c58> [ Decimal('3.14'), NA ] ``` **Arrow 0.9.0** ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas File "array.pxi", line 177, in pyarrow.lib.array File "error.pxi", line 77, in pyarrow.lib.check_status File "error.pxi", line 77, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Error converting from Python objects to Decimal: Got Python object of type NoneType but can only handle these types: decimal.Decimal ``` This PR propose to work around this via Decimal NaN: ```python pa.Array.from_pandas(pd.Series([Decimal('3.14'), Decimal('NaN')]), type=pa.decimal128(3, 2)) ``` ``` <pyarrow.lib.Decimal128Array object at 0x10ffd2e68> [ Decimal('3.14'), NA ] ``` ## How was this patch tested? Manually tested: ```bash SPARK_TESTING=1 ./bin/pyspark pyspark.sql.tests ScalarPandasUDFTests ``` **Before** ``` Traceback (most recent call last): File "/.../spark/python/pyspark/sql/tests.py", line 4672, in test_vectorized_udf_null_decimal self.assertEquals(df.collect(), res.collect()) File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect sock_info = self._jdf.collectToPython() File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o51.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 (TID 7, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/pyspark/worker.py", line 320, in main process() File "/.../spark/python/pyspark/worker.py", line 315, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream batch = _create_batch(series, self._timezone) File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch arrs = [create_array(s, t) for s, t in series] File "/.../spark/python/pyspark/serializers.py", line 241, in create_array return pa.Array.from_pandas(s, mask=mask, type=t) File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas File "array.pxi", line 177, in pyarrow.lib.array File "error.pxi", line 77, in pyarrow.lib.check_status File "error.pxi", line 77, in pyarrow.lib.check_status ArrowInvalid: Error converting from Python objects to Decimal: Got Python object of type NoneType but can only handle these types: decimal.Decimal ``` **After** ``` Running tests... ---------------------------------------------------------------------- Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). .......S............................. ---------------------------------------------------------------------- Ran 37 tests in 21.980s ``` Author: hyukjinkwon <[email protected]> Closes #21928 from HyukjinKwon/SPARK-24976. (cherry picked from commit f4772fd) Signed-off-by: Bryan Cutler <[email protected]>
|
merged to master and branch-2.3, thanks @HyukjinKwon ! |
|
Thank you @felixcheung, @icexelloss and @BryanCutler. |
What changes were proposed in this pull request?
See ARROW-2432. Seems using
from_pandasto convert decimals fails if encounters a value ofNone:Arrow 0.8.0
Arrow 0.9.0
This PR propose to work around this via Decimal NaN:
How was this patch tested?
Manually tested:
Before
After