[SPARK-24976][PYTHON] Allow None for Decimal type conversion (specific to PyArrow 0.9.0) #21928

HyukjinKwon · 2018-07-31T03:41:54Z

What changes were proposed in this pull request?

See ARROW-2432. Seems using from_pandas to convert decimals fails if encounters a value of None:

import pyarrow as pa
import pandas as pd
from decimal import Decimal

pa.Array.from_pandas(pd.Series([Decimal('3.14'), None]), type=pa.decimal128(3, 2))

Arrow 0.8.0

<pyarrow.lib.Decimal128Array object at 0x10a572c58>
[
  Decimal('3.14'),
  NA
]

Arrow 0.9.0

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas
  File "array.pxi", line 177, in pyarrow.lib.array
  File "error.pxi", line 77, in pyarrow.lib.check_status
  File "error.pxi", line 77, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Error converting from Python objects to Decimal: Got Python object of type NoneType but can only handle these types: decimal.Decimal

This PR propose to work around this via Decimal NaN:

pa.Array.from_pandas(pd.Series([Decimal('3.14'), Decimal('NaN')]), type=pa.decimal128(3, 2))

<pyarrow.lib.Decimal128Array object at 0x10ffd2e68>
[
  Decimal('3.14'),
  NA
]

How was this patch tested?

Manually tested:

SPARK_TESTING=1 ./bin/pyspark pyspark.sql.tests ScalarPandasUDFTests

Before

Traceback (most recent call last):
  File "/.../spark/python/pyspark/sql/tests.py", line 4672, in test_vectorized_udf_null_decimal
    self.assertEquals(df.collect(), res.collect())
  File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect
    sock_info = self._jdf.collectToPython()
  File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o51.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 (TID 7, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/.../spark/python/pyspark/worker.py", line 320, in main
    process()
  File "/.../spark/python/pyspark/worker.py", line 315, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream
    batch = _create_batch(series, self._timezone)
  File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch
    arrs = [create_array(s, t) for s, t in series]
  File "/.../spark/python/pyspark/serializers.py", line 241, in create_array
    return pa.Array.from_pandas(s, mask=mask, type=t)
  File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas
  File "array.pxi", line 177, in pyarrow.lib.array
  File "error.pxi", line 77, in pyarrow.lib.check_status
  File "error.pxi", line 77, in pyarrow.lib.check_status
ArrowInvalid: Error converting from Python objects to Decimal: Got Python object of type NoneType but can only handle these types: decimal.Decimal

After

Running tests...
----------------------------------------------------------------------
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
.......S.............................
----------------------------------------------------------------------
Ran 37 tests in 21.980s

holdensmagicalunicorn · 2018-07-31T03:41:57Z

@HyukjinKwon, thanks! I am a bot who has found some folks who might be able to help with the review:@gatorsmile, @JoshRosen and @mateiz

HyukjinKwon · 2018-07-31T03:42:21Z

cc @ueshin, @icexelloss and @BryanCutler

SparkQA · 2018-07-31T04:23:49Z

Test build #93821 has finished for PR 21928 at commit 652afd0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2018-07-31T05:33:21Z

I wonder if we could tune the bot suggestions to more recent contributions/contributors?

HyukjinKwon · 2018-07-31T05:35:50Z

Yea.. this also triggered me to send an email to the mailing list - http://apache-spark-developers-list.1001551.n3.nabble.com/Review-notification-bot-tc24133.html

felixcheung · 2018-07-31T05:49:13Z

yea, it doesn't seem very useful to ping matei on every single PR ;)

felixcheung · 2018-07-31T05:53:38Z

python/pyspark/serializers.py

+                LooseVersion("0.9.0") <= LooseVersion(pa.__version__) < LooseVersion("0.10.0"):
+            # TODO: see ARROW-2432. Remove when the minimum PyArrow version becomes 0.10.0.
+            return pa.Array.from_pandas(s.apply(
+                lambda v: decimal.Decimal('NaN') if v is None else v), mask=mask, type=t)


existing test should test this test_vectorized_udf_null_decimal. This is failed without the current change when PyArrow 0.9.0 is used.

felixcheung · 2018-07-31T05:54:46Z

python/pyspark/serializers.py

            return pa.Array.from_pandas(s.apply(
                lambda v: v.decode("utf-8") if isinstance(v, str) else v), mask=mask, type=t)
+        elif t is not None and pa.types.is_decimal(t) and \
+                LooseVersion("0.9.0") <= LooseVersion(pa.__version__) < LooseVersion("0.10.0"):


consider a single place to check pyarrow versions?

Yea, but not sure if I am aware of other issues specific to PyArrow versions. Will make a single place if I happen to fix things specific to PyArrow versions for sure.

icexelloss · 2018-07-31T14:02:51Z

@HyukjinKwon arrow 0.10.0 release is around the corner. I think Spark 2.4 will very likely to ship with 0.10.0 (where I believe this issue has been fixed, @BryanCutler can you confirm?)

I am not sure if it's necessary to has this patch just for pyarrow 0.9.0 ...

HyukjinKwon · 2018-07-31T14:57:18Z

@icexelloss you mean we should change minimum PyArrow version as well?

HyukjinKwon · 2018-07-31T16:29:02Z

I think we shouldn't change minimum PyArrow version in 2.4.0 and the upgrade doesn't require to change the minimum as far as I remember.

We should backport this anyway even if we will change the minimum version for 2.4.0. let's go ahead.

icexelloss · 2018-07-31T16:52:50Z

I see. Yeah sounds good to me.

…

On Tue, Jul 31, 2018 at 12:30 PM Hyukjin Kwon ***@***.***> wrote: I think we shouldn't change minimum PyArrow version in 2.4.0 and the upgrade doesn't require to change it as far as I remember. We should backport this anyway even if we will upgrade it for 2.4.0. let's go ahead. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21928 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAwbrI0nm_g9FHFrQFonGQtzvh7HeQQ6ks5uMIYJgaJpZM4VnjD_> .

BryanCutler

LGTM, from what I recall this was the only issue with pyarrow 0.9.0.

…c to PyArrow 0.9.0) ## What changes were proposed in this pull request? See [ARROW-2432](https://jira.apache.org/jira/browse/ARROW-2432). Seems using `from_pandas` to convert decimals fails if encounters a value of `None`: ```python import pyarrow as pa import pandas as pd from decimal import Decimal pa.Array.from_pandas(pd.Series([Decimal('3.14'), None]), type=pa.decimal128(3, 2)) ``` **Arrow 0.8.0** ``` <pyarrow.lib.Decimal128Array object at 0x10a572c58> [ Decimal('3.14'), NA ] ``` **Arrow 0.9.0** ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas File "array.pxi", line 177, in pyarrow.lib.array File "error.pxi", line 77, in pyarrow.lib.check_status File "error.pxi", line 77, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Error converting from Python objects to Decimal: Got Python object of type NoneType but can only handle these types: decimal.Decimal ``` This PR propose to work around this via Decimal NaN: ```python pa.Array.from_pandas(pd.Series([Decimal('3.14'), Decimal('NaN')]), type=pa.decimal128(3, 2)) ``` ``` <pyarrow.lib.Decimal128Array object at 0x10ffd2e68> [ Decimal('3.14'), NA ] ``` ## How was this patch tested? Manually tested: ```bash SPARK_TESTING=1 ./bin/pyspark pyspark.sql.tests ScalarPandasUDFTests ``` **Before** ``` Traceback (most recent call last): File "/.../spark/python/pyspark/sql/tests.py", line 4672, in test_vectorized_udf_null_decimal self.assertEquals(df.collect(), res.collect()) File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect sock_info = self._jdf.collectToPython() File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o51.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 (TID 7, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/pyspark/worker.py", line 320, in main process() File "/.../spark/python/pyspark/worker.py", line 315, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream batch = _create_batch(series, self._timezone) File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch arrs = [create_array(s, t) for s, t in series] File "/.../spark/python/pyspark/serializers.py", line 241, in create_array return pa.Array.from_pandas(s, mask=mask, type=t) File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas File "array.pxi", line 177, in pyarrow.lib.array File "error.pxi", line 77, in pyarrow.lib.check_status File "error.pxi", line 77, in pyarrow.lib.check_status ArrowInvalid: Error converting from Python objects to Decimal: Got Python object of type NoneType but can only handle these types: decimal.Decimal ``` **After** ``` Running tests... ---------------------------------------------------------------------- Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). .......S............................. ---------------------------------------------------------------------- Ran 37 tests in 21.980s ``` Author: hyukjinkwon <[email protected]> Closes #21928 from HyukjinKwon/SPARK-24976. (cherry picked from commit f4772fd) Signed-off-by: Bryan Cutler <[email protected]>

BryanCutler · 2018-08-01T00:25:54Z

merged to master and branch-2.3, thanks @HyukjinKwon !

HyukjinKwon · 2018-08-01T00:53:15Z

Thank you @felixcheung, @icexelloss and @BryanCutler.

Allow None for Decimal type conversion (specific to Arrow 0.9.0)

652afd0

HyukjinKwon changed the title ~~[SPARK-24976][PYTHON] Allow None for Decimal type conversion (specific to Arrow 0.9.0)~~ [SPARK-24976][PYTHON] Allow None for Decimal type conversion (specific to PyArrow 0.9.0) Jul 31, 2018

felixcheung reviewed Jul 31, 2018

View reviewed changes

BryanCutler approved these changes Aug 1, 2018

View reviewed changes

asfgit closed this in f4772fd Aug 1, 2018

HyukjinKwon deleted the SPARK-24976 branch October 16, 2018 12:45

[SPARK-24976][PYTHON] Allow None for Decimal type conversion (specific to PyArrow 0.9.0) #21928

[SPARK-24976][PYTHON] Allow None for Decimal type conversion (specific to PyArrow 0.9.0) #21928

Uh oh!

Conversation

HyukjinKwon commented Jul 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

holdensmagicalunicorn commented Jul 31, 2018

Uh oh!

HyukjinKwon commented Jul 31, 2018

Uh oh!

SparkQA commented Jul 31, 2018

Uh oh!

felixcheung commented Jul 31, 2018

Uh oh!

HyukjinKwon commented Jul 31, 2018

Uh oh!

felixcheung commented Jul 31, 2018

Uh oh!

felixcheung Jul 31, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Jul 31, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jul 31, 2018

Choose a reason for hiding this comment

Uh oh!

icexelloss commented Jul 31, 2018

Uh oh!

HyukjinKwon commented Jul 31, 2018

Uh oh!

HyukjinKwon commented Jul 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

icexelloss commented Jul 31, 2018 via email

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Aug 1, 2018

Uh oh!

HyukjinKwon commented Aug 1, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

HyukjinKwon commented Jul 31, 2018 •

edited

Loading

HyukjinKwon Jul 31, 2018 •

edited

Loading

HyukjinKwon commented Jul 31, 2018 •

edited

Loading