From 735cf0564c8ba2ce726701c61902c456f556716c Mon Sep 17 00:00:00 2001 From: Bryan Cutler Date: Mon, 7 Oct 2019 11:28:31 -0700 Subject: [PATCH 1/2] Add section on compatibility for Arrow 0.15.0 to SQL prog guide --- docs/sql-pyspark-pandas-with-arrow.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/docs/sql-pyspark-pandas-with-arrow.md b/docs/sql-pyspark-pandas-with-arrow.md index 25112d1d2a833..6a51e45c5338f 100644 --- a/docs/sql-pyspark-pandas-with-arrow.md +++ b/docs/sql-pyspark-pandas-with-arrow.md @@ -219,3 +219,14 @@ Note that a standard UDF (non-Pandas) will load timestamp data as Python datetim different than a Pandas timestamp. It is recommended to use Pandas time series functionality when working with timestamps in `pandas_udf`s to get the best performance, see [here](https://pandas.pydata.org/pandas-docs/stable/timeseries.html) for details. + +### Compatibiliy Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x + +Since Arrow 0.15.0, a change in the binary IPC format requires an environment variable to be set in +Spark so that PySpark maintain compatibility with versions on PyArrow 0.15.0 and above. The following can be added to `conf/spark-env.sh` to use the legacy IPC format: + +``` +ARROW_PRE_0_15_IPC_FORMAT=1 +``` + +This will instruct PyArrow >= 0.15.0 to use the legacy IPC format with the older Arrow Java that is in Spark 2.3.x and 2.4.x. From 82d5f5420e365b9f407dfc9f5cf2f8ac9406bc78 Mon Sep 17 00:00:00 2001 From: Bryan Cutler Date: Wed, 9 Oct 2019 13:54:45 -0700 Subject: [PATCH 2/2] Added clarification, links to jira and arrow blog --- docs/sql-pyspark-pandas-with-arrow.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/docs/sql-pyspark-pandas-with-arrow.md b/docs/sql-pyspark-pandas-with-arrow.md index 6a51e45c5338f..7f01483d40583 100644 --- a/docs/sql-pyspark-pandas-with-arrow.md +++ b/docs/sql-pyspark-pandas-with-arrow.md @@ -222,11 +222,17 @@ working with timestamps in `pandas_udf`s to get the best performance, see ### Compatibiliy Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x -Since Arrow 0.15.0, a change in the binary IPC format requires an environment variable to be set in -Spark so that PySpark maintain compatibility with versions on PyArrow 0.15.0 and above. The following can be added to `conf/spark-env.sh` to use the legacy IPC format: +Since Arrow 0.15.0, a change in the binary IPC format requires an environment variable to be +compatible with previous versions of Arrow <= 0.14.1. This is only necessary to do for PySpark +users with versions 2.3.x and 2.4.x that have manually upgraded PyArrow to 0.15.0. The following +can be added to `conf/spark-env.sh` to use the legacy Arrow IPC format: ``` ARROW_PRE_0_15_IPC_FORMAT=1 ``` -This will instruct PyArrow >= 0.15.0 to use the legacy IPC format with the older Arrow Java that is in Spark 2.3.x and 2.4.x. +This will instruct PyArrow >= 0.15.0 to use the legacy IPC format with the older Arrow Java that +is in Spark 2.3.x and 2.4.x. Not setting this environment variable will lead to a similar error as +described in [SPARK-29367](https://issues.apache.org/jira/browse/SPARK-29367) when running +`pandas_udf`s or `toPandas()` with Arrow enabled. More information about the Arrow IPC change can +be read on the Arrow 0.15.0 release [blog](http://arrow.apache.org/blog/2019/10/06/0.15.0-release/#columnar-streaming-protocol-change-since-0140).