Skip to content

Commit fa2ae9d

Browse files
BryanCutlerHyukjinKwon
authored andcommitted
[SPARK-24392][PYTHON] Label pandas_udf as Experimental
## What changes were proposed in this pull request? The pandas_udf functionality was introduced in 2.3.0, but is not completely stable and still evolving. This adds a label to indicate it is still an experimental API. ## How was this patch tested? NA Author: Bryan Cutler <[email protected]> Closes #21435 from BryanCutler/arrow-pandas_udf-experimental-SPARK-24392.
1 parent de01a8d commit fa2ae9d

File tree

5 files changed

+12
-0
lines changed

5 files changed

+12
-0
lines changed

docs/sql-programming-guide.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1827,6 +1827,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see
18271827
- Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
18281828
- In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to `false`.
18291829

1830+
## Upgrading From Spark SQL 2.3.0 to 2.3.1 and above
1831+
1832+
- As of version 2.3.1 Arrow functionality, including `pandas_udf` and `toPandas()`/`createDataFrame()` with `spark.sql.execution.arrow.enabled` set to `True`, has been marked as experimental. These are still evolving and not currently recommended for use in production.
1833+
18301834
## Upgrading From Spark SQL 2.2 to 2.3
18311835

18321836
- Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named `_corrupt_record` by default). For example, `spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()` and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. Instead, you can cache or save the parsed results and then send the same query. For example, `val df = spark.read.schema(schema).json(file).cache()` and then `df.filter($"_corrupt_record".isNotNull).count()`.

python/pyspark/sql/dataframe.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1975,6 +1975,8 @@ def toPandas(self):
19751975
.. note:: This method should only be used if the resulting Pandas's DataFrame is expected
19761976
to be small, as all the data is loaded into the driver's memory.
19771977
1978+
.. note:: Usage with spark.sql.execution.arrow.enabled=True is experimental.
1979+
19781980
>>> df.toPandas() # doctest: +SKIP
19791981
age name
19801982
0 2 Alice

python/pyspark/sql/functions.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2456,6 +2456,8 @@ def pandas_udf(f=None, returnType=None, functionType=None):
24562456
:param functionType: an enum value in :class:`pyspark.sql.functions.PandasUDFType`.
24572457
Default: SCALAR.
24582458
2459+
.. note:: Experimental
2460+
24592461
The function type of the UDF can be one of the following:
24602462
24612463
1. SCALAR

python/pyspark/sql/group.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,8 @@ def apply(self, udf):
236236
into memory, so the user should be aware of the potential OOM risk if data is skewed
237237
and certain groups are too large to fit in memory.
238238
239+
.. note:: Experimental
240+
239241
:param udf: a grouped map user-defined function returned by
240242
:func:`pyspark.sql.functions.pandas_udf`.
241243

python/pyspark/sql/session.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -584,6 +584,8 @@ def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=Tr
584584
.. versionchanged:: 2.1
585585
Added verifySchema.
586586
587+
.. note:: Usage with spark.sql.execution.arrow.enabled=True is experimental.
588+
587589
>>> l = [('Alice', 1)]
588590
>>> spark.createDataFrame(l).collect()
589591
[Row(_1=u'Alice', _2=1)]

0 commit comments

Comments
 (0)