Skip to content

Commit ff980cc

Browse files
zhengruifengHyukjinKwon
authored andcommitted
[SPARK-52940][PYTHON][TESTS] Add arrow aggregation unit tests for complex return types
### What changes were proposed in this pull request? Pandas UDF for aggregation only support primitive return type, see https://apache.github.io/spark/api/python/tutorial/sql/arrow_pandas.html#series-to-scalar While Arrow UDF can support complex type with `pa.Scalar`, this PR adds unit tests to guard it, e.g. ```py In [1]: import pyarrow as pa ...: from pyspark.sql import Window ...: from pyspark.sql import functions as sf ...: from pyspark.sql.pandas.functions import arrow_udf In [2]: df = spark.createDataFrame([(1, 1), (2, 2), (3, 5)], ("id", "v")) In [3]: arrow_udf("struct<m1: int, m2:int>") ...: def arrow_collect_min_max(id: pa.Array, v: pa.Array) -> dict[int, int]: ...: assert isinstance(id, pa.Array), str(type(id)) ...: assert isinstance(v, pa.Array), str(type(v)) ...: m1 = pa.compute.min(id) ...: m2 = pa.compute.max(v) ...: t = pa.struct([pa.field("m1", pa.int32()), pa.field("m2", pa.int32())]) ...: return pa.scalar(value={"m1": m1.as_py(), "m2": m2.as_py()}, type=t) ...: ...: result1 = df.select( ...: arrow_collect_min_max("id", "v").alias("struct"), ...: ) In [4]: result1.show() +------+ |struct| +------+ |{1, 5}| +------+ ``` ### Why are the changes needed? to improve test coverage ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #51649 from zhengruifeng/arrow_agg_nested. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
1 parent a82b415 commit ff980cc

File tree

1 file changed

+163
-70
lines changed

1 file changed

+163
-70
lines changed

0 commit comments

Comments
 (0)