-
Notifications
You must be signed in to change notification settings - Fork 28.7k
Commit ff980cc
[SPARK-52940][PYTHON][TESTS] Add arrow aggregation unit tests for complex return types
### What changes were proposed in this pull request?
Pandas UDF for aggregation only support primitive return type, see https://apache.github.io/spark/api/python/tutorial/sql/arrow_pandas.html#series-to-scalar
While Arrow UDF can support complex type with `pa.Scalar`, this PR adds unit tests to guard it, e.g.
```py
In [1]: import pyarrow as pa
...: from pyspark.sql import Window
...: from pyspark.sql import functions as sf
...: from pyspark.sql.pandas.functions import arrow_udf
In [2]: df = spark.createDataFrame([(1, 1), (2, 2), (3, 5)], ("id", "v"))
In [3]: arrow_udf("struct<m1: int, m2:int>")
...: def arrow_collect_min_max(id: pa.Array, v: pa.Array) -> dict[int, int]:
...: assert isinstance(id, pa.Array), str(type(id))
...: assert isinstance(v, pa.Array), str(type(v))
...: m1 = pa.compute.min(id)
...: m2 = pa.compute.max(v)
...: t = pa.struct([pa.field("m1", pa.int32()), pa.field("m2", pa.int32())])
...: return pa.scalar(value={"m1": m1.as_py(), "m2": m2.as_py()}, type=t)
...:
...: result1 = df.select(
...: arrow_collect_min_max("id", "v").alias("struct"),
...: )
In [4]: result1.show()
+------+
|struct|
+------+
|{1, 5}|
+------+
```
### Why are the changes needed?
to improve test coverage
### Does this PR introduce _any_ user-facing change?
no, test-only
### How was this patch tested?
new tests
### Was this patch authored or co-authored using generative AI tooling?
no
Closes #51649 from zhengruifeng/arrow_agg_nested.
Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>1 parent a82b415 commit ff980ccCopy full SHA for ff980cc
File tree
Expand file treeCollapse file tree
1 file changed
+163
-70
lines changedFilter options
- python/pyspark/sql/tests/arrow
Expand file treeCollapse file tree
1 file changed
+163
-70
lines changed
0 commit comments