Skip to content

Conversation

@pull
Copy link

@pull pull bot commented Sep 13, 2022

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

### What changes were proposed in this pull request?
Support NumPy ndarray in built-in functions(`pyspark.sql.functions`) by introducing Py4J input converter `NumpyArrayConverter`. The converter converts a ndarray to a Java array.

The mapping between ndarray dtype with Java primitive type is defined as below:
```py
            np.dtype("int64"): gateway.jvm.long,
            np.dtype("int32"): gateway.jvm.int,
            np.dtype("int16"): gateway.jvm.short,
            # Mapping to gateway.jvm.byte causes
            #   TypeError: 'bytes' object does not support item assignment
            np.dtype("int8"): gateway.jvm.short,
            np.dtype("float32"): gateway.jvm.float,
            np.dtype("float64"): gateway.jvm.double,
            np.dtype("bool"): gateway.jvm.boolean,
```

### Why are the changes needed?
As part of [SPARK-39405](https://issues.apache.org/jira/browse/SPARK-39405) for NumPy support in SQL.

### Does this PR introduce _any_ user-facing change?
Yes. NumPy ndarray is supported in built-in functions.

Take `lit` for example,
```py
>>> spark.range(1).select(lit(np.array([1, 2], dtype='int16'))).dtypes
[('ARRAY(1S, 2S)', 'array<smallint>')]
>>> spark.range(1).select(lit(np.array([1, 2], dtype='int32'))).dtypes
[('ARRAY(1, 2)', 'array<int>')]
>>> spark.range(1).select(lit(np.array([1, 2], dtype='float32'))).dtypes
[("ARRAY(CAST('1.0' AS FLOAT), CAST('2.0' AS FLOAT))", 'array<float>')]
>>> spark.range(1).select(lit(np.array([]))).dtypes
[('ARRAY()', 'array<double>')]
```

### How was this patch tested?
Unit tests.

Closes #37635 from xinrong-meng/builtin_ndarray.

Authored-by: Xinrong Meng <[email protected]>
Signed-off-by: Xinrong Meng <[email protected]>
ELHoussineT and others added 3 commits September 12, 2022 20:46
### What changes were proposed in this pull request?

Use `bool` instead of `np.bool` as `np.bool` will be deprecated (see: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations)

Using `np.bool` generates this warning:

```
UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect on failures in the middle of computation.
3070E                     `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
3071E                   Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
```

### Why are the changes needed?
Deprecation soon: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations.

### Does this PR introduce _any_ user-facing change?
The warning will be suppressed

### How was this patch tested?
Existing tests should suffice.

Closes #37817 from ELHoussineT/patch-1.

Authored-by: ELHoussineT <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
…Suite

### What changes were proposed in this pull request?

Currently, when calling the method `checkError` from `SparkFunSuite` to check the QueryContext, we need to include the trait `QueryErrorsSuiteBase` to use ExpectedContext.
This is not convenient. Let's simply migrate the trait `QueryErrorsSuiteBase` into `SparkFunSuite`

### Why are the changes needed?

Simplify test framework

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

GA

Closes #37858 from gengliangwang/minorRefactor.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?

This PR proposes to install the `openpyxl` for PySpark test environments to re-enable the `to_excel` tests.

### Why are the changes needed?

For better test coverage

### Does this PR introduce _any_ user-facing change?

No, it's test only

### How was this patch tested?

Enabling the existing skipping tests related to `openpyxl`.

Closes #37671 from itholic/SPARK-40229.

Lead-authored-by: itholic <[email protected]>
Co-authored-by: Haejoon Lee <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
zhengruifeng and others added 5 commits September 13, 2022 11:43
### What changes were proposed in this pull request?
1, add a dedicated expression for `DataFrame.cov`;
2, add missing parameter `ddof` in `DataFrame.cov`

### Why are the changes needed?
for api coverage

### Does this PR introduce _any_ user-facing change?
yes, API change

```
        >>> np.random.seed(42)
        >>> df = ps.DataFrame(np.random.randn(1000, 5),
        ...                   columns=['a', 'b', 'c', 'd', 'e'])
        >>> df.cov()
                  a         b         c         d         e
        a  0.998438 -0.020161  0.059277 -0.008943  0.014144
        b -0.020161  1.059352 -0.008543 -0.024738  0.009826
        c  0.059277 -0.008543  1.010670 -0.001486 -0.000271
        d -0.008943 -0.024738 -0.001486  0.921297 -0.013692
        e  0.014144  0.009826 -0.000271 -0.013692  0.977795
        >>> df.cov(ddof=2)
                  a         b         c         d         e
        a  0.999439 -0.020181  0.059336 -0.008952  0.014159
        b -0.020181  1.060413 -0.008551 -0.024762  0.009836
        c  0.059336 -0.008551  1.011683 -0.001487 -0.000271
        d -0.008952 -0.024762 -0.001487  0.922220 -0.013705
        e  0.014159  0.009836 -0.000271 -0.013705  0.978775
        >>> df.cov(ddof=-1)
          a         b         c         d         e
        a  0.996444 -0.020121  0.059158 -0.008926  0.014116
        b -0.020121  1.057235 -0.008526 -0.024688  0.009807
        c  0.059158 -0.008526  1.008650 -0.001483 -0.000270
        d -0.008926 -0.024688 -0.001483  0.919456 -0.013664
        e  0.014116  0.009807 -0.000270 -0.013664  0.975842
```

### How was this patch tested?
added tests

Closes #37829 from zhengruifeng/ps_cov_ddof.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
### What changes were proposed in this pull request?

This PR is a follow-up for SPARK-40107. It updates the way we check the `empty2null` expression in a V1 write query plan. Previously, we only search for this expression in Project. But optimizer can change the position of this expression, for example collapsing projects with aggregates. As a result, we need to search the entire plan to see if `empty2null` has been added by `V1Writes`.

### Why are the changes needed?

To prevent unnecessary `empty2null` projections from being added in FileFormatWriter.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit tests.

Closes #37856 from allisonwang-db/spark-40107-followup.

Authored-by: allisonwang-db <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…th the behavior of log4j1

### What changes were proposed in this pull request?
As Kimahriman mentioned, the default logging now goes to stdout instead of stderr, so this pr change it back to stderr.

### Why are the changes needed?
Keep consistent with the behavior of log4j1.

Ref to

https://github.com/apache/spark/blob/78a5825fe266c0884d2dd18cbca9625fa258d7f7/core/src/main/resources/org/apache/spark/log4j-defaults.properties#L18-L23

and the `log4j2.properties.template` also points to `SYSTEM_ERR`.

https://github.com/apache/spark/blob/78d492c1b153240dddc636ec6002e7bfc6b94b3b/conf/log4j2.properties.template#L20-L32

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

Closes #37854 from LuciferYang/SPARK-40406.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
…tion with input

### What changes were proposed in this pull request?
Refactor expanding and rolling test for function with input

### Why are the changes needed?
Refactor expanding and rolling test for function with input:

```python
# Before
self._test_groupby_rolling_func("count")

# After
# str can be accept
self._test_groupby_rolling_func("count")
# Can also accept lambda to support more func style
self._test_groupby_expanding_func(
    lambda x: x.quantile(0.5), lambda x: x.quantile(0.5, "lower")
)
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- CI passed
- cherry-pick: e22ee1c and test manually.

Closes #37835 from Yikun/SPARK-40327.

Authored-by: Yikun Jiang <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…t trait

### What changes were proposed in this pull request?

This PR proposes to factor the common attributes out from `FlatMapGroupsWithStateExec` to `FlatMapGroupsWithStateExecBase`.

### Why are the changes needed?

There are a lot of stuff to share if you implement another version of `FlatMapGroupsWithStateExec`.
Should better factor them out. This is also part of #37285 which demonstrates how the refactored trait is used.

### Does this PR introduce _any_ user-facing change?

No, this is refactoring-only.

### How was this patch tested?

Existing test cases should cover it.

Closes #37859 from HyukjinKwon/SPARK-40411.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
zhengruifeng and others added 2 commits September 13, 2022 14:44
…ort missing values and `min_periods `

### What changes were proposed in this pull request?
refactor `pearson` correlation in `DataFrame.corr` to:

1. support missing values;
2. add parameter  `min_periods`;
3. enable arrow execution since no longer depend on `VectorUDT`;
4. support lazy evaluation;

before
```
In [1]: import pyspark.pandas as ps

In [2]: df = ps.DataFrame([[1,2], [3,None]])

In [3]: df

   0    1
0  1  2.0
1  3  NaN

In [4]: df.corr()
22/09/09 16:53:18 ERROR Executor: Exception in task 9.0 in stage 5.0 (TID 24)
org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user defined function (VectorAssembler$$Lambda$2660/0x0000000801215840: (struct<0_double_VectorAssembler_0915f96ec689:double,1:double>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
```

after
```
In [1]: import pyspark.pandas as ps

In [2]: df = ps.DataFrame([[1,2], [3,None]])

In [3]: df.corr()

     0   1
0  1.0 NaN
1  NaN NaN

In [4]: df.to_pandas().corr()
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:976: PandasAPIOnSparkAdviceWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas DataFrame is expected to be small.
  warnings.warn(message, PandasAPIOnSparkAdviceWarning)
Out[4]:
     0   1
0  1.0 NaN
1  NaN NaN
```

### Why are the changes needed?
for API coverage and support common cases containing missing values

### Does this PR introduce _any_ user-facing change?
yes, API change, new parameter supported

### How was this patch tested?
added UT

Closes #37845 from zhengruifeng/ps_df_corr_missing_value.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
### What changes were proposed in this pull request?
In the PR, I propose to remove the error class `INDEX_OUT_OF_BOUNDS` from `error-classes.json` and the exception `SparkIndexOutOfBoundsException`. And replace the last one by a SparkException w/ the error class `INTERNAL_ERROR` because the exception should not be raised in regular cases.

`ArrayDataIndexedSeq` throws the exception from `apply()`, and `ArrayDataIndexedSeq` can be created from `ArrayData.toSeq` only. The last one is invoked from 2 places:

1. The `Slice` expression ( or `slice` function):
https://github.com/apache/spark/blob/443eea97578c41870c343cdb88cf69bfdf27033a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1600-L1601

where any access to the produced array is guarded:
```sql
spark-sql> set spark.sql.ansi.enabled=true;
spark.sql.ansi.enabled	true
Time taken: 2.415 seconds, Fetched 1 row(s)
spark-sql> SELECT slice(array(1, 2, 3, 4), 2, 2)[4];
...
org.apache.spark.SparkArrayIndexOutOfBoundsException: [INVALID_ARRAY_INDEX] The index 4 is out of bounds. The array has 2 elements. Use the SQL function `get()` to tolerate accessing element at invalid index and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
== SQL(line 1, position 8) ==
SELECT slice(array(1, 2, 3, 4), 2, 2)[4]
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	at org.apache.spark.sql.errors.QueryExecutionErrors$.invalidArrayIndexError(QueryExecutionErrors.scala:239)
	at org.apache.spark.sql.catalyst.expressions.GetArrayItem.nullSafeEval(complexTypeExtractors.scala:271)
```
see
https://github.com/apache/spark/blob/a9bb924480e4953457dad680c15ca346f71a26c8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala#L268-L271

2. `MapObjects.convertToSeq`:
https://github.com/apache/spark/blob/5b96e82ad6a4f5d5e4034d9d7112077159cf5044/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L886

where any access to the produced IndexedSeq is guarded via map-way access in
https://github.com/apache/spark/blob/5b96e82ad6a4f5d5e4034d9d7112077159cf5044/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L864-L867

### Why are the changes needed?
To improve code maintenance.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running the affected test suite:
```
$ build/sbt "core/testOnly *SparkThrowableSuite"
$ build/sbt "test:testOnly *ArrayDataIndexedSeqSuite"
```

Closes #37857 from MaxGekk/rm-INDEX_OUT_OF_BOUNDS.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
@pull pull bot merged commit 1439d9b into wangyum:master Sep 13, 2022
wangyum pushed a commit that referenced this pull request Jan 9, 2023
### What changes were proposed in this pull request?
Currently, Spark DS V2 aggregate push-down doesn't supports project with alias.

Refer https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala#L96

This PR let it works good with alias.

**The first example:**
the origin plan show below:
```
Aggregate [DEPT#0], [DEPT#0, sum(mySalary#8) AS total#14]
+- Project [DEPT#0, SALARY#2 AS mySalary#8]
   +- ScanBuilderHolder [DEPT#0, NAME#1, SALARY#2, BONUS#3], RelationV2[DEPT#0, NAME#1, SALARY#2, BONUS#3] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession77978658,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions5f8da82)
```
If we can complete push down the aggregate, then the plan will be:
```
Project [DEPT#0, SUM(SALARY)#18 AS sum(SALARY#2)#13 AS total#14]
+- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee
```
If we can partial push down the aggregate, then the plan will be:
```
Aggregate [DEPT#0], [DEPT#0, sum(cast(SUM(SALARY)#18 as decimal(20,2))) AS total#14]
+- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee
```

**The second example:**
the origin plan show below:
```
Aggregate [myDept#33], [myDept#33, sum(mySalary#34) AS total#40]
+- Project [DEPT#25 AS myDept#33, SALARY#27 AS mySalary#34]
   +- ScanBuilderHolder [DEPT#25, NAME#26, SALARY#27, BONUS#28], RelationV2[DEPT#25, NAME#26, SALARY#27, BONUS#28] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession25c4f621,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions345d641e)
```
If we can complete push down the aggregate, then the plan will be:
```
Project [DEPT#25 AS myDept#33, SUM(SALARY)#44 AS sum(SALARY#27)#39 AS total#40]
+- RelationV2[DEPT#25, SUM(SALARY)#44] test.employee
```
If we can partial push down the aggregate, then the plan will be:
```
Aggregate [myDept#33], [DEPT#25 AS myDept#33, sum(cast(SUM(SALARY)#56 as decimal(20,2))) AS total#52]
+- RelationV2[DEPT#25, SUM(SALARY)#56] test.employee
```

### Why are the changes needed?
Alias is more useful.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Users could see DS V2 aggregate push-down supports project with alias.

### How was this patch tested?
New tests.

Closes apache#35932 from beliefer/SPARK-38533_new.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit f327dad)
Signed-off-by: Wenchen Fan <[email protected]>
pull bot pushed a commit that referenced this pull request Jul 21, 2025
…ingBuilder`

### What changes were proposed in this pull request?

This PR aims to improve `toString` by `JEP-280` instead of `ToStringBuilder`. In addition, `Scalastyle` and `Checkstyle` rules are added to prevent a future regression.

### Why are the changes needed?

Since Java 9, `String Concatenation` has been handled better by default.

| ID | DESCRIPTION |
| - | - |
| JEP-280 | [Indify String Concatenation](https://openjdk.org/jeps/280) |

For example, this PR improves `OpenBlocks` like the following. Both Java source code and byte code are simplified a lot by utilizing JEP-280 properly.

**CODE CHANGE**
```java

- return new ToStringBuilder(this, ToStringStyle.SHORT_PREFIX_STYLE)
-   .append("appId", appId)
-   .append("execId", execId)
-   .append("blockIds", Arrays.toString(blockIds))
-   .toString();
+ return "OpenBlocks[appId=" + appId + ",execId=" + execId + ",blockIds=" +
+     Arrays.toString(blockIds) + "]";
```

**BEFORE**
```
  public java.lang.String toString();
    Code:
       0: new           #39                 // class org/apache/commons/lang3/builder/ToStringBuilder
       3: dup
       4: aload_0
       5: getstatic     #41                 // Field org/apache/commons/lang3/builder/ToStringStyle.SHORT_PREFIX_STYLE:Lorg/apache/commons/lang3/builder/ToStringStyle;
       8: invokespecial #47                 // Method org/apache/commons/lang3/builder/ToStringBuilder."<init>":(Ljava/lang/Object;Lorg/apache/commons/lang3/builder/ToStringStyle;)V
      11: ldc           #50                 // String appId
      13: aload_0
      14: getfield      #7                  // Field appId:Ljava/lang/String;
      17: invokevirtual #51                 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder;
      20: ldc           #55                 // String execId
      22: aload_0
      23: getfield      #13                 // Field execId:Ljava/lang/String;
      26: invokevirtual #51                 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder;
      29: ldc           #56                 // String blockIds
      31: aload_0
      32: getfield      #16                 // Field blockIds:[Ljava/lang/String;
      35: invokestatic  #57                 // Method java/util/Arrays.toString:([Ljava/lang/Object;)Ljava/lang/String;
      38: invokevirtual #51                 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder;
      41: invokevirtual #61                 // Method org/apache/commons/lang3/builder/ToStringBuilder.toString:()Ljava/lang/String;
      44: areturn
```

**AFTER**
```
  public java.lang.String toString();
    Code:
       0: aload_0
       1: getfield      #7                  // Field appId:Ljava/lang/String;
       4: aload_0
       5: getfield      #13                 // Field execId:Ljava/lang/String;
       8: aload_0
       9: getfield      #16                 // Field blockIds:[Ljava/lang/String;
      12: invokestatic  #39                 // Method java/util/Arrays.toString:([Ljava/lang/Object;)Ljava/lang/String;
      15: invokedynamic #43,  0             // InvokeDynamic #0:makeConcatWithConstants:(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)Ljava/lang/String;
      20: areturn
```

### Does this PR introduce _any_ user-facing change?

No. This is an `toString` implementation improvement.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#51572 from dongjoon-hyun/SPARK-52880.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
pull bot pushed a commit that referenced this pull request Aug 19, 2025
…onicalized expressions

### What changes were proposed in this pull request?

Make PullOutNonDeterministic use canonicalized expressions to dedup group and  aggregate expressions. This affects pyspark udfs in particular. Example:

```
from pyspark.sql.functions import col, avg, udf

pythonUDF = udf(lambda x: x).asNondeterministic()

spark.range(10)\
.selectExpr("id", "id % 3 as value")\
.groupBy(pythonUDF(col("value")))\
.agg(avg("id"), pythonUDF(col("value")))\
.explain(extended=True)
```

Currently results in a plan like this:

```
Aggregate [_nondeterministic#15](#15), [_nondeterministic#15 AS dummyNondeterministicUDF(value)#12, avg(id#0L) AS avg(id)#13, dummyNondeterministicUDF(value#6L)#8 AS dummyNondeterministicUDF(value)#14](#15%20AS%20dummyNondeterministicUDF(value)#12,%20avg(id#0L)%20AS%20avg(id)#13,%20dummyNondeterministicUDF(value#6L)#8%20AS%20dummyNondeterministicUDF(value)#14)
+- Project [id#0L, value#6L, dummyNondeterministicUDF(value#6L)#7 AS _nondeterministic#15](#0L,%20value#6L,%20dummyNondeterministicUDF(value#6L)#7%20AS%20_nondeterministic#15)
   +- Project [id#0L, (id#0L % cast(3 as bigint)) AS value#6L](#0L,%20(id#0L%20%%20cast(3%20as%20bigint))%20AS%20value#6L)
      +- Range (0, 10, step=1, splits=Some(2))
```

and then it throws:

```
[[MISSING_AGGREGATION] The non-aggregating expression "value" is based on columns which are not participating in the GROUP BY clause. Add the columns or the expression to the GROUP BY, aggregate the expression, or use "any_value(value)" if you do not care which of the values within a group is returned. SQLSTATE: 42803
```

- how canonicalized fixes this:
  -  nondeterministic PythonUDF expressions always have distinct resultIds per udf
  - The fix is to canonicalize the expressions when matching. Canonicalized means that we're setting the resultIds to -1, allowing us to dedup the PythonUDF expressions.
- for deterministic UDFs, this rule does not apply and "Post Analysis" batch extracts and deduplicates the expressions, as expected

### Why are the changes needed?

- the output of the query with the fix applied still makes sense - the nondeterministic UDF is invoked only once, in the project.

### Does this PR introduce _any_ user-facing change?

Yes, it's additive, it enables queries to run that previously threw errors.

### How was this patch tested?

- added unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#52061 from benrobby/adhoc-fix-pull-out-nondeterministic.

Authored-by: Ben Hurdelhey <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
pull bot pushed a commit that referenced this pull request Nov 3, 2025
…building

### What changes were proposed in this pull request?

This PR aims to add `libwebp-dev` to fix `dev/infra/Dockerfile` building.

### Why are the changes needed?

To fix `build_infra_images_cache` GitHub Action job
- https://github.com/apache/spark/actions/workflows/build_infra_images_cache.yml

<img width="545" height="88" alt="Screenshot 2025-11-02 at 14 56 19" src="https://github.com/user-attachments/assets/f70d6093-6574-40f3-a097-ba5c9086f3c1" />

The root cause is identical with other Dockerfile failure.
```
#13 578.4 -------------------------- [ERROR MESSAGE] ---------------------------
#13 578.4 <stdin>:1:10: fatal error: ft2build.h: No such file or directory
#13 578.4 compilation terminated.
#13 578.4 --------------------------------------------------------------------
#13 578.4 ERROR: configuration failed for package 'ragg'
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs. Especially, `Cache base image` test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#52840 from dongjoon-hyun/SPARK-54141.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants