Skip to content

Conversation

@pull
Copy link

@pull pull bot commented Oct 19, 2022

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

SandishKumarHN and others added 11 commits October 19, 2022 12:04
… to_protobuf

From SandishKumarHN(sanysandishgmail.com) and Mohan Parthasarathy(mposdev21gmail.com)

This PR follows main PR #37972

The following is an example of how to use from_protobuf and to_protobuf in Pyspark.

```python
data = [("1", (2, "Alice", 109200))]
ddl_schema = "key STRING, value STRUCT<age: INTEGER, name: STRING, score: LONG>"
df = spark.createDataFrame(data, ddl_schema)
desc_hex = str('0ACE010A41636F6E6E6563746F722F70726F746F6275662F7372632F746573742F726'
... '5736F75726365732F70726F746F6275662F7079737061726B5F746573742E70726F746F121D6F72672E61'
... '70616368652E737061726B2E73716C2E70726F746F627566224B0A0D53696D706C654D657373616765121'
... '00A03616765180120012805520361676512120A046E616D6518022001280952046E616D6512140A057363'
... '6F7265180320012803520573636F72654215421353696D706C654D65737361676550726F746F736206707'
... '26F746F33')
 import tempfile
# Writing a protobuf description into a file, generated by using
# connector/protobuf/src/test/resources/protobuf/pyspark_test.proto file
with tempfile.TemporaryDirectory() as tmp_dir:
...     desc_file_path = "%s/pyspark_test.desc" % tmp_dir
...     with open(desc_file_path, "wb") as f:
...         _ = f.write(bytearray.fromhex(desc_hex))
...         f.flush()
...         message_name = 'SimpleMessage'
...         proto_df = df.select(to_protobuf(df.value,
...         desc_file_path, message_name).alias("value"))
...         proto_df.show(truncate=False)
...         proto_df = proto_df.select(from_protobuf(proto_df.value,
...         desc_file_path, message_name).alias("value"))
...         proto_df.show(truncate=False)
    +----------------------------------------+
    |value                                   |
    +----------------------------------------+
    |[08 02 12 05 41 6C 69 63 65 18 90 D5 06]|
    +----------------------------------------+
    +------------------+
    |value             |
    +------------------+
    |{2, Alice, 109200}|
    +------------------+
```

### ****Tests Covered****
- from_protobuf / to_protobuf (functions.py)

Closes #38212 from SandishKumarHN/PYSPARK_PROTOBUF.

Authored-by: SandishKumarHN <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?

This adds additional checkpoint rename file check.

### Why are the changes needed?

We encountered an issue recently that one customer's structured streaming job failed to read delta file.

The temporary file exists but it was not successfully renamed to final delta file path.

We currently don't check if renamed file exists but assume it successful. As the result, failing to read delta file assumed to be committed in last batch makes re-triggering the job impossible.

We should be able to do a check against checkpoint renamed file to prevent such difficulty in advance.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #38291 from viirya/add_file_check.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…alArgumentException in CreateDatabaseCommand & AlterDatabaseSetLocationCommand

### What changes were proposed in this pull request?
This pr aims to use SparkIllegalArgumentException instead of IllegalArgumentException in CreateDatabaseCommand & AlterDatabaseSetLocationCommand.

### Why are the changes needed?
When I work on https://issues.apache.org/jira/browse/SPARK-40790,
I found when `location` is empty, DDL command(CreateDatabaseCommand & AlterDatabaseSetLocationCommand) throw IllegalArgumentException, it seem not to fit into the new error framework.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existed UT.

Closes #38274 from panbingkun/setNamespaceLocation_error.

Authored-by: panbingkun <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
…tion`

### What changes were proposed in this pull request?
In the PR, I propose to port the following tests suites onto `checkError` to check valuable error parts instead of error messages:
- JDBCWriteSuite
- JDBCTableCatalogSuite
- JdbcUtilsSuite
- CreateTableAsSelectSuite
- SparkScriptTransformationSuite
- SQLViewSuite
- InsertSuite

### Why are the changes needed?
Migration on `checkError()` will make the tests independent from the text of error messages.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running the modified test suites:
```
$ build/sbt "test:testOnly *SQLViewSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalogSuite"
$ build/sbt "test:testOnly *JdbcUtilsSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *SparkScriptTransformationSuite"
```

Closes #38267 from MaxGekk/parseexception-checkError.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
…onnect expression

### What changes were proposed in this pull request?

This PR is a followup of #38270 that changes `__lt__` to use `<` that is less than comparison.

### Why are the changes needed?

To less than comparison to use `<` properly.

### Does this PR introduce _any_ user-facing change?

No, the original change is not released yet.

### How was this patch tested?

Unit test was added.

Closes #38303 from HyukjinKwon/SPARK-40538-followup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?
fix  test_connect_column_expressions.py

### Why are the changes needed?
master is broken

```
ERROR [0.006s]: test_column_expressions (pyspark.sql.tests.connect.test_connect_column_expressions.SparkConnectColumnExpressionSuite)
Test a more complex combination of expressions and their translation into
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/sql/tests/connect/test_connect_column_expressions.py", line 80, in test_column_expressions
    mod_fun.unresolved_function.arguments[0].unresolved_attribute.parts, ["id"]
AttributeError: parts
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
test locally

Closes #38308 from zhengruifeng/connect_hot_fix_expressions.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…t anchor

### What changes were proposed in this pull request?

This PR proposes to disable some tests for DataFrame.corrwith since there is regression in pandas 1.5.0.

We should re-enable it after fixing in future pandas release.

### Why are the changes needed?

There is bug in pandas for DataFrame.corrwith for some cases, we should skip it for now.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The existing CI should pass

Closes #38292 from itholic/SPARK-40779.

Authored-by: itholic <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?

This PR decouples `LIMIT` and `OFFSET` to two plan which matches with DataFrame semantic where `OFFSET` is an independent operation and can happen before a LIMIT (or any other DataFrame).

### Why are the changes needed?

Improve the support for LIMIT and OFFSET.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

UT

Closes #38275 from amaliujia/add_limit_offset_to_dsl.

Authored-by: Rui Wang <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?
Delete the directory that has been created in advance in the test case. Add the judgment of whether the directory exists.

### Why are the changes needed?
The implementation class of ExternalCatalog will perform folder operations when performing operations such as create/drop database/table/partition. The test case creates a folder in advance when obtaining the DB/Partition path URI, in unit tests the external catalog does not actually create files, resulting in the result of the test case is not convincing enough force.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

tests were added

Closes #38206 from huangxiaopingRD/fix_catalog_ut.

Authored-by: huangxiaoping <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…ect *

### What changes were proposed in this pull request?

1. Sync newest proto to python client.
2. Update Aggregate to match proto change.
3. Change `select` to have it accept both `column` and `str`
4. Make sure `*` pass through the entire path which has been implemented on the server side #38023

### Why are the changes needed?

Update python client side to match the change in connect proto.

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

UT

Closes #38218 from amaliujia/select_start_in_python.

Authored-by: Rui Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
Remove unused apt lists cache

### Why are the changes needed?
Clean cache to reduce docker image size.

This is also [recommanded](https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#run) by docker community:

```
$ docker run --user 0:0 -ti apache/spark bash
root5d1ca347279e:/opt/spark/work-dir# ls /var/lib/apt/lists/
auxfiles								     lock
deb.debian.org_debian_dists_bullseye-updates_InRelease			     partial
deb.debian.org_debian_dists_bullseye-updates_main_binary-arm64_Packages.lz4  security.debian.org_debian-security_dists_bullseye-security_InRelease
deb.debian.org_debian_dists_bullseye_InRelease				     security.debian.org_debian-security_dists_bullseye-security_main_binary-arm64_Packages.lz4
deb.debian.org_debian_dists_bullseye_main_binary-arm64_Packages.lz4
root5d1ca347279e:/opt/spark/work-dir# du --max-depth=1 -h /var/lib/apt/lists/
4.0K	/var/lib/apt/lists/partial
4.0K	/var/lib/apt/lists/auxfiles
17M	/var/lib/apt/lists/
```

### Does this PR introduce _any_ user-facing change?
Yes in some level, image size is reduced.

### How was this patch tested?
K8s CI passed

Closes #38298 from Yikun/SPARK-40513.

Authored-by: Yikun Jiang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

This PR proposes to flip the default value of Kafka offset fetching config (`spark.sql.streaming.kafka.useDeprecatedOffsetFetching`) from `true` to `false`, which enables AdminClient based offset fetching by default.

### Why are the changes needed?

We had been encountered several production issues with old offset fetching (e.g. hang, issue with Kafka consumer group rebalance) which could be mitigated with new offset fetching. Despite the breaking change on the ACL, there is no need for moderate users to suffer from the old way.

The discussion went through the dev. mailing list: https://lists.apache.org/thread/spkco94gw33sj8355mhlxz1vl7gl1g5c

### Does this PR introduce _any_ user-facing change?

Yes, especially users who relies on Kafka ACL based on consumer group. They need to either adjust the ACL to topic based one, or set the value to `true` for `spark.sql.streaming.kafka.useDeprecatedOffsetFetching` to use the old approach.

### How was this patch tested?

Existing UTs.

Closes #38306 from HeartSaVioR/SPARK-40844.

Authored-by: Jungtaek Lim <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
LuciferYang and others added 2 commits October 20, 2022 09:16
…first

### What changes were proposed in this pull request?
This pr aims to pin GA used Java version to pass GA test first:

- Java 8 pin to 8.0.345
- Java 11 pin to 11.0.16
- Java 17 pin to 17.0.4

this change should be revert after find the root cause

### Why are the changes needed?
Make  GA passed first.

The following test failed with 8u352/11.0.17/17.0.5:

```
[info] *** 12 TESTS FAILED ***
[error] Failed: Total 6746, Failed 12, Errors 0, Passed 6734, Ignored 5
[error] Failed tests:
[error] 	org.apache.spark.sql.catalyst.expressions.CastWithAnsiOffSuite
[error] 	org.apache.spark.sql.catalyst.util.TimestampFormatterSuite
[error] 	org.apache.spark.sql.catalyst.expressions.CastWithAnsiOnSuite
[error] 	org.apache.spark.sql.catalyst.util.RebaseDateTimeSuite
[error] 	org.apache.spark.sql.catalyst.expressions.TryCastSuite
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

Closes #38311 from LuciferYang/java-version.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…nnect

### What changes were proposed in this pull request?

Add initial Read API for Spark Connect that allows setting schema, format, option and path, and then to read files into DataFrame.

### Why are the changes needed?

PySpark readwriter API parity for Spark Connect

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

UT

Closes #38086 from amaliujia/SPARK-40539.

Authored-by: Rui Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@wangyum wangyum merged commit 01c7a46 into wangyum:master Oct 20, 2022
pull bot pushed a commit that referenced this pull request Jul 21, 2025
…ingBuilder`

### What changes were proposed in this pull request?

This PR aims to improve `toString` by `JEP-280` instead of `ToStringBuilder`. In addition, `Scalastyle` and `Checkstyle` rules are added to prevent a future regression.

### Why are the changes needed?

Since Java 9, `String Concatenation` has been handled better by default.

| ID | DESCRIPTION |
| - | - |
| JEP-280 | [Indify String Concatenation](https://openjdk.org/jeps/280) |

For example, this PR improves `OpenBlocks` like the following. Both Java source code and byte code are simplified a lot by utilizing JEP-280 properly.

**CODE CHANGE**
```java

- return new ToStringBuilder(this, ToStringStyle.SHORT_PREFIX_STYLE)
-   .append("appId", appId)
-   .append("execId", execId)
-   .append("blockIds", Arrays.toString(blockIds))
-   .toString();
+ return "OpenBlocks[appId=" + appId + ",execId=" + execId + ",blockIds=" +
+     Arrays.toString(blockIds) + "]";
```

**BEFORE**
```
  public java.lang.String toString();
    Code:
       0: new           #39                 // class org/apache/commons/lang3/builder/ToStringBuilder
       3: dup
       4: aload_0
       5: getstatic     #41                 // Field org/apache/commons/lang3/builder/ToStringStyle.SHORT_PREFIX_STYLE:Lorg/apache/commons/lang3/builder/ToStringStyle;
       8: invokespecial #47                 // Method org/apache/commons/lang3/builder/ToStringBuilder."<init>":(Ljava/lang/Object;Lorg/apache/commons/lang3/builder/ToStringStyle;)V
      11: ldc           #50                 // String appId
      13: aload_0
      14: getfield      #7                  // Field appId:Ljava/lang/String;
      17: invokevirtual #51                 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder;
      20: ldc           #55                 // String execId
      22: aload_0
      23: getfield      #13                 // Field execId:Ljava/lang/String;
      26: invokevirtual #51                 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder;
      29: ldc           #56                 // String blockIds
      31: aload_0
      32: getfield      #16                 // Field blockIds:[Ljava/lang/String;
      35: invokestatic  #57                 // Method java/util/Arrays.toString:([Ljava/lang/Object;)Ljava/lang/String;
      38: invokevirtual #51                 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder;
      41: invokevirtual #61                 // Method org/apache/commons/lang3/builder/ToStringBuilder.toString:()Ljava/lang/String;
      44: areturn
```

**AFTER**
```
  public java.lang.String toString();
    Code:
       0: aload_0
       1: getfield      #7                  // Field appId:Ljava/lang/String;
       4: aload_0
       5: getfield      #13                 // Field execId:Ljava/lang/String;
       8: aload_0
       9: getfield      #16                 // Field blockIds:[Ljava/lang/String;
      12: invokestatic  #39                 // Method java/util/Arrays.toString:([Ljava/lang/Object;)Ljava/lang/String;
      15: invokedynamic #43,  0             // InvokeDynamic #0:makeConcatWithConstants:(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)Ljava/lang/String;
      20: areturn
```

### Does this PR introduce _any_ user-facing change?

No. This is an `toString` implementation improvement.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#51572 from dongjoon-hyun/SPARK-52880.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.