[pull] master from apache:master #47

pull · 2022-10-19T19:20:15Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

… to_protobuf From SandishKumarHN(sanysandishgmail.com) and Mohan Parthasarathy(mposdev21gmail.com) This PR follows main PR #37972 The following is an example of how to use from_protobuf and to_protobuf in Pyspark. ```python data = [("1", (2, "Alice", 109200))] ddl_schema = "key STRING, value STRUCT<age: INTEGER, name: STRING, score: LONG>" df = spark.createDataFrame(data, ddl_schema) desc_hex = str('0ACE010A41636F6E6E6563746F722F70726F746F6275662F7372632F746573742F726' ... '5736F75726365732F70726F746F6275662F7079737061726B5F746573742E70726F746F121D6F72672E61' ... '70616368652E737061726B2E73716C2E70726F746F627566224B0A0D53696D706C654D657373616765121' ... '00A03616765180120012805520361676512120A046E616D6518022001280952046E616D6512140A057363' ... '6F7265180320012803520573636F72654215421353696D706C654D65737361676550726F746F736206707' ... '26F746F33') import tempfile # Writing a protobuf description into a file, generated by using # connector/protobuf/src/test/resources/protobuf/pyspark_test.proto file with tempfile.TemporaryDirectory() as tmp_dir: ... desc_file_path = "%s/pyspark_test.desc" % tmp_dir ... with open(desc_file_path, "wb") as f: ... _ = f.write(bytearray.fromhex(desc_hex)) ... f.flush() ... message_name = 'SimpleMessage' ... proto_df = df.select(to_protobuf(df.value, ... desc_file_path, message_name).alias("value")) ... proto_df.show(truncate=False) ... proto_df = proto_df.select(from_protobuf(proto_df.value, ... desc_file_path, message_name).alias("value")) ... proto_df.show(truncate=False) +----------------------------------------+ |value | +----------------------------------------+ |[08 02 12 05 41 6C 69 63 65 18 90 D5 06]| +----------------------------------------+ +------------------+ |value | +------------------+ |{2, Alice, 109200}| +------------------+ ``` ### ****Tests Covered**** - from_protobuf / to_protobuf (functions.py) Closes #38212 from SandishKumarHN/PYSPARK_PROTOBUF. Authored-by: SandishKumarHN <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? This adds additional checkpoint rename file check. ### Why are the changes needed? We encountered an issue recently that one customer's structured streaming job failed to read delta file. The temporary file exists but it was not successfully renamed to final delta file path. We currently don't check if renamed file exists but assume it successful. As the result, failing to read delta file assumed to be committed in last batch makes re-triggering the job impossible. We should be able to do a check against checkpoint renamed file to prevent such difficulty in advance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #38291 from viirya/add_file_check. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…alArgumentException in CreateDatabaseCommand & AlterDatabaseSetLocationCommand ### What changes were proposed in this pull request? This pr aims to use SparkIllegalArgumentException instead of IllegalArgumentException in CreateDatabaseCommand & AlterDatabaseSetLocationCommand. ### Why are the changes needed? When I work on https://issues.apache.org/jira/browse/SPARK-40790, I found when `location` is empty, DDL command(CreateDatabaseCommand & AlterDatabaseSetLocationCommand) throw IllegalArgumentException, it seem not to fit into the new error framework. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existed UT. Closes #38274 from panbingkun/setNamespaceLocation_error. Authored-by: panbingkun <[email protected]> Signed-off-by: Max Gekk <[email protected]>

…tion` ### What changes were proposed in this pull request? In the PR, I propose to port the following tests suites onto `checkError` to check valuable error parts instead of error messages: - JDBCWriteSuite - JDBCTableCatalogSuite - JdbcUtilsSuite - CreateTableAsSelectSuite - SparkScriptTransformationSuite - SQLViewSuite - InsertSuite ### Why are the changes needed? Migration on `checkError()` will make the tests independent from the text of error messages. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "test:testOnly *SQLViewSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalogSuite" $ build/sbt "test:testOnly *JdbcUtilsSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *SparkScriptTransformationSuite" ``` Closes #38267 from MaxGekk/parseexception-checkError. Authored-by: Max Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]>

…onnect expression ### What changes were proposed in this pull request? This PR is a followup of #38270 that changes `__lt__` to use `<` that is less than comparison. ### Why are the changes needed? To less than comparison to use `<` properly. ### Does this PR introduce _any_ user-facing change? No, the original change is not released yet. ### How was this patch tested? Unit test was added. Closes #38303 from HyukjinKwon/SPARK-40538-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? fix test_connect_column_expressions.py ### Why are the changes needed? master is broken ``` ERROR [0.006s]: test_column_expressions (pyspark.sql.tests.connect.test_connect_column_expressions.SparkConnectColumnExpressionSuite) Test a more complex combination of expressions and their translation into ---------------------------------------------------------------------- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/sql/tests/connect/test_connect_column_expressions.py", line 80, in test_column_expressions mod_fun.unresolved_function.arguments[0].unresolved_attribute.parts, ["id"] AttributeError: parts ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? test locally Closes #38308 from zhengruifeng/connect_hot_fix_expressions. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…t anchor ### What changes were proposed in this pull request? This PR proposes to disable some tests for DataFrame.corrwith since there is regression in pandas 1.5.0. We should re-enable it after fixing in future pandas release. ### Why are the changes needed? There is bug in pandas for DataFrame.corrwith for some cases, we should skip it for now. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The existing CI should pass Closes #38292 from itholic/SPARK-40779. Authored-by: itholic <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? This PR decouples `LIMIT` and `OFFSET` to two plan which matches with DataFrame semantic where `OFFSET` is an independent operation and can happen before a LIMIT (or any other DataFrame). ### Why are the changes needed? Improve the support for LIMIT and OFFSET. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #38275 from amaliujia/add_limit_offset_to_dsl. Authored-by: Rui Wang <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? Delete the directory that has been created in advance in the test case. Add the judgment of whether the directory exists. ### Why are the changes needed? The implementation class of ExternalCatalog will perform folder operations when performing operations such as create/drop database/table/partition. The test case creates a folder in advance when obtaining the DB/Partition path URI, in unit tests the external catalog does not actually create files, resulting in the result of the test case is not convincing enough force. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? tests were added Closes #38206 from huangxiaopingRD/fix_catalog_ut. Authored-by: huangxiaoping <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ect * ### What changes were proposed in this pull request? 1. Sync newest proto to python client. 2. Update Aggregate to match proto change. 3. Change `select` to have it accept both `column` and `str` 4. Make sure `*` pass through the entire path which has been implemented on the server side #38023 ### Why are the changes needed? Update python client side to match the change in connect proto. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #38218 from amaliujia/select_start_in_python. Authored-by: Rui Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? Remove unused apt lists cache ### Why are the changes needed? Clean cache to reduce docker image size. This is also [recommanded](https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#run) by docker community: ``` $ docker run --user 0:0 -ti apache/spark bash root5d1ca347279e:/opt/spark/work-dir# ls /var/lib/apt/lists/ auxfiles lock deb.debian.org_debian_dists_bullseye-updates_InRelease partial deb.debian.org_debian_dists_bullseye-updates_main_binary-arm64_Packages.lz4 security.debian.org_debian-security_dists_bullseye-security_InRelease deb.debian.org_debian_dists_bullseye_InRelease security.debian.org_debian-security_dists_bullseye-security_main_binary-arm64_Packages.lz4 deb.debian.org_debian_dists_bullseye_main_binary-arm64_Packages.lz4 root5d1ca347279e:/opt/spark/work-dir# du --max-depth=1 -h /var/lib/apt/lists/ 4.0K /var/lib/apt/lists/partial 4.0K /var/lib/apt/lists/auxfiles 17M /var/lib/apt/lists/ ``` ### Does this PR introduce _any_ user-facing change? Yes in some level, image size is reduced. ### How was this patch tested? K8s CI passed Closes #38298 from Yikun/SPARK-40513. Authored-by: Yikun Jiang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR proposes to flip the default value of Kafka offset fetching config (`spark.sql.streaming.kafka.useDeprecatedOffsetFetching`) from `true` to `false`, which enables AdminClient based offset fetching by default. ### Why are the changes needed? We had been encountered several production issues with old offset fetching (e.g. hang, issue with Kafka consumer group rebalance) which could be mitigated with new offset fetching. Despite the breaking change on the ACL, there is no need for moderate users to suffer from the old way. The discussion went through the dev. mailing list: https://lists.apache.org/thread/spkco94gw33sj8355mhlxz1vl7gl1g5c ### Does this PR introduce _any_ user-facing change? Yes, especially users who relies on Kafka ACL based on consumer group. They need to either adjust the ACL to topic based one, or set the value to `true` for `spark.sql.streaming.kafka.useDeprecatedOffsetFetching` to use the old approach. ### How was this patch tested? Existing UTs. Closes #38306 from HeartSaVioR/SPARK-40844. Authored-by: Jungtaek Lim <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

…first ### What changes were proposed in this pull request? This pr aims to pin GA used Java version to pass GA test first: - Java 8 pin to 8.0.345 - Java 11 pin to 11.0.16 - Java 17 pin to 17.0.4 this change should be revert after find the root cause ### Why are the changes needed? Make GA passed first. The following test failed with 8u352/11.0.17/17.0.5: ``` [info] *** 12 TESTS FAILED *** [error] Failed: Total 6746, Failed 12, Errors 0, Passed 6734, Ignored 5 [error] Failed tests: [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOffSuite [error] org.apache.spark.sql.catalyst.util.TimestampFormatterSuite [error] org.apache.spark.sql.catalyst.expressions.CastWithAnsiOnSuite [error] org.apache.spark.sql.catalyst.util.RebaseDateTimeSuite [error] org.apache.spark.sql.catalyst.expressions.TryCastSuite ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #38311 from LuciferYang/java-version. Authored-by: yangjie01 <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…nnect ### What changes were proposed in this pull request? Add initial Read API for Spark Connect that allows setting schema, format, option and path, and then to read files into DataFrame. ### Why are the changes needed? PySpark readwriter API parity for Spark Connect ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #38086 from amaliujia/SPARK-40539. Authored-by: Rui Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ingBuilder` ### What changes were proposed in this pull request? This PR aims to improve `toString` by `JEP-280` instead of `ToStringBuilder`. In addition, `Scalastyle` and `Checkstyle` rules are added to prevent a future regression. ### Why are the changes needed? Since Java 9, `String Concatenation` has been handled better by default. | ID | DESCRIPTION | | - | - | | JEP-280 | [Indify String Concatenation](https://openjdk.org/jeps/280) | For example, this PR improves `OpenBlocks` like the following. Both Java source code and byte code are simplified a lot by utilizing JEP-280 properly. **CODE CHANGE** ```java - return new ToStringBuilder(this, ToStringStyle.SHORT_PREFIX_STYLE) - .append("appId", appId) - .append("execId", execId) - .append("blockIds", Arrays.toString(blockIds)) - .toString(); + return "OpenBlocks[appId=" + appId + ",execId=" + execId + ",blockIds=" + + Arrays.toString(blockIds) + "]"; ``` **BEFORE** ``` public java.lang.String toString(); Code: 0: new #39 // class org/apache/commons/lang3/builder/ToStringBuilder 3: dup 4: aload_0 5: getstatic #41 // Field org/apache/commons/lang3/builder/ToStringStyle.SHORT_PREFIX_STYLE:Lorg/apache/commons/lang3/builder/ToStringStyle; 8: invokespecial #47 // Method org/apache/commons/lang3/builder/ToStringBuilder."<init>":(Ljava/lang/Object;Lorg/apache/commons/lang3/builder/ToStringStyle;)V 11: ldc #50 // String appId 13: aload_0 14: getfield #7 // Field appId:Ljava/lang/String; 17: invokevirtual #51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 20: ldc #55 // String execId 22: aload_0 23: getfield #13 // Field execId:Ljava/lang/String; 26: invokevirtual #51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 29: ldc #56 // String blockIds 31: aload_0 32: getfield #16 // Field blockIds:[Ljava/lang/String; 35: invokestatic #57 // Method java/util/Arrays.toString:([Ljava/lang/Object;)Ljava/lang/String; 38: invokevirtual #51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 41: invokevirtual #61 // Method org/apache/commons/lang3/builder/ToStringBuilder.toString:()Ljava/lang/String; 44: areturn ``` **AFTER** ``` public java.lang.String toString(); Code: 0: aload_0 1: getfield #7 // Field appId:Ljava/lang/String; 4: aload_0 5: getfield #13 // Field execId:Ljava/lang/String; 8: aload_0 9: getfield #16 // Field blockIds:[Ljava/lang/String; 12: invokestatic #39 // Method java/util/Arrays.toString:([Ljava/lang/Object;)Ljava/lang/String; 15: invokedynamic #43, 0 // InvokeDynamic #0:makeConcatWithConstants:(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)Ljava/lang/String; 20: areturn ``` ### Does this PR introduce _any_ user-facing change? No. This is an `toString` implementation improvement. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51572 from dongjoon-hyun/SPARK-52880. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

SandishKumarHN and others added 11 commits October 19, 2022 12:04

github-actions bot added BUILD CONNECT CORE INFRA KUBERNETES PANDAS API ON SPARK PROTOBUF PYTHON R SQL STRUCTURED STREAMING labels Oct 19, 2022

pull bot removed BUILD CORE SQL PYTHON PANDAS API ON SPARK STRUCTURED STREAMING INFRA KUBERNETES labels Oct 19, 2022

pull bot added ⤵️ pull and removed R CONNECT PROTOBUF labels Oct 19, 2022

github-actions bot added BUILD CONNECT CORE DOCS INFRA KUBERNETES PANDAS API ON SPARK PROTOBUF PYTHON R SQL STRUCTURED STREAMING labels Oct 19, 2022

LuciferYang and others added 2 commits October 20, 2022 09:16

wangyum merged commit 01c7a46 into wangyum:master Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from apache:master #47

[pull] master from apache:master #47

Uh oh!

pull bot commented Oct 19, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

[pull] master from apache:master #47

[pull] master from apache:master #47

Uh oh!

Conversation

pull bot commented Oct 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

pull bot commented Oct 19, 2022 •

edited

Loading