[SPARK-48849][SS]Create OperatorStateMetadataV2 for the TransformWithStateExec operator #13

ericm-db · 2024-06-27T00:01:36Z

What changes were proposed in this pull request?

Introducing the OperatorStateMetadataV2 format, and writing this out with the OperatorStateMetadataLog. This file has a pointer to the State Schema file, and is written in the planning phase.

Why are the changes needed?

We can keep arbitrary operator properties as a part of the OperatorStateMetadata type, and using the metadata file, we can read the latest state schema file.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit tests in the TransformWithStateSuite

Was this patch authored or co-authored using generative AI tooling?

No

…ingBuilder` ### What changes were proposed in this pull request? This PR aims to improve `toString` by `JEP-280` instead of `ToStringBuilder`. In addition, `Scalastyle` and `Checkstyle` rules are added to prevent a future regression. ### Why are the changes needed? Since Java 9, `String Concatenation` has been handled better by default. | ID | DESCRIPTION | | - | - | | JEP-280 | [Indify String Concatenation](https://openjdk.org/jeps/280) | For example, this PR improves `OpenBlocks` like the following. Both Java source code and byte code are simplified a lot by utilizing JEP-280 properly. **CODE CHANGE** ```java - return new ToStringBuilder(this, ToStringStyle.SHORT_PREFIX_STYLE) - .append("appId", appId) - .append("execId", execId) - .append("blockIds", Arrays.toString(blockIds)) - .toString(); + return "OpenBlocks[appId=" + appId + ",execId=" + execId + ",blockIds=" + + Arrays.toString(blockIds) + "]"; ``` **BEFORE** ``` public java.lang.String toString(); Code: 0: new apache#39 // class org/apache/commons/lang3/builder/ToStringBuilder 3: dup 4: aload_0 5: getstatic apache#41 // Field org/apache/commons/lang3/builder/ToStringStyle.SHORT_PREFIX_STYLE:Lorg/apache/commons/lang3/builder/ToStringStyle; 8: invokespecial apache#47 // Method org/apache/commons/lang3/builder/ToStringBuilder."<init>":(Ljava/lang/Object;Lorg/apache/commons/lang3/builder/ToStringStyle;)V 11: ldc apache#50 // String appId 13: aload_0 14: getfield #7 // Field appId:Ljava/lang/String; 17: invokevirtual apache#51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 20: ldc apache#55 // String execId 22: aload_0 23: getfield #13 // Field execId:Ljava/lang/String; 26: invokevirtual apache#51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 29: ldc apache#56 // String blockIds 31: aload_0 32: getfield #16 // Field blockIds:[Ljava/lang/String; 35: invokestatic apache#57 // Method java/util/Arrays.toString:([Ljava/lang/Object;)Ljava/lang/String; 38: invokevirtual apache#51 // Method org/apache/commons/lang3/builder/ToStringBuilder.append:(Ljava/lang/String;Ljava/lang/Object;)Lorg/apache/commons/lang3/builder/ToStringBuilder; 41: invokevirtual apache#61 // Method org/apache/commons/lang3/builder/ToStringBuilder.toString:()Ljava/lang/String; 44: areturn ``` **AFTER** ``` public java.lang.String toString(); Code: 0: aload_0 1: getfield #7 // Field appId:Ljava/lang/String; 4: aload_0 5: getfield #13 // Field execId:Ljava/lang/String; 8: aload_0 9: getfield #16 // Field blockIds:[Ljava/lang/String; 12: invokestatic apache#39 // Method java/util/Arrays.toString:([Ljava/lang/Object;)Ljava/lang/String; 15: invokedynamic apache#43, 0 // InvokeDynamic #0:makeConcatWithConstants:(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)Ljava/lang/String; 20: areturn ``` ### Does this PR introduce _any_ user-facing change? No. This is an `toString` implementation improvement. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51572 from dongjoon-hyun/SPARK-52880. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…onicalized expressions ### What changes were proposed in this pull request? Make PullOutNonDeterministic use canonicalized expressions to dedup group and aggregate expressions. This affects pyspark udfs in particular. Example: ``` from pyspark.sql.functions import col, avg, udf pythonUDF = udf(lambda x: x).asNondeterministic() spark.range(10)\ .selectExpr("id", "id % 3 as value")\ .groupBy(pythonUDF(col("value")))\ .agg(avg("id"), pythonUDF(col("value")))\ .explain(extended=True) ``` Currently results in a plan like this: ``` Aggregate [_nondeterministic#15](#15), [_nondeterministic#15 AS dummyNondeterministicUDF(value)#12, avg(id#0L) AS avg(id)#13, dummyNondeterministicUDF(value#6L)#8 AS dummyNondeterministicUDF(value)#14](#15%20AS%20dummyNondeterministicUDF(value)#12,%20avg(id#0L)%20AS%20avg(id)#13,%20dummyNondeterministicUDF(value#6L)#8%20AS%20dummyNondeterministicUDF(value)#14) +- Project [id#0L, value#6L, dummyNondeterministicUDF(value#6L)#7 AS _nondeterministic#15](#0L,%20value#6L,%20dummyNondeterministicUDF(value#6L)#7%20AS%20_nondeterministic#15) +- Project [id#0L, (id#0L % cast(3 as bigint)) AS value#6L](#0L,%20(id#0L%20%%20cast(3%20as%20bigint))%20AS%20value#6L) +- Range (0, 10, step=1, splits=Some(2)) ``` and then it throws: ``` [[MISSING_AGGREGATION] The non-aggregating expression "value" is based on columns which are not participating in the GROUP BY clause. Add the columns or the expression to the GROUP BY, aggregate the expression, or use "any_value(value)" if you do not care which of the values within a group is returned. SQLSTATE: 42803 ``` - how canonicalized fixes this: - nondeterministic PythonUDF expressions always have distinct resultIds per udf - The fix is to canonicalize the expressions when matching. Canonicalized means that we're setting the resultIds to -1, allowing us to dedup the PythonUDF expressions. - for deterministic UDFs, this rule does not apply and "Post Analysis" batch extracts and deduplicates the expressions, as expected ### Why are the changes needed? - the output of the query with the fix applied still makes sense - the nondeterministic UDF is invoked only once, in the project. ### Does this PR introduce _any_ user-facing change? Yes, it's additive, it enables queries to run that previously threw errors. ### How was this patch tested? - added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52061 from benrobby/adhoc-fix-pull-out-nondeterministic. Authored-by: Ben Hurdelhey <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

ericm-db force-pushed the op-state-metadata branch from fd4396f to 2c35d5f Compare June 27, 2024 00:02

github-actions bot added SQL STRUCTURED STREAMING labels Jun 27, 2024

ericm-db force-pushed the op-state-metadata branch 3 times, most recently from 37362b2 to 5458d30 Compare July 2, 2024 23:17

ericm-db changed the title ~~Op state metadata~~ Introducing OperatorStateMetadataV2 for TransformWithState operator Jul 2, 2024

ericm-db force-pushed the op-state-metadata branch 2 times, most recently from 5a3fab4 to 9f24341 Compare July 3, 2024 22:02

jingz-db added 4 commits July 8, 2024 10:23

a base change a draft suite

4849f20

working version, will write test suites and test for composite types

2bbd2ce

a suite with composite type, why key encoder spec overwritten

4f5185a

fix suites & add TTL suites

00741ff

ericm-db force-pushed the op-state-metadata branch from f48db5b to 44da39a Compare July 9, 2024 16:31

ericm-db added 16 commits July 9, 2024 09:48

feedback

3691a16

creating operatorstatemetadata log

0ad3679

removing ': Array[StateStoreMetadata]'

ef86e37

adding operatorProperties as a metadata column

06940c3

changing the order of the metadata

a668b77

tests pass

a057166

test case

07ccd55

rebase

03265af

Feedback

22b8b0a

files written correctly

37392bf

tests minus purging

cbbd47f

tests pass

81e1fb1

tests pass

99609ee

hdfsmetadatalog

77ffe95

feedback

6c90c9f

checking the OperatorStateMetadata log for the state schema file

b638592

ericm-db added 2 commits July 9, 2024 09:53

adding todo

58e1947

removing println, test passes

6ff37f4

ericm-db force-pushed the op-state-metadata branch from 44da39a to 6ff37f4 Compare July 9, 2024 17:43

ericm-db changed the title ~~Introducing OperatorStateMetadataV2 for TransformWithState operator~~ [SPARK-48849][SS]Create OperatorStateMetadataV2 for the TransformWithStateExec operator Jul 9, 2024

ericm-db closed this Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-48849][SS]Create OperatorStateMetadataV2 for the TransformWithStateExec operator #13

[SPARK-48849][SS]Create OperatorStateMetadataV2 for the TransformWithStateExec operator #13

Uh oh!

ericm-db commented Jun 27, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-48849][SS]Create OperatorStateMetadataV2 for the TransformWithStateExec operator #13

[SPARK-48849][SS]Create OperatorStateMetadataV2 for the TransformWithStateExec operator #13

Uh oh!

Conversation

ericm-db commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ericm-db commented Jun 27, 2024 •

edited

Loading