[WIP] OperatorStateMetadata validation #14

ericm-db · 2024-07-16T03:09:43Z

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

…onicalized expressions ### What changes were proposed in this pull request? Make PullOutNonDeterministic use canonicalized expressions to dedup group and aggregate expressions. This affects pyspark udfs in particular. Example: ``` from pyspark.sql.functions import col, avg, udf pythonUDF = udf(lambda x: x).asNondeterministic() spark.range(10)\ .selectExpr("id", "id % 3 as value")\ .groupBy(pythonUDF(col("value")))\ .agg(avg("id"), pythonUDF(col("value")))\ .explain(extended=True) ``` Currently results in a plan like this: ``` Aggregate [_nondeterministic#15](#15), [_nondeterministic#15 AS dummyNondeterministicUDF(value)#12, avg(id#0L) AS avg(id)#13, dummyNondeterministicUDF(value#6L)#8 AS dummyNondeterministicUDF(value)#14](#15%20AS%20dummyNondeterministicUDF(value)#12,%20avg(id#0L)%20AS%20avg(id)#13,%20dummyNondeterministicUDF(value#6L)#8%20AS%20dummyNondeterministicUDF(value)#14) +- Project [id#0L, value#6L, dummyNondeterministicUDF(value#6L)#7 AS _nondeterministic#15](#0L,%20value#6L,%20dummyNondeterministicUDF(value#6L)#7%20AS%20_nondeterministic#15) +- Project [id#0L, (id#0L % cast(3 as bigint)) AS value#6L](#0L,%20(id#0L%20%%20cast(3%20as%20bigint))%20AS%20value#6L) +- Range (0, 10, step=1, splits=Some(2)) ``` and then it throws: ``` [[MISSING_AGGREGATION] The non-aggregating expression "value" is based on columns which are not participating in the GROUP BY clause. Add the columns or the expression to the GROUP BY, aggregate the expression, or use "any_value(value)" if you do not care which of the values within a group is returned. SQLSTATE: 42803 ``` - how canonicalized fixes this: - nondeterministic PythonUDF expressions always have distinct resultIds per udf - The fix is to canonicalize the expressions when matching. Canonicalized means that we're setting the resultIds to -1, allowing us to dedup the PythonUDF expressions. - for deterministic UDFs, this rule does not apply and "Post Analysis" batch extracts and deduplicates the expressions, as expected ### Why are the changes needed? - the output of the query with the fix applied still makes sense - the nondeterministic UDF is invoked only once, in the project. ### Does this PR introduce _any_ user-facing change? Yes, it's additive, it enables queries to run that previously threw errors. ### How was this patch tested? - added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52061 from benrobby/adhoc-fix-pull-out-nondeterministic. Authored-by: Ben Hurdelhey <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added SQL STRUCTURED STREAMING labels Jul 16, 2024

ericm-db changed the base branch from op-state-metadata to metadata-v2-purging July 16, 2024 03:11

ericm-db changed the base branch from metadata-v2-purging to op-state-metadata July 16, 2024 03:11

ericm-db changed the base branch from op-state-metadata to metadata-v2-purging July 16, 2024 04:20

ericm-db added 14 commits July 15, 2024 21:21

feedback

0fd4413

removing println, test passes

bbb7091

keySchema

efd1a09

rebase

47ea28b

keySchema

fbe62d7

rebase

43bf78f

writing schema

c3fc316

purging

2dcd90e

init

4ba3d58

added test case and error msg

bc24c1f

it works

e0d8ffb

it compiles lol

948b2f7

removing unnecessary files

405eb55

validation

9688905

ericm-db force-pushed the metadata-v2-validation branch from 96dbc6d to 9688905 Compare July 16, 2024 04:22

removing old code

654133c

ericm-db closed this Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] OperatorStateMetadata validation #14

[WIP] OperatorStateMetadata validation #14

Uh oh!

ericm-db commented Jul 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[WIP] OperatorStateMetadata validation #14

[WIP] OperatorStateMetadata validation #14

Uh oh!

Conversation

ericm-db commented Jul 16, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant