Skip to content

Conversation

@ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Mar 21, 2022

What changes were proposed in this pull request?

  • Ensure output ordering using requiredOrdering
  • Override outputOrdering in AdaptiveSparkPlanExec

Why are the changes needed?

AdaptiveSparkPlanExec should ensure the output ordering is the requiredOrdering, so we leverage the EnsureRequirements to add sort if need.

FileFormatWriter will check and add an implicit sort for dynamic partition columns or bucket columns according to the input physical plan. The check became always failure since AQE AdaptiveSparkPlanExec has no outputOrdering.

That casues a redundant sort if user has specified a sort which satisfies the required ordering (dynamic partition and bucket columns).

Does this PR introduce any user-facing change?

no, improve performance

How was this patch tested?

add test

CREATE TABLE t1 (c int) USING PARQUET PARTITIONED BY(p string);
CREATE TABLE t2 USING PARQUET AS SELECT 1 c, 'a' p;
INSERT INTO TABLE t1 PARTITION(p) select c, p from t2 order by p;

Before:
image

After:
image

@github-actions github-actions bot added the SQL label Mar 21, 2022
@ulysses-you
Copy link
Contributor Author

cc @maryannxue @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to override outputPartitioning as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can, but there is no requirements about outputPartitioning

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we can remove this now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OptimizeSkewedJoin still use this. I considered unify them, but seems OptimizeSkewedJoin does not affect the required output ordering.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should refine the test name if we don't really test table insertion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, refined it

@ulysses-you ulysses-you changed the title [SPARK-38578][SQL] Avoid unnecessary sort in FileFormatWriter if user has specified sort in AQE [SPARK-38578][SQL] AdaptiveSparkPlanExec should ensure user-specified ordering Mar 23, 2022
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// User-specified repartition is only effective when it's the root node, or under
// User-specified sort is only effective when it's the root node, or under

@ulysses-you
Copy link
Contributor Author

The failed test is irrelevant

ReportSinkMetricsSuite.test ReportSinkMetrics
org.scalatest.exceptions.TestFailedException: Expected null, but got {"metrics-1"="value-1", "metrics-2"="value-2"}

@cloud-fan
Copy link
Contributor

Can we rerun the tests?

@ulysses-you
Copy link
Contributor Author

rebased since that flaky test has been fixed

@cloud-fan
Copy link
Contributor

After a second thought, what will happen if we just do a one-line fix override def outputOrdering: Seq[SortOrder] = requiredOrdering? AQE optimization may remove user-specified sort, but it doesn't matter as the file writing will add the necessary sort.

@ulysses-you
Copy link
Contributor Author

file writing will add the necessary sort.

it does not. FileFormatWriter just checks the initialization SparkPlan if the ordering satisfies the required. If AQE changes the ordering at runtime, FileFormatWriter can not aware the change. So if we override the outputOrdering, we must make sure AQE won't change it.

@ulysses-you
Copy link
Contributor Author

closed, in favor of #34568

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants